Why Do You Need A Server

I’ve somewhat always had a “server” of sorts for as long as I’ve had a computer.
Be it for Minecraft, web hosting or long-running computations, I’ve always found one useful.
In this case, the server will be used for HPC and file-storage workloads.

Hardware

Sourcing Parts

PCPartPicker Part List

TypeItemPrice
CPUAMD Ryzen 9 7950X 4.5 GHz 16-Core ProcessorPurchased For £350.00
MotherboardMSI PRO B650-P WIFI ATX AM5 Motherboard£150.00
MemoryCrucial CT2K32G48C40U5 64 GB (2 x 32 GB) DDR5-4800 CL40 Memory£110.00
StorageWestern Digital Black SN770 1 TB M.2-2280 PCIe 4.0 X4 NVME Solid State DrivePurchased For £0.00
StorageSeagate IronWolf NAS 4 TB 3.5" 5400 RPM Internal Hard Drive£100.79 @ Amazon UK
Video CardNVIDIA Founders Edition GeForce RTX 3090 24 GB Video Card£580.00
CaseKOLINK Unity Lateral Performance ATX Mid Tower Case£39.95 @ Overclockers.co.uk
Power SupplyAsus ROG THOR P2 Gaming 1200 W 80+ Platinum Certified Fully Modular ATX Power SupplyPurchased For £130.00
Prices include shipping, taxes, rebates, and discounts
Total£1460.74

I opted for the high-risk but low-cost option of sourcing almost everything used, starting with the GPU.
At the time, this was the cheapest 3090 I could get, it’s also the reason the server is watercooled (this went so well…)

The CPU was sourced from CEX as it was the cheapest place to get it whilst remaining reputable.
The case was bought new from Amazon.

Everything else was sourced from eBay (yes, even the PSU).

POST Troubles

I put the server together (mostly) and fired it up, to my dissapointment, it seemed to not POST.
I had bought the motherboard as a broken one assuming that bent CPU socket pins were the only issue, unfortunately it seems that assumption was false, I ended up purchasing a new motherboard which solved this issue.

It lives?

After filling it with coolant, confirming it POSTs and doesn’t leak I attempted to attach a SATA SSD to the motherboard.
The PC shut-off immediately.

After a bit of diagnosing it turned out that the SATA cable that shipped with this PSU was actually for the wrong PSU, not sure why the pump wasn’t affected by this but regardless I rewired it thanks to this handy pinout diagram.

IT LIVES!

the server, standing proudly besides my 3D printer

And so, the server is alive, all’s right with the world.

Hard Drives

To attach hard drives to the server, I opted to re-use a Dell PowerEdge R710 backplane I had lying around, after ["figuring out" the pinout](https://www.reddit.com/r/homelab/comments/eod1ro/comment/jvdfzrp/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) I was able to power the backplane and connect it to an IT-mode flashed PERC H200 card.

I tested this, naturally, by attaching a blu-ray drive to it.

Ok but actually

The hard-drives I used in the end were all “refurbished”, except for two which were taken from the R720.
In total, 4 Seagate 4TB drives and 2 WD 2TB drives.

Did you know that 4TB hard drives have the best price-to-storage ratio on bargainhardware?
Capacity (TB)Cheapest Drive Cost (GBP)Cost Per TB
4 TB£30.00£7.50
6 TB£48.00£8.00
8 TB£72.00£9.00
10 TB£96.00£9.60
12 TB£132.00£11.00
13.2 TB£138.00£10.45
14 TB£144.00£10.29
16 TB£168.00£10.50
18 TB£240.00£13.33
20 TB£324.00£16.20
22 TB£324.00£14.73

Fans

After running the server for a few hours, I noticed the drives would get quite hot (80+ degrees celcius), these were drives designed to be used in a rack-mounted and those usually draw air in from the front, through the drives. Odds were they wanted active cooling, so I devised a “solution”:

a wee bit redneck, but it works

Disaster - A note on picking the right tubes and coolant

At one point, I was fine-tuning a model when I noticed the GPU temperature was abnormally high (90 degrees celcius), I investigated and noticed that one of the tubes I was using had collapsed in on itself, likely from a combination of temperature and pressure from the angle it was at.

(the tube that collapsed, prior to collapsing)

I replaced the segment of tubing and kept going until I had the same issue a week later, so I opted to replace all the tubing with thicker tubing.
Unfortunately, the coolant didn’t seem to be flowing well…

I ended up taking apart the water blocks for the CPU and GPU to clean up gunk accumulation.

In addition to this, I repasted and replaced the pads on the the GPU and repasted the CPU. I also replaced the blue coolant I was using with distilled water.

THERE IS NO BENEFIT TO USING ANYTHING OTHER THAN DISTILLED WATER FOR WATER COOLING

The stuff added to coloured coolants is often detrimental, and distilled water is insanely cheap. Also colour coolant can clog water blocks from what I’ve heard [not fact checked].

After the loop was cleaned, flushed, etc, the server was put back together and has been running just fine ever since.

(Somehow I did manage to kill the RTX 3090’s RGB in this debacle, but the GPU itself seems to be fine.)

(Somehow I did manage to kill the RTX 3090’s RGB in this debacle, but the GPU itself seems to be fine.)

Software

Although rather unconventional, the server runs Arch btw. This was just a personal preference.

ZFS

The drives are managed via ZFS as such:

MATE Terminal

File

Edit

View

Search

Terminal

Help

cortex@CORTEX-Server  ~  sudo zpool list
[sudo] password for cortex: 
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
personal       928G   391G   537G        -         -    47%    42%  1.00x    ONLINE  -
reproducible  3.62T  3.50T   126G        -         -    57%    96%  1.00x    ONLINE  -
srv           14.5T  2.54T  12.0T        -         -     3%    17%  1.00x    ONLINE  -
cortex@CORTEX-Server  ~         

MATE Terminal

File

Edit

View

Search

Terminal

Help

cortex@CORTEX-Server  ~  sudo zfs list  
[sudo] password for cortex: 
NAME                       USED  AVAIL  REFER  MOUNTPOINT
personal                   391G   509G   112K  /mnt/personal
personal/cm-hpc            390G   509G   390G  /mnt/personal/cm-hpc
reproducible              3.50T  11.5G  3.50T  /reproducible
srv                       1.21T  5.71T  10.1M  /srv
srv/cm-hpc                76.7G  5.71T  76.7G  /srv/cm-hpc
srv/docker                 285G  5.71T   209K  /srv/docker
srv/docker/dokuwiki        138M  5.71T   138M  /srv/docker/dokuwiki
srv/docker/forgejo         284G  5.71T   284G  /srv/docker/forgejo
srv/docker/frigate        77.4M  5.71T  77.4M  /srv/docker/frigate
srv/docker/homeassistant   140K  5.71T   140K  /srv/docker/homeassistant
srv/docker/penpot          106M  5.71T   106M  /srv/docker/penpot
srv/hpc                    874G  5.71T   874G  /srv/hpc

The 2 WD drives are essentially striped as the “reproducible” pool for a total of 4TB non-redundant storage for what is essentially a glorified downloads folder, hence the name “reproducible” (it can be redownloaded if really needed).
Wheras the 4 4TB drives are used as a RAIDZ2 array which provides redundancy if 2 disks fail, additionally, datasets are configured for each Docker container.

Drive Wear

The hard drives for the server were being unusually loud, making clicking noises occasionally. It turned out that this noise was the drives parking and unparking.
I disabled this behaviour using OpenSeaChest (specifically openSeaChest_PowerControl) for the Seagate drives as such:

MATE Terminal

File

Edit

View

Search

Terminal

Help

cortex@CORTEX-Server  ~/openSeaChest/Make/gcc/openseachest_exes  ➦ bafb778  sudo ./openSeaChest_PowerControl -d /dev/sdd --changePower --disableMode --powerMode idle_b
...
cortex@CORTEX-Server  ~/openSeaChest/Make/gcc/openseachest_exes  ➦ bafb778  sudo ./openSeaChest_PowerControl -d /dev/sdd --changePower --disableMode --powerMode idle_c
===EPC Settings===
    * = timer is enabled
    C column = Changeable
    S column = Saveable
    All times are in 100 milliseconds

Name       Current Timer Default Timer Saved Timer   Recovery Time C S
Idle A     *1200         *10           *1200         256           Y N
Idle B      2400         *6000          2400         512           Y N
Idle C      18000         18000         18000        7168          Y N
Standby Y   6000          18000         6000         25600         N N
Standby Z   9000          36000         9000         33792         N N

(I opted not to disable idle_a)

I also used WDIDLE to disable the WD equivalent for the reproducible pool drives.

LXC

The server runs many LXCs operating in a veth network type. In effect each LXC is directly connected to my LAN.
Some of the LXCs have the GPU passed-through to them with the following configuration entries:

MATE Terminal

File

Edit

View

Search

Terminal

Help

GNU nano 8.5 - /var/lib/lxc/hpc/config                                 

# Allow cgroup access
lxc.cgroup.devices.allow = c 195:* rwm
lxc.cgroup.devices.allow = c 240:* rwm
lxc.cgroup.devices.allow = c 243:* rwm

# Pass through device files
lxc.mount.entry = /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry = /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry = /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry = /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry = /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file

(I have totally forgotten how the cgroups work)

The two relevant LXCs that I wil focus on are cortex-hc and cortex-docker.

CORTEX-HC

cortex-hc is an Arch Linux container which I primarily use for running my workloads, it exposes a GUI environment over VNC.

I could go on a long rant about the absolute pain that was getting headless VNC to work whilst the NVIDIA drivers were installed…

But in a nutshell, I found that running:

rm /lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1
rm /usr/lib/libnvidia-egl-gbm.so*

fixed the issue.

As for how the VNC server is started, I merely used a systemd service to run a script with the following:

export DISPLAY=:1
Xvfb $DISPLAY -screen 0 1920x1080x16 &
mate-session &
x11vnc -display $DISPLAY -bg -forever -quiet -listen localhost -rfbauth /home/stamisme/.vnc/passwd -shared

“If it ain’t broke…”

cortex-docker

.
|-- dokuwiki
|   |-- compose.yml
|   |-- logs
|   |-- nginx
|   `-- php
|-- forgejo
|   `-- compose.yml
|-- homeassistant
|   |-- compose.yml
|   `-- config
`-- jellyfin
    |-- cache
    |-- compose.yml
    `-- config

The cortex-docker container uses the Arch linux template (alongside every other container). As you can undoubtedly tell it’s just using Docker compose.

NGINX

The server’s many available sites are handled by a single NGINX reverse proxy running on the Arch host OS.
SSL is managed by certbot.

For example, the configuration file for git.hackerdude.tech is:

server {
    server_name git.hackerdude.tech;
    
    # HTTP configuration
    
    # HTTP to HTTPS
    if ($scheme != "https") {
        return 301 https://$host$request_uri;
    }
    
    # HTTPS configuration
    http2 on;

    location /robots.txt {
       add_header Content-Type text/plain;
       return 200 "User-agent: *\nDisallow: /\n";
    }

    location / {
        client_max_body_size 5G;
        proxy_pass  http://192.168.4.43:3000;
        proxy_redirect                      off;
        proxy_set_header  Host              $http_host;
        proxy_set_header  X-Real-IP         $http_CF_Connecting_IP;
        proxy_set_header  X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header  X-Forwarded-Proto $scheme;
        proxy_read_timeout                  900;

        # WebSocket support
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }


    listen 443 ssl; # managed by Certbot
    ssl_certificate /etc/letsencrypt/live/hackerdude.tech/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/hackerdude.tech/privkey.pem; # managed by Certbot
    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
}

server {
    if ($host = git.hackerdude.tech) {
        return 301 https://$host$request_uri;
    } # managed by Certbot


    server_name git.hackerdude.tech;
    listen 80;
    return 404; # managed by Certbot
}

Fin!

To be fair this post mainly exists as notes for me to reference back to every now and then

Todo

  • [DONE] Explore streaming solutions such as Sunshine instead of VNC for latency
  • [DONE] Figure out why UE5 Lightmass is broken in LXCs

UE5 lightmass issues were fixed by removing an assertion from Engine/Source/Runtime/Core/Private/HAL/ThreadingBase.cpp
if it works…

@ -85,7 +85,7 @@ FTaskTagScope::FTaskTagScope(bool InTagOnlyIfNone, ETaskTag InTag) : Tag(InTag),

if (ActiveTaskTag == ETaskTag::EStaticInit)
{
-    checkf(Tag == ETaskTag::EGameThread, TEXT("The Gamethread can only be tagged on the inital thread of the application"));
+    //checkf(Tag == ETaskTag::EGameThread, TEXT("The Gamethread can only be tagged on the inital thread of the application"));
#if !IS_RUNNING_GAMETHREAD_ON_EXTERNAL_THREAD
#	if PLATFORM_WINDOWS && DO_CHECK
    // When RenderDoc injects its DLL it first creates this process in a suspended