As a Data Scientist coming from the Networking and DevOps world, I’m a firm believer in the Homelab philosophy: The best way to learn things is to experiment and build within a lab environment. This was crucial to me getting my CCNA, and how I learned virtualization as well.
Now that I’m transitioning into Data-Science and Machine learning, I recently updated my lab. In this post I explain the specific HW and SW setup I followed to convert my z620 in an Nvidia-Docker workstation. I wanted to go through my thought process in how I decided to update my lab, and give some recommendations for those who are building a Data-Science/DeepLearning Homelab.
Intro: Containers vs VM’s vs the Cloud
To better explain some of the terms I’m going to be using for the rest of this post, I need to explain some vocabulary. When a DataScientist or Machine Learning engineer wants to build a model, or run an experiment, there are 4 primary ways they can do this:
A Workstation or Server: A workstation is just a computer(desktop) that has been designed for professional use, usually with more RAM, components that are meant to last longer. A server, in this context, is a computer where the case is designed to be around other computers, not people, so the fans are louder, but it can be mounted in a standard rack.
A Virtualization Host: The analogy I always use to explain virtualization is the film The Matrix: If you were in the matrix right now, and signals being sent to your brain told you that this is reality, would it be possible for you know that you’re in the matrix? A virtualization host (like Vmware or Proxmox), is basically the matrix: there is a workstation or server that is the “host” and the Virtualization software simulates reality for the VMs (Virtual Machines), so that they don’t realize they aren’t actually being run on “bare metal”(their own workstation or server). There are a number of benefits to virtualization, such as the ability to “snapshot” a VM before you upgrade or make a change to it, so you can quickly revert if you aren’t happy with the affects of your change.
Containers: If virtualization is The Matrix then containers can be thought of like Gaming Cafe filled with people playing different videogames. Rather than trying to simulate reality, there is a shared reality (actual space), and every computer is showing a player a specific game. With a container (such as Docker), all of the applications share the same kernel, and low level hardware access, and only have their specific libraries in use. This produces performance gains compared to virtualization. Here is a great resource that explain this in more technical detail if you are interested.
The Cloud: The cloud, in the context of data science and machine learning is easy to explain: It’s a Container or VM running on someone elses computer that you pay for by the hour. This means that when you aren’t running and experiment or working on a project, the hardware is being utilized by someone else who is.
Part 1: Why my homelab needed a change, the importance of reproducibility in Data Science
While I was working on Anime_Rec, I discovered that I had exceeded my available memory. After solving my immediate problem with better programming practices(I had data in a dataframe, and I was going to pivot table it, then convert the pivot-table to a sparse matrix, when the better solution was to convert directly from a dataframe to a sparse matrix in one step), I had to grapple with three questions:
1.Given my current hardware (I had a Virtualization host with 96gb of ram that was sitting unused) why was I hitting memory issues with a 16gb workstation?
2. Why was I working with neural nets on a system that was ‘Bespoke’ or heavily customized? If I wanted to productionize my model, or send my work to someone else for them to reproduce my results, how would that go?
3. Would I be better served entirely by using Containers rather than VMs?
The answer to 1 is obvious in retrospect: I had never updated my VM lab when my use case changed. I had a VM lab for networking ;I was using the ‘hybrid network lab’ model of virtualized routers being given nic ports communicating with physical switches.
The answer to 2 is fairly important to anyone working Data Science: From a ‘Science’ perspective, an experiment or a model should be reproducible by others. From a ‘Business’ perspective, when you find yourself relying on a heavily customized (hours of your time to rebuild) system to do an important task, you’ve made a problem. This problem is often called “technical debt”: Specifically you’ve ensured that ‘future you’ (whether this person is you in the future, or your backfill when you move on to something else) is going to spend a lot of (likely unplanned) time to either reverse engineer/ rebuild your setup whenever you have a HW failure, an update breaks a dependency or you need to collaborate with someone.
The answer to 3 is important for anyone who wants to work in DeepLearning: Containers offer a small (7-10 percent are the numbers I usually see) performance increase over full VMs. Containers also offer (with Nvidia-Docker) a much better way to share a GPU between multiple containers than can be achieved with VMs. (With VMs you need to shut down both VMs and re-assign the GPU passthrough). In a similar way, containers allow for more ‘flexible’ memory management (Every container gets as much memory as it requests unless you restrict it otherwise, which can certainly be a double edged sword), which is exactly what I want.
So I decided: it was time to convert from a Proxmox host to an Nvidia-Docker server.
Part 2: A brief warning (or an argument for cloud)
Before I continue I feel obligated to insert a ‘You should consider doing what I say, not what I do’ warning here: AWS, Google Cloud and Azure offer cloud based GPU equipped systems by the hour. While in the long term building your own system is certainly cheaper, you need to do many many hours of GPU computing until you hit breakeven. Until you’ve used up Google Cloud’s promotional credits, and spent at least $100 on gpu cloud fees, I don’t recommend building a system like this if your goal is to save money.
However, I think there are several benefits that make this ‘Worth’ the cost: As I explained above: ‘There is no such thing as the Cloud, just other People’s computers’. Just as it’s important to understand how the algorithms for machine learning models are working (So you don’t feed unscaled data into a KNN classifier, for example), knowing how linux servers work with fewer abstractions is always useful a skill.
I also would note that I’m using the z620 which I’ve had for a while(drastically lowering my total project cost), but if someone is buying or building a new system for this purpose there may be better options. The z620 is a dual e5-2670 system. The e5-2670 is a legendary processor among the homelab community. When a large internet company began to retire, en masse, a large number of servers using the processor in 2015, the secondary market price for a chip than once retailed for $1600 fell low as $70 due to used component recyclers. There have been a number of great resources on how to use these to build home servers. The z620 (along with the z420 and the dell t3600) are workstations that shipped with a motherboard capable of handling this processor. After the e5-2670 flooded the secondary market, compatible motherboards have been getting harder to find, which is I why I prefer working with a compatible workstation rather than building a system from scratch. When it comes to multi-threaded use cases, the e5-2670 is the standard by which I judge all other processors for homelab purposes.
However, the e5-2670 is a 5 year old chip with a power draw of 115 watts. It also has the multi-threaded performance of a modern amd midrange chip,which has half the power draw. This means that by going AMD, you can get smaller motherboards that fit in normal cases, and you can use ddr4 ram rather than ddr3. You also get a warranty with a new AMD. Also, if any part of your workflow is not optimized for multi-threaded work, then per thread performance on the e5-2670 is going to be a massive bottleneck and a modern intel consumer chip with only 2 or 4 ‘faster’ cores will give better performance.
If you read my next post, you’ll notice that I’m using a 1080ti. TLDR, because of the mining situation, the 1080ti is one of the only cards with good availability that can be found at close to MSRP. There are other resources about choosing a Nvidia GPU for deep learning, but as of right now, the 1060 6gb is probably the cheapest worth using, and the performance to price appears to scale linearly to the 1080ti, so the consensus advice is buy the best card you can afford from the 10 series consumer line.
Part 3: If I was building this from scratch
If I was building this from scratch, here’s what I would consider:
Cheapest in time and effort: either a dell t3600 or hp z420 with an e5-2670 and 32gb of ram from ebay. Use a ~120gb ssd for OS, 2tb platter drive to store large datasets. Look for a blower model 1060 6gb at close to msrp and install that.
For Mid Range I would recommend something like this https://pcpartpicker.com/b/trD2FT
For any higher end system, I would recommend more research (once you get into multiple GPUs this becomes a lot more complex), and consider what exactly your use case is, what is your purchasing budget and what exactly HW refresh cycle you plan to be on.
I hope this post was useful to anyone considering a homelab for DeepLearning.
You can contact me at my Contact Page if you have any questions, or if you want to talk about anything involving the intersection of Data Science, Machine Learning, and DevOps.
Also published on Medium.