{"id":10174,"date":"2024-08-30T11:13:31","date_gmt":"2024-08-30T19:13:31","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=10174"},"modified":"2025-06-18T10:51:45","modified_gmt":"2025-06-18T10:51:45","slug":"build-local-llm-server","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/","title":{"rendered":"Building a Low-Cost Local LLM Server to Run 70 Billion Parameter Models"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"585\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/colorful-image-of-computer-hardware-1024x585-1.jpeg\" alt=\"\" class=\"wp-image-17104\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/colorful-image-of-computer-hardware-1024x585-1.jpeg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/colorful-image-of-computer-hardware-1024x585-1-300x171.jpeg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/colorful-image-of-computer-hardware-1024x585-1-768x439.jpeg 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><em>A guest post from Fabr\u00edcio Ceolin, DevOps Engineer at Comet. Inspired by the growing demand for large-scale language models, Fabr\u00edcio engineered a cost-effective local LLM server capable of running models with up to 70 billion parameters. In this guide, you&#8217;ll explore how to build a powerful and scalable local LLM environment, enabling you to harness the full potential of these advanced models.<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-introduction\">Introduction<\/h2>\n\n\n\n<p>As the demand for large-scale language models (LLMs) continues to grow, particularly for running local AI agents, developers and researchers face significant challenges in managing the computational requirements of these powerful models. Running LLMs locally presents several hurdles, including the need for substantial hardware resources, high operational costs, and complex software configurations. These challenges are often a barrier for those who wish to experiment with, debug, and optimize LLM code without relying on expensive cloud-based solutions.<\/p>\n\n\n\n<p>This article addresses these challenges by providing a comprehensive guide to building a low-cost local LLM server capable of running models with up to 70 billion parameters. The proposed solution leverages affordable and repurposed hardware, initially intended for Ethereum mining, combined with advanced software tools like Kubernetes and OLLAMA, to create a scalable and efficient environment for LLM development.<\/p>\n\n\n\n<p>Following this guide, readers will learn how to assemble the necessary hardware, configure the software environment, deploy LLMs locally and run basic LLM queries. This approach reduces the cost associated with LLM experimentation and provides greater control over the development process, making it accessible to a broader audience of developers and researchers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-audience\">Audience<\/h2>\n\n\n\n<p>This guide is primarily intended for developers and researchers with some familiarity with hardware setups and software configurations, particularly around GPUs, Docker, and Kubernetes. If you are less familiar with these technologies, additional resources and explanations are provided via links to help you follow along.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-hardware-selection\">Hardware Selection<\/h2>\n\n\n\n<p>The first step in building a local LLM server is selecting the proper hardware. Depending on the response speed you require, you can opt for a <a href=\"https:\/\/medium.com\/@ttio2tech_28094\/local-large-language-models-hardware-benchmarking-ollama-benchmarks-cpu-gpu-macbooks-c696abbec613\">CPU, GPU, or even a MacBook<\/a>. For this project, I repurposed components originally intended for Ethereum mining to get a reasonable speed to run LLM agents. This approach provided both relative affordability and the computing power needed to run LLMs around the clock. I combined six GPUs to achieve a total of 96 GB of VRAM, essential for running LLMs with 70 billion parameters, <a href=\"https:\/\/www.perplexity.ai\/page\/how-memory-do-i-need-in-gpu-to-mwAOE4G6QJ69NWlAuJuWiQ\">which requires 84 GB of VRAM using 8-bit quantization<\/a>. Here\u2019s the hardware I used:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Motherboard:<\/strong> ASUS PRIME H410M-E with two PCI Express slots.<\/li>\n\n\n\n<li><strong>Riser Card:<\/strong> 2 PCIe 1 to 4 PCI Express 1X, Riser Card PCIe Graphics Card Expansion Card for PC.<\/li>\n\n\n\n<li><strong>Graphics Cards:<\/strong> 6 NVIDIA GPUs, including 4 RTX 3060 (12 GB of RAM each) and 2 Tesla P40 (24 GB of RAM each, Custom FAN), totaling 96 GB of VRAM.<\/li>\n\n\n\n<li><strong>RAM:<\/strong> 32 GB.<\/li>\n\n\n\n<li><strong>CPU<\/strong>: 10th generation Intel Core i3.<\/li>\n\n\n\n<li><strong>Power Supplies:<\/strong> 3 interconnected 750-watt power supplies.<\/li>\n\n\n\n<li><strong>Storage:<\/strong> A 2 TB NVMe drive.<\/li>\n\n\n\n<li><strong>GPU Interconnection:<\/strong> GPUs are connected to an external PCIe 1x slot via a USB cable in the riser card multiplier slot.<\/li>\n<\/ul>\n\n\n\n<p>The total cost was around $2400. However, it\u2019s possible to reduce costs by using an older motherboard with at least two PCIe slots and an older processor that <a href=\"https:\/\/github.com\/ollama\/ollama\/issues\/644\">supports AVX\/AVX2 instructions<\/a>. A standard spinning hard disk can be used for storage instead of an NVMe drive.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-software-configuration-overview\">Software Configuration Overview<\/h2>\n\n\n\n<p>It would be best if you had a robust software configuration to effectively run LLMs locally. For this setup, I chose to use Kubernetes over Docker Compose. While Docker Compose is suitable for simpler environments, Kubernetes, with its advanced orchestration capabilities like dynamic scaling, automated deployment, and load balancing, is key to managing complex on-premises workloads like LLMs while also enriching my DevOps skills for handling GPU workloads.<\/p>\n\n\n\n<p>Here\u2019s a high-level overview of the software tools and steps involved in setting up the server:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Kubernetes:<\/strong> Used to manage the execution of OLLAMA models, providing scalability and flexibility.<\/li>\n\n\n\n<li><strong>OLLAMA:<\/strong> A versatile tool that enables the dynamic use of multiple GPUs to load and execute models locally.<\/li>\n\n\n\n<li><strong>Open Web UI:<\/strong> A user-friendly web interface for managing OLLAMA models within a Kubernetes deployment. (A nice to have.)<\/li>\n<\/ol>\n\n\n\n<p>This high-level overview helps you understand the structure before we dive into each component in detail.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-detailed-steps\">Detailed steps<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-prerequisites\">Prerequisites<\/h3>\n\n\n\n<p>You need to prepare the host to run the environment:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.perplexity.ai\/page\/how-to-install-nvidia-support-rDA7xqpsTS6ZaudpqAHqNQ\">Enable GPU support on Linux Host<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.perplexity.ai\/page\/how-to-enable-gpu-support-in-d-VzHZl1aiTI20MoMeJ89urg\">Enable GPU Support in Docker<\/a><\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-step-1-k3s-setup\">Step 1: k3s setup<\/h3>\n\n\n\n<p><strong>Download the Dockerfile and Create the YAML Configuration<\/strong><\/p>\n\n\n\n<p>Download the Dockerfile and create the necessary YAML configuration for the NVIDIA device plugin.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Download the Dockerfile:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">wget https:\/\/k3d.io\/v5.6.3\/usage\/advanced\/cuda\/Dockerfile<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Create the device-plugin-daemonset.yaml:<\/strong><\/li>\n<\/ul>\n\n\n\n<p>This YAML file configures the NVIDIA device plugin in Kubernetes.<br>\n<script src=\"https:\/\/gist.github.com\/caleb-kaiser\/eacdf97aa62680cad44a45f4b0e9d6ac.js\"><\/script><\/p>\n\n\n\n<p><strong>Build and Run the Docker Image<\/strong><\/p>\n\n\n\n<p>Next, build and run the custom Kubernetes image with GPU support.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Build the Docker Image:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">docker build . -t localhost\/rancher\/k3s:v1.28.8-k3s1-cuda-12.4.1-base-ubuntu22.04<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Run the Docker Container:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">docker run -d --name k3s-controlplane --gpus all -e K3S_KUBECONFIG_OUTPUT=\"\/output\/kubeconfig.yaml\" -e K3S_KUBECONFIG_MODE=\"666\" -v ${PWD}\/k3s:\/output --privileged -v \/usr\/lib\/x86_64-linux-gnu\/:\/usr\/local\/cuda\/lib64 --network host localhost\/rancher\/k3s:v1.28.8-k3s1-cuda-12.4.1-base-ubuntu22.04 server<\/pre>\n\n\n\n<p><strong>Test the Kubernetes Setup<\/strong><\/p>\n\n\n\n<p>After running the Docker container, export the <code>KUBECONFIG<\/code> environment variable and use <code>kubectl<\/code> to test the setup.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Export KUBECONFIG:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">export KUBECONFIG=${PWD}\/k3s\/kubeconfig.yaml<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Test with kubectl:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">kubectl get pods -A<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-step-2-deploy-open-web-ui\">Step 2: Deploy Open Web UI<\/h3>\n\n\n\n<p>With Kubernetes set up, you can deploy a customized version of Open Web UI to manage OLLAMA models. I deployed OLLAMA via Open Web UI to serve as a multipurpose LLM server for convenience, though this step is not strictly necessary \u2014 <a href=\"https:\/\/sarinsuriyakoon.medium.com\/deploy-ollama-on-local-kubernetes-microk8s-6ca22bfb7fa3\">you can run OLLAMA directly if preferred<\/a>. My customized version is based on a pre-massive Open Web UI project update. Any assistance in updating my branch to the latest main version is welcome.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Download Open Web UI:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\"># This customized\ngit clone https:\/\/github.com\/fabceolin\/open-webui\ncd open-webui<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Deploy Open Web UI:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">helm upgrade --install -f values.yaml open-webui kubernetes\/helm\/<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Start Port Forwarding from Kubernetes to localhost:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">nohup kubectl port-forward svc\/ollama 3000:80 &amp;<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-step-3-install-and-test-ollama-locally-to-download-the-models\">Step 3: Install and Test OLLAMA Locally to download the models<\/h3>\n\n\n\n<p>The final step is to install OLLAMA locally and test it with your configured models.<\/p>\n\n\n\n<p><strong>Install OLLAMA<\/strong><\/p>\n\n\n\n<p>Use Homebrew to install OLLAMA, then download and configure your LLM model.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Install OLLAMA with brew:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">brew install ollama\nexport OLLAMA_HOST=http:\/\/localhost:3000\n# This should return the models from the localhost:3000\nollama list<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Pull the Model:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">ollama pull llama3.1:70b-instruct-q8_0<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Configure the Model:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">cat &lt;&lt;EOF &gt;Modelfile\nFROM llama3.1:70b-instruct-q8_0\nPARAMETER temperature 0.1\nPARAMETER stop Result\nSYSTEM \"\"\"\"\"\"\nEOF<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Create the Local Model:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">ollama create -f Modelfile fabceolin\/llama3.1:70b-instruct-q8_0<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Test the model:<\/strong><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">curl http:\/\/localhost:3000\/api\/generate -d '{ \"model\": \"fabceolin\/llama3.1:70b-instruct-q8_0\", \"prompt\": \"You`re a kindergarten teacher, and you need to answer a child`s question: Why is the sky blue?\", \"stream\": false }'\n\n{\"model\":\"fabceolin\/llama3.1:70b-instruct-q4_0\",\"created_at\":\"2024-08-17T14:56:09.4438738Z\",\"response\":\"What a great question!\\n\\nYou know how we can see lots of different colors around us, like the green grass and the yellow sunflowers?\\n\\nWell, when sunlight comes from the sun, it's actually made up of all those different colors, like a big ol' rainbow!\\n\\nBut here's the magic part: when that sunlight travels through the air in our atmosphere, it starts to scatter. That means it bounces around all over the place.\\n\\nAnd guess what? The blue light scatters more than any other color! It's like the blue light is playing a game of tag with the air molecules, bouncing off them and flying every which way.\\n\\nSo, when we look up at the sky, we see mostly the blue light because it's scattered in all directions. That's why the sky looks blue to us!\\n\\nIsn't that cool?\\n\\n(And don't worry if you didn't understand everything \u2013 it's a pretty big concept for little minds! But I hope this helps you imagine how amazing and magical our world is!)\",\"done\":true,\"done_reason\":\"stop\",\"context\":[128006,882,128007,271,2675,63,265,264,68223,11326,11,323,499,1205,311,4320,264,1716,40929,3488,25,8595,374,279,13180,6437,30,128009,128006,78191,128007,271,3923,264,2294,3488,2268,2675,1440,1268,584,649,1518,10283,315,2204,8146,2212,603,11,1093,279,6307,16763,323,279,14071,7160,89770,1980,11649,11,994,40120,4131,505,279,7160,11,433,596,3604,1903,709,315,682,1884,2204,8146,11,1093,264,2466,8492,6,48713,2268,4071,1618,596,279,11204,961,25,994,430,40120,35292,1555,279,3805,304,1057,16975,11,433,8638,311,45577,13,3011,3445,433,293,31044,2212,682,927,279,2035,382,3112,8101,1148,30,578,6437,3177,1156,10385,810,1109,904,1023,1933,0,1102,596,1093,279,6437,3177,374,5737,264,1847,315,4877,449,279,3805,35715,11,65128,1022,1124,323,16706,1475,902,1648,382,4516,11,994,584,1427,709,520,279,13180,11,584,1518,10213,279,6437,3177,1606,433,596,38067,304,682,18445,13,3011,596,3249,279,13180,5992,6437,311,603,2268,89041,956,430,7155,1980,7,3112,1541,956,11196,422,499,3287,956,3619,4395,1389,433,596,264,5128,2466,7434,369,2697,20663,0,2030,358,3987,420,8779,499,13085,1268,8056,323,24632,1057,1917,374,16715],\"total_duration\":30203319345,\"load_duration\":53928635,\"prompt_eval_count\":32,\"prompt_eval_duration\":186546000,\"eval_count\":207,\"eval_duration\":29918639000}curl https:\/\/localhost:3000\/api\/chat -d '{ \"model\": \"fabceolin\/llama3.1:70b-instruct-q8_0\", \"prompt\": \"Why is the sky blue?\" }'\n<\/pre>\n\n\n\n<p><strong>Testing OLLAMA with a Crew.ai Agent<\/strong><\/p>\n\n\n\n<p>To demonstrate the capability of this setup, here\u2019s an example of running a simple Crew.ai agent:<br>\n<script src=\"https:\/\/gist.github.com\/caleb-kaiser\/09b96975d8b745aa35d5a29fdc2a1364.js\"><\/script><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">python3 -m venv env\nsource env\/bin\/activate\npython -m pip install -r requirements.txt\npython example-crew.ai.py\n[DEBUG]: == Working Agent: Math Professor\n[INFO]: == Starting Task: what is 3 + 5\n\n\n&gt; Entering new CrewAgentExecutor chain...\nTo solve this problem, I will simply add 3 and 5 together using basic arithmetic operations.\n\nI will take the numbers 3 and 5 and combine them by counting up from 3, adding 5 units to get the total sum.\n\nThis is a simple addition operation that follows the rules of basic mathematics.\n\n\nFinal Answer: The final answer is 8\n\n&gt; Finished chain.\n[DEBUG]: == [Math Professor] Task output: The final answer is 8\n\n\nThe final answer is 8<\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-10193 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"801\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/1_QUKO5t9yLMh8nBOy-TfJzQ.png\" alt=\"nvtop output with GPU charts\" class=\"wp-image-10193\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/1_QUKO5t9yLMh8nBOy-TfJzQ.png 1200w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/1_QUKO5t9yLMh8nBOy-TfJzQ-300x200.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/1_QUKO5t9yLMh8nBOy-TfJzQ-1024x684.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/1_QUKO5t9yLMh8nBOy-TfJzQ-768x513.png 768w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><figcaption class=\"wp-element-caption\">nvtop output with GPU charts<\/figcaption><\/figure>\n\n\n\n<p>With <a href=\"https:\/\/github.com\/TheR1D\/shell_gpt\">shell-gpt<\/a>, you can generate git commit messages.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Please use my forked LiteLLM branch instead of the main project - PR link: https:\/\/github.com\/BerriAI\/litellm\/pull\/5148\npip install git+https:\/\/github.com\/fabceolin\/litellm.git --upgrade\n\n# Add this configuration to sgpt\ncat &gt;~\/.config\/shell_gpt\/.sgptrc &lt;&lt;EOF\nUSE_LITELLM=true\nAPI_BASE_URL=http:\/\/localhost:3000\nDEFAULT_MODEL=ollama_chat\/fabceolin\/llama3.1:70b-instruct-q8_0\nEOF\n\ncurl -o ~\/.config\/shell_gpt\/roles\/CommitMessageGenerator.json https:\/\/raw.githubusercontent.com\/fabceolin\/dotfiles\/master\/sgpt\/roles\/CommitMessageGenerator.json\n# Enter in a directory that you want to generate the diff commit\nsgpt --role CommitMessageGenerator \"$(git diff)\"\n\"feat(ollama_chat.py): add 'follow_redirects' parameter to request configuration for improved handling of redirects in ollama_completion_stream function\"<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-next-steps\">Next Steps<\/h2>\n\n\n\n<p>In the next phase of this project, I plan to integrate ngrok to enable secure remote access to the local LLM server. Ngrok will securely expose the local API over the internet, making it accessible from anywhere while maintaining a strong security posture. This feature is particularly valuable for accessing the server remotely or collaborating with others across different locations.<\/p>\n\n\n\n<p>Additionally, I\u2019ll be running agents continuously and need a reliable tool to monitor their activity. Comet is launching a new product with a user-friendly interface designed specifically for this purpose, and I plan to integrate it to streamline the monitoring process.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\">Conclusion<\/h2>\n\n\n\n<p>Building a local LLM server capable of running models with 70 billion parameters might seem daunting, but it becomes achievable and cost-effective with the proper hardware and software. By repurposing Ethereum mining hardware and leveraging tools like OLLAMA and Kubernetes, you can create a robust, scalable environment for developing and deploying advanced language models right from your own setup.<\/p>\n\n\n\n<p>This project highlights not only the technical feasibility but also the practical benefits of maintaining a local LLM server. From cost savings to greater control over your infrastructure, the advantages are clear, especially for developers and researchers looking to push the boundaries of AI without relying on expensive cloud solutions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-references\">References<\/h2>\n\n\n\n<p>1.<a href=\"https:\/\/medium.com\/@ttio2tech_28094\/local-large-language-models-hardware-benchmarking-ollama-benchmarks-cpu-gpu-macbooks-c696abbec613\"> Local Large Language Models Hardware Benchmarking: Ollama Benchmarks (CPU, GPU, MacBooks)<\/a><br>\n2. <a href=\"https:\/\/www.perplexity.ai\/page\/how-memory-do-i-need-in-gpu-to-mwAOE4G6QJ69NWlAuJuWiQ\">How much memory do I need in GPU to run Ollama?<\/a><br>\n3. <a href=\"https:\/\/github.com\/ollama\/ollama\/issues\/644\">Ollama Issue #644: GPU Support<\/a><br>\n4. <a href=\"https:\/\/www.perplexity.ai\/page\/how-to-install-nvidia-support-rDA7xqpsTS6ZaudpqAHqNQ\">How to Install NVIDIA Support for Ollama<\/a><br>\n5. <a href=\"https:\/\/www.perplexity.ai\/page\/how-to-enable-gpu-support-in-d-VzHZl1aiTI20MoMeJ89urg\">How to Enable GPU Support in Docker for Ollama<\/a><br>\n6. <a href=\"https:\/\/github.com\/fabceolin\/open-webui\">Open-WebUI: A Web UI for Open-Source LLMs<\/a><br>\n7.<a href=\"https:\/\/github.com\/fabceolin\/litellm\"> LiteLLM: A Lightweight Framework for LLMs<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A guest post from Fabr\u00edcio Ceolin, DevOps Engineer at Comet. Inspired by the growing demand for large-scale language models, Fabr\u00edcio engineered a cost-effective local LLM server capable of running models with up to 70 billion parameters. In this guide, you&#8217;ll explore how to build a powerful and scalable local LLM environment, enabling you to harness [&hellip;]<\/p>\n","protected":false},"author":137,"featured_media":17104,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[8,65,6,7],"tags":[78,52,31],"coauthors":[225],"class_list":["post-10174","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-llmops","category-machine-learning","category-tutorials","tag-kubernetes","tag-llm","tag-llmops"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Tutorial: Build a Low-Cost Local LLM Server to Run 70B Models<\/title>\n<meta name=\"description\" content=\"Learn how to repurpose crypto-mining hardware and other low-cost components to build a home server capable of running 70B models.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Building a Low-Cost Local LLM Server to Run 70 Billion Parameter Models\" \/>\n<meta property=\"og:description\" content=\"Learn how to repurpose crypto-mining hardware and other low-cost components to build a home server capable of running 70B models.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-08-30T19:13:31+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-18T10:51:45+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/colorful-image-of-computer-hardware-1024x585-1.jpeg\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"585\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Fabr\u00edcio Ceolin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Fabr\u00edcio Ceolin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Tutorial: Build a Low-Cost Local LLM Server to Run 70B Models","description":"Learn how to repurpose crypto-mining hardware and other low-cost components to build a home server capable of running 70B models.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/","og_locale":"en_US","og_type":"article","og_title":"Building a Low-Cost Local LLM Server to Run 70 Billion Parameter Models","og_description":"Learn how to repurpose crypto-mining hardware and other low-cost components to build a home server capable of running 70B models.","og_url":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-08-30T19:13:31+00:00","article_modified_time":"2025-06-18T10:51:45+00:00","og_image":[{"width":1024,"height":585,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/colorful-image-of-computer-hardware-1024x585-1.jpeg","type":"image\/jpeg"}],"author":"Fabr\u00edcio Ceolin","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Fabr\u00edcio Ceolin","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/"},"author":{"name":"Fabr\u00edcio Ceolin","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/82aca13e0ee4258eb0f0c0faf4cb0e22"},"headline":"Building a Low-Cost Local LLM Server to Run 70 Billion Parameter Models","datePublished":"2024-08-30T19:13:31+00:00","dateModified":"2025-06-18T10:51:45+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/"},"wordCount":1276,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/colorful-image-of-computer-hardware-1024x585-1.jpeg","keywords":["Kubernetes","LLM","LLMOps"],"articleSection":["Comet Community Hub","LLMOps","Machine Learning","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/","url":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/","name":"Tutorial: Build a Low-Cost Local LLM Server to Run 70B Models","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/colorful-image-of-computer-hardware-1024x585-1.jpeg","datePublished":"2024-08-30T19:13:31+00:00","dateModified":"2025-06-18T10:51:45+00:00","description":"Learn how to repurpose crypto-mining hardware and other low-cost components to build a home server capable of running 70B models.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/colorful-image-of-computer-hardware-1024x585-1.jpeg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/colorful-image-of-computer-hardware-1024x585-1.jpeg","width":1024,"height":585},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/build-local-llm-server\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Building a Low-Cost Local LLM Server to Run 70 Billion Parameter Models"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/82aca13e0ee4258eb0f0c0faf4cb0e22","name":"Fabr\u00edcio Ceolin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/b2a0c093d93e9097b04a383123d39f98","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/cropped-fabricio-headshot-96x96.jpeg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/08\/cropped-fabricio-headshot-96x96.jpeg","caption":"Fabr\u00edcio Ceolin"},"description":"Fabr\u00edcio Ceolin serves as a Senior DevOps Engineer at Comet, with over two decades of experience in the software industry. Passionate about AI, he brings an extensive background in cloud computing, CI\/CD pipelines, and automation, along with deep expertise in Linux, Kubernetes, and Python, which allows him to explore innovative technology initiatives.","url":"https:\/\/www.comet.com\/site\/blog\/author\/fabricioceolin\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/10174","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/137"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=10174"}],"version-history":[{"count":3,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/10174\/revisions"}],"predecessor-version":[{"id":17105,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/10174\/revisions\/17105"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/17104"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=10174"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=10174"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=10174"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=10174"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}