How to Run Local LLMs with Claude Code (Unsloth)

unsloth.ai · armcat · 19 hours ago · view on HN · security
0 net
How to Run Local LLMs with Claude Code | Unsloth Documentation globe 🇺🇸 English chevron-down Get Started 🦥 Homepage ⭐ Beginner? Start here! chevron-right 📒 Unsloth Notebooks 🔮 All Our Models 📥 Installation chevron-right 🧬 Fine-tuning Guide chevron-right 💡 Reinforcement Learning Guide chevron-right New 💎 Faster MoE Training 🔎 Embedding Fine-tuning 🌀 Ultra Long Context RL Models 💜 Qwen3.5 chevron-right 🧩 NVIDIA Nemotron 3 Super 🌠 Qwen3-Coder-Next waveform MiniMax-M2.5 z GLM-4.7-Flash 🥝 Kimi K2.5 z GLM-5 openai gpt-oss chevron-right 🌠 Qwen3 chevron-right 🚀 Complete LLM Directory chevron-right Basics 🖥️ Inference & Deployment chevron-right claude Claude Code openai OpenAI Codex rectangle-history Multi-GPU Training Unsloth chevron-right screwdriver-wrench Tool Calling Guide 🔊 Text-to-Speech Fine-tuning 🦥 Dynamic 2.0 GGUFs chevron-right 👁️ Vision Fine-tuning ⚠️ Troubleshooting & FAQs chevron-right 💬 Chat Templates 🛠️ Unsloth Environment Flags ♻️ Continued Pretraining 🏁 Last Checkpoint 📊 Unsloth Benchmarks Blog ⚡ New 3x Faster Training ruler-combined 500K Context Training down-left-and-up-right-to-center Quantization-Aware Training docker Unsloth Docker Guide sparkle DGX Spark and Unsloth microchip Blackwell, RTX 50 and Unsloth chevron-up chevron-down gitbook Powered by GitBook xmark block-quote On this page chevron-down This step-by-step guide shows you how to connect open LLMs and APIs to Claude Code entirely locally, complete with screenshots. Run using any open model like Qwen3.5, DeepSeek and Gemma. For this tutorial, we’ll use Qwen3.5 and GLM-4.7-Flash . Both are the strongest 35B MoE agentic & coding model as of Mar 2026 (which works great on a 24GB RAM/unified mem device) to autonomously fine-tune an LLM with Unsloth arrow-up-right . You can swap in any other model , just update the model names in your scripts. Qwen3.5 Tutorial GLM-4.7-Flash Tutorial claude Claude Code Tutorial For model quants, we will utilize Unsloth Dynamic GGUFs to run any LLM quantized, while retaining as much accuracy as possible. circle-info Claude Code has changed quite a lot since Jan 2026. There are lots more settings and necessary features you will need to toggle. hashtag 📖 LLM Setup Tutorials Before we begin, we firstly need to complete setup for the specific model you're going to use. We use llama.cpp which is an open-source framework for running LLMs on your Mac, Linux, Windows etc. devices. Llama.cpp contains llama-server which allows you to serve and deploy LLMs efficiently. The model will be served on port 8001, with all agent tools routed through a single OpenAI-compatible endpoint. hashtag Qwen3.5 Tutorial We'll be using Qwen3.5 -35B-A3B and specific settings for fast accurate coding tasks. If you don't have enough VRAM and want a smarter model, Qwen3.5-27B is a great choice, but it will be ~2x slower, or you can use other Qwen3.5 variants like 9B, 4B or 2B. circle-info Use Qwen3.5-27B if you want a smarter model or if you don't have enough VRAM. It will be ~2x slower than 35B-A3B however. Or you can use Qwen3-Coder-Next which is fantastic if you have enough VRAM. 1 hashtag Install llama.cpp We need to install llama.cpp to deploy/serve local LLMs to use in Claude Code etc. We follow the official build instructions for correct GPU bindings and maximum performance. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices , set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default. Copy apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git-all -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split cp llama.cpp/build/bin/llama- * llama.cpp 2 hashtag Download and use models locally Download the model via huggingface_hub in Python (after installing via pip install huggingface_hub hf_transfer ). We use the UD-Q4_K_XL quant for the best size/accuracy balance. You can find all Unsloth GGUF uploads in our Collection here . If downloads get stuck, see Hugging Face Hub, XET debugging Copy hf download unsloth/Qwen3.5-35B-A3B-GGUF \ --local-dir unsloth/Qwen3.5-35B-A3B-GGUF \ --include " *UD-Q4_K_XL* " # Use "*UD-Q2_K_XL*" for Dynamic 2bit circle-check We used unsloth/Qwen3.5-35B-A3B-GGUF , but you can use another variant like 27B or any other model like unsloth/ Qwen3-Coder-Next -GGUF . 3 hashtag Start the Llama-server To deploy Qwen3.5 for agentic workloads, we use llama-server . We apply Qwen's recommended sampling parameters for thinking mode: temp 0.6 , top_p 0.95 , top-k 20 . Keep in mind these numbers change if you use non-thinking mode or other tasks. Run this command in a new terminal (use tmux or open a new terminal). The below should fit perfectly in a 24GB GPU (RTX 4090) (uses 23GB) --fit on will also auto offload, but if you see bad performance, reduce --ctx-size . triangle-exclamation We used --cache-type-k q8_0 --cache-type-v q8_0 for KV cache quantization for less VRAM usage. For full precision, use --cache-type-k bf16 --cache-type-v bf16 According to multiple reports, Qwen3.5 degrades accuracy with f16 KV cache, so do not use --cache-type-k f16 --cache-type-v f16 which is also on by default in llama.cpp. Note bf16 KV Cache might be slightly slower on some machines. Copy ./llama.cpp/llama-server \ --model unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \ --alias " unsloth/Qwen3.5-35B-A3B " \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ --port 8001 \ --kv-unified \ --cache-type-k q8_0 --cache-type-v q8_0 \ --flash-attn on --fit on \ --ctx-size 131072 # change as required circle-check You can also disable thinking for Qwen3.5 which can improve performance for agentic coding stuff. To disable thinking with llama.cpp add this to the llama-server command: --chat-template-kwargs "{\"enable_thinking\": false}" hashtag GLM-4.7-Flash Tutorial 1 hashtag Install llama.cpp We need to install llama.cpp to deploy/serve local LLMs to use in Claude Code etc. We follow the official build instructions for correct GPU bindings and maximum performance. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices , set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default. Copy apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git-all -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split cp llama.cpp/build/bin/llama- * llama.cpp 2 hashtag Download and use models locally Download the model via huggingface_hub in Python (after installing via pip install huggingface_hub hf_transfer ). We use the UD-Q4_K_XL quant for the best size/accuracy balance. You can find all Unsloth GGUF uploads in our Collection here . If downloads get stuck, see Hugging Face Hub, XET debugging circle-check We used unsloth/GLM-4.7-Flash-GGUF , but you can use anything like unsloth/Qwen3-Coder-Next-GGUF - see Qwen3-Coder-Next Copy import os os . environ [ " HF_HUB_ENABLE_HF_TRANSFER " ] = " 1 " from huggingface_hub import snapshot_download snapshot_download ( repo_id = " unsloth/GLM-4.7-Flash-GGUF " , local_dir = " unsloth/GLM-4.7-Flash-GGUF " , allow_patterns = [ " *UD-Q4_K_XL* " ], ) 3 hashtag Start the Llama-server To deploy GLM-4.7-Flash for agentic workloads, we use llama-server . We apply Z.ai's recommended sampling parameters ( temp 1.0 , top_p 0.95 ). Run this command in a new terminal (use tmux or open a new terminal). The below should fit perfectly in a 24GB GPU (RTX 4090) (uses 23GB) --fit on will also auto offload, but if you see bad performance, reduce --ctx-size . triangle-exclamation We used --cache-type-k q8_0 --cache-type-v q8_0 for KV cache quantization to reduce VRAM usage. If you see reduced quality, instead you can use bf16 but it will increase VRAM use by twice: --cache-type-k bf16 --cache-type-v bf16 Copy ./llama.cpp/llama-server \ --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \ --alias " unsloth/GLM-4.7-Flash " \ --temp 1.0 \ --top-p 0.95 \ --min-p 0.01 \ --port 8001 \ --kv-unified \ --cache-type-k q8_0 --cache-type-v q8_0 \ --flash-attn on --fit on \ --batch-size 4096 --ubatch-size 1024 \ --ctx-size 131072 # change as required circle-check You can also disable thinking for GLM-4.7-Flash which can improve performance for agentic coding stuff. To disable thinking with llama.cpp add this to the llama-server command: --chat-template-kwargs "{\"enable_thinking\": false}" hashtag claude Claude Code Tutorial triangle-exclamation See Fixing 90% slower inference in Claude Code after installing Claude Code to fix open models being 90% slower due to KV Cache invalidation. Once you are done doing the first steps of setting up your local LLM, it's time to setup Claude Code. Claude Code is Anthropic's agentic coding tool that lives in your terminal, understands your codebase, and handles complex Git workflows via natural language. hashtag Install Claude Code and run it locally Mac / Linux Setups Windows Setups Configure Set the ANTHROPIC_BASE_URL environment variable to redirect Claude Code to your local llama.cpp server. Also you might need to set ANTHROPIC_API_KEY depending on the server. For example: Session vs Persistent: The commands above apply to the current terminal only. To persist across new terminals: Add the export line to ~/.bashrc (bash) or ~/.zshrc (zsh). circle-exclamation If you see Unable to connect to API (ConnectionRefused) , remember to unset ANTHROPIC_BASE_URL via unset ANTHROPIC_BASE_URL Missing API key If you see this, set export ANTHROPIC_API_KEY='sk-no-key-required' ## or 'sk-1234' circle-info If Claude Code still asks you to sign in on first run, add "hasCompletedOnboarding": true and "primaryApiKey": "sk-dummy-key" to ~/.claude.json . For the VS Code extension, also enable Disable Login Prompt in settings (or add "claudeCode.disableLoginPrompt": true to settings.json ). Use Powershell for all commands below: Configure Set the ANTHROPIC_BASE_URL environment variable to redirect Claude Code to your local llama.cpp server. Also you must use $env:CLAUDE_CODE_ATTRIBUTION_HEADER=0 see below. triangle-exclamation Claude Code recently prepends and changes a Claude Code Attribution header, which invalidates the KV Cache. See this LocalLlama discussion arrow-up-right . To solve this, do $env:CLAUDE_CODE_ATTRIBUTION_HEADER=0 or edit ~/.claude/settings.json with: Session vs Persistent: The commands above apply to the current terminal only. To persist across new terminals: Run setx ANTHROPIC_BASE_URL "http://localhost:8001" once, or add the $env: line to your $PROFILE . circle-info If Claude Code still asks you to sign in on first run, add "hasCompletedOnboarding": true and "primaryApiKey": "sk-dummy-key" to ~/.claude.json . For the VS Code extension, also enable Disable Login Prompt in settings (or add "claudeCode.disableLoginPrompt": true to settings.json ). hashtag 🕵️ Fixing 90% slower inference in Claude Code triangle-exclamation Claude Code recently prepends and adds a Claude Code Attribution header, which invalidates the KV Cache, making inference 90% slower with local models . See this LocalLlama discussion arrow-up-right . To solve this, edit ~/.claude/settings.json to include CLAUDE_CODE_ATTRIBUTION_HEADER and set it to 0 within "env" circle-info Using export CLAUDE_CODE_ATTRIBUTION_HEADER=0 DOES NOT work! For example do cat > ~/.claude/settings.json then add the below (when pasted, do ENTER then CTRL+D to save it). If you have a previous ~/.claude/settings.json file, just add "CLAUDE_CODE_ATTRIBUTION_HEADER" : "0" to the "env" section, and leave the rest of the settings file unchanged. hashtag 🌟 Running Claude Code locally on Linux / Mac / Windows circle-check We used unsloth/GLM-4.7-Flash-GGUF , but you can use anything like unsloth/Qwen3.5-35B-A3B-GGUF . triangle-exclamation See Fixing 90% slower inference in Claude Code first to fix open models being 90% slower due to KV Cache invalidation. Navigate to your project folder ( mkdir project ; cd project ) and run: To use Qwen3.5-35B-A3B, simply change it to: To set Claude Code to execute commands without any approvals do (BEWARE this will make Claude Code do and execute code however it likes without any approvals!) Try this prompt to install and run a simple Unsloth finetune: After waiting a bit, Unsloth will be installed in a venv via uv, and loaded up: and finally you will see a successfully finetuned model with Unsloth! IDE Extension (VS Code / Cursor) You can also use Claude Code directly inside your editor via the official extension: Install for VS Code arrow-up-right Install for Cursor arrow-up-right Claude Code in VS Code docs arrow-up-right Alternatively, press Ctrl+Shift+X (Windows/Linux) or Cmd+Shift+X (Mac), search for Claude Code , and click Install . circle-exclamation If you see Unable to connect to API (ConnectionRefused) , remember to unset ANTHROPIC_BASE_URL via unset ANTHROPIC_BASE_URL triangle-exclamation If you find open models to be 90% slower, see Claude Code first to fix KV cache being invalidated. Previous Troubleshooting Inference chevron-left Next OpenAI Codex chevron-right Last updated 4 days ago Was this helpful? 📖 LLM Setup Tutorials Qwen3.5 Tutorial GLM-4.7-Flash Tutorial Claude Code Tutorial 🕵️ Fixing 90% slower inference in Claude Code Was this helpful? sun-bright desktop moon Copy curl -fsSL https://claude.ai/install.sh | bash # Or via Homebrew: brew install --cask claude-code Copy export ANTHROPIC_BASE_URL="http://localhost:8001" Copy export ANTHROPIC_API_KEY='sk-no-key-required' ## or 'sk-1234' Copy irm https://claude.ai/install.ps1 | iex Copy $env:ANTHROPIC_BASE_URL="http://localhost:8001" Copy { ... "env": { "CLAUDE_CODE_ATTRIBUTION_HEADER" : "0", ... } } Copy { "promptSuggestionEnabled": false, "env": { "CLAUDE_CODE_ENABLE_TELEMETRY": "0", "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1", "CLAUDE_CODE_ATTRIBUTION_HEADER" : "0" }, "attribution": { "commit": "", "pr": "" }, "plansDirectory" : "./plans", "prefersReducedMotion" : true, "terminalProgressBarEnabled" : false, "effortLevel" : "high" } Copy claude --model unsloth/GLM-4.7-Flash Copy claude --model unsloth/Qwen3.5-35B-A3B Copy claude --model unsloth/GLM-4.7-Flash --dangerously-skip-permissions Copy You can only work in the cwd project/. Do not search for CLAUDE.md - this is it. Install Unsloth via a virtual environment via uv. Use `python -m venv unsloth_env` then `source unsloth_env/bin/activate` if possible. See https://unsloth.ai/docs/get-started/install/pip-install on how (get it and read). Then do a simple Unsloth finetuning run described in https://github.com/unslothai/unsloth. You have access to 1 GPU. sun-bright desktop moon