Clone the code
git clone https://github.com/ggml-org/llama.cpp.git
Follow the guide to run the command below to build the docker
docker build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .
Download models from huggingface, for rtx4070 12g, Q3 is a better option
curl.exe -L -o Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q3_K_P.gguf "https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive/resolve/main/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q3_K_P.gguf?download=true"
curl.exe -L -o mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf "https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive/resolve/main/mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf?download=true"
rtx4070 roughly can set 16k context size, run with command below (make sure your docker desktop is running first)
docker run --gpus all -p 8090:8080 -v F:\AI\llama.cpp\models:/models --entrypoint /app/llama-server local/llama.cpp:full-cuda -m /models/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q3_K_P.gguf --mmproj /models/mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf -ngl 14 --ctx-size 16384 -c 16384 --parallel 1 -fa 1 --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --port 8080
token speed is around 15 t/s, my desktop env:
intel i13700k, ram ddr5 64gb, graphic rtx4070 12g, 1t ssd

