this post was submitted on 10 Jun 2025
11 points (92.3% liked)

LocalLLaMA

3166 readers
8 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
 

I'm limited to 24GB of VRAM, and I need pretty large context for my use-case (20k+). I tried "Qwen3-14B-GGUF:Q6_K_XL," but it doesn't seem to like calling tools more than a couple times, no matter how I prompt it.

Tried using "SuperThoughts-CoT-14B-16k-o1-QwQ-i1-GGUF:Q6_K" and "DeepSeek-R1-Distill-Qwen-14B-GGUF:Q6_K_L," but Ollama or LangGraph gives me an error saying these don't support tool calling.

top 5 comments
sorted by: hot top controversial new old
[–] [email protected] 1 points 14 hours ago* (last edited 14 hours ago) (1 children)

Uhh... Tool calling is built into thier tokenizers, but Ollama/Langchain just ignore them because they're spagetti abstractions. To be blunt, langchain and ollama are overhyped, buggy junk trying to reinvent wheels.

For any kind of STEM work, I'd run Llama Nemotron 49B exl3 via TabbyAPI, which exposes a generic openai endpoint anything can use:

https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1

https://huggingface.co/turboderp/Llama-3.3-Nemotron-Super-49B-v1-exl3/tree/3.0bpw

Nemotron models freaking rock at anything STEM-adjacent, and I can get squeeze in 48K+ context on 24GB VRAM (depending on your cache quantization settings).

Otherwise GLM-4 is very good at tool calling, as is Qwen3, and you can more confortable run them as GGUFs if you don't want to leave the llama.cpp ecosystem, or exl2s if you have specific trouble with exl3 in TabbyAPI.

[–] [email protected] 1 points 14 hours ago

Wow this is some awese information Brucethemoose thanks for sharing! I hope you dont mind if I ask some things.

I feel like a lot of people including myself only vaguely understand tool calling, how its supposed to work, and simple practice excersises to use it on via scripts and APIs. What's a dead simple python script someone could cook to tool call within the openai-compatable API?

In your own words what exactly is tool calling and how does an absolute beginner tap into it? Could you clarify what you mean by 'tool calling being built into their tokenizers'?

Would you mind sharing some sources where we can learn more? I'm sure huggingface has courses but maybe you know some harder to find sources?

Is tabbyAPI an engine similar to ollama, llama.cpp, ect?

What is elx2,3, ect?

[–] [email protected] 3 points 2 days ago (1 children)

Devstral was released recently specifically trained for tool calling in mind. I havent personally tried it out yet but people say it works good with vscode+roo

[–] [email protected] 2 points 2 days ago

Hmm, Devstral doesn't call any tools for me in the current stable Ollama version or the current release candidate. Wonder if it's a bug in ollama or langchain. I've since tried "QwQ-32B-GGUF:Q3_K_XL", and it's a little better than Qwen3-14B:Q6, but still not quite satisfactory, and is much slower and "thinks" too much.

[–] [email protected] 1 points 2 days ago

I have not found a good solution for this yet within that amount of ram.