Ollama is now powered by MLX on Apple Silicon in preview

▲

Ollama is now powered by MLX on Apple Silicon in preview(ollama.com)

286 points byredundantly7 hours ago |20 comments

▲franze2 hours ago

I created "apfel" https://github.com/Arthur-Ficial/apfel a CLI for the apple on-device local foundation model (Apple intelligence) yeah its super limited with its 4k context window and super common false positives guardrails (just ask it to describe a color) ... bit still ... using it in bash scripts that just work without calling home / out or incurring extra costs feels super powerful.

▲JumpCrisscross9 minutes ago

…is it a reference to apfelwein?

▲LeoDaVibeci1 hour ago

Honestly I can't believe Apple put that foundation model product out the door. I was so excited about it, but when I tried it, it was such a disappointment. Glad to hear you calling that out so I know it wasn't just me.

Looks like they have pivoted completely over to Gemini, thank god.

▲franze53 minutes ago

yeah, it is super limited but also you can now do

  cmd(){ local x c r a; while [[ $1 == -* ]]; do case $1 in -x)x=1;shift;; -c)c=1;shift;; *)break;; esac; done; r=$(apfel -q -s 'Output only a shell command.' "$*" | sed '/^```/d;/^#/d;s/^[[:space:]]*//;/^$/d' | head -1); [[ $r ]] || { echo "no command generated"; return 1; }; printf '\e[32m$\e[0m %s\n' "$r"; [[ $c ]] && printf %s "$r" | pbcopy && echo "(copied)"; [[ $x ]] && { printf 'Run? [y/N] '; read -r a; [[ $a == y ]] && eval "$r"; }; return 0; }

cmd find all swift files larger than 1MB

cmd -c show disk usage sorted by size

cmd -x what process is using port 3000

cmd list all git branches merged into main

cmd count lines of code by language

without calling home or downloading extra local models

and well, maybe one day they get their local models .... more powerful, "less afraid" and way more context window.

▲AbuAssar2 hours ago

nice project, thanks for sharing.

any plans for providing it through brew for easy installation?

▲woadwarrior0122 minutes ago

There's a very similar afm CLI that can be installed via Homebrew.

https://github.com/scouzi1966/maclocal-api

▲franze1 hour ago

good idea

▲babblingfish6 hours ago

LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.

▲troad3 hours ago

I very recently installed llama.cpp on my consumer-grade M4 MBP, and I've been having loads of fun poking and prodding the local models. There's now a ChatGPT style interface baked into llama.cpp, which is very handy for quick experimentation. (I'm not entirely sure what Ollama would get me that llama.cpp doesn't, happy to hear suggestions!)

There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)

▲theshrike791 hour ago

Qwen3.5 has tool calling, so you can give it a wikipedia tool which it uses to know what happened in Tiananmen Square without issues =)

▲girvo1 hour ago

I'd recommend it too, because the knowledge cutoff of all the open weight Chinese models (M2.7, Qwen3.5, GLM-5 etc) is earlier than you'd think, so giving it web search (I use `ddgr` with a skill) helps a surprising amount

▲theshrike7947 minutes ago

Yep, having a "stupid" central model with multiple tools is IMO the key to efficient agentic systems.

It needs to be just smart enough to use the tools and distill the responses into something usable. And one of the tools can be "ask claude/codex/gemini" so the local model itself doesn't actually need to do much.

▲zozbot23437 minutes ago

> Yep, having a "stupid" central model with multiple tools is IMO the key to efficient agentic systems.

That doesn't fix the "you don't know what you don't know" problem which is huge with smaller models. A bigger model with more world knowledge really is a lot smarter in practice, though at a huge cost in efficiency.

▲theshrike7922 minutes ago

That's the key, it just needs to be smart enough to 1) know it doesn't know and 2) "know a guy" as they say =) (call a tool for the exact information)

Picking a model that's juuust smart enough to know it doesn't know is the key.

▲WesolyKubeczek17 minutes ago

Cool, I always wanted to invade Belgium. Maybe if my plan is good, I could run a successful gofundme?

▲whackernews3 hours ago

Oh does llama.cpp use MLX or whatever? I had this question, wonder if you know? A search suggests it doesn’t but I don’t really understand.

▲irusensei2 hours ago

>Oh does llama.cpp use MLX or whatever?

No. It runs on MacOS but uses Metal instead of MLX.

▲zozbot2341 hour ago

ANE-powered inference (at least for prefill, which is a key bottleneck on pre-M5 platforms) is also in the works, per https://github.com/ggml-org/llama.cpp/issues/10453#issuecomm...

▲OkGoDoIt1 hour ago

Is that better or worse?

▲LoganDark2 hours ago

llama.cpp uses GGML which uses Metal directly.

▲melvinroest5 hours ago

I have journaled digitally for the last 5 years with this expectation.

Recently I built a graphRAG app with Qwen 3.5 4b for small tasks like classifying what type of question I am asking or the entity extraction process itself, as graphRAG depends on extracted triplets (entity1, relationship_to, entity2). I used Qwen 3.5 27b for actually answering my questions.

It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.

I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.

▲nkzd4 hours ago

Did you get any insights about yourself from this process? I am thinking of doing the same

▲melvinroest1 hour ago

TL;DR: you don't need to do any treasure hunt on your notes by just typing stuff into the search bar. Having your own graphRAG system + LLM on your notes is basically a "Google" but then on your own notes. Any question you have: if you have a note for it, it will bubble up. The annoying thing is that false positives will also bubble up.

----

Full reaction:

Yes but perhaps not in a way you might expect. Qwen's reasoning ability isn't exactly groundbreaking. But it's good enough to weave a story, provided it has some solid facts or notes. GraphRAG is definitely a good way to get some good facts, provided your notes are valuable to you and/or contain some good facts.

So the added value is that you now have a super charged information retrieval system on your notes with an LLM that can stitch loose facts reasonably well together, like a librarian would. It's also very easy to see hallucinations, if you recognize your own writing well, which I do.

The second thing is that I have a hard time rereading all my notes. I write a lot of notes, and don't have the time to reread any of them. So oftentimes I forget my own advice. Now that I have a super charged information retrieval system on my notes, whenever I ask a question: the graphRAG + LLM search for the most relevant notes related to my question. I've found that 20% of what I wrote is incredibly useful and is stuff that I forgot.

And there are nuggets of wisdom in there that are quite nuanced. For me specifically, I've seen insights in how I relate to work that I should do more with. I'll probably forget most things again but I can reuse my system and at some point I'll remember what I actually need to remember. For example, one thing I read was that work doesn't feel like work for me if I get to dive in, zoom out, dive in, zoom out. Because in the way I work as a person: that means I'm always resting and always have energy for the task that I'm doing. Another thing that it got me to do was to reboot a small meditation practice by using implementation intentions (e.g. "if I wake up then I meditate for at least a brief amount of time").

What also helps is to have a bit of a back and forth with your notes and then copy/paste the whole conversation in Claude to see if Claude has anything in its training data that might give some extra insight. It could also be that it just helps with firing off 10 search queries and finds a blog post that is useful to the conversation that you've had with your local LLM.

▲dwayne_dibley12 minutes ago

This might be how Apple will start to see even more sales, the M series processors are so far ahead of anything else, local LLMs could be their main selling point.

▲AugSun5 hours ago

"Most users don't need frontier model performance" unfortunately, this is not the case.

▲selcuka4 hours ago

Any citations? Because that was my impression, too. I want frontier model performance for my coding assistant, but "most users" could do with smaller/faster models.

ChatGPT free falls back to GPT-5.2 Mini after a few interactions.

▲lxgr3 hours ago

Have you used GPT instant or mini yourself? I think it’s pretty cynical to assume that this is “good enough for most people”, even if they don’t know the difference between that and better models.

▲throwaway274481 hour ago

Say more. Why do you think this?

▲asutekku4 hours ago

Frontier model has much better knowledge and they usually hallucinate less. It's not about the coding capabilities, it's about how much you can trust the model.

▲Barbing3 hours ago

re: trust-

Have you tried the free version of ChatGPT? It is positively appalling. It’s like GPT 3.5 but prompted to write three times as much as necessary to seem useful. I wonder how many people have embarrassed themselves, lost their jobs, and been critically misinformed. All easy with state-of-the-art models but seemingly a guarantee with the bottom sub-slop tier.

Is the average person just talking to it about their day or something?

▲theshrike791 hour ago

Even the paid version of ChatGPT tends to use a 1000 words when 10 will do.

You can try asking it the same question as Claude and compare the answers. I can guarantee you that the ChatGPT answer won't fit on a single screen on a 32" 4k monitor.

Claude's will.

▲throwaway274481 hour ago

If someone blindly submits chatbot output they deserve to be embarrassed and fired. But I don't think that's going to improve.

▲jychang2 hours ago

The free version of ChatGPT is insanely crippled, so that's not surprising.

▲theshrike791 hour ago

It depends. If they're using a small/medium local model as a 1:1 ChatGPT replacement as-is, they'll have a bad time. Even ChatGPT refers to external services to get more data.

But a local model + good harness with a robust toolset will work for people more often than not.

The model itself doesn't need to know who was the president of Zambia in 1968, because it has a tool it can use to check it from Wikipedia.

▲ZeroGravitas1 hour ago

You can install the complete text of Wikipedia locally too.

They've usually been intended for ereader/off-grid/post-zombie-apocalypse situations but I'd guess someone is working on an llm friendly way to install it already.

Be interesting to know the tradeoffs. The Tienammen square example suggests why you'd maybe want the knowledge facts to come from a separate source.

▲zozbot2341 hour ago

The Wikipedia folks are now working on implementing a language-independent representation for their encyclopedic content - one that's intended to be rigorously compositional and semantics-aware, loosely comparable to Universal Meaning Representation (UMR) as known in the linguistics domain, that - if successful - may end up interacting in very interesting ways with multi-language capable LLMs. Very early experiments (nowhere near as capable as UMR as of yet, but experimenting with the underlying software infrastructure) are at https://abstract.wikipedia.org , whilst a direct comparison of the projected design is given by https://commons.wikimedia.org/wiki/File:Abstract_Wikipedia_N... https://elemwala.toolforge.org/static/nlgsig-nov2025.html

▲helsinkiandrew2 hours ago

> unfortunately, this is not the case

Most users are fixing grammar/spelling, summarising/converting/rewriting text, creating funny icons, and looking up simple facts, this is all far from frontier model performance.

I've a feeling that if/when Apple release their onboard LLM/Siri improvements that can call out if needed, the vast majority of people will be happy with what they get for free that's running on their phone.

▲cyanydeez13 minutes ago

eh, its weird how thetech world wants to build trillions of data centers for...what, escapingthe permanent underclass?

I think what "need" youspeak of is a bit of a colored statement.

▲blitzar1 hour ago

"Hey dingus, set timer for 30 minutes"

▲ZeroGravitas3 hours ago

It feels like you'll soon need a local llm to intermediate with the remote llm, like an ad blocker for browsers to stop them injecting ads or remind you not to send corporate IP out onto the Internet.

▲tomashubelbauer2 hours ago

I'd like to coin the term "user agent" for this

▲blitzar1 hour ago

"copilot" seems a good term

could also be considered a triage layer

▲jl63 hours ago

Not sure about the using less electricity part. With batching, it’s more efficient to serve multiple users simultaneously.

▲TeMPOraL3 hours ago

Indeed. Data centers have so many ways and reasons to be much more energy-efficient than local compute it's not even funny.

▲karimf3 hours ago

Depending on the use case, the future is already here.

For example, last week I built a real-time voice AI running locally on iPhone 15.

One use case is for people learning speaking english. The STT is quite good and the small LLM is enough for basic conversation.

https://github.com/fikrikarim/volocal

▲Barbing3 hours ago

Brilliant. Hope to see you in the App Store!

▲karimf3 hours ago

Oh thank you! I wasn’t sure if it was worth submitting to the app store since it was just a research preview, but I could do it if people want it.

▲nbenitezl49 minutes ago

But when using it on the cloud a LLM can consult 50 websites, which is super fast for their datacenters as they are backbone of internet, instead you'll have to wait much more on your device to consult those websites before giving you the LLM response. Am i wrong?

▲comboy38 minutes ago

As things stand today even when doing research tasks, time spent by model is >> than fetching websites. I don't see it changing any time soon, except when some deals happen behind the scenes where agents get to access CF guarded resources that normally get blocked from automated access.

▲Const-me38 minutes ago

While data centres indeed have awesome internet connectivity, don’t forget the bandwidth is shared by all clients using a particular server.

If you have 100 mbit/sec internet connection at home, a computer in a data centre has 10 gbit/sec, but the server is serving 200 concurrent clients — your bandwidth is twice as fast.

▲iNic1 hour ago

It will probably be a future. My guess is that for many businesses it will still make sense to have more powerful models and to run them centralized in a datacenter. Also, by batching queries you can get efficiencies at scale that might be hard to replicate locally. I can also see a hybrid approach where local models get good at handing off to cloud models for complex queries.

▲niek_pas1 hour ago

> For many businesses it will still make sense to have more powerful models and to run them centralized in a datacenter.

Agree, and I think of it this way: for a lot of businesses, it already makes sense to have a bunch of more powerful computers and run them centralized in a datacenter. Nevertheless, most people at most companies do most of their work on their Macbook Air or Dell whatever. I think LLMs will follow a similar pattern: local for 90% of use cases, powerful models (either on-site in a datacenter or via a service) for everything else.

▲zozbot2341 hour ago

> Most users don't need frontier model performance.

SSD weights offload makes it feasible to run SOTA local models on consumer or prosumer/enthusiast-class platforms, though with very low throughput (the SSD offload bandwidth is a huge bottleneck, mitigated by having a lot of RAM for caching). But if you only need SOTA performance rarely and can wait for the answer, it becomes a great option.

▲goldenarm1 hour ago

It's more secure, but it would make supply much much worse.

Data centers use GPU batching, much higher utilisation rates, and more efficient hardware. It's borderline two order of magnitude more efficient than your desktop.

▲amelius2 hours ago

LLM in silicon is the future. It won't be long until you can just plug an LLM chip into your computer and talk to it at 100x the speed of current LLMs. Capability will be lower but their speed will make up for it.

▲jillesvangurp21 minutes ago

You can always delegate sub agents to cloud based infrastructure for things that need more intelligence. But the future indeed is to keep the core interaction loop on the local device always ready for your input.

A lot of stuff that we ask of these models isn't all that hard. Summarize this, parse that, call this tool, look that up, etc. 99.999% really isn't about implementing complex algorithms, solving important math problems, working your way through a benchmark of leet programming exercises, etc. You also really don't need these models to know everything. It's nice if it can hallucinate a decent answer to most questions. But the smarter way is to look up the right answer and then summarize it. Good enough goes a long way. Speed and latency are becoming a key selling point. You need enough capability locally to know when to escalate to something slower and more costly.

This will drive an overdue increase in memory size of phones and laptops. Laptops especially have been stuck at the same common base level of 8-16GB for about 15 years now. Apple still sells laptops with just 8GB (their new Neo). I had a 16 GB mac book pro in 2012. At the time that wasn't even that special. My current one has 48GB; enough for some of the nicer models. You can get as much as 256GB today.

▲zozbot23415 minutes ago

> This will drive an overdue increase in memory size of phones and laptops.

DRAM costs are still skyrocketing, so no, I don't think so. It's more likely that we'll bring back wear-resistant persistent memory as formerly seen with Intel Optane.

▲theshrike791 hour ago

I'm expecting someone to come up with an LLM version of the Coral USB Accelerator: https://www.coral.ai/products/accelerator

Just plug in a stick in your USB-C port or add an M.2 or PCIe board and you'll get dramatically faster AI inference.

▲pezgrande5 hours ago

You could argue that the only reason we have good open-weight models is because companies are trying to undermine the big dogs, and they are spending millions to make sure they dont get too far ahead. If the bubble pops then there wont be incentive to keep doing it.

▲aurareturn5 hours ago

I agree. I can totally see in the future that open source LLMs will turn into paying a lumpsum for the model. Many will shut down. Some will turn into closed source labs.

When VCs inevitably ask their AI labs to start making money or shut down, those free open source LLMS will cease to be free.

Chinese AI labs have to release free open source models because they distill from OpenAI and Anthropic. They will always be behind. Therefore, they can't charge the same prices as OpenAI and Anthropic. Free open source is how they can get attention and how they can stay fairly close to OpenAI and Anthropic. They have to distill because they're banned from Nvidia chips and TSMC.

Before people tell me Chinese AI labs do use Nvidia chips, there is a huge difference between using older gimped Nvidia H100 (called H20) chips or sneaking around Southeast Asia for Blackwell chips and officially being allowed to buy millions of Nvidia's latest chips to build massive gigawatt data centers.

▲pezgrande4 hours ago

> have to release free open source models because they distill from OpenAI and Anthropic

They dont really have to though, they just need to be good enough and cheaper (even if distilled). That being said, it is true they are gaining a lot of visibility (specially Qwen) because of being open-source(weight).

Hardware-wise they seem they will catch-up in 3-5 years (Nvidia is kind of irrelevant, what matters is the node).

▲aurareturn3 hours ago

I highly doubt they can catch up in 3-5 years to Nvidia.

Chips take about 3 years to design. Do you think China will have Feymann-level AI systems in 3 years?

I think in 3 years, they'll have H200-equivalent at home.

▲spiderfarmer4 hours ago

“They will always be behind”

Car manufacturers said the same.

▲aurareturn4 hours ago

It did take decades to catch and surpass US car makers right?

▲seanmcdirmid4 hours ago

About 2.5 decades from the start of the JVs, but they did it. Semiconductors and jet turbines are really the last two tech trees that China has yet to master.

▲aurareturn3 hours ago

Right. When I said "they'll always be behind", I meant in the next 5-10 years. They're gated by EUV tech. And once they have EUV tech, they need to scale up chip manufacturing.

▲Barbing3 hours ago

Which might they master first?

▲Lio4 hours ago

This seems to be somewhat similar to web browsers.

I could see the model becoming part of the OS.

Of course Google and Microsoft will still want you to use their models so that they can continue to spy on you.

Apple, AMD and Nvidia would sell hardware to run their own largest models.

▲mirekrusin3 hours ago

You can have viable business model around open weight models where you offer fine tuning at a fee.

▲thih93 hours ago

> it also would use less electricity

How would it use less electricity? I’d like to learn more.

▲jychang3 hours ago

That's completely not true. LLM on device would use MORE electricity.

Service providers that do batch>1 inference are a lot more efficient per watt.

Local inference can only do batch=1 inference, which is very inefficient.

▲overfeed3 hours ago

> It's just a matter of getting the performance good enough.

Who will pay for the ongoing development of (near-)SoTA local models? The good open-weight models are all developed by for-profit companies - you know how that story will end.

▲miki1232113 hours ago

> would use less electricity

Sorry to shatter your bubble, but this is patently false, LLMs are far more efficient on hardware that simultaneously serves many requests at once.

There's also the (environmental and monetary) cost of producing overpowered devices that sit idle when you're not using them, in contrast to a cloud GPU, which can be rented out to whoever needs it at a given moment, potentially at a lower cost during periods of lower demand.

Many LLM workloads aren't even that latency sensitive, so it's far easier to move them closer to renewable energy than to move that energy closer to you.

▲zozbot2341 hour ago

> LLMs are far more efficient on hardware that simultaneously serves many requests at once.

The LLM inference itself may be more efficient (though this may be impacted by different throughput vs. latency tradeoffs; local inference makes it easier to run with higher latency) but making the hardware is not. The cost for datacenter-class hardware is orders of magnitude higher, and repurposing existing hardware is a real gain in efficiency.

▲Tepix1 hour ago

Seems doubtful. The utilisation will be super high for data center silicon whereas your PC or phone at home is mostly idle.

▲zozbot2341 hour ago

> your PC or phone at home is mostly idle

If you're purely repurposing hardware that you need anyway for other uses, that doesn't really matter.

(Besides, for that matter, your utilization might actually rise if you're making do with potato-class hardware that can only achieve low throughput and high latency. You'd be running inference in the background, basically at all times.)

▲ysleepy2 hours ago

I'm actually not sure that's true. Apart from people buying the device with or without the neural accelerator, the perf/watt could be on par or better with the big iron. The efficiency sweet-spot is usually below the peak performance point, see big.little architectures etc.

▲kortilla3 hours ago

Well this is an article about running on hardware I already have in my house. In the winter that’s just a little extra electricity that converts into “free” resistive heating.

▲nikanj3 hours ago

That also means sending every user a copy of the model that you spend billions training. The current model (running the models at the vendor side) makes it much easier to protect that investment

▲gedy5 hours ago

Man I really hope so, as, as much as I like Claude Code, I hate the company paying for it and tracking your usage, bullshit management control, etc. I feel like I'm training my replacement. Things feel like they are tightening vs more power and freedom.

On device I would gladly pay for good hardware - it's my machine and I'm using as I see fit like an IDE.

▲aurareturn5 hours ago

When local LLMs get good enough for you to use delightfully, cloud LLMs will have gotten so much smarter that you'll still use it for stuff that needs more intelligence.

▲gedy5 hours ago

True, but I'm already producing code/features faster than company knows what to do with, (even though every company says "omg we need this yesterday", etc). Even coding before AI was basically same.

Code tools that free my time up is very nice.

▲aurareturn5 hours ago

It isn't going to replace cloud LLMs since cloud LLMs will always be faster in throughput and smarter. Cloud and local LLMs will grow together, not replace each other.

I'm not convinced that local LLMs use less electricity either. Per token at the same level of intelligence, cloud LLMs should run circles around local LLMs in efficiency. If it doesn't, what are we paying hundreds of billions of dollars for?

I think local LLMs will continue to grow and there will be an "ChatGPT" moment for it when good enough models meet good enough hardware. We're not there yet though.

Note, this is why I'm big on investing in chip manufacture companies. Not only are they completely maxed out due to cloud LLMs, but soon, they will be double maxed out having to replace local computer chips with ones that are suited for inferencing AI. This is a massive transition and will fuel another chip manufacturing boom.

▲raincole4 hours ago

Yep. People were claiming DeepSeek was "almost as good as SOTA" when it came out. Local will always be one step away like fusion.

It's just wishful thinking (and hatred towards American megacorps). Old as the hills. Understandable, but not based on reality.

▲kortilla2 hours ago

Don’t try to draw trend lines for an industry that has existed for <5 years.

▲virtue35 hours ago

We are 100% there already. In browser.

the webgpu model in my browser on my m4 pro macbook was as good as chatgpt 3.5 and doing 80+ tokens/s

Local is here.

▲AndroTux4 hours ago

Sir, ChatGPT 3.5 is more than 3 years old, running on your bleeding edge M4 Pro hardware, and only proves the previous commenters point.

▲AugSun4 hours ago

It works really well for "You're helpful assistant / Hi / Hello there. how may I help you today?" Anything else (esp in non-EN language) and you will see the limitations yourself. just try it.

▲mirekrusin3 hours ago

Local RTX 5090 is actually faster than A100/H100.

▲aurareturn2 hours ago

It's a $4,000 GPU with 32GB of VRAM and needs a 1,000 watt PSU. It's not realistic for the masses.

If it has something like 80GB of VRAM, it'll cost $10k.

The actual local LLM chip is Apple Silicon starting at the M5 generation with matmul acceleration in the GPU. You can run a good model using an M5 Max 128GB system. Good prompt processing and token generation speeds. Good enough for many things. Apple accidentally stumbled upon a huge advantage in local LLMs through unified memory architecture.

Still not for the masses and not cheap and not great though. Going to be years to slowly enable local LLMs on general mass local computers.

▲hrmtst938374 hours ago

You're assuming throughput sets the value, but offline use and privacy change the tradeoff fast.

▲aurareturn3 hours ago

Yea I get that there will always be demand for local waifus. I never said local LLMs won't be a thing. I even said it will be a huge thing. Just won't replace cloud.

▲AugSun5 hours ago

Looking at downvotes I feel good about SDE future in 3-5 years. We will have a swamp of "vibe-experts" who won't be able to pay 100K a month to CC. Meanwhile, people who still remember how to code in Vim will (slowly) get back to pre-COVID TC levels.

▲QuantumNomad_5 hours ago

What is CC and TC? I have not heard these abbreviations (except for CC to mean credit card or carbon copy, neither of which is what I think you mean here).

▲Ericson23144 hours ago

I figured it out from context clues

CC: Claude Code

TC: total comp(ensation)

▲AugSun4 hours ago

Thank you for clarifying! (I had no idea it needs to be explained, sorry.)

▲Yukonv3 hours ago

Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out. What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.

▲robotswantdata2 hours ago

Why are people still using Ollama? Serious.

Lemonade or even llama.cpp are much better optimised and arguably just as easy to use.

▲vorticalbox11 minutes ago

i like ollama, mostly because the cli is pretty nice. its desktop app has stupid choices like if a model can support tools then the ui should give me the "search" option but it only shows for cloud models.

i have ran lmstudio for a while but i don't really use local models that much other than to mess about.

▲zozbot2349 minutes ago

You can also use OpenWebUI locally which should give you a nice friendly UX once you set it up.

▲LuxBennu5 hours ago

Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path

▲zozbot2342 hours ago

They initially messed up this launch and overwrote some of the GGUF models in their library, making them non-downloadable on platforms other than Apple Silicon. Hopefully that gets fixed.

▲goldenarm1 hour ago

How many tokens per second?

▲daveorzach28 minutes ago

What are significant differences between Ollama and LM Studio now? I haven’t used Ollama because it was missing MLX when I started using LLM GUIs.

▲codelion5 hours ago

How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/

▲janandonly1 hour ago

> Please make sure you have a Mac with more than 32GB of unified memory.

Yeah, I can still save money by buying a cheaper device with less RAM and just paying my PPQ.AI or OpenRouter.com fees .

▲zozbot23453 minutes ago

> Please make sure you have a Mac with more than 32GB of unified memory.

The lack of proper support for SSD offload (via mmap or otherwise) is really the worst part about this. There's no underlying reason why a 3B-active model shouldn't be able to run, however slowly, on a cheap 8GB MacBook Neo with active weights being streamed in from SSD and cached. (This seems to be in the works for GGML/GGUF as part of upgrading to newer upstream versions; no idea whether MLX inference can also support this easily.)

▲harel2 hours ago

What would be the non Mac computer to run these models locally at the same performance profile? Any similar linux ARM based computers that can reach the same level?

▲dabinat38 minutes ago

Intel’s doing interesting things with their Arc GPUs. They’re offering GPUs that aren’t super fast for gaming but are relatively low power and have a boatload of VRAM. The new B70 is half the retail price of a 5090 (probably more like 1/3rd or 1/4 of actual 5090 selling prices) but has the same amount of memory and half the TDP. So for the same price as a 5090 you could get several and use them together.

▲sgt2 hours ago

Not even close. If you want to run this on PC's you need to get a GPU like 5090 but that's still not the same cost per token, and it will be less reliable and use a lot more power. Right now the Apple Silicon machines are the most cost effective per token and per watt.

▲harel1 hour ago

It's odd no manufacturer jumped on this wagon to offer a competitive alternative.

▲hu338 minutes ago

Is there even enough market for this?

These models are dumber and slower than API SoTA models and will always be.

My time and sanity is much more expensive than insurance against any risk of sending my garbage code to companies worth hundreds of billions of dollars.

For most, it's a downgrade to use local models in multiple fronts: total cost of ownership, software maintenance, electricity bill, losing performance on the machine doing the inference, having to deal with more hallucinations/bugs/lower quality code and slower iteration speed.

▲zozbot23427 minutes ago

> These models are dumber and slower than API SoTA models and will always be.

Sure but you're paying per-token costs on the SoTA models that are roughly an order of magnitude higher than third-party inference on the locally available models. So when you account for per-token cost, the math skews the other way.

▲theshrike791 hour ago

Framework Desktop is the closest one with the MAX 385/395 chip. It's mostly about the memory being fast enough rather than just CPU/GPU oomph.

The 64GB model is 2240€ base and the 128GB is 3069€ base + all the stuff you need to add to make it an actual computer.

As a comparison the 64GB Mac Mini is 2499€ here and a 128GB Mac Studio is 4274€.

▲eigenspace18 minutes ago

Note though that that a MAX 395 has half the memory bandwidth of a M4 Max chip, and the memory bandwidth is going to be the biggest limiting factor, so you'll likely be getting around half the tokens/second with that Framework Desktop.

▲rubymamis32 minutes ago

I wonder if the Snapdragon X Elite already caught up with the Apple's M series in that regard - does anybody know?

▲dial9-15 hours ago

still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram

▲rubymamis31 minutes ago

Doesn't OpenCode supports local models?

▲gedy5 hours ago

How close is this? It says it needs 32GB min?

▲HDBaseT5 hours ago

You can run Qwen3.5-35B-A3B on 32GB of RAM sure, although to get 'Claude Code' performance, which I assume he means Sonnet or Opus level models in 2026, this will likely be a few years away before its runnable locally (with reasonable hardware).

▲Foobar85685 hours ago

I fully agree, I run that one with Q4 on my MBP, and the performance (including quality of response) is a let down.

I am wondering how people rave so much about local "small devices" LLM vs what codex or Claude code are capable of.

Sadly there are too much hype on local LLM, they look great for 5min tests and that's it.

▲brcmthrowaway5 hours ago

Just train it better with AGENTS.md

▲mfa19995 hours ago

How does this compare to llama.cpp in terms of performance?

▲solarkraft4 hours ago

MLX is a bit faster (low double digit percentage), but uses a bit more RAM. Worthwhile tradeoff for many.

▲ysleepy2 hours ago

On my M4 Pro MLX has almost 2x tok/s

▲AugSun5 hours ago

"We can run your dumbed down models faster":

#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.

▲puskuruk3 hours ago

Finally! My local infra is waiting for it for months!

▲brcmthrowaway5 hours ago

What is the difference between Ollama, llama.cpp, ggml and gguf?

▲benob4 hours ago

Ollama is a user-friendly UI for LLM inference. It is powered by llama.cpp (or a fork of it) which is more power-user oriented and requires command-line wrangling. GGML is the math library behind llama.cpp and GGUF is the associated file format used for storing LLM weights.

▲redmalang3 hours ago

i've found llama.cpp (as i understand it, ollama now uses their own version of this) to work much better in practice, faster and much more flexible.

▲xiconfjs5 hours ago

Ollama on MacOS is a one-click solution with stable obe-click updates. Happy so far. But the mlx support was the only missing piece for me.

▲yard20102 hours ago

Can you please write about your hardware?

▲darshanmakwana2 hours ago

Really nice to see this!