I built an expense tracker that runs entirely on-device
How Gastos uses Vision, Speech, and Natural Language frameworks to bring AI to expense tracking — without touching a server.
Most expense apps are built on a trade-off you’re not supposed to notice. The fast, smart ones — the ones with receipt scanning and search that actually works — they’re sending your data to a server. The private ones are usually pretty basic. Spreadsheets with better marketing.
I wanted to find out if that trade-off was real, or if it was just the path of least resistance for most teams. So I built Gastos: a private, local-first expense tracker for iPhone that runs AI entirely on-device.
Here’s what I learned.
Why not just use a server?
Server-side AI is easier to build. You send an image, get structured data back, bill it to an API. The models are more capable. The accuracy is higher. For most apps, it’s the obvious call.
But for an expense tracker, I kept coming back to a few things that bothered me.
Your spending data is genuinely personal — not in a hypothetical future-breach way, but right now. Where you eat, what you drink, which pharmacies you go to, which cities you travel to. That data, in aggregate, is a detailed picture of your life. I didn’t want to be the person responsible for holding it on a server.
I also kept thinking about reliability. Offline reliability, specifically. You log an expense the moment you spend — at a market in a country with bad data, on a subway, in a restaurant with terrible signal. If the AI call depends on the network, you lose it exactly when you need it. That friction compounds. You stop logging. The app becomes useless.
And then there was the engineering constraint I found interesting: could Apple’s on-device frameworks actually handle it? Or would I be shipping a half-baked experience and calling it a feature?
The architecture
Gastos is built on SwiftData with no server component. There’s no account, no sign-up, no sync — your expenses live in a local database on your iPhone. That constraint is not a compromise; it’s the foundation every other decision sits on.
For AI, I used three frameworks:
VisionKit for receipt OCR. Point your camera at a receipt and VisionKit handles the text extraction. From there, I parse the structured data — amount, merchant name, currency, date — using a combination of pattern matching and Foundation Models where available. The results are good on clean receipts and inconsistent on thermal paper from small vendors. That’s not a VisionKit problem; it’s a receipt format problem. I haven’t found a solution that doesn’t involve a server.
SFSpeechRecognizer for voice input. Say “lunch twenty-two dollars” and the app parses it into a tagged expense. This works, and it’s the fastest input path when you don’t want to look at your screen. I’ll be honest: voice is the newest of the three input modes, and it shows. It’s functional, but it hasn’t had the same testing time as text and receipt input. If you try it and hit a rough edge, that’s expected — it’ll improve.
NLEmbedding for semantic search. This one is still on the roadmap. The plan is to index your expenses against semantic embeddings so that searching “food last week” or “transport in Tokyo” matches results by meaning rather than just keyword. NLEmbedding runs on-device, so it would fit the architecture cleanly. I haven’t shipped it yet because I want the indexing to be fast enough that it doesn’t feel like a penalty. Right now it’s the next significant AI feature in the queue.
What worked
The receipt scanning is the one that surprised me most. I went in expecting VisionKit to be a rough approximation — something I’d have to apologize for. On most printed receipts, it extracts the right amount, the merchant name, and the date without any intervention. The currency detection is solid for common formats.
The offline architecture also turned out to be a better user experience than I expected. There’s no loading state for anything core. Search is instant. Analytics render immediately. When you remove network latency from the loop, the app just feels faster in a way that’s hard to attribute to any single decision.
SwiftData was the right call for storage. It integrates cleanly with SwiftUI, handles relationships well, and doesn’t add abstraction I don’t need. For an app with no sync requirement, it’s close to ideal.
What didn’t
Foundation Models — Apple’s on-device LLM for generating natural language output, like tidying up a receipt description — has a locale bug I haven’t been able to work around. On en-SG (Singapore English), the model fails silently. AI-assisted receipt descriptions fall back to regex-extracted text, which is functional but less readable. I’ve disclosed this because it’s the kind of thing that would frustrate a user in Singapore who expects the AI to work the same way as someone in the US. It doesn’t, and I haven’t shipped a fix yet.
Receipt accuracy on low-quality thermal paper is inconsistent. Small vendors, faded ink, non-standard layouts — VisionKit gives you what it can, but sometimes what it can is a partial result. I surface the extracted data and let you edit it immediately, which helps, but it’s not a solution. It’s an acknowledgment that the problem is hard.
Voice input still needs more real-world testing. It works in quiet environments. It gets shakier with background noise, unusual phrasing, or expenses with amounts in non-round numbers. I’d rather say that plainly than ship it with confident marketing copy and let you figure it out.
The real tradeoffs
Local-first is not a free lunch.
No sync means no second device. Your expenses are on your iPhone. If you switch phones, you export and import. If you lose your phone without a backup, you lose your data. These are real constraints, not theoretical edge cases. I export to JSON and CSV specifically so the data is yours to move, but the responsibility for that move is yours too.
There are no collaborative features. You can’t share an expense log with a partner or split costs automatically. The architecture doesn’t support it, and I haven’t found a way to add it without introducing a server.
The on-device AI is genuinely good, but it’s not the same as what a well-resourced cloud endpoint can do. Receipt parsing on unusual formats is weaker. Voice accuracy in noisy environments is weaker. I’ve traded accuracy ceiling for privacy and offline reliability. Whether that’s the right trade depends on what you care about.
Why it was worth building
The thing I keep coming back to is that the trade-off I started with — smart and convenient vs. private and offline — turned out to be smaller than I expected. Not gone. Smaller.
VisionKit handles the vast majority of receipts well enough that most users will never notice the edge cases. The offline experience is genuinely better than cloud apps in the situations where it matters most. And knowing that none of this touches a server is not a theoretical reassurance — it’s a structural fact about how the app is built.
If you’ve been looking for an expense tracker that doesn’t require an account, works on a plane, and doesn’t phone home when you scan a receipt — that’s what I built.
Gastos is a local-first expense tracker for iPhone. Log expenses by text, receipt photo, or voice. On-device AI, Travel Mode, and everything stays on your phone.
Frequently asked questions
- Can an app use AI without sending data to a server?
- Yes. Apple provides on-device frameworks — Vision and VisionKit for image analysis and OCR, Speech (SFSpeechRecognizer) for voice transcription, and Natural Language (NLEmbedding) for semantic text understanding — that run entirely on the iPhone's processor. Gastos uses these for receipt scanning, voice input, and text search. No API keys, no server calls, no data leaving your device.
- What frameworks does Apple provide for on-device AI?
- Vision and VisionKit for image analysis and OCR, Speech (SFSpeechRecognizer) for voice transcription, and Natural Language (NLEmbedding) for semantic text understanding. All run locally on Apple Silicon.
- How do you build an app that works completely offline?
- Use on-device storage (SwiftData), on-device AI frameworks, and cache any external data — like exchange rates — locally. The key constraint: no feature can depend on a network call to function. If it needs the internet, it needs a fallback.