KuraDB watches a folder, parses every document, and embeds it automatically — then serves semantic and CJK-aware keyword search over a strictly read-only local API.
curl -fsSL https://kuradb.agenvoy.com/scripts/install.sh | bash
Watch, parse, embed, and serve — a single Go binary with zero external vector database.
Drop a file in the watched folder. KuraDB sniffs, parses, chunks, and queues it for embedding — no commands, no schemas.
Two-stage cosine search: coarse source-level ranking, then a parallel chunk-level scan across CPU cores. Self-implemented, no ToriiDB.
gse-powered Chinese segmentation with stopwords. Scores rows by keyword hit count for precise, language-aware matching.
The HTTP API has no mutation endpoints by design. Data enters one way only — through the watcher pipeline. A trust boundary, not a config flag.
Every chunk and vector lives in SQLite. The in-memory vector cache is rebuildable — never authoritative, never lost on restart.
Upsert only invalidates a vector when content actually changed. Identical files keep their embeddings — no wasted OpenAI round-trips.
Query embeddings are cached in a global SQLite store. Repeated searches skip the embedding call and go straight to the vector scan.
Removed files are dismissed, not dropped. Every semantic and keyword query strictly filters dismissed content from results.
Binds a random local port and publishes its URL to an endpoint file, so Agenvoy discovers and consumes it as a child process automatically.
Data enters through a single path. The database package is the only write entry point — nothing else touches SQLite.
Poll the inbox every 10s. Detect changes via a persisted size + mtime snapshot.
walkFiles →Sniff text, dispatch by type, split into chunks — PDF, Office, sheets, markdown.
parser →Tick every 5s, batch of 64. Embed pending rows, write vectors back to SQLite.
openai →Read-only HTTP over localhost. Two-stage vector + keyword search over fresh rows.
ginSemantic for meaning, keyword for exact terms. Both filter dismissed content and never leak internal scoring fields.
Query cache → embed on miss → two-stage cosine search → drop hits below 0.3 → hydrate full rows from SQLite, grouped by source.
Tokenize the query with CJK-aware segmentation, score each row by how many keyword clauses match, order by hit count — always filtering dismissed rows.
Text sniffing skips binaries and media automatically. Everything else is parsed, chunked, and embedded.
Full text extraction, page-aware chunking
Word and PowerPoint document bodies
Tabular sheets flattened to rows
Markdown and plain UTF-8 text
Read-only HTTP for any consumer
Four endpoints to query, four subcommands to manage. No mutation surface anywhere.
All under /api, bound to localhost on a random free port. limit defaults to 10, max 100.
Run kura with no args to start the server. Subcommands manage the database registry.
Personal-scale data needs a linear scan and a good index — not a distributed vector cluster.
Install once, register a database, drop your files, and start querying.
Bootstraps Go and builds the kura binary. macOS and Linux.
curl -fsSL https://kuradb.agenvoy.com/scripts/install.sh | bash
Creates an inbox and a ~/Kura_notes symlink.
kura add notes
Copy documents into the symlinked inbox folder.
cp ~/Documents/*.pdf ~/Kura_notes/
Server auto-loads, watches, and embeds.
kura
One binary. One folder. Agent-ready search.