BigQuery’s SEARCH function only works with ASCII characters

Admittedly, it may not have been my brightest idea to use BigQuery as the search backend for poketto.me. But since Firebase doesn’t have built-in full-text search, I would have needed to add another tool to my stack anyway. I figured BigQuery would be easier than managing something external like Apache Lucene or Elasticsearch. Plus, BigQuery has a built-in SEARCH(...) function, so why not give it a try?

As it turns out, SEARCH(...) is really more of a token-based text analyzer than a true search function:
1️⃣ It splits both the search term and the text into tokens (words).
2️⃣ It matches full tokens only.
3️⃣ It returns just TRUE or FALSE—no offsets, no match lengths.
🤯 And worst of all: It only works with ASCII. Yes. Like it’s 1979.

With these limitations, my first iteration of full-text search in poketto.me was basically unusable:
😟 Searching for entire words worked (“Finanzminister” returned results).
😟 Partial words didn’t (“Finanzmin” returned nothing).
😟 Non-ASCII characters broke it completely (“Österreich” yielded no results).

So what did I do? I fell back on the bluntest tool in my toolbox: REGEXP_CONTAINS(...).

I know this isn’t a long-term solution, but for now, it works surprisingly well (and fast):
1️⃣ I implemented my own tokenization logic—splitting search terms into words, keeping quoted phrases together.
2️⃣ I construct a long SQL query using REGEXP_CONTAINS to check titles, full text, site names, and bylines:

REGEXP_CONTAINS(LOWER(text), LOWER(@search_term_{idx}))

3️⃣ For each “found” save, I scan the full text again for offsets and lengths to highlight matches in the UI.

Clumsy? *Yes.* Sustainable? *Probably not.* But it works—for now—until I replace it with a real full-text search backend.

🔗 [https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#search]{.underline}
🔗 [https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_contains]{.underline}