When I talk about “Conscientious AI,” I keep circling back to proportionality: is an LLM really the right tool for the task, or can something simpler, cheaper, and more predictable get the job done just as well?

This question arose once again while I was working on parlametrics, a data platform that I developed based on the Austrian Parliament’s Open Data offering. Parliamentary inquiries in particular are a valuable source of data: They enable parliamentarians to extract detailed information about how the government operates. In the process, “keywords” are eventually assigned to each inquiry so that they can be grouped and searched more easily, but this often takes days or even weeks. To speed up this process, parlametrics tries to predict the most likely keywords for new inquiries based on their title, enabling users to discover similar inquiries without having to wait for manual tagging.

Of course, I could have thrown an LLM at the problem. A prompt like this would have worked with Gemini, Mistral, Claude, whatever you prefer:

Here's the title of a parliamentary inquiry of the Austrian Parliament: {title}. Which of the following keywords best fits to this inquiry?
* Verkehr und Infrastruktur
* Bildung
* Soziales
* ...

But that has an endless number of caveats:

  • LLMs are stochastic, meaning: There’s no guarantee, that the same prompt will yield the same result twice.
  • LLMs are costly: Even if you pay cents per millions of tokens, that adds up. Plus: There’s a non-negligible environment footprint.
  • LLMs are slow: Whether you host it yourself, or you call an API, you have to factor in a few seconds for each round trip.

So, instead of doing that, I trained a tiny machine learning model on the ~80,000 historical inquiries with their manually assigned keywords. The training only takes a few minutes, and the gives me the same result every time—in a fraction of a second.

The data pipeline is almost embarrassingly simple:

  • The sklearn python package
  • A CSV with 80,000 titles and their assigned keywords
  • A Vectorizer to turn text into sparse features:
vectorizer = TfidfVectorizer(
    analyzer="word",
    ngram_range=(1, 2),   # unigrams + bigrams
    max_features=50_000,
    sublinear_tf=True,    # log-scale TF
    min_df=2,
)
  • A Classifier that learns one model per keyword:
clf = OneVsRestClassifier(
    LogisticRegression(C=5, max_iter=1000, solver="lbfgs"),
    n_jobs=-1,
)

Training is as easy as clf.fit(X_train, Y_train). Once the weights are ready, I serialize the vectorizer, classifier, and MultiLabelBinarizer with joblib. Whenever the app needs a prediction, it just loads the artifacts and runs the following steps:

vectorizer = joblib.load(f"{self.models_dir}/{model}/vectorizer.joblib")
clf = joblib.load(f"{self.models_dir}/{model}/classifier.joblib")
mlb = joblib.load(f"{self.models_dir}/{model}/mlb.joblib")

X = vectorizer.transform([title])
proba = clf.predict_proba(X)

scores = dict(zip(mlb.classes_, proba[0]))
sorted_scores = sorted(scores.items(), key=lambda x: x[1], reverse=True)

predicted = [label for label, prob in sorted_scores if prob >= threshold][:top_n]

The upside compared to spinning up an LLM is clear:

  1. Testability: Every deployment produces the same output for the same input.
  2. Predictability: I control the evaluation and can validate against historical ground truth before shipping.
  3. Efficiency: It runs on CPUs, uses very little memory, and does not require the latency or cost of an API call.

So if you ever find yourself debating whether an LLM is really necessary, try the old-school text classification route first. Odds are you already have enough data to train something that works, and it might just be the most conscientious AI choice.

PS: There’s also a thing I learned about transparency here: When building user interfaces that render AI-generated data, regardless of whether an LLM or any other algorithm was used, make it clear that the information the user looks at was not produced by a human.