Looking for advice: adding a search feature able to match different phrasings of the same question (lemmy.ml)

submitted 9 hours ago* (last edited 8 hours ago) by comfy@lemmy.ml to c/webdev@programming.dev

5 comments fedilink hide all child comments

I want to build a small site which acts as a broad, searchable FAQ for a certain topic.

Consider I have the FAQ:

What is the approximate mass of Earth?

It's 5.9722 × 10^24 kilograms, wow!

I want the user to have a chance at finding this FAQ by asking How heavy is our planet

Looking at this basically, the two similar questions have only one shared word, "is", which is an extremely common word. So using something really simple like word comparison or even stemming/lemmatization alone won't help.

On the very other end of the spectrum, a search engine's AI feature can interpret this effectively, rephrase the question and give a similar answer. So, what strategies are are in-between these two extremes?

A few people will be adding questions to the site regularly.
If possible, no external services, just self-hosting on an affordable server.
Simpler and lighter solutions are preferred.

Are any of the features in OpenSearch (ElasticSearch/Lucene fork) able to do this? Is it overkill?

Since the site will have new questions to match regularly, will a solution require the repeated, wasteful retraining of NLP models to to create weights? Or is training so efficient for small-scale text datasets that it's responsible and reasonable to do on a cheap low-end server?

edit: Just spitballing here, I could try a solution which does the bulk work at insert-time rather than runtime, by asking a general pre-trained language model to rephrase the question many different ways, or generate keywords, then use those responses to generate tags for a basic keyword search to match. This would avoid making a heavy search function or retraining any model on the server.

Example result:

GPT-4o mini

Here’s a list of synonyms for the keywords in "What is the approximate mass of Earth?" formatted as an array of strings:

json

[
  "weight",
  "heaviness",
  "bulk",
  "load",
  "volume",
  "estimated",
  "rough",
  "approximal",
  "near",
  "close to",
  "planet Earth",
  "the globe",
  "the world",
  "Terra",
  "our planet"
]

you are viewing a single comment's thread
view the rest of the comments

[-] apparia@discuss.tchncs.de 3 points 8 hours ago* (last edited 8 hours ago)

I'm not an expert, but it sounds like you want an embedding+vector database. This essentially extracts the part of an LLM that "understands" (loaded term, note the quotes) the text you put in, and then does a lookup directly on that "understanding", so it's very good at finding alternate phrasings or slightly differing questions.

There's no actual text generation involved, and no need to retrain anything when adding new questions.

OpenSearch has an implementation (which I learned about just now while writing this comment and thus cannot vouch for); you could start there.

[-] comfy@lemmy.ml 1 points 7 hours ago

Thanks. Lots of the documentation was difficult to parse for a beginner so your explanation of those features is very helpful.

[-] apparia@discuss.tchncs.de 2 points 7 hours ago

Yeah, even though I have a bit of background I can't really make heads or tails of that OpenSearch doc at a glance, it's dense stuff.

In my experience knowing the keywords to stick in a search engine is often half the battle; there are plenty of resources out there on "vector databases". "Semantic search" from the lede of the OpenSearch doc might be another good one to have around.

Feel free to ask me any other questions and I can try to answer to the best of my abilities, though again, not an expert and honestly I've never actually used these myself beyond toy examples.

this post was submitted on 14 Feb 2026

6 points (100.0% liked)

Web Development

5468 readers

28 users here now

Welcome to the web development community! This is a place to post, discuss, get help about, etc. anything related to web development

What is web development?

Web development is the process of creating websites or web applications

Rules/Guidelines

Follow the programming.dev site rules
Keep content related to web development
If what you're posting relates to one of the related communities, crosspost it into there to help them grow
If youre posting an article older than two years put the year it was made in brackets after the title

Related Communities

Wormhole

!cool_github_projects@programming.dev

Some webdev blogs

Not sure what to post in here? Want some web development related things to read?

Heres a couple blogs that have web development related content

https://frontendfoc.us/ - [RSS]
https://wesbos.com/blog
https://davidwalsh.name/ - [RSS]
https://www.nngroup.com/articles/
https://sia.codes/posts/ - [RSS]
https://www.smashingmagazine.com/ - [RSS]
https://www.bennadel.com/ - [RSS]
https://web.dev/ - [RSS]

Credits

Icon base by Delapouite under CC BY 3.0 with modifications to add a gradient

founded 2 years ago

MODERATORS

snowe@programming.dev

erlingur@programming.dev

Ategon@programming.dev