If you're running nginx I am using the following:
if ($http_user_agent ~* "SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot|Screaming Frog SEO Spider|admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|Offline Explorer|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot|ClaudeBot|Bytespider|ImagesiftBot|Barkrowler|DataForSeoBo|Amazonbot|facebookexternalhit|meta-externalagent|FriendlyCrawler|GoogleOther|PetalBot|Applebot") { return 403; }
That will block those that actually use recognisable user agents. I add any I find as I go on. It will catch a lot!
I also have a huuuuuge IP based block list (generated by adding all ranges returned from looking up the following AS numbers):
AS45102 (Alibaba cloud) AS136907 (Huawei SG) AS132203 (Tencent) AS32934 (Facebook)
Since these guys run or have run bots that impersonate real browser agents.
There are various tools online to return prefix/ip lists for an autonomous system number.
I put both into a single file and include it into my web site config files.
EDIT: Just to add, keeping on top of this is a full time job! EDIT 2: Removed Mojeek bot as it seems to be a normal web crawler.