this post was submitted on 15 Oct 2024
45 points (97.9% liked)

Fediverse

28074 readers
953 users here now

A community to talk about the Fediverse and all it's related services using ActivityPub (Mastodon, Lemmy, KBin, etc).

If you wanted to get help with moderating your own community then head over to [email protected]!

Rules

Learn more at these websites: Join The Fediverse Wiki, Fediverse.info, Wikipedia Page, The Federation Info (Stats), FediDB (Stats), Sub Rehab (Reddit Migration), Search Lemmy

founded 1 year ago
MODERATORS
 

I've had real issues trying to search the fediverse. I've had bad luck with the search function of both Lemmy and Mbin, and while https://fedi-search.com/ exists the Whoogle server is down and either way the search just seems to be a list of various fediverse instances and nothing fancier (which also means that it's not a complete search?). Other than that it's quite the hassle to list all the instances you'd like to search for every search. What's the best way to search the fediverse? What works for you? And is it somehow possible to add a shortcut to e.g., DDG that searches specific sites without having to type for example site:lemmy.dbzer0.com and all the other instances all the time?

you are viewing a single comment's thread
view the rest of the comments
[–] JupiterRowland 1 points 3 days ago (1 children)

It's technologically impossible for any search to cover all of the Fediverse. Like, absolutely 100% of it.

That's because it's technologically impossible for anything in or outside the Fediverse to be aware of the full extent of the Fediverse and know all its instances, all its actors, all its (public) content in real-time.

It would only be possible if there was a fully centralised search engine. And that search engine had been hard-coded into all Fediverse server apps for years so that even instances that haven't been upgraded in two or three years know it.

If Joe Übergeek spun up his own personal CherryPick or (streams) or Forte instance or whatever on his own Raspi, that instance would immediately have to announce its existence to that centralised search engine. Otherwise, the search engine wouldn't have any way of knowing this new instance exists. If Joe Übergeek sent his first test post into the void because he has no connections yet, it would immediately have to be pushed to that search engine. And if Joe Übergeek decided to turn off ActivityPub on his (streams) channel, his instance would immediately have to notify the search engine which would immediately have to list that channel as formerly but no longer available.

Now imagine such a search being decentralised, e.g. built into Fediverse server apps like Mastodon or Lemmy. In this case, all server apps would have to know all instances out there with Fediverse-wide search. And immediately so.

Imagine Mastodon had such search built-in. Imagine Alice started up her own personal Mastodon instance with this search at 10:30. Imagine Bob installed his own personal (streams) instance from source at 10:31.

In order for the search on Alice's Mastodon instance to actually cover 100% of the Fediverse, it would require Bob's (streams) instance to push all necessary information to it. In order for this to work, Bob's (streams) instance would have to know of the existence of Alice's Mastodon instance from the moment it's installed.

This couldn't be done via any form of discovery, for where would (streams) go look for search instances?

So an automatically-generated list of search instances would have to be necessary. It would have to be delivered with the code upon installation.

This means that Alice's Mastodon instance would have to add itself to the list of search instances in the streams repository (https://codeberg.org/streams/streams) as a pull request and then immediately merge that PR into both dev and release, the latter past dev, both without Mike Macgirvin's permission, so that Bob's new (streams) instance knows about Alice's less-than-a-minute-old Mastodon instance with search the very moment that Bob installs it, so that Bob's (streams) instance knows that it will have to report everything that happens to it in public to Alice's Mastodon instance with built-in Fediverse search.

Whenever someone spins up a new instance that has Fediverse search built in, this would cause a PR in the code repositories of all Fediverse server applications that adds this instance to the initial list of search instances, and it'd cause that PR to immediately be merged into all active branches with no consent by the maintainers. And each shutdown of an instance with Fediverse search would cause a PR and an automated merge because that instance would have to be removed from the initial list of search instances.

I guess it should be obvious what an outlandish idea this is.

[–] [email protected] 1 points 1 day ago (1 children)

What if search itself were a federated function? Although I'm a software dev I really don't know much about the mechanics of large-scale search engines such as Google, but I know their server farms somehow share the load of performing searches and maintaining whatever database they maintain to optimize searching. Seems like the fediverse could do search in a similar way. I'm just saying your critique of the idea, although well thought out, seems like a critique of a particular strategy. It's not obvious to me that the very idea of federated search is outlandish.

[–] JupiterRowland 1 points 2 hours ago

Still, the issue would be to find all instances of all Fediverse server applications.

I mean, the idea was to cover the whole Fediverse with that search. Literally everything.

Like, imagine I spin up my own instance of Forte on a home server to try it out and see if it already works.

How's a Fediverse search engine supposed to know about my brand-new Forte instance? Clairvoyance? Hah. A crawler? Yeah, right, as if any crawler out there was fast enough to discover a brand-new instance of something that doesn't have a running instance at all yet. At least not beyond enclosed, experimental instances detached from the rest of the Fediverse.

I mean, instead of Forte, I could also install what Forte was forked from, namely something colloquially referred to as (streams). Something that intentionally doesn't have a name, doesn't have a brand identity, doesn't have a unified server identifier. Unlike Mastodon whose instances all identify as "mastodon" and Lemmy whose instances all identify as "lemmy" and Hubzilla whose instances all identify as "hubzilla", (streams) instances don't all identify the same. That field is customisable. And it has been customised for as long as (streams) has been around. You can't reliably crawl (streams) instances. Instead of "streams", they can identify as "y" (because Y is not X) or "get ready to rumbly" (public instance actually) or "bunny of doom" or "diversi spiritus".

In fact, crawlers would have to be able to identify any kind of Fediverse server software. Even if someone has only just forked something, a crawler would be able to recognise it as Fediverse server software. If you hard-code server identifiers into the crawler, it'd be out-of-date as soon as someone decides to fork Mastodon or Misskey or Firefish or Sharkey or whatever again. And, as mentioned above, you can't crawl (streams) instances by identifier.

It simply is impossible to discover and index the whole Fediverse by crawling, Google-style. And if a Fediverse search engine can't discover a (streams) instance that identifies as "y", it can't index the content coming from the man who created (streams) and Forte and still occasionally develops both. The man who created the oldest still existing Fediverse project, Friendica, as well as the Swiss Army knife of the Fediverse, Hubzilla, and the very concept of nomadic identity. One of the most competent and experienced Fediverse devs ever. A crawler couldn't find him.

Still, the search engine needs to know all Fediverse instances, right?

Well, if crawling fails, and crawling does fail, there's only one way to achieve that: Each instance would have to announce its presence to anything that's supposed to be able to search the Fediverse.

But in order to be able that, each instance would have to know everything that can search the Fediverse. And all instances of it. Every single one of them.

And if it shall announce its existence when it spins up for the first time, it will have to know all these search instances immediately before spinning up.

How can it possibly know them all before even going online itself?

Two options. Either a centralised list of all search instance that's being updated as soon as a new one is spun up.

But you said, "federated." As in not centralised.

Or the list would have to be built into the source code as it's being git pulled from the code repository. In fact, the list would have to be git pulled from the code repository immediately before the server spins up so that it's up-to-date when the server spins up. This would mean that the whole server software would have to be updated before start-up.

Of course, each Fediverse server software project that's started from scratch would have to implement this list, otherwise its instances couldn't be found.

But how is this list supposed to be kept up-to-date?

I mean, let's suppose what has been spun up here is something that has Fediverse search built in. It itself would have to be added to this list so that other new instances can announce themselves to this new instance, so that it can find them and index their content.

So how is this new search-equipped instance supposed to be added to the list of search instances?

Shall it add itself to the list by manipulating the production code of all Fediverse server applications that have Fediverse search built in? Past the maintainers and without their consent?

Perfect search that covers 100% of the Fediverse has to rely on lists of some kind, that's clear. The Fediverse changes too quickly to be crawlable. It's too diverse to be crawlable. And it has server software which itself is inherently uncrawlable because it's undiscoverable by design.

But such lists are impossible to always be kept up-to-date, too.