Curated Tumblr

4952 readers

553 users here now

For preserving the least toxic and most culturally relevant Tumblr heritage posts.

The best transcribed post each week will be pinned and receive a random bitmap of a trophy superimposed with the author's username and a personalized message. Here are some OCR tools to assist you in your endeavors:

web
iOS
FOSS Android Recs per u/[email protected]: 1 , 2

Don't be mean. I promise to do my best to judge that fairly.

founded 2 years ago

MODERATORS

Apytele

[email protected]

358

Take that, atheists (files.catbox.moe)

submitted 6 months ago by [email protected] to c/curatedtumblr

71 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 6 points 6 months ago (1 children)

The genomes have likely been indexed to make finding results faster. Google doesn't search the entire internet when you make a query :P

[–] [email protected] 3 points 6 months ago* (last edited 6 months ago) (2 children)

I know that similar computational problems use indexing and vector-space representation but how would you build an index of TiBs of almost-random data that makes it faster to find the strictly closest match of an arbitrarily long sequence? I can think of some heuristics, such as bitmapping every occurrence of any 8-pair sequence across each kibibit in the list. A query search would then add the bitmaps of all 8-pair sequences within the query including ones with up to 2 errors, and using the resulting map to find "hotspots" to be checked with brute force. This will decrease the computation and storage access per query but drastically increase the storage size, which is already hard to manage.

However, efficient fuzzy string matching in giant datasets is an interesting problem that computer scientists must have encountered before. Can you find a good paper that works well with random, non-delimited data instead of just using the approach of word-based indices for human languages like Lucene and OpenFTS?

[–] [email protected] 5 points 6 months ago* (last edited 6 months ago)

As per my other post, this person isn't doing any of that.

But, since you asked for papers on generic matching algorithms, I found this during the silent conniption fit you sent me into after suggesting that some random tumblr user plugged a tumblr bot directly into a state of the art genomics db.

https://link.springer.com/article/10.1007/s11227-022-04673-3

Please note that while, yes, they ran this test on a standard office computer, they were only searching against 12 million characters.

A single tebibyte of characters would be more like 1 trillion characters. A pebibyte would be more like 1 ~~quintillion~~ quadrillion.

... much, much, much longer processing times.

Edit: Used the wrong word for stupendously large numbers that start with q.

[–] [email protected] 1 points 6 months ago

Yeah good point, not a trivial undertaking. I'm not an expert in that area but maybe elasticsearch or similar technology is able to find matches. Although I have no idea how that works under the hood