this post was submitted on 25 Oct 2023

110 points (97.4% liked)

Ask Lemmy

27205 readers

1825 users here now

A Fediverse community for open-ended, thought provoking questions

Rules: (interactive)

1) Be nice and; have fun

Doxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them

2) All posts must end with a '?'

This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?

3) No spam

Please do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.

4) NSFW is okay, within reason

Just remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either [email protected] or [email protected]. NSFW comments should be restricted to posts tagged [NSFW].

5) This is not a support community.

It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email [email protected]. For other questions check our partnered communities list, or use the search function.

6) No US Politics.

Please don't post about current US Politics. If you need to do this, try [email protected] or [email protected]

Reminder: The terms of service apply here too.

Partnered Communities:

Logo design credit goes to: tubbadu

founded 2 years ago

MODERATORS

[email protected]

candyman337

[email protected]

110

why do & ampersands never display properly in titles? (self.asklemmy)

submitted 1 year ago* (last edited 1 year ago) by bernieecclestoned to c/[email protected]

35 comments fedilink hide all child comments

but work in body text &

all 37 comments

sorted by: hot top controversial new old

[–] [email protected] 46 points 1 year ago* (last edited 1 year ago) (6 children)

The API sanitizes them, so they're stored encoded (&) in the database.

Some frontends correct for this when posts are rendered, some don't. Voyager and Tesseract, at least, seem to correct them. Not sure about others.

[–] [email protected] 15 points 1 year ago

Working fine on Sync.

[–] [email protected] 10 points 1 year ago (1 children)

That's the problem, then. You shouldn't store entities in the db, the table is likely already utf8, which supports all characters

[–] [email protected] 7 points 1 year ago (3 children)

I think 0.19 is reverting that behaviour, because it was indeed a certified bad idea.

I think the idea was to attempt to bulletproof potentially crappy clients especially after the XSS incident, but the problem is it's simply not even always rendered in a web context which makes the processing kind of a pain.

Wouldn't surprise me if it becomes double and triple encoded too at times because of the federation. Do you encode again or trust that the remote sent you urlencoded data already?

Best format is the original format and transform as late as possible, ideally in clients where there's awareness of what characters are special. It is in web, not so much in an Android or terminal app.

I don't think the Lemmy devs are particularly experienced web developers in general. There's been a fair amount of dubious API design decisions like passing auth as a GET parameter... Thankfully they also fixed that one in 0.19.

[–] [email protected] 1 points 1 year ago (2 children)

What exactly makes storing it encoded a bad idea? A waste of space perhaps.

[–] [email protected] 3 points 1 year ago (1 children)

Because then you need to take care everywhere to decode it as needed and also make sure you never double-encode it.

For example, do other servers receive it pre-encoded? What if the remote instance doesn't do that, how do you ensure what other instances send you is already encoded correctly? Do you just encode whatever you receive, at risk of double encoding it? And generally, what about use cases where you don't need it, like mobile apps?

Data should be transformed where it needs it, otherwise you always add risks of messing it up, which is exactly what we're seeing. That encoding is reversible, but then it's hard to know how many times it may have been encoded. For example, if I type & which is already an entity, do you detect that and decode it even though I never intended to because I'm posting an HTML snippet?

Right now it's so broken that if you edit a post, you get an editor... with escaped HTML entities. What happens if you save your post after that? It's double encoded! Now everyone and every app has to make sure to decode HTML entities and it leads to more bugs.

There is exactly one place where it needs to encode, and that's in web clients, more precisely, when it's being displayed as HTML. That's where it should be encoded. Mobile apps don't care they don't even render HTML to begin with. Bots and most things using the API don't care. They shouldn't have to care because it may be rendered as HTML somewhere. It just creates more bugs and more work for pretty much everyone involved. It sucks.

Now we have an even worse problem is that we don't know what post is encoded which way, so once 0.19 rolls out and there's version mismatches it's going to be a shitshow and may very well lead to another XSS incident.

[–] [email protected] 0 points 1 year ago (1 children)

That's a problem of not conforming to any standard. Not with it being a bad idea in general, like say storing passwords in plaintext is.

[–] [email protected] 1 points 1 year ago (1 children)

It still leads to unsolvable problems like, what is expected when two instances federate content with eachother? What if you use a web app to use a third party instance and it spits out unsanitized data?

If you assume it's part of the API contract, then an evil instance can send you unescaped content and you got an exploit. If you escape it you'll double escape it from well behaved instances. This applies to apps too: now if Voyager for example starts expecting pre-sanitized data from the API, and it makes an API call to an evil instance that doesn't? Bam, you've got yourself potential XSS. There's nothing they can do to prevent it. Either it's inherently unsafe, or safe but will double-escape.

You end up making more vulnerabilities through edge cases than you solve by doing that. Now all an attacker needs to do is find a way to trick you into thinking they have sanitized data when it's not.

The only safe transport for user data is raw. You can never assume any user/remote input is pre-sanitized. Apps, even web ones, shouldn't assume the data is sanitized, they should sanitize it themselves because only then you can guarantee that it will come out correctly, and safely.

This would only work if you own both the server and the UI that serves it. It immediately falls apart when you don't control the entire pipeline from submission to display, and on the fediverse with third party clients and apps and instances, you inherently can't trust anything.

[–] [email protected] 0 points 1 year ago

Sorry for the late reply, but the point is that there is no trivial way to detect whether and how many times something has been encoded. You may end up with multiple levels of encoding in multiple systems and everything becomes untractable. Morever, as i said this doesn't have to be a problem, as you can just decode everything as much as you can BEFORE you put it in the db, as the db can handle all of that by itself. Just let it do its job. Paradoxically, if you use only channels that support utf8 and don't apply any transformation, your data is already perfect as it is. Then it is the job of the client to do what it needs to be able to render properly, but for instance a non-html client shouldn't need to use html libraries to be able to strip html stuff from the text before it can be displayed.

[–] [email protected] 0 points 1 year ago

Sorry for the late reply, it's been a week... but yeah passing creds in the Get is very bad for multiple reasons. For instance if you pass the creds on a page that contains ads or trackers, they are probably going to store the url AND your credentials and propagate them to a million systems of third parties. That's. Not. Good.

[–] bernieecclestoned 5 points 1 year ago

Thanks

[–] [email protected] 3 points 1 year ago

Works fine in connect

[–] [email protected] 1 points 1 year ago

Does not work on MacOS Firefox.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago)

imma quicky test on thunder

edit: displays &

[–] [email protected] 35 points 1 year ago (2 children)

There was some scare in lemmy development circles recently about script injection vulnerabilities. The various apps and frontend developers "solved" the problem by peppering untrusted user input with escape sequences all over the place. User submits post? Escape title! Receive new post from a federated instance? Escape title!

Obviously if you escape the title twice and display once, it will show up weird. The problem is that the various devs haven't agreed yet which parts of the messaging protocol are supposed to be already escaped and which are not. Ideally all user input should be stored and transmitted in raw form, and only escaped right before displaying. But due to various zealously-cautious devs we get this instead:

[–] [email protected] 11 points 1 year ago (2 children)

There's a difference between cautious and incorrect. It's broken. If they're that concerned, where are the unit tests?

[–] [email protected] 5 points 1 year ago

They incorrectly broke it because they were overzealous.

[–] [email protected] 4 points 1 year ago

Content showing up weird in federation sounds like a good use of integration tests to me

[–] [email protected] 5 points 1 year ago

This was a really informative comment, thanks!

[–] [email protected] 15 points 1 year ago (1 children)

For me(using sync), it shows a different font for each &

[–] [email protected] 2 points 1 year ago

That has nothing to do with the ampersand, it's just that post titles and bodies in general have different fonts. It's just easier to notice in the ampersand since it's so different between the fonts.

[–] Zeppo 10 points 1 year ago (1 children)

They show as & on the mobile web interface for various instances. I would say it’s something improperly done with what are called HTML entities. HTML entities are a way of encoding various elements that have meaning in HTML so they can be displayed, without being interpreted as HTML by the browser, which could not only break a layout but have security implications. So the titles are sanitized to prevent injection attacks but somehow are not stored/output in a way that they display properly.

[–] bernieecclestoned 1 points 1 year ago

Thanks for the explanation

[–] [email protected] 9 points 1 year ago* (last edited 1 year ago) (2 children)

Looks fine to me. It works.

Using Voyager on Android

[–] bernieecclestoned 9 points 1 year ago* (last edited 1 year ago) (2 children)

Thanks, I'm on Jerboa Android & they show as & amp;

[–] [email protected] 1 points 1 year ago

Same on Boost

[–] [email protected] 1 points 1 year ago

Works on Sync.

[–] [email protected] 3 points 1 year ago (1 children)

Fine for me too with Eternity

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago)

I have Eternity on Android and it says "& ampersands". Broken on boost.

[–] [email protected] 7 points 1 year ago (2 children)

I believe it's been fixed for the next version of Lemmy. But for now, small ampersand (U+FE60) works as a substitute: ﹠

[–] [email protected] 2 points 1 year ago

Small ampersand I love it

[–] [email protected] 1 points 1 year ago

Also I wonder if your username and my own, (head tilt) may share the same meaning

[–] [email protected] 3 points 1 year ago

On Eternity we have the opposite (work in title but not body), but when I click into the post it looks fine.

[–] [email protected] 2 points 1 year ago

They're also broken in code blocks