this post was submitted on 02 Sep 2024
5 points (100.0% liked)

Rust

5981 readers
32 users here now

Welcome to the Rust community! This is a place to discuss about the Rust programming language.

Wormhole

[email protected]

Credits

  • The icon is a modified version of the official rust logo (changing the colors to a gradient and black background)

founded 1 year ago
MODERATORS
 

So I'm trying to parse school's website for some info. I'm trying to get some values using xpath. So I found a html 5 parser and it can't properly parse the first line. Then I figure you it's actually XHTML and not HTML. After quick Google search I found out XHTML can be properly parsed using any XML parser and so I found one and... It can't parse the first line. So I ask LLama3.1 (like a real programmer) why I can't parse the first line with any parser. It explained so nicely that I did not destroy my keyboard when I was told that this document is "XHTML 1.0 Transitional" and it's a mix of HTML 4 and XHTML and can't be parsed with HTML nor XML parser. I hate the guy that invented that so much...

So I can't find a crate to parse XHTML 1.0 transitional? Or a crate to convert xhtml to something else? Any advice?

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 1 points 2 months ago* (last edited 2 months ago)

I wish I could be more help. My advice is you need a better grade of general purpose HTML parsing library, possibly even a browser emulator, rather than a lib specifically for XHTML 1.0 transitional or a converter.

In my Python web automation course in college we used BeautifulSoup and I think maybe mechanize. I think either of those would probably be robust enough to do what you’re trying to do, but if it has to be Rust I’m not sure what’s out there. Otherwise you could upgrade to Selenium or something.

Or if you’re trying to do something fairly simple and you don’t need to parse the whole thing but it’s still a little too complex for plain old regular expressions, you might be able to build a simple parser with the rust pest crate, but of course I would absolutely not recommend trying to build your own full-featured XHTML parser.