XHTML 1.0 Transitional parser?

@[email protected] · 4 months ago

XHTML 1.0 Transitional parser?

@gsfraley · 4 months ago

I would try another HTML 5 parser. HTML 5 is somewhat of a unification of HTML and XHTML, getting into syntax-specifics between the two with XML parsing is probably going to be an uphill battle. That said, I’m curious what the first line is, it could just be malformed entirely.

@[email protected] · 4 months ago

Thats the first line:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

I thought it was html because it everything on the web is html. But because of the first line I figured out it was xhtml which should be parsed with xml parser, but I did not know the transitional is a mix which cant be parsed with anything.

@gsfraley · 4 months ago

Hmm, doctype declarations are sort of like the markup equivalent of headers. Usually parsers read them to know what flavor to expect and then go parse the rest of the page separately. You shouldn’t have to do this, but if you chop off that first line and run it through a standard HTML parser it might work fine.

@[email protected] · edit-2 4 months ago

Thats the first thing that I tried and still failes somewhere deep in the html where I probably shouldn’t skip a line.

@[email protected] · 4 months ago

Have you tried some tag soup parser? That should work as a last resort even if the ones building a tree structure don’t.

@calcopiritus · 4 months ago

HTML is hard to parse because it allows mistakes.

I don’t know the answer to your question. But if it was me, I’d run the HTML parser until it encounters an error, manually fix the error, then try to parse again. Until it parses correctly.

@[email protected] · edit-2 4 months ago

I wish I could be more help. My advice is you need a better grade of general purpose HTML parsing library, possibly even a browser emulator, rather than a lib specifically for XHTML 1.0 transitional or a converter.

In my Python web automation course in college we used BeautifulSoup and I think maybe mechanize. I think either of those would probably be robust enough to do what you’re trying to do, but if it has to be Rust I’m not sure what’s out there. Otherwise you could upgrade to Selenium or something.

Or if you’re trying to do something fairly simple and you don’t need to parse the whole thing but it’s still a little too complex for plain old regular expressions, you might be able to build a simple parser with the rust pest crate, but of course I would absolutely not recommend trying to build your own full-featured XHTML parser.