• @[email protected]
    link
    fedilink
    English
    37 hours ago

    I love that PDFs are so difficult to transform into HTML, too

    FYI, if that’s relevant to your field, every new article published on arxiv.org now has a HTML render as well.

    And on many older publications, transforming “arxiv.org” into “ar5iv.org” leads to an HTML rendering that is a best-effort experiments they ran for a while.

    • JackbyDev
      link
      fedilink
      English
      26 hours ago

      That’s really cool! What I really would like is a tool that converts PDFs to semantic HTML files. I took a peek there and it seems easier for them because they have the original LeX source.

      I think for arbitrary PDFs files the information just isn’t there. I’ve looked into it a bit and it’s sort of all over. A tool called pdf2htmlex is pretty good but it makes the HTML look exactly like the PDF.

      • @[email protected]
        link
        fedilink
        English
        25 hours ago

        Yes, PDFs are much more permissive and may not have any semantic information at all. Hell, some old publications are just scanned images!

        PDF -> semantic seems to be a hard problem that basically requires OCR, like these people are doing

        • JackbyDev
          link
          fedilink
          English
          11 hour ago

          Oh nice, thanks for sharing that project. I haven’t heard of it before!