brin_bellway: forget-me-not flowers (Default)
Brin ([personal profile] brin_bellway) wrote 2021-07-22 04:45 pm (UTC)

A bit of poking at search engines turns up nothing. A bit of poking at Reddit turns up a 16-day-old post with no responses and a 4-month-old post complaining that--among other things--wget has never been any good at downloading embedded images with off-site hosting. Depending on how strict its definition of "off-site" is, that might explain the problem: in any case, if true it suggests a bound on how useful a solution can be.

Found a basic WARC-to-HTML converter, with a reasonably reputable creator. Can't find a way to get it to convert the links and embed the page requisites, but it means I can store things in WARC format and produce Recoll-indexable versions locally as needed.

...it looks like it may be possible to run warcat *through Termux*, creating crude-but-not-useless(-and-no-worse-than-current-wget-outputs) mobile-readable versions locally as needed. BRB.

[...]

Okay, it's expensive if you weren't otherwise going to have Python installed on Termux (warcat itself seems to be tiny, but Python consumes 327 MB of storage space), but I *did* manage to extract HTML files from the WARC scrape of my Dreamwidth using only my phone. (Well, I looked at stuff on the Termux wiki on my laptop, but I *could* have done that part on my phone too if necessary.) For some reason Material Files isn't offering me the option of opening HTML files in my browser, but I can get raw HTML directly through Material Files or open a somewhat-less-raw version in my ebook reader. The formatting sucks, and I hope I am never desperate enough to need to resort to this sort of thing, but the option *is* there.

Post a comment in response:

If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org