Brin (
brin_bellway) wrote2021-07-21 12:15 pm
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
(no subject)
[arguably cw: amnesia]
(part 1)
---
Older scrapes seem to have their requisites intact. One from August 2019 is intact; so is one from March 2020; one from September 2020 is not.
The obvious thing that changed between March 2020 and September 2020 is that I changed laptops. Something is wrong with this laptop's setup. I'm not sure how to tell what.
---
I went and tried it in a Lubuntu 18.04 virtual machine (which comes with wget 1.19.4 rather than 1.20.3), and that one got the stylesheet but not the images.
(part 1)
---
Older scrapes seem to have their requisites intact. One from August 2019 is intact; so is one from March 2020; one from September 2020 is not.
The obvious thing that changed between March 2020 and September 2020 is that I changed laptops. Something is wrong with this laptop's setup. I'm not sure how to tell what.
---
I went and tried it in a Lubuntu 18.04 virtual machine (which comes with wget 1.19.4 rather than 1.20.3), and that one got the stylesheet but not the images.
no subject
no subject
Found a basic WARC-to-HTML converter, with a reasonably reputable creator. Can't find a way to get it to convert the links and embed the page requisites, but it means I can store things in WARC format and produce Recoll-indexable versions locally as needed.
...it looks like it may be possible to run warcat *through Termux*, creating crude-but-not-useless(-and-no-worse-than-current-wget-outputs) mobile-readable versions locally as needed. BRB.
[...]
Okay, it's expensive if you weren't otherwise going to have Python installed on Termux (warcat itself seems to be tiny, but Python consumes 327 MB of storage space), but I *did* manage to extract HTML files from the WARC scrape of my Dreamwidth using only my phone. (Well, I looked at stuff on the Termux wiki on my laptop, but I *could* have done that part on my phone too if necessary.) For some reason Material Files isn't offering me the option of opening HTML files in my browser, but I can get raw HTML directly through Material Files or open a somewhat-less-raw version in my ebook reader. The formatting sucks, and I hope I am never desperate enough to need to resort to this sort of thing, but the option *is* there.
no subject
Why are you trying to do the scraping on mobile only?
no subject
The download was under 100 MB, but it takes up more space after installing. Apparently Python was also the reason BorgBackup was so expensive, and if I'm going to have Python anyway the marginal space cost of BorgBackup is only about 7 MB. I might tinker around with that.
---
>>Why are you trying to do the scraping on mobile only?
I'm not trying to *do* the scraping on mobile, but in a pinch it would be nice to have the option of *accessing* scrape results on mobile. One of the main downsides of the WARC file format as things stand is that nobody's written a WARC-reader app for Andr...I wonder if any of *those* can be run through Termux. Might only be the raw-data ones, since those are the only ones that can output directly to a command line. What can Termux do regarding pulling up GUIs?
...a lot, potentially. Huh. I will have to look into that. I haven't done much with Termux, but it seems to be a very powerful tool once you've started to wrap your head around it.
Anyway, my point was that I like to have the *option* of reading my files on Android even if I might never make use of that option, and I would feel more comfortable about switching to grab-site as my primary web scraper if I had a way to read its output on Android.
Back when my handheld computer was a Sansa, I kept a file backup on it (and successfully recovered from an abrupt laptop loss this way), but for most types of files it was purely storage: you couldn't *do* anything without another computer to hook it up to, and for a while there in my early-mid teens my laptop was broken and I was writing diary entries by borrowing other people's computers and hooking my Sansa up to them to access the diary file. One of the selling points of upgrading to a smartphone, for me, was a self-accessing backup drive.
no subject