00:00
00:00
S3C

Age 48, Dude

Bureaucrat/Wannabe

NG Motivational Speaker

Joined on 3/25/08

Level:
5
Exp Points:
246 / 280
Exp Rank:
> 100,000
Vote Power:
4.39 votes
Audio Scouts
3
Rank:
Civilian
Global Rank:
> 100,000
Blams:
0
Saves:
15
B/P Bonus:
0%
Whistle:
Normal
Medals:
727
Supporter:
4y 9m 21d

Yeah it is their own archival tool, but for each page they crawl my server takes a hit. I don't know if there's some kind of rate-limiting built in or if it just gets overloaded with too many requests to give those 503s - it is a shared server after all. Ran it again yesterday and it just isn't working for me, 503s all the way through on new captures.

The Google Sheet I'm running is an index of all my posts at the moment, that haven't been indexed before, ca 8000 lines of:

https://cyberd.org/a-small-haiku.html
https://cyberd.org/a-little-haiku.html
https://cyberd.org/hardcore-henry-2015.html
https://cyberd.org/cd2k16.html
https://cyberd.org/bullet-to-the-head-2012.html
https://cyberd.org/the-tournament-2009.html
https://cyberd.org/the-island-2005.html
https://cyberd.org/jason-bourne-2016.html
https://cyberd.org/teenage-mutant-ninja-turtles-2-2016.html
https://cyberd.org/something-there-for-you.html
https://cyberd.org/week-36-37-summer-recap.html
https://cyberd.org/musicalish-128.html
https://cyberd.org/musicalish-127.html
https://cyberd.org/musicalish-126.html
https://cyberd.org/skiptrace-2016.html

...you just add in a list of URLs in the first column and it runs through it, capturing outlinks on each page too if you so desire. I don't think it crawls indefinitely, just outlinks on each URL you list. Though that can easily be ~ a hundred each with media/scripts.

Ooh, don't think I ran into that one before, maybe useful!