A searchable index of Hacker News “Who is hiring?” job postings.
← All postings · May 2018 thread
Internet Archive
Original posting
Internet Archive | Web Crawl Engineer, Archive-It - San Francisco, CA or remote - Full
Time
Running large-scale web harvests on global and national domain levels and focused and
specialized crawls using Heritrix, our open-source crawler, as well as other open-source
technologies developed internally, including Umbra, Brozzler, warcprox and others.
Configuration, monitoring, and improvement of large-scale web crawls to ensure their
quality and timely completion. Processing, analysis and quality assurance of archived web
content to ensure it is complete and of the highest quality. Contribute to development of
tools for automated analysis and reporting of crawl material, and to development projects
focused on crawling, processing, and access. Manage both large ingests and exports of web
data, derivatives, logs, and reports. Demonstrated experience of delivering on commitments
with deadlines and project timelines and working in a collaborative team of engineers and
project/product managers.
Skills & Requirements
Experience in Unix shell scripting and Python coding required Experience with web crawlers
or scrapers, especially Heritrix Solid experience in Internet protocols (HTTP is must.)
Strong knowledge of HTML, JavaScript and Web technologies in general Ability to work in,
and enjoy, a loosely structured work environment
To Apply: To apply please email cover letter, salary expectations, and résumé to
jobs+crawlengineer@archive.org with the subject line "Web Crawl Engineer."