Common Crawl

A searchable index of Hacker News “Who is hiring?” job postings.

← All postings · October 2015 thread

Company	Common Crawl
Website	commoncrawl.org ↗
Role taxonomy	Data / Analytics
Specialties	Data Science
Location	Remote
Salary	—
Apply via	Application link — http://commoncrawl.org/jobs/ · jobs@commoncrawl.org
Hiring notes	—
Tech	Python Java AWS
Posted by	Smerity
Posted	Oct 2, 2015
Source	View on Hacker News ↗

Original posting

Common Crawl (http://commoncrawl.org/) Position: Crawl Engineer / Data Scientist Location: SF or Remote Email: jobs@commoncrawl.org Common Crawl is the non-profit organization that builds and maintains the single largest publicly accessible dataset of the world's knowledge, encompassing petabytes of web crawl data. Any can download and use the data for free and we've been used for a wide variety of purposes. As the crawl engineer, you'll run a crawl that spans hundreds of millions of domains and billions of pages each month. You'll command a fleet of machines on AWS that use Nutch to capture the web data and then Hadoop to turn it into a better structured dataset for others to use. This is a rewarding role as you're really giving back to the open data community :) Requirements: - Fluent in Java (Nutch and Hadoop are core to our mission) - Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...) - Knowledge the Amazon Web Services (AWS) ecosystem - Experience with Python - Basic command line Unix knowledge - BS Computer Science or equivalent work experience Full details: http://commoncrawl.org/jobs/