HN Jobs

A searchable index of Hacker News “Who is hiring?” job postings.

← All postings · October 2015 thread

Common Crawl

CompanyCommon Crawl
Websitecommoncrawl.org
Role taxonomyData / Analytics
SpecialtiesData Science
LocationRemote
Salary
Apply viaApplication linkhttp://commoncrawl.org/jobs/ · jobs@commoncrawl.org
Hiring notes
TechPythonJavaAWS
Posted bySmerity
PostedOct 2, 2015
SourceView on Hacker News ↗

Original posting

Common Crawl (http://commoncrawl.org/) Position: Crawl Engineer / Data Scientist Location: SF or Remote Email: jobs@commoncrawl.org Common Crawl is the non-profit organization that builds and maintains the single largest publicly accessible dataset of the world's knowledge, encompassing petabytes of web crawl data. Any can download and use the data for free and we've been used for a wide variety of purposes. As the crawl engineer, you'll run a crawl that spans hundreds of millions of domains and billions of pages each month. You'll command a fleet of machines on AWS that use Nutch to capture the web data and then Hadoop to turn it into a better structured dataset for others to use. This is a rewarding role as you're really giving back to the open data community :) Requirements: - Fluent in Java (Nutch and Hadoop are core to our mission) - Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...) - Knowledge the Amazon Web Services (AWS) ecosystem - Experience with Python - Basic command line Unix knowledge - BS Computer Science or equivalent work experience Full details: http://commoncrawl.org/jobs/