Over the past few months, we have been working with a company called Statec (a data science company from Brazil) to design features for predictive algorithms. One of the first considerations when working with predictive algorithms is choosing relevant data to train on. We decided quite naively to put together a list of web page features that we thought might offer some value. Our goal was simply to see if, from the features available, we could come close to predicting a web page’s rank in Google.

With GetStat we simply loaded keyword combinations in the top 25

With GetStat, we simply loaded keyword combinations in the top 25 service industries with the location of the top 200 cities (by size) in the United States. This Bolivia WhatsApp Number List resulted in 5,000 unique search terms (eg, “Charlotte Accountant” from Charlotte, NC). Our company, Consultwebs, focuses on legal marketing, but we wanted the model to be more universal. After loading all 5,000 terms and waiting a day, we got around 500,000 search results that we could use to build our dataset. After finding it so simple, we collected the rest of the data.

This ignores what’s available on Python, but was surprisingly

Bolivia WhatsApp Number List

This ignores what’s available on Python, but was surprisingly useful for our purposes. Text Statistics – Useful for getting data on sentence length, reading level, etc. Majestic – I started by exploring their API through a custom script, but they delivered the data in one sip which was very nice. Thanks, Dixon! Cheerio – An easy-to-use library for parsing DOM elements using jQuery-style markup. IPInfo – Not really a library, but a great API for getting server information. The crawl process was very slow, mainly due to overflow limits from API providers and our proxy service. We would have created a cluster, but the expense limited us to hitting a few APIs about once per second. Slowly we got a full crawl of the full 500,000 URLs. Here are some notes on my experience with URL crawling for data collection:Use APIs whenever possible.

 

 

Leave a Reply

Your email address will not be published.