Reading-Notes

Project maintained by eslamakram Hosted on GitHub Pages — Theme by mattgraham

How to Web Scrape with Python

web harvesting, or web data extraction Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.

how?

download turnstile data by this link Python Code:We start by importing the following libraries. import requests import urllib.request import time from bs4 import BeautifulSoup
parse the html with BeautifulSoup so that we can work with a nicer, nested BeautifulSoup data structure url = ‘http://web.mta.info/developers/turnstile.html’ response = requests.get(url) soup = BeautifulSoup(response.text, “html.parser”)
findAll to locate all of our tags soup.findAll(‘a’)

5 Tips For Web Scraping Without Getting Blocked or Blacklisted

IP Rotation The number one way sites detect web scrapers is by examining their IP address, thus most of web scraping without getting blocked is using a number of different IP addresses to avoid any one IP address from getting banned. To avoid sending all of your requests through the same IP address, you can use an IP rotation service like ScraperAPI or other proxy services in order to route your requests through a series of different IP addresses. This will allow you to scrape the majority of websites without issue.
Set a Real User Agent User Agents are a special type of HTTP header that will tell the website you are visiting exactly what browser you are using. Some websites will examine User Agents and block requests from User Agents that don’t belong to a major browser.
Set Other Request Headers Real web browsers will have a whole host of headers set, any of which can be checked by careful websites to block your web scraper. In order to make your scraper appear to be a real browser, you can navigate to https://httpbin.org/anything, and simply copy the headers that you see there (they are the headers that your current web browser is using). Things like “Accept”, “Accept-Encoding”, “Accept-Language”, and “Upgrade-Insecure-Requests” being set will make your requests look like they are coming from a real browser so you won’t get your web scraping blocked.
Set Random Intervals In Between Your Requests It is easy to detect a web scraper that sends exactly one request each second 24 hours a day! No real person would ever use a website like that, and an obvious pattern like this is easily detectable. Use randomized delays (anywhere between 2-10 seconds for example) in order to build a web scraper that can avoid being blocked.
Set a Referrer The Referer header is an http request header that lets the site know what site you are arriving from. Generally it’s a good idea to set this so that it looks like you’re arriving from Google, you can do this with the header:

“Referer”: “https://www.google.com/”

Use a Headless Browser The trickiest websites to scrape may detect subtle tells like web fonts, extensions, browser cookies, and javascript execution in order to determine whether or not the request is coming from a real user. In order to scrape these websites you may need to deploy your own headless browser
Avoid Honeypot Traps A lot of sites will try to detect web crawlers by putting in invisible links that only a robot would follow. You need to detect whether a link has the “display: none” or “visibility: hidden” CSS properties set, and if they do avoid following that link, otherwise a site will be able to correctly identify you as a programmatic scraper, fingerprint the properties of your requests, and block you quite easily. Honeypots are one of the easiest ways for smart webmasters to detect crawlers, so make sure that you are performing this check on each page that you scrape