What is a web crawler?
Web crawlers, also known as web spiders and web robots, are programs or scripts that automatically grab network information and data according to certain rules. In layman's terms, web crawlers simulate human behavior, replace human operations with programs, and jump from one link to the next, traversing web pages just like crawling on the Internet. The actions of crawlers such as jumping, opening, and browsing are faster than humans, and the levels of websites they browse are also deeper, so they are called web crawlers.
In 1993, Matthew Gray, a student at the Massachusetts Institute of Technology, wrote a program called "Internet Wanderer", which was used to count the number of servers on the Internet and retrieve the domain name of the website. Thus, the world's first web crawler was born. With the rapid development of the Internet and the explosive growth of web pages, fast and accurate retrieval is becoming more and more difficult. The developers have made many improvements and optimizations on the basis of the "Internet Wanderer" program, which is used to search the entire Internet. At the same time, the popularity of search engines has promoted the development of web crawlers in the direction of multi-strategy, load balancing, and large-scale incremental crawling.
According to the system structure and implementation technology, web crawlers can be divided into four categories: general web crawlers that are applied to search engines and large-scale data collection, focused web crawlers that are oriented to the collection of specified topics and target pages, and that only collect updates and changes of web pages. Quantitative web crawlers, and deep web crawlers capable of collecting ever-changing information behind static links and hidden behind search forms.
Three stages of malicious crawling and technical anti-crawling
Malicious crawling and anti-crawling continue to evolve with technological development and are a dynamic process of attack and defense. Based on the development of web crawlers and changes in malicious crawling behavior, there are generally three stages.
In the first stage, restrict IP and account numbers, and intercept verification codes
At first, the anti-crawling measures of the website were to directly reject the visits that did not originate from the browser. When a malicious web crawler accesses, a 403 error response code will appear, or a "sorry, unable to access" prompt will be received.
In order to bypass the anti-crawling mechanism, the web crawler sets Headers information, simulates a browser, and performs large-scale malicious crawling of static pages with multiple threads.
Headers is the Http request and the corresponding core, carrying the main information of the user's access to the webpage, including Cookie (username, password), host (requested server host), User Agent (browser, browser kernel, manufacturer, etc.), Referer (browsing track, such as the previous page), etc.
For malicious crawling behavior, websites and platforms restrict and block accounts and devices that frequently change UserAgent (simulated browser) and frequently use proxy IP: when the same IP and the same device visit the website within a certain period of time, the system automatically limits It visits and browses; when a visitor visits too many times, the request is automatically redirected to a verification code page, and the visit can only be continued after entering the correct verification code.
In the second stage, dynamic web technology protects information
Faced with the upgrade of anti-crawling technology, web crawlers have also been upgraded accordingly. The web crawler can automatically identify and fill in the verification code, bypassing the interception of the second verification; use multiple accounts at the same time, configure IP proxy tools, and bypass the platform's restrictions on account numbers and IP addresses.
In response to changes in web crawlers, many websites and platforms use dynamic web page development technology. Based on dynamic web page technology, the URL address of the web page is not fixed, and the background interacts with the front-end users in real time to complete user query, submission and other actions. Moreover, different pages will be generated when visiting the same URL address at different times and by different users. Compared with traditional static web pages, dynamic web pages effectively protect important data information and effectively curb the malicious crawling behavior of web crawlers.
The third stage is to prevent and control malicious theft in the whole process
In order to bypass the new anti-crawling measures, the web crawler uses Selenium and Phantomjs technology to completely simulate human operations.
Selenium is a tool for web application testing that can run directly in the browser. It supports all mainstream browsers, and can let the browser automatically load the page according to the developer's instructions, obtain the required information data, even take screenshots of the page, or determine whether certain actions on the website have occurred. Since Selenium needs to be used in conjunction with a third-party browser, developers use the Phantomjs tool (or "virtual browser") instead of a real browser.
With the continuous iteration of web crawlers, a single prevention and control measure is no longer effective. Platforms and enterprises need three-dimensional defense measures to effectively deal with malicious crawling behavior.