In 2023, the Hangzhou Intermediate People's Court concluded two cases involving "store relocation software" related to unfair competition. The plaintiff in this case was the operating entity of a well-known domestic large-scale e-commerce platform, while the defendant developed a replica software called "a certain moving quick goods listing batch release", which was accused of illegally obtaining platform product information and selling it in other service markets.
According to the plaintiff's claim, its platform and merchants invested a large amount of costs in operating product, transaction, and logistics data information, and took various measures to protect these data resources, prohibiting unauthorized access, copying, storage, and use. The defendant's software provided services to paying users for a long time, illegally capturing the plaintiff platform's product links, titles, images, details, parameters, prices, inventory, and other information, and promoted that it can be copied and transported to other platforms with a single click, resulting in large sales volume.
After trial, the court found that the defendant, without authorization from the plaintiff, illegally obtained and uploaded platform product information to other competing shopping platforms, violating relevant provisions of the "PRC Anti-Unfair Competition Law" and constituting unfair competition on the internet. After the court learned of the plaintiff's willingness to mediate, it actively explained the legal principles and consequences to the defendant, making the defendant clearly aware of the seriousness of its infringement.
Ultimately, the parties voluntarily reached a settlement agreement. According to the agreement, the defendant promised to delete the relevant data and derivative data, and guaranteed that the software would no longer have the functionality to illegally obtain relevant data. In addition, the defendant also paid 100,000 yuan in compensation for economic losses to the plaintiff.
Steps of website scraping to steal e-commerce data
As a data acquisition tool, web scraping is gradually becoming a major hazard for e-commerce platforms. Website scraping not only enables the acquisition of crucial merchant information and the proliferation of counterfeit websites but also collects sensitive user information, posing a serious threat to users' financial security and privacy. Additionally, web scraping attacks can disrupt normal promotional activities and cause irreversible damage to merchant credibility.
Illegal actors utilize website scraping to steal e-commerce data, primarily through the following steps:
1. Selecting target websites and platforms:
The first step in scraping e-commerce data is selecting target websites. Researchers carefully analyze the request characteristics of the target website, including headers, cookies, parameters, etc., to construct subsequent spider requests.
2. Constructing requests and executing scraping:
Using tools such as Python's requests library or Selenium library, spider engineers construct requests and send them out for scraping. These requests are designed to retrieve e-commerce platform product data, bypassing traditional anti-crawler techniques to obtain the required information.
3. Data retrieval and storage:
Once product data is successfully scraped, the spider saves the data to local files or databases for subsequent analysis and use. Common data storage methods include CSV files, JSON files, and MySQL databases.
4. Data cleaning and processing:
Scraped data often contains noise and redundancy, requiring cleaning and processing. Numeric data such as prices and sales can undergo statistical and visual analysis to gain insights into market dynamics and consumer behavior.
5. Bypassing anti-crawler mechanisms:
During scraping, various anti-crawler mechanisms may be encountered, such as IP blocking, CAPTCHA, etc. To bypass these challenges, spider engineers employ a series of measures, such as using proxy servers, adjusting request frequencies, using CAPTCHA recognition technology, etc. Additionally, some spiders even use distributed IP proxy pools, simulate human behavior, and set random time intervals, among other strategies.
E-commerce Platform: How to Detect Website Scraping?
Modern website scraping programs possess features such as random IP addresses, anonymous proxies, identity modification, and emulation of human operation behavior, making them very difficult to detect and block. Detection and analysis need to be carried out across multiple dimensions.
1. Targeted Access:
ss: Malicious website scraping aims to obtain core information from websites and apps, such as user data, product prices, and review content. Therefore, they typically only access pages containing this information while ignoring irrelevant pages
- Access Behavior: or: Website scraping is automated, following preset processes and rules for access. As a result, their behavior exhibits noticeable regularity, rhythm, and consistency, differing significantly from the randomness, flexibility, and diversity of normal user behavior.
3. Device Access Patterns
erns: Malicious website scraping seeks to gather the maximum amount of information in the shortest time possible. Thus, they utilize the same device for a large number of access operations, including browsing, querying, and downloading, leading to abnormal indicators such as access frequency, duration, and depth.
4. IP Address Access:
ess: To avoid detection and banning by websites, malicious website scraping employs various methods to change IP addresses, such as using cloud services, routers, and proxy servers. This can result in inconsistencies in information such as the geographic origin, ISP, and network type of the IP address, or noticeable deviations from the distribution of normal users.
5. Access Timeframes:
mes: To minimize the risk of detection, malicious website scraping typically operates during periods of low website traffic and weak monitoring, such as late at night or early morning. This can lead to abnormal indicators such as access volume and bandwidth usage during these timeframes.
6. Analysis and Mining:
ng: By collecting, processing, mining, and modeling data on the access patterns of normal users and website scraping, it is possible to construct spider recognition models tailored to the website itself, thereby improving recognition accuracy and efficiency
Dingxiang: Assisting E-commerce Platforms in Intercepting Malicious Website Scraping ping
The attack methods of website scraping are becoming increasingly intelligent and complex. Relying solely on limiting access frequency or encrypting frontend pages is no longer sufficient for effective defense. It is necessary to enhance human-machine recognition technology and increase the ability to identify and intercept abnormal behaviors, thereby limiting website scraping access and raising the cost of malicious data theft. Dingxiang provides comprehensive solutions for enterprises to effectively prevent malicious website scraping behaviors.
Dingxiang atbCAPTCHA, based on AIGC technology, can prevent threats such as AI brute force attacks, automated attacks, and phishing attacks, effectively preventing unauthorized access and intercepting website scraping theft. It integrates 13 verification methods and multiple prevention and control strategies, supports seamless verification for legitimate users, and reduces real-time response and handling time to within 60 seconds, further enhancing the convenience and efficiency of login services.
Dingxiang Device Fingerprinting generates unified and unique device fingerprints for each device by internally linking multi-device information. Based on multidimensional identification strategies involving device, environment, and behavior, it identifies risky devices such as virtual machines, proxy servers, and emulators manipulated by malicious actors. It analyzes whether devices exhibit abnormal or inconsistent behaviors with user habits, such as multiple account logins or frequent changes in IP addresses, and quickly identifies whether access page scrapers originate from malicious devices.
Dingxiang Dinsight assists enterprises in risk assessment, anti-fraud analysis, and real-time monitoring, improving the efficiency and accuracy of risk control. The average processing speed of Dinsight's daily risk control strategies is within 100 milliseconds, supporting the configuration and precipitation of multi-party data and enabling self-monitoring and self-iteration of risk control performance based on mature indicators, strategies, models, and deep learning technology.
Paired with Dinsight, the Xintell intelligent modeling platform automatically optimizes security strategies for known risks and supports risk control strategy configuration in different scenarios based on risk control logs and data mining. Using associated networks and deep learning technology, it standardizes complex processes such as data processing, feature derivation, model construction, and final model deployment, providing a one-stop modeling service. This effectively identifies potential malicious scraping threats and further enhances the recognition of malicious data theft behaviors and the interception effects on malicious website scraping.