How to Scrap Amazon Reviews with Python and Proxies: A Step-by-Step Tutorial with Code

Home / Blog / How to Scrap Amazon Reviews with Python and Proxies: A Step-by-Step Tutorial with Code

Amazon is one of the world’s largest online marketplaces with millions of products listed on the platform. Many customers rely on reviews to make informed buying decisions. As an amazon seller, you might want to scrape reviews to gather feedback on your product, monitor competitors, or perform market research.

In this tutorial, we will show you how to scrape Amazon reviews using Python and proxies. Proxies are necessary because Amazon might block your IP address if you make too many requests in a short period of time. By using proxies, you can distribute your requests across different IP addresses to avoid detection.

Prerequisites

Before we start, make sure you have the following installed on your system:

  • Python 3
  • requests library
  • BeautifulSoup library
  • csv library

You can install these libraries using the following commands in your command prompt:

pip install requests
pip install BeautifulSoup4
pip install csv

 Step 1: Import Libraries

First, you need to import the required libraries into your Python script. Here is the code to import the libraries:
import requests
from bs4 import BeautifulSoup
import csv
import random
import time
  • requests: for making HTTP requests to Amazon
  • BeautifulSoup: for parsing HTML of the review pages
  • csv: for writing the scraped data to a CSV file
  • random: for randomly choosing a proxy and user-agent string
  • time: for delaying between requests to avoid detection

Step 2: Set up Proxies

Now, you need to set up a list of proxies that you will use to scrape Amazon reviews. You can find free proxy lists online. Here is an example list of proxies:

proxies = ['http://10.10.1.10:3128', 'https://10.10.1.11:1080', 'http://10.10.1.10:80']

Step 3: Set up User-Agent

Amazon might block your request if you use a default user-agent. To avoid this, you can use a user-agent string. You can use a list of user-agent strings and choose one randomly for each request. Here is an example list of user-agent strings:

user_agents = [
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36',
 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36',
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393',
]

Step 4: Scrape Reviews

Now you can start scraping Amazon reviews. Here is an example function that scrapes reviews from the first page of a product listing:

def scrape_reviews(product_url, num_reviews):
 # Open CSV file for writing
 with open('reviews.csv', 'w', newline='', encoding='utf-8') as file:
 writer = csv.writer(file)
 writer.writerow(['Review', 'Rating', 'Date'])
 
 # Loop through pages of reviews
 for i in range(1, num_reviews + 1):
 # Choose a random proxy and user-agent
 proxy = {'http': random.choice(proxies)}
 headers = {'User-Agent': random.choice(user_agents)}
 
 # Make request to review page
 response = requests.get(product_url + f'/product-reviews/{i}', proxies=proxy, headers=headers)
 soup = BeautifulSoup(response.content, 'html.parser')
 
 # Extract review information and save to CSV file
 reviews = soup.find_all('div', {'class': 'a-section review aok-relative'})
 for review in reviews:
 review_text = review.find('span', {'class': 'a-size-base review-text review-text-content'}).text.strip()
 rating = review.find('span', {'class': 'a-icon-alt'}).text[:3]
 date = review.find('span', {'class': 'a-size-base a-color-secondary review-date'}).text.strip()[3:]
 writer.writerow([review_text, rating, date])

Alternatives Solution

If you don’t know how to use Python, you can use a web scraper to scrape Amazon reviews. A web scraper is a tool that extracts data from websites automatically. There are many web scraping tools available, some of which are free and some of which are paid.

There are lots of amazon scrapers online, One popular web scraper is ParseHub. ParseHub is a free web scraper that allows you to extract data from websites using a point-and-click interface. Here’s how you can use ParseHub to scrape Amazon reviews:

  1. Go to the ParseHub website and create an account.
  2. Install the ParseHub browser extension.
  3. Open Amazon and go to the product page you want to scrape.
  4. Click on the ParseHub browser extension and select “Create New Project”.
  5. Use the point-and-click interface to select the review text, rating, and date.
  6. Run the scraper and download the data as a CSV file.

Note that scraping Amazon reviews without permission may violate Amazon’s terms of service. Use this method at your own risk and always make sure to follow ethical and legal guidelines.

Conclusion

Scraping Amazon reviews can be a useful tool for sellers to monitor feedback on their products and gain insights into market trends. However, it is important to use proxies to avoid detection and ensure that your requests are not blocked by Amazon.

In this tutorial, we showed you how to scrape Amazon reviews using Python and proxies. We used the requests library to make HTTP requests, BeautifulSoup to parse the HTML of the review pages, and csv to write the scraped data to a CSV file.

We also showed you how to set up a list of proxies and randomly choose one for each request to avoid detection by Amazon. Additionally, we used a list of user-agent strings to prevent Amazon from blocking our requests.

Remember that scraping Amazon reviews without permission may violate Amazon’s terms of service and could lead to legal consequences. Use this tutorial at your own risk and always make sure to follow ethical and legal guidelines.

Leave a Reply

Your email address will not be published. Required fields are marked *