Introduction
Hello 👋
Today, we will see how to shop on Amazon like a pro 😎
As a computer engineer, I always prefer automating a task for hours rather than wasting minutes doing it by hand.
So, when I found myself checking for the price of my favourite keyboard on Amazon every day, I decided to spend hours writing code that does the same rather than wasting 5 minutes every day doing it manually.
Deciding on a Tech Stack
The Scraper
I first tried using only the python requests
library to get an HTML and scrape that via BeautifulSoup
.
Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.
~ Amazon
So no BeautifuSoup
. I had to use a web driver for automation to make Amazon think I was human. So I moved to Selenium
, wrote the scraping script again, this time encountering a different error. Yay! Progress!
Using the Chrome driver, Amazon still prevented loading the pages consistently, where I could just grab the product name and price directly with its CSS selector. So move to Firefox web driver. With that, Amazon reliably displayed the content where I could just grab the first item with the class a-price
and ID productTitle
. (I don't need to specify which is which)
The Server
Since I had the web scraping script in Python, I decided to go with the same for the server. Hence, picked FastAPI
. But here, I overlooked 1 thing. Python is terrible when it comes to parallelly running code. Sure, it has subprocess
and schedule
libraries, but that still got me stuck because schedule
involves a while(True)
loop for checking if any scheduled actions are pending. I could not keep that within my server, and If I start a subprocess, I need to implement a messaging queue to check whether I have a response yet. All of that was too much work, so instead, I shifted to Go
for the server and parallel programming part and Python
for the scraping. This way, I could write the server in Go, start a new routine
, and call the Python script from there.
Hence, I finalised on:
Server: Go and Gin
Scraping: Python and Selenium
Database: SQLite3 (a simple DB, nothing crazy.)
API Endpoints
The server I made has exposed 3 routes currently:
/all
to get everything from the DB/price
to fetch the product price from Amazon, given the URL in the request body/track
to start tracking the product price, given the URL in the request body and notify the user in case it changes.
The first 2 are just for testing purposes.
Wait, but what if I buy the product and want to stop receiving the price updates?
I plan on working on a /cancel
route for the same in the future.
The Scripts
For the above endpoints, I needed 2 scripts.
get_price.py
to for/price
get_info.py
to fetch product name and price to be added to the DB for each/track
request
Both scripts are pretty similar in their basic structure that is:
Import dependencies
Initialise the Firefox web driver
Open the URL
Get the required element(s)
driver.quit
Print the required output
I have logged errors in a separate file as I was scanning stdout for the final output and didn't want error messages to mess with that.
get_price.py
try:
from sys import argv
from selenium import webdriver
from selenium.webdriver.common.by import By
import logging
logging.basicConfig(filename='./scripts/get_price.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
except ImportError:
logging.error("Dependencies missing. Please run 'pip install -r ./scripts/requirements.txt'")
exit(1)
try:
# initialise Firefox webdriver with options
options = webdriver.FirefoxOptions()
# don't open a window
options.add_argument("-headless")
driver = webdriver.Firefox(options=options)
except:
logging.error("Error initialising Firefox webdriver.")
exit(1)
try:
driver.get(argv[1])
except:
logging.error("Error fetching page from URL.")
exit(1)
try:
# get the price of the product
price_string = driver.find_element(By.CLASS_NAME, 'a-price').text
except:
logging.error("Error fetching price from URL.")
exit(1)
driver.quit()
# print the price without the currency symbol or commas
print(float(price_string.replace('₹', '').replace(',', '').replace('\n', '.')), end="")
get_info.py
try:
from sys import argv
from selenium import webdriver
from selenium.webdriver.common.by import By
import logging
logging.basicConfig(filename='./scripts/get_info.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
except ImportError:
logging.error("Dependencies missing. Please run 'pip install -r ./scripts/requirements.txt'")
exit(1)
try:
# initialise Firefox webdriver with options
options = webdriver.FirefoxOptions()
# don't open a window
options.add_argument("-headless")
driver = webdriver.Firefox(options=options)
except:
logging.error("Error initialising Firefox webdriver.")
exit(1)
try:
driver.get(argv[1])
except:
logging.error("Error fetching page from URL.")
exit(1)
try:
# get the price of the product
price_string = driver.find_element(By.CLASS_NAME, 'a-price').text
price = float(price_string.replace("₹", "").replace(",", "").replace("\n", "."))
# get product name
name_string = driver.find_element(By.ID, 'productTitle').text
except:
logging.error("Error fetching price from URL.")
exit(1)
driver.quit()
print(f'{{"name": "{name_string}", "url": "{argv[1]}", "price": {price}}}')
Program Flow
The user (or front end in this case) sends a POST request with a URL to start tracking a product. I then call the get_info
script to add the product Name, URL and Price to the DB.
Then, I start a go routine to run the get_price
script periodically. If this price changes, I notify the user accordingly.
I have also tried to stick as much as possible to the Clean Architecture I discussed in my previous blogs.
Key Functions
I won't paste the code for the entire server here, just the key components.
The entire code can be found on my GitHub.
main.go
package main
import (
"price-tracker/database"
"price-tracker/handler"
"price-tracker/router"
)
func main() {
// Initialize repository
productDB := database.NewDB()
// Initialize use case
productHandler := handler.NewHandler(productDB)
// Set up router
router := router.SetupRouter(productHandler)
// Run the server
router.Run(":8080")
}
Here, I have used dependency injection so that even if I switch out the implementation of one of the components, it does not affect the other components as long as the names of exported functions are consistent.
For each request, the router calls the handler, coordinating the scripting and database functions.
Handler
func (h *Handler) TrackPrice(url string) (*entities.Product, error) {
outputChannel := make(chan entities.Product)
errorChannel := make(chan error)
go fetchProductInfo(url, outputChannel, errorChannel)
select {
case err := <-errorChannel:
fmt.Println("Error fetching price")
return nil, err
case product := <-outputChannel:
h.db.AddProduct(&product)
go startTracking(&product, h.db)
return &product, nil
}
}
Here, we can see there are 2 go routines.
fetchProductInfo
gets the initial product name and price from Amazon.
If the details are found, the product is added to the DB and startTracking
then periodically compares the price on Amazon to the one stored in the DB.
Since startTracking will run as a go routine, I can have a while(True) in there without halting the execution of my server.
startTracking()
func startTracking(product *entities.Product, db *database.DB) {
for {
scriptOutputChannel := make(chan float64)
scriptErrorChannel := make(chan error)
dbOutputChannel := make(chan float64)
dbErrorChannel := make(chan error)
go func(outputChannel chan float64, errorChannel chan error) {
newPrice, err := getProductPrice(product.URL)
if err != nil {
errorChannel <- err
return
}
outputChannel <- newPrice
}(scriptOutputChannel, scriptErrorChannel)
go func(outputChannel chan float64, errorChannel chan error) {
oldPrice, err := db.GetPrice(product.ID)
if err != nil {
errorChannel <- err
return
}
outputChannel <- oldPrice
}(dbOutputChannel, dbErrorChannel)
select {
case err := <-scriptErrorChannel:
fmt.Println("Error fetching price: ", err)
return
case err := <-dbErrorChannel:
fmt.Println("Error fetching price: ", err)
return
case oldPrice := <-scriptOutputChannel:
newPrice := <-dbOutputChannel
if newPrice != oldPrice {
db.UpdatePrice(product.ID, newPrice)
fmt.Printf("Price updated.\n%s\n%f -> %f\n", product.Name, oldPrice, newPrice)
}
}
// Sleep for a day before checking again.
time.Sleep(24 * time.Hour)
}
}
Here, I parallelly get the current price from Amazon and the old price from the DB.
If there's a mismatch, I notify the user (For now, just print to the console, later an E-Mail or an SMS).
I execute this comparison, wait a day, and then try again.
Future Work
Implement the notification part (SMS or E-Mail)
Expose another endpoint to cancel the tracking of a product.