How to Scrap Product for E-commerce with Sanity CMS and Cheerio

Introduction Building e-commerce websites has always been an exciting challenge for me, especially when working on high-stakes projects like selling cosmetics from a well-known brand. Recently, I was tasked with building an online store for a client who was an official distributor for a cosmetic brand called "Loreal." (just kidding) The project seemed straightforward at first, but there was a twist — I needed to automate the process of importing products directly into the Sanity CMS from an external WordPress-based site. Having worked on many e-commerce websites, I’ve always embraced new challenges, but this one was unique. With a source website built in WordPress and a public API at hand, I saw an opportunity to create a seamless automation process that could save the client a lot of time and effort. In this post, I’ll walk you through how I approached the task, the tools I used, and how I built an automated product import system that checks all the right boxes: efficient, scalable, and low-maintenance. Scraping the Data To kick things off, I needed to gather all the product information from the source website. Since the website was powered by WordPress, and the public API was exposed (be aware) the products’ names and links. At first glance, it seemed like an easy task: I could use a simple loop with search parameters like per_page and page to paginate through all the product listings. const API_URL = 'https://example.com/wp-json/wp/v2/products' async function fetchAllProductLinks() { let allProductDetails = [] let page = 1 let products = [] try { // Keep fetching as long as there are products in the response do { const response = await fetch(`${API_URL}?per_page=20&page=${page}`) products = response.data // Push an object containing both link and name to the result array allProductDetails = allProductDetails.concat( products.map((product) => ({ link: product.link, name: product.name })) ) page++ } while (products.length > 0) // Stop if no products are returned return allProductDetails } catch (error) { console.error('Error fetching product links and names:', error) } } With this code, I was able to easily collect all the product links. But the real work started when I needed to scrape detailed information from each individual product page. That's where Cheerio came in. Cheerio is a fast and lightweight library that mimics jQuery, making it perfect for scraping HTML content from a page. I used it to load each product page and extract key details, such as the product name, description, images, and more. const cheerio = require('cheerio') async function scrapeProductDetails(url) { try { // this is a CORS proxy to bypass the CORS issue const { data } = await fetch(`https://cors-anywhere.herokuapp.com/` + url) const $ = cheerio.load(data) // Heres where you need to use your browser console skills to find the right selectors const productName = $('h1').text() const productDescription = $('meta[name="description"]').attr('content') const productImage = $('img.product-image').attr('src') return { productName, productDescription, productImage } } catch (error) { console.error('Error scraping product details:', error) } } Now that I had a way to scrape product details, I ran into a familiar problem — CORS errors. Since I was making requests from my server to an external API, the browser blocked the requests due to cross-origin restrictions. Fortunately, I was able to bypass this using Heroku CORS Anywhere, which acted as a proxy to resolve the CORS issue. Validating and Enriching the Data At this point, I had most of the product details I needed, but there was one crucial piece of information still missing — the price. My client provided me with a PDF that contained the distributor’s pricing information. To make sure I had accurate data, I needed to cross-reference the prices with the ones in the PDF. Here, I turned to ChatGPT to help me extract the necessary details from the PDF. With a bit of prompt engineering (an article coming soon), I was able to create a process that automatically extracted the product names, reference codes, and prices from the PDF. Integrating with Sanity CMS With the product data now validated and enriched, it was time to push everything into Sanity CMS. I used the Sanity client’s .createIfNotExists() method to create product records only if they didn’t already exist. This method helped me avoid creating duplicates and ensured that I was always working with the most up-to-date product information. const sanityClient = require('@sanity/client') const client = sanityClient({ projectId: 'your-project-id', dataset: 'production', token: 'your-token', useCdn: true }) async function createProductInSanity(product) { try { // there were a lot more fields but you

Mar 6, 2025 - 23:23

How to Scrap Product for E-commerce with Sanity CMS and Cheerio

Introduction

Building e-commerce websites has always been an exciting challenge for me, especially when working on high-stakes projects like selling cosmetics from a well-known brand. Recently, I was tasked with building an online store for a client who was an official distributor for a cosmetic brand called "Loreal." (just kidding) The project seemed straightforward at first, but there was a twist — I needed to automate the process of importing products directly into the Sanity CMS from an external WordPress-based site.

Having worked on many e-commerce websites, I’ve always embraced new challenges, but this one was unique. With a source website built in WordPress and a public API at hand, I saw an opportunity to create a seamless automation process that could save the client a lot of time and effort. In this post, I’ll walk you through how I approached the task, the tools I used, and how I built an automated product import system that checks all the right boxes: efficient, scalable, and low-maintenance.

Scraping the Data

To kick things off, I needed to gather all the product information from the source website. Since the website was powered by WordPress, and the public API was exposed (be aware) the products’ names and links. At first glance, it seemed like an easy task: I could use a simple loop with search parameters like per_page and page to paginate through all the product listings.

const API_URL = 'https://example.com/wp-json/wp/v2/products'

async function fetchAllProductLinks() {
  let allProductDetails = []
  let page = 1
  let products = []

  try {
    // Keep fetching as long as there are products in the response
    do {
      const response = await fetch(`${API_URL}?per_page=20&page=${page}`)
      products = response.data
      // Push an object containing both link and name to the result array
      allProductDetails = allProductDetails.concat(
        products.map((product) => ({
          link: product.link,
          name: product.name
        }))
      )
      page++
    } while (products.length > 0) // Stop if no products are returned

    return allProductDetails
  } catch (error) {
    console.error('Error fetching product links and names:', error)
  }
}

With this code, I was able to easily collect all the product links. But the real work started when I needed to scrape detailed information from each individual product page. That's where Cheerio came in.

Cheerio is a fast and lightweight library that mimics jQuery, making it perfect for scraping HTML content from a page. I used it to load each product page and extract key details, such as the product name, description, images, and more.

const cheerio = require('cheerio')

async function scrapeProductDetails(url) {
  try {
    // this is a CORS proxy to bypass the CORS issue
    const { data } = await fetch(`https://cors-anywhere.herokuapp.com/` + url)
    const $ = cheerio.load(data)

    // Heres where you need to use your browser console skills to find the right selectors
    const productName = $('h1').text()
    const productDescription = $('meta[name="description"]').attr('content')
    const productImage = $('img.product-image').attr('src')

    return { productName, productDescription, productImage }
  } catch (error) {
    console.error('Error scraping product details:', error)
  }
}

Now that I had a way to scrape product details, I ran into a familiar problem — CORS errors. Since I was making requests from my server to an external API, the browser blocked the requests due to cross-origin restrictions. Fortunately, I was able to bypass this using Heroku CORS Anywhere, which acted as a proxy to resolve the CORS issue.

Validating and Enriching the Data

At this point, I had most of the product details I needed, but there was one crucial piece of information still missing — the price. My client provided me with a PDF that contained the distributor’s pricing information. To make sure I had accurate data, I needed to cross-reference the prices with the ones in the PDF.

Here, I turned to ChatGPT to help me extract the necessary details from the PDF. With a bit of prompt engineering (an article coming soon), I was able to create a process that automatically extracted the product names, reference codes, and prices from the PDF.

Integrating with Sanity CMS

With the product data now validated and enriched, it was time to push everything into Sanity CMS. I used the Sanity client’s .createIfNotExists() method to create product records only if they didn’t already exist. This method helped me avoid creating duplicates and ensured that I was always working with the most up-to-date product information.

const sanityClient = require('@sanity/client')

const client = sanityClient({
  projectId: 'your-project-id',
  dataset: 'production',
  token: 'your-token',
  useCdn: true
})

async function createProductInSanity(product) {
  try {
    // there were a lot more fields but you get idea
    await client.createIfNotExists({
      name: product.name,
      productDescription: product.description,
      productImage: product.image
    })
  } catch (error) {
    console.error('Error creating product in Sanity:', error)
  }
}

However, I needed to make sure the Sanity mutations didn’t overload the server with too many requests. To address this, I used p-limit, a small library that limits the number of concurrent promises, ensuring I didn’t hit the rate limits.

const pLimit = require('p-limit')
const limit = pLimit(5) // Limit to 5 concurrent requests

const productPromises = products.map((product) =>
  limit(() => createProductInSanity(product))
)

await Promise.all(productPromises)

Sanity CMS

To make the process smoother for the client—the final user—I created a couple of Sanity Actions to streamline product management. One of these actions allows for manual scraping directly within the product document in Sanity CMS.

export function findInfo(context) {
  const client = context.getClient({ apiVersion: '2022-11-29' })

  const asyncFindInfo = (props) => {
    const { published, draft } = props
    // in case the product is not published
    const slug = draft ? draft.slug : published ? published.slug : null

    return {
      label: 'Buscar Información',
      onHandle: async () => {
        try {
          if (!slug) {
            toast.error('Error: you must have a slug') // toast from sooner ❤
            return
          }

          const product = await fetch(
            `https://cors-anywhere.herokuapp.com/https://apple.com/${slug.current}/` // just kidding, ist not apple


                                            
                            
                                Read More                                
                            
                        
                                        
                        Tags:
                        
                                                    
                    
                    
                        
                            
                                                                    
                                        
                                            
                                            Previous Article                                        
                                    
                                    
                                        From Monolith to Microservices: A Developer's Tale 
                                    
                                                            
                            
                                                                    
                                        
                                            Next Article                                            
                                        
                                    
                                    
                                        Amazon SQS -Cheat Sheet
                                    
                                                            
                        
                    
                                        
                        
                            
                                
                                    
                                        Related Posts
                                    
                                
                                
                                    
                                                                                            
                                                        
                                                                                                                            
                                                                    
                                                                        
                                                                                                                                            
                                                                
                                                                                                                        
                                                                Feb 24, 2025
     0

                                                        
                                                    
                                                                                                    
                                                        
                                                                                                                            
                                                                    
                                                                        
                                                                                                                                            
                                                                
                                                                                                                        **Aigency A Multi-Disciplinary Artificial Intelligence ...
                                                                Apr 4, 2025
     0

                                                        
                                                    
                                                                                                    
                                                        
                                                                                                                            
                                                                    
                                                                        
                                                                                                                                            
                                                                
                                                                                                                        How to Keep Docker Secrets Secure
                                                                Mar 21, 2025
     0

                                                        
                                                    
                                                                                    
                                
                            
                        
                    
                                            
                            
                                
                                    
                                                                                    
                                                                            
                                    
                                                                                    
                                                    
        
        
        
            
                
                    Name
                    
                
                
                    Email
                    
                
            
        
        
            Comment