Modern Web Scraping: Using JavaScript and Smartproxy for Data Recovery
Metacritic is a review platform encompassing games, movies, TV series, and music. Its primary goal is to guide consumers in making more informed decisions about how to invest their money in entertainment. Through a unique metric developed by the platform itself, Metacritic acknowledges the importance of user opinions, deeming them valuable for those seeking guidance beyond mere YouTube videos. The site’s mission boils down to assisting consumers in smartly deciding how to spend their time and money, aggregating various opinions and presenting them in a clear, scored manner.
Our aim is merely to dynamically capture data for a use-case application of a proxy infrastructure solution. That is, the collected data will neither be sold nor altered, respecting the policy of terms of use.
Initial Settings
To start a project, we need to ensure that we have the framework to run local JavaScript installed, that is, Node.js. You can check it out at this link: https://nodejs.org/en. Keep in mind that this article was sponsored by Smartproxy, providing us with the necessary proxy infrastructure to prevent our application from being tracked/blocked.
Continuing with the development of our solution, we can create a folder using the command mkdir js_scraping and inside the directory, type npm init — yes. This is where we’ll initiate the project that will contain the files.
mkdir js_scraping
npm init - yes
We will also need to install a library for handling HTML (Cheerio) and another to render and control browser actions (Playwright). The command to perform this task is npm install cheerio and npm init playwright@latest. You can visit the official Playwright page at https://playwright.dev/docs/intro for more information about the framework.
Residential Proxies
Residential Proxies In activities involving data scraping, we often face various challenges, one of which is remaining anonymous and secure. That’s why residential proxies help us stay safe by hiding our actual location and preventing tracking activities.
The architecture of a residential proxy is essentially a server that uses an IP address from an ISP, not from a data center. Each such address has a physical location. When you connect to the Internet, your IP reveals your location and provides information about your browser and cookies. Residential IPs are registered by Internet service providers in public databases, allowing websites to identify the provider, network, and device location. Online services typically recognize residential IPs as real people, unlike data center IPs. Using these services can speed up the development process, allowing software engineers the freedom to focus on the solution. Many of these residential proxy services have the native technology to rotate the IP automatically, which lets us worry less about restrictions.
For more details, you can check out https://smartproxy.com/blog/what-is-a-residential-proxies-network.
Smartproxy and Proxy Setup
Smartproxy is a provider of proxy solutions and internet data collection, with one of their primary objectives being to help unlock publicly available data. In addition, they offer user-friendly proxy management tools accessible to anyone.
They primarily offer residential proxies, meaning the IPs provided are from real devices in homes, as opposed to data centers. A major advantage of rotating proxies is that they change the IP address with every request or after a specific time interval. This is beneficial for many tasks, such as web scraping, as it prevents the scraper from being detected or blocked.
They have an extensive network of proxies across many countries and cities, allowing users to specify geographic locations for their online activities. They provide an intuitive control panel where users can manage their plans, view usage statistics, and adjust other settings. The company is also known for offering comprehensive documentation and customer support, assisting users in effectively setting up and utilizing their services.
To create an account, simply visit Smartproxy, sign up, select the product that best matches your business profile, and get to work. For our example, we chose the Residential Proxies with 5GB of traffic, which is more than sufficient for our purposes and offers good value for money.
Smartproxy offers flexible pricing plans, allowing you to pay as you go. This is especially useful when we need a more scalable architecture.
Navigating the modern needs of privacy and digital efficiency, Smartproxy offers cutting-edge proxy solutions tailored to your unique requirements. By clicking on this link, you’ll find plans meticulously designed to align with your architectural and resource realities. Don’t miss the opportunity to enhance your online browsing and security with one of the market leaders! Subscribe now and experience the difference.
When accessing with your login and password, we will have a dashboard screen with two types, one with username and password and another to generate the endpoint where we will connect and attach within our script.
In the endpoint generator, it’s possible to define the connection method. In our case, we will use the connection via user. Additionally, we have the option to select the location, and for this example, we’ll choose Brazil. The session type selected is Rotation, which provides a new IP for each requested session in the chosen region. If the website we are gathering information from is located in Brazil or another region, it’s strategic to use an endpoint with a nearby location. There’s also the alternative of using fixed IPs for a set period, known as sticky. This option is ideal for tasks requiring more human interaction with the browser or that take longer to complete.
A highly intuitive feature on the dashboard is the integration with other languages, accessible via the code examples button. For more details, they have excellent documentation with very clear explanations about other products. You can check it out here: https://help.smartproxy.com/docs/residential-authentication-methods.
Browser Instance Creation
In the previous section, we delved into the workings of the configuration dashboard. Now, we will begin building our scraper. To do this, we will set up the proxy by creating a JavaScript file named config.js. This file will contain three variables that will be exported as a module. For security reasons, the username and password details have been hidden.
Let’s proceed with creating a file named instancePlaywright.js, which will serve as our default browser. This file will also be exported as a module, allowing us to use it in subsequent steps. Within it, we will import the Playwright and the proxy configuration module we mentioned earlier.
We will use default browser switch arguments, which are very common in the world of data scraping, as shown in the image below.
We’ve now created our dictionary with the browser arguments as well as the proxy configuration. If the parameter is true, it will be activated.
Finally, we’ve set up the remaining configurations to dictate the behavior of our default browser. Our function will return the variables “browser”, “page”, “code”, and “html” that we plan to use. In conclusion, we’ll export all of this as a module.
Below is our code that will utilize our browser instance.
Testing Using the Smartproxy Service
Before we begin building the data scraper, it’s crucial to test and ensure that the configured proxy is operating correctly. For this, we’ve created a function that will use cheerio and the browser we pre-configured earlier. This function aims to access the website https://api.ipify.org/?format=json, which will provide us with the current IP address, and will return the HTML in a dictionary format. Below, an image simplifies the code for illustration.
Running this script in the terminal using the VPN service first, we get the following result.
Querying this IP address we have its location below:
To use the proxy services we’ve implemented, we need to set the useProxy parameter to true, as explained in the image below. It takes two parameters and returns four variables.
Executing our script we have the following IPs using the Rotation configuration defined in the Smartproxy dashboard.
Both addresses are located in the USA but in different cities. It’s worth noting that while the region string should be specified, the username and password will remain constant.
Using residential proxies offers several advantages. The first is the freedom not to worry about network infrastructure when running applications. The second benefit is their legitimate appearance since the IPs come from real devices, making them less susceptible to blocks compared to IPs originating from data centers, as previously mentioned.
Rotation is also a feature, as with each request, we’ll get a new IP address or one for a specific period. Since residential connections are spread worldwide, you can obtain proxies from various geographical locations. Residential proxies are widely used for activities that require a more “human” appearance, that is, common usage. We must also mention that some online services have stricter restrictions on traffic from data center proxies. Therefore, preferring residential proxies might be a way to bypass these constraints. Generally, residential proxies tend to be pricier due to their nature and perceived advantages. However, depending on the volume of information you’re seeking, the investment might be worth it rather than facing a permanent ban. And our partner for this article has great plans to start your data extraction journey.
In the next topic, we’ll build our data scraper and finally begin the information extraction.
Data Scraper Construction
Now that we have our proxy infrastructure set up, we can proceed with the development of the application that will carry out the data collection.
Analyzing the Metacritic site, we need to think about how to gather this information. The site’s structure is defined by pages, and in our application’s internal architecture, we will need to use the technique of navigating to the next page as long as the button exists. Another challenge is closing the popup that accepts the cookie policy.
We also need to identify the collection objects, such as the title, game platform, publication date, description, and review score. The image below highlights what we just mentioned, marked in red.
Upon accessing the developer tools, we can retrieve the selector objects to obtain information from the tags we need. This is where JavaScript shines, especially in terms of performance. Being a language that interacts directly with the browser greatly aids in development.
The table below displays the main target tags and the description each one represents. You can access the developer tools and check each of these tags using the browser’s console.
Let’s establish some essential functions to handle recurring situations. One of these functions is designed to make our algorithm identify a button and, if it’s present, click on it repeatedly until it disappears. Thus, we developed the function called nextPage, which takes a page object and two selectors, initial and final. Essentially, this function checks for the presence of a button that leads to the next page. If the button is identified, the algorithm will press it until it’s no longer available; otherwise, an error will be returned. There are multiple approaches to this challenge, but we chose this one. The following code illustrates our choice.
Another important issue is that every time we access the page, a popup appears requesting that we accept the terms. However, this isn’t useful for us, so we’ll create a function that, based on a selector, identifies the popup and closes it immediately. The code is provided below.
The first function aims to assign a hash ID to each downloaded item, using the native JavaScript library, crypto.
The second function, named appendToJsonFile, is dedicated to data recording. It takes on the responsibility of checking for the existence of the file and analyzing its content. In the face of an error, we have the choice to either abandon the function or proceed and overwrite the file. In this example, I chose to continue. Subsequently, we’ll append the data files to our array, and ultimately save the information to the file, updating it.
Below, you can review the implementation of these two functions.
Let’s conclude with the function responsible for extracting the data. First, we establish variables that instantiate the browser, incorporating the function to close a popup if it emerges, and loading its HTML. Using the map function, native to JavaScript, we add all the tags specified in the tag table and then return a dictionary containing this data. This information is grouped into a list of nodes. Here, we also integrate the nextPage function, which will continue to click until the next button is no longer detected. This structure allows us to gather information from each page, continuously adding to our JSON file.
Below, you’ll see the complete code, encompassing all the mentioned functions. Although straightforward, this code is sturdy for extracting data from the targeted page and adaptable for production environments.
Results
In this section, we will delve into the results obtained and our reflections regarding the service used.
Within the Smartproxy dashboard, we can see some statistics for the residential proxy. For our tests, we had a purchased product of 5GB; our total consumption was 0.34 GB, which represents 6.76% of the service consumption. In terms of transfers (download/upload), that is, transfers between the browser and protocols, there were 0.338 downloads and 0.018 uploads. A total of 10,566 requests for the day the script was run. Depending on the algorithm’s design and what we aim to retrieve, this will influence the results. Hence, it’s vital to sketch out the solution before jumping into coding. Moreover, it’s essential to understand the problem and break it down into parts for the final solution.
The image below displays the results in detail regarding consumption.
We can view the consumption by targets and see which websites are utilizing our scrapers. Besides the Metacritic site, other platforms appear to consume more gigabytes, which is interesting to evaluate. Many sites might be laden with SEO marketing applications, slowing down the data collection process. Hence, we should pay close attention to these aspects. Below, we can see in detail what we just discussed.
For our IP rotation settings, we can see in the images below that they were successful, and with each request, the addresses were automatically switched by the Smartproxy platform, anonymizing our addresses. It’s as if each time we accessed a different page, it would be through a different IP. This is the advantage of having a tool to perform this kind of operation.
In conclusion, we can see a small sample of the data obtained. Various use cases can be applied here, such as Machine Learning with game classification and evaluation algorithms on the various platforms analyzed. This will not be covered here, as it is not the focus of this article, but with the data obtained, we can use the information strategically in such a competitive market.
For more information, you can consult the GitHub repository with the code.
{
"title": "The Legend of Zelda: Ocarina of Time",
"url": "/game/nintendo-64/the-legend-of-zelda-ocarina-of-time",
"date": "November 23, 1998",
"platform": "Nintendo64",
"score": "99",
"summary": "As a young boy, Link is tricked by Ganondorf, the King of the Gerudo Thieves.
The evil human uses Link to gain access to the Sacred Realm, where he places his tainted hands on Triforce and transforms the beautiful Hyrulean landscape into a barren wasteland. Link is determined to fix the problems he helped to create, so with the help of Rauru he travels through time gathering the powers of the Seven Sages.",
"ids": "9a4ed74f8e8d093fd24e57372040fff6ab6d4277"
},
{
"title": "Tony Hawk's Pro Skater 2",
"url": "/game/playstation/tony-hawks-pro-skater-2",
"date": "September 20, 2000",
"platform": "PlayStation",
"score": "98",
"summary": "As most major publishers' development efforts shift to any number of next-generation platforms, Tony Hawk 2 will likely stand as one of the last truly fantastic games to be released on the PlayStation.",
"ids": "c241e2c0b15d3e08d27e090144c56f9c946cba7e"
},
{
"title": "Grand Theft Auto IV",
"url": "/game/playstation-3/grand-theft-auto-iv",
"date": "April 29, 2008",
"platform": "PlayStation3",
"score": "98",
"summary": "[Metacritic's 2008 PS3 Game of the Year; Also known as \"GTA IV\"] What does the American Dream mean today? For Niko Belic, fresh off the boat from Europe. It's the hope he can escape his past. For his cousin, Roman, it is the vision that together they can find fortune in Liberty City, gateway to the land of opportunity. As they slip into debt and are dragged into a criminal underworld by a series of shysters, thieves and sociopaths, they discover that the reality is very different from the dream in a city that worships money and status, and is heaven for those who have them an a living nightmare for those who don't. [Rockstar Games]",
"ids": "b7d0c26e30c44c321d724f137c8e8297750561a6"
},
{
"title": "SoulCalibur",
"url": "/game/dreamcast/soulcalibur",
"date": "September 8, 1999",
"platform": "Dreamcast",
"score": "98",
"summary": "This is a tale of souls and swords, transcending the world and all its history, told for all eternity... The greatest weapons-based fighter returns, this time on Sega Dreamcast. Soul Calibur unleashes incredible graphics, fantastic fighters, and combos so amazing they'll make your head spin!",
"ids": "cede055ef55649111e6696f7c74345475af23193"
},
{
"title": "Grand Theft Auto IV",
"url": "/game/xbox-360/grand-theft-auto-iv",
"date": "April 29, 2008",
"platform": "Xbox360",
"score": "98",
"summary": "[Metacritic's 2008 Xbox 360 Game of the Year; Also known as \"GTA IV\"] What does the American Dream mean today? For Niko Belic, fresh off the boat from Europe. It's the hope he can escape his past. For his cousin, Roman, it is the vision that together they can find fortune in Liberty City, gateway to the land of opportunity. As they slip into debt and are dragged into a criminal underworld by a series of shysters, thieves and sociopaths, they discover that the reality is very different from the dream in a city that worships money and status, and is heaven for those who have them an a living nightmare for those who don't. [Rockstar Games]",
"ids": "55035b9d61a6f9ef72f6f8f6dfd125b4ccf362bb"
},
{
"title": "Super Mario Galaxy",
"url": "/game/wii/super-mario-galaxy",
"date": "November 12, 2007",
"platform": "Wii",
"score": "97",
"summary": "[Metacritic's 2007 Wii Game of the Year] The ultimate Nintendo hero is taking the ultimate step ... out into space. Join Mario as he ushers in a new era of video games, defying gravity across all the planets in the galaxy. When some creature escapes into space with Princess Peach, Mario gives chase, exploring bizarre planets all across the galaxy. Mario, Peach and enemies new and old are here. Players run, jump and battle enemies as they explore all the planets in the galaxy. Since this game makes full use of all the features of the Wii Remote, players have to do all kinds of things to succeed: pressing buttons, swinging the Wii Remote and the Nunchuk, and even pointing at and dragging things with the pointer. Since he's in space, Mario can perform mind-bending jumps unlike anything he's done before. He'll also have a wealth of new moves that are all based around tilting, pointing and shaking the Wii Remote. Shake, tilt and point! Mario takes advantage of all the unique aspects of the Wii Remote and Nunchuk controller, unleashing new moves as players shake the controller and even point at and drag items with the pointer. [Nintendo]",
"ids": "bf33e0eadaf538352069ac7fee0ec58b4c47ff94"
},
{
"title": "Super Mario Galaxy 2",
"url": "/game/wii/super-mario-galaxy-2",
"date": "May 23, 2010",
"platform": "Wii",
"score": "97",
"summary": "Super Mario Galaxy 2, the sequel to the galaxy-hopping original game, includes the gravity-defying, physics-based exploration from the first game, but is loaded with entirely new galaxies and features to challenge players. On some stages, Mario can pair up with his dinosaur buddy Yoshi and use his tongue to grab items and spit them back at enemies. Players can also have fun with new items such as a drill that lets our hero tunnel through solid rock. [Nintendo]",
"ids": "473b5ee831ab054cf7e4754b072e0755288f462b"
},
{
"title": "Red Dead Redemption 2",
"url": "/game/xbox-one/red-dead-redemption-2",
"date": "October 26, 2018",
"platform": "XboxOne",
"score": "97",
"summary": "Developed by the creators of Grand Theft Auto V and Red Dead Redemption, Red Dead Redemption 2 is an epic tale of life in America’s unforgiving heartland. The game’s vast and atmospheric world also provides the foundation for a brand new online multiplayer experience. America, 1899. The end of the Wild West era has begun. After a robbery goes badly wrong in the western town of Blackwater, Arthur Morgan and the Van der Linde gang are forced to flee. With federal agents and the best bounty hunters in the nation massing on their heels, the gang has to rob, steal and fight their way across the rugged heartland of America in order to survive. As deepening internal fissures threaten to tear the gang apart, Arthur must make a choice between his own ideals and loyalty to the gang that raised him. [Rockstar]",
"ids": "e8b079b3c758208b603efde2b6a06050c1d07bb9"
},
{
"title": "Grand Theft Auto V",
"url": "/game/xbox-one/grand-theft-auto-v",
"date": "November 18, 2014",
"platform": "XboxOne",
"score": "97",
"summary": "Grand Theft Auto 5 melds storytelling and gameplay in unique ways as players repeatedly jump in and out of the lives of the game's three protagonists, playing all sides of the game's interwoven story.",
"ids": "c05b5d7b59ba98be4e481ede342c0d457a71a0a2"
},
{
"title": "Grand Theft Auto V",
"url": "/game/playstation-3/grand-theft-auto-v",
"date": "September 17, 2013",
"platform": "PlayStation3",
"score": "97",
"summary": "Los Santos is a vast, sun-soaked metropolis full of self-help gurus, starlets and once-important, formerly-known-as celebrities. The city was once the envy of the Western world, but is now struggling to stay afloat in an era of economic uncertainty and reality TV. Amidst the chaos, three unique criminals plot their own chances of survival and success: Franklin, a former street gangster in search of real opportunities and serious cheddar; Michael, a professional ex-con whose retirement is a lot less rosy than he hoped it would be; and Trevor, a violent maniac driven by the chance of a cheap high and the next big score. Quickly running out of options, the crew risks it all in a sequence of daring and dangerous heists that could set them up for life.",
"ids": "0ba98aa810f2d6ed6a88274c03a98754cbf83b33"
},
{
"title": "Disco Elysium: The Final Cut",
"url": "/game/pc/disco-elysium-the-final-cut",
"date": "March 30, 2021",
"platform": "PC",
"score": "97",
"summary": "Disco Elysium - The Final Cut is the definitive edition of the smash-hit RPG. Pursue your political dreams in new quests, meet and question more of the city's locals, and explore a whole extra area. Full voice-acting, controller support, and expanded language options also included. Get even more out of this award-winning open world. You're a detective with a unique skill system at your disposal and a whole city block to carve your path across. Interrogate unforgettable characters, crack murders, or take bribes. Become a hero or an absolute disaster of a human being.",
"ids": "3b2e04f114596590f9c17b51a336602f0cac5217"
},
{
"title": "Grand Theft Auto V",
"url": "/game/xbox-360/grand-theft-auto-v",
"date": "September 17, 2013",
"platform": "Xbox360",
"score": "97",
"summary": "Los Santos is a sprawling sun-soaked metropolis full of self-help gurus, starlets and once-important stars. The city was once the envy of the Western world, but is now struggling to stay relevant in an era of economic uncertainty and reality TV. Amidst the chaos, three very different criminals chart their own chances of survival and success: Franklin, a former street gangster, now looking for real opportunities and fat stacks of cash; Michael, a professional ex-con whose retirement is significantly less rosy than he hoped it would be; and Trevor, a violent maniac driven by the chance of a cheap high and the next big score. Rapidly running out of options, the crew risks everything in a series of bolt and dangerous heists that could set them up for the long haul.",
"ids": "c5b2eea5de4b607be80dd21ddaf1282af9afd268"
},
{
"title": "Tony Hawk's Pro Skater 2",
"url": "/game/dreamcast/tony-hawks-pro-skater-2",
"date": "November 6, 2000",
"platform": "Dreamcast",
"score": "97",
"summary": "Hawk's back - with new technology, new pros and new tricks! THPS2, the legend rides on! Skate as legendary Tony Hawk or any one of 12 other pro skaters. Create your own custom skaters. Multiple play modes including 1-Player, Career and Free Skate modes, as well as 2-player modes such as Trick Attack, Graffiti Tag and Horse. Build your own custom skate parks with the real-time 3D park editor.",
"ids": "79878c5ab307a6f2fc8cf648c6f7ef32d6931965"
},
{
"title": "The Legend of Zelda: Breath of the Wild",
"url": "/game/switch/the-legend-of-zelda-breath-of-the-wild",
"date": "March 3, 2017",
"platform": "Switch",
"score": "97",
"summary": "Forget everything you know about The Legend of Zelda games. Step into a world of discovery, exploration and adventure in The Legend of Zelda: Breath of the Wild, a boundary-breaking new game in the acclaimed series. Travel across fields, through forests and to mountain peaks as you discover what has become of the ruined kingdom of Hyrule in this open-air adventure. Explore the wilds of Hyrule any way you like - Climb up towers and mountain peaks in search of new destinations, then set your own path to get there and plunge into the wilderness.
Along the way, you'll battle towering enemies, hunt wild beasts and gather ingredients for the food and elixirs you'll need to sustain you on your journey. More than 100 Shrines of Trials to discover and explore - Shrines dot the landscape, waiting to be discovered in any order you want. Search for them in various ways, and solve a variety of puzzles inside. Work your way through the traps and devices inside to earn special items and other rewards that will help you on your adventure.\n* Be prepared and properly equipped - With an entire world waiting to be explored, you'll need a variety of outfits and gear to reach every corner. You may need to bundle up with warmer clothes or change into something better suited to the desert heat. Some clothing even has special effects that, for example, can make you faster and stealthier.\n* Battling enemies requires strategy - The world is inhabited with enemies of all shapes and sizes. Each one has its own attack method and weaponry, so you must think quickly and develop the right strategies to defeat them.\n* amiibo compatibility - Tap the Wolf Link amiibo (sold separately) to make Wolf Link appear in game. Wolf Link will attack enemies on his own and help you find items you're searching for.",
"ids": "c9872a98e6ef0220ee26f2bfeaeceae8f9bf20f1"
},
{
"title": "Tony Hawk's Pro Skater 3",
"url": "/game/playstation-2/tony-hawks-pro-skater-3",
"date": "October 28, 2001",
"platform": "PlayStation2",
"score": "97",
"summary": "Challenge up to four friends in online competitions over a LAN or the Internet. Take them on in both Trick Attack and Graffiti modes. [Activision]",
"ids": "6ab7b42f8846bc738e80ec9133e5d34503ed9dcb"
}
Sources
https://playwright.dev/docs/intro
https://help.smartproxy.com/docs/residential-authentication-methods
https://api.ipify.org/?format=json