Last week, AddThis and Oracle Data Cloud had the opportunity to host the June 2019 meetup of Women Who Code DC. During the meetup, Jenny Ching, one of our software engineers, gave us an informative presentation on “Enterprise Solutions for Scraping the Internet.”
Although it was a fairly advanced topic, she presented the information in an engaging, informative way. As a developer at AddThis, here are a few of my key takeaways.
What is “internet scraping”?
First, we covered the basics of what internet scraping is:
It’s the process of extracting information from HTML web pages.
Web pages are then classified by a machine learning model using the extracted data. The same model is used to filter out inappropriate websites so that no data is extracted from them. This is an important step in the process and affects how the extracted information is used later.
The tools behind the magic
Before the presentation, I had no idea what the process of scraping the internet looked like. Now, I realize you use a suite of tools make it work, and each tool serves a special purpose.
Jenny showed us an intricate diagram of how everything plays together. Most of the tools require exclusive access to servers. Internally, we use VPNs to access some parts of our development stacks, so it made sense to me that her team uses it for the same purpose.
Apart from the tools needed for the scraping service to work, there are tools surrounding the stack that are also valuable. They’re used for alerts, logging, containerization, orchestration, continuous organization, and delivery. Common applications like Docker, Kubernetes, Kibana, Jenkins, Git, etc. apply here.
Fun facts about internet scraping
Did you know that three million unique HTML webpages are scraped every day? That’s a lot of pages! Even with everything I had learned so far, that fact was still difficult to fathom. That means they are dealing with some heavy-duty technologies to accommodate all that data.
Overall, my favorite part about Jenny’s presentation was that I learned something new that was completely out of my wheelhouse. I still had follow-up questions for her after the presentation, but I was able to understand the general concepts with ease.
A diverse community
I also enjoyed the opportunity to meet and learn with other women who came from varying professional backgrounds. In the meetup, we had designers, front-end web developers, veteran back-end engineers, and even a local teacher. It just shows that when you’re passionate about something and explain it well, it can resonate with a broader audience.
Want to learn more about Women Who Code and future meetups? Check out their website here: https://www.womenwhocode.com/.
Interested in joining our team? Check out our open job listings!