Rethinking my approach to scraping dynamic content with Python and Selenium - python

Currently I am working on a project that will scrape content from various similarly designed websites which contain dynamic content. My end goal is to then aggregate all this data into one application or report of sorts. I made some progress in terms of pulling the needed data from one page but my lack of experience and knowledge in this realm has left me thinking I went down the wrong path.
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
The above link is the perfect example of the type of page I will be pulling from.
In my initial attempt I was able to have the page scroll to the bottom all the while collecting data from the various elements using, plus the manual scroll.
cards = driver.find_elements_by_css_selector("div[class^='product-card__Content']")
This allowed me to on the fly pull all the data points I needed, minus the overarching category, which happens to be a parent element, this is something I can map manually in excel, but would prefer to be able to have it pulled alongside everything else.
This got me thinking that maybe I should have taken a top down approach, rathen then what I am seeing as a bottom up approach? But no matter how hard I try based on advice on others I could not get it working as intended where I can pull the category from the parent div due to my lack of understanding.
Based on input of others I was able to make a pivot of sorts and using the below code, I was able to get the category as well as the product name, without any need to scroll the page at all, which went against every experience I have had with this project so far - I am unclear how/why this is possible.
for product_group_name in driver.find_elements_by_css_selector("div[class^='products-grid__ProductGroupTitle']"):
for product in driver.find_elements_by_xpath("//div[starts-with(#class,'products-grid__ProductGroup')][./div[starts-with(#class,'products-grid__ProductGroupTitle')][text()='" + product_group_name.text + "']]//div[starts-with(#class,'consumer-product-card__InViewContainer')]"):
print (product_group_name.text, product.text)
The problem with this code, which is much quicker as it does not rely on scrolling, is that no matter how I approach it I am unable to pull the additional data points of brand and price. Obviously it is something in my approach, but outside of my knowledge level currently.
Any specific or general advice would be appreciated as I would like to scale this into something a bit more robust as my knowledge builds, I would like to be able to have this scanning multiple different URLS at set points in the day, long way away from this but I want to make sure I start on the right path if possible. Based off what I have provided, is the top down approach better in this case? Bottom up? Is this subjective?
I have noticed comments about pulling the entire source code of the page and working with that, would that be a valid approach and possibly better suited to my needs? Would it even be possible based on the dynamic nature of the page?
Thank you.

Related

Python Report Writing with flexible templates for report body rows

I have a need to create reports from Python. Our existing reporting system for my Volunteer Fire Company is being deprecated, and it had some idiosyncrasies anyway. Everything I'm looking at that seems feasible uses a template for the report body. Before I get too far down a dead end path trying different methods, I'd like to know if anyone knows of anything out there that can do something like this. Specifically, conditional formatting in a row in the body. Below is what we get currently- today I do the bolding of the rows manually based on our criteria of 33% or better. In my code I'll obviously know if the row deserves to be bolded- but everything I'm looking at that uses a template wouldn't allow this. The end result will be a PDF, but if I have to go through Excel, Word, HTML, or whatever to get there, I'll check it out. Sorry for the non-specific question, but I could potentially churn a lot more wasted time looking for something when somebody may already know what to use. Thanks.

How do I scrape content from a dynamically generated page using selenium and python?

I have tried many attempts and all fail to record the data I need in a reliable and complete manner. I understand the extreme basics of python and selenium for automating simple tasks but in this case the content is dynamically generated and I am unable to find the correct way to access and subsequently record all the data I need.
The URL I am looking to scrape content from is structured similar to the following:
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
In particular I am trying grab all info using something like -
browser.find_elements_by_xpath('//*[#id="products-container"]
Is this the right approach? How do I access specific sub elements of this element (and all elements of the same path)
I have read that I might need beautifulsoup4, but I am unsure the best way to approach this.
Would the best approach be to use xpaths? If so is there a way to iterate through all elements and record all the data within or do I have to specify each and every data point that I am after?
Any assistance to point me in the right direction would be extremely helpful as I am still learning and have hit a roadblock in my progress.
My end goal is a list of all product names, prices and any other data points that I deem relevant based on the specific exercise at hand. If I could find the correct way to access the data points I could then store them and compare/report on them as needed.
Thank you!
I think you are looking for something like
browser.find_elements_by_css_selector('[class*="product-information__Title"]')
This should find all elements with a class beginning with that string.

Looking for the best way to automate scraping values off of a CMS to build reports

first post so go easy on me :)
The situation is that I'm trying to scrape the information off of a web based (customer) CMS (Customer-Management System) that has sales information on it to have it then get those values into excel or Google sheets to ultimately build a report, thus saving time/errors from flipping through all of them manually.
I remember using a solution (multiple tools) once that would basically go through the pages and take values from defined fields on those pages and then throw that information into columns on a sheet that we'd then manipulate manually. I'm pretty sure it was python based and (I think) used tampermonkey extension to get the information on a dev/debugger version of chrome.
The process looked something like this:
Already logged into the CMS -> Execute the tool/script that would then automatically open an order in a new window
It'd then go through that order and take values from specific fields and then copy those values in a sheet
It'd then close the window and proceed on to the next order in the specified range
Once it completes the specified (date) range, the columns would be something like salesperson / order number / sale amount / attachment amount / etc - to then be manually manipulated, no further automation needed (beyond the formulas in the sheet)
Anyone have any ideas on how to get this done or any guides anyone knows of for this specific type of task? Trying to automate this as much as possible - Thanks in advance.
Python should be a good choice as it provides you with many different tools. Depending on the functionality of the CMS you can choose different packages.
Simple HTML scraping
For simple scraping of static HTML content scrapy or Beautiful Soup should be enough.
Scraping including executable content
For these cases you can use Selenium which you can combine with Beautiful Soup. For more details can be found in this related question and this one.

I need a starting point to code an app to extract text from pdf to excel

To start I just want to state that I'm an Electrical Engineer with basic knowledge of programming.
My requirement is as follows:
I want to create an app where I can load and view PDF files that
contain tables.
These PDF files tables are of irregular shapes and in a different
position on every page. (that's why tools like tabular couldn't help
me)
Each table entry is multiline and of irregular dimensions (I cannot
select a whole row at a time it has to be each element alone. simply
copying the lines to excel won't work either because it will need a
lot of formatting)
So I want to be able to select each table entry individually from the
table (like a selection or cropping box over the required text),
delete new line if there is a new line in the text and just keep spaces.
The generated excel (or access database I do not really mind any)
should be reviewable and saveable (if those are even words XD).
I have a good knowledge of python and a very elementary knowledge of Django and I'm seeking some expert who can tell me what do I really need to learn (and if possible where to learn it) to execute my project.
Is it very much for me to execute and if I can dedicate 10 hours a week, how much would it take me to execute such a project.
Thanks all for your help in advance.
Don't use Python, use Word. Open the pdf, then step through the tables collection to collect the data and put it into excel. See this for an example
Here are the advises i can provide you :
first of all, ask internet for questions :
https://lmddgtfy.net/?q=python%20library%20tabular%20pdf
-> Camelot , which is mentioned multiple time seems to be relevant
For the use of excel sheet, i present you one of the most famous library for manipulating DataFrame: Pandas
You can use small courses on internet which will offer you a quick ability to manage your project easier.
for the application, you can easily find on youtube courses on a library made by someone who will explain you how to do a basic application. It could offer you the entry point you are talking about. Then, You can just wonder what else do you need or simply want for making it better.
for the time needed, it depends on how much time do you need to understand the basics, how much time you spend on having a deeper comprehension. I think in one week, working during your free time with a real interest, it could be working( not perfect, but working, which is a good beginning)
PS: I am not sure if your question is relevant for the aims of stackoverflow. I suggest you to read this file. ( https://stackoverflow.com/help/how-to-ask)

Does anyone know of a hello world website?

I'm learning a practice called 'web scraping' using python. From what I can tell so far the idea is to send out a request to load the site data from a server, store the DOM html in a variable, and then basically data mine the s*** out of the resulting string until you are able to quickly access exactly and only the information you need.
Well I'm ready to start fiddling with statements that might help me do the actual data mining, but first I need to see and understand all of the html in my string. After I've got the hang of it I won't care what the html looks like, but right now I need to be able to reference it to properly analyze my output. so far I've tried google, python.net, youtube, various blogs and etc. But they all look like alianeese.
I'm just looking for the typical stuff you know?
<html><head><meta><script src=""><style src=""><title></title></head><body><div class=""><img src=""></div><div><h1>my page</h1><li></li><li></li><li></li><li></li><li></li><li></li><p>click here</p></div></body></html>
You get what I'm saying? Just a website... that uses like... html... to render some simple structured data.
P.S. This is kind of neat. I went to give this post some tags and I discovered 'simple-html-dom'. So I googled it. Apparently it's some kind of language that lets you parse html from online sources in exactly the way I am trying to. I may check that out later, but I still want to figure out how to do this with python.
EDIT Actually something like this would work fine but it's just so big. I would prefer something smaller to work with.
While it would probably be nice to build your own web pages to use, you can also try looking for pages "optimized for lynx". Lynx is a text-only browser with which "simple" pages naturally work best.
Most of the links you'll find will be dead already, but I found this list for instance, which still has many alive and equally simple pages: http://www.put.com/dead.html (please ignore the content itself... there is no particular reason I chose this example other than that it probably works nicely for your purposes!)

Categories

Resources