Copy a table from a website using selenium and flask - python

For my local sports club website it would be nice to copy and sync the rankings and results from the official league website. I'm using flask and selenium for python 3.5.
So far I'm using
driver.find_element_by_class_name("table")
to locate the tables. Is there an efficient way to store this and pass this on to the jinja templates all at once? Or do I have to store and process all the different parts of the table (header, rows, elements) separately?

As you have the information in a <table>, you should just extract the information based on <tr>, <td> (and possible <th>) and store that in a CSV or other structured file (YAML, JSON) and take the data for the jinja templates from there.
If you only update your file when data changes this is one of the more efficient ways, you can check e.g. every hour if the input (the official league table) changes.
This decoupling is also important for when the league data changes to use e.g. <div> and <span> and your input processing needs to be adapted.
#Will suggestion to use BeautifulSoup is a good one, especially if the data is to process is large, a one time retrieve of HTML from selenium and processing by BeautifulSoup is much faster. If you are not willing to investigate time into that look at using full CSS (not just class) for selecting elements in selenium (using .find_element_by_css_selector() ) that is most easily translated to BeautifulSoup (using .select()), once you are going to make the transition.

Related

Looking for the best way to automate scraping values off of a CMS to build reports

first post so go easy on me :)
The situation is that I'm trying to scrape the information off of a web based (customer) CMS (Customer-Management System) that has sales information on it to have it then get those values into excel or Google sheets to ultimately build a report, thus saving time/errors from flipping through all of them manually.
I remember using a solution (multiple tools) once that would basically go through the pages and take values from defined fields on those pages and then throw that information into columns on a sheet that we'd then manipulate manually. I'm pretty sure it was python based and (I think) used tampermonkey extension to get the information on a dev/debugger version of chrome.
The process looked something like this:
Already logged into the CMS -> Execute the tool/script that would then automatically open an order in a new window
It'd then go through that order and take values from specific fields and then copy those values in a sheet
It'd then close the window and proceed on to the next order in the specified range
Once it completes the specified (date) range, the columns would be something like salesperson / order number / sale amount / attachment amount / etc - to then be manually manipulated, no further automation needed (beyond the formulas in the sheet)
Anyone have any ideas on how to get this done or any guides anyone knows of for this specific type of task? Trying to automate this as much as possible - Thanks in advance.
Python should be a good choice as it provides you with many different tools. Depending on the functionality of the CMS you can choose different packages.
Simple HTML scraping
For simple scraping of static HTML content scrapy or Beautiful Soup should be enough.
Scraping including executable content
For these cases you can use Selenium which you can combine with Beautiful Soup. For more details can be found in this related question and this one.

Web scraping: finding element after a DOM Tree change

I am relatively new to web scraping/crawlers and was wondering about 2 issues in the event where a parsed DOM element is not found in the fetched webpage anymore:
1- Is there a clever way to detect if the page has changed? I have read that it's possible to store and compare hashes but I am not sure how effective it is.
2- In case a parsed element is not found in the fetched webpage anymore, if we assume that we know that the same DOM element still exists somewhere in the DOM Tree in a different location, is there a way to somehow traverse the DOM Tree efficiently without having to go over all of its nodes?
I am trying to find out how experienced developers deal with those two issues and would appreciate insights/hints/strategies on how to manage them.
Thank you in advance.
I didn't see this in your tag list so I thought I'd mention this before anything else: a tool called BeautifulSoup, designed specifically for web-scraping.
Web scraping is a messy process. Unless there's some long-standing regularity or direct relationship with the web site, you can't really rely on anything remaining static in the web page - certainly not when you scale to millions of web pages.
With that in mind:
There's no one-fit-all solution. Some ideas:
Use RSS, if available.
Split your scraping into crude categories where some categories have either implied or explicit timestamps (eg: news sites) you can use to trigger an update on your end.
You already mentioned this but hashing works quite well and is relatively cheap in terms of storage. Another idea here is to not hash the entire page but rather only dynamic or elements of interest.
Fetch HEAD, if available.
Download and store previous and current version of the files, then use a utility like diff.
Use a 3rd party service to detect a change and trigger a "refresh" on your end.
Obviously each of the above has its pros and cons in terms of processing, storage, and memory requirements.
As of version 4.x of BeautifulSoup you can use different HTML parsers, namely, lxml, which should allow you to use XPath. This will definitely be more efficient than traversing the entire tree manually in a loop.
Alternatively (and likely even more efficient) is using CSS selectors. The latter is more flexible because it doesn't depend on the content being in the same place; of course this assumes the content you're interested in retains the CSS attributes.
Hope this helps!

Getting potentially large amounts of data from a website: Should I use Scrapy or urllib2?

I'm not new to programming—but am (very) new to web-scraping. I'd like to get data from this website in this manner:
Get the team-data from the given URL and store it in some text file.
"Click" the links of each of the team members and store that data in some other text file.
Click various other specific links and store data in its own separate text file.
Again, I'm quite new to this. I have tried opening the specified website with urllib2 (in hopes of being able to parse it with BeautifulSoup), but opening it resulted in a timeout.
Ultimately, I'd like to do something like specify a team's URL to a script, and have said script update associated text files of the team, its players, and various other things in different links.
Considering what I want to do, would it be better to learn how to create a web-crawler, or directly do things via urllib2? I'm under the impression that a spider is faster, but will basically click on links at random unless told to do otherwise (I do not know whether or not this impression is accurate).

How do I scrape a specific table from a web page and display it in Excel? The table goes horizontally?

I am trying to scrape information from the tables at this website >>Here<<
I want to be able to get the scores when I want, I want to be able to get it and export it into Excel, also, I would like the data to come under the hole no. as well. The data that I want is wrapped in a <table> tag with a class of "scoreboard", that is the bit that I want. I would also like the players name.
Is this possible, if so how?
Please answer.
Excel has its own import data from website feature. It has a nice GUI and can let you easily make dynamic web queries in your excel sheet so that the data will update every time you open the book. This might be the easiest and most efficient way for you to go.
Scrappy is much better, especially for larger projects, for use in python, but if your going to put it back into Excel it might not be worth the extra effort.
Check out the official Excel docs on creating dynamic web queries here.
You might wanna take a look at Scrapy. It's a web scraper framework written in Python. It's powerful and easily extensible and customizable.

Using Beautifulsoup and regex to traverse javascript in page

I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!
I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.

Categories

Resources