Download Multiple Linked CSV files from a site - python

The sample site I am using is: http://stats.jenkins.io/jenkins-stats/svg/svgs.html
There are a ton of CSVs linked on this site. Now obviously I can go through each link click and download, but I know there is a better way.
I was able to put together the following Python script using BeautifulSoup but all it does is print the soup:
from bs4 import BeautifulSoup
import urllib2
jenkins = "http://stats.jenkins.io/jenkins-stats/svg/svgs.html"
page = urllib2.urlopen(jenkins)
soup = BeautifulSoup(page)
print soup
Below is a sample I get when I print the soup, but I am still missing how to actually download the multiple CSV files from this detail.
<td>
<a alt="201412-jobs.svg" class="info" data-content="<object data='201412-jobs.svg' width='200' type='image/svg+xml'/>" data-original-title="201412-jobs.svg" href="201412-jobs.svg" rel="popover">SVG</a>
<span>/</span>
<a alt="201412-jobs.csv" class="info" href="201412-jobs.csv">CSV</a>
</td>

Just use a BeatifulSoup to parse this webpage and get all the URLs of the CSV files and then download each one using urllib.request.urlretrieve().
This is a one time task, so I don`t think, that you need anything like Scrapy for it.

I totally get where youre coming from, have wanted to do the same myself, lucky if you are a linux use there is a super easy way to do what you want. On the other side, using a webscraper, im familiar with bs4 but scrapy is my life (sadly) but as far as I recall bs/4 has no real option-able way to download without to use of urlib/request but all the same !!
As to your current bs4 spider,,, First you should probably ascertain only the links that are .csv, extract clean.. I IMAGINE it would look like
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv'. '.fileformatetcetc'])
continue
This is like doing find all but limiting the response to ... well only the once with .csv or desired extension...
Then you would join the responses from that to the base url(if its incomplete). If not needed the Using csv module you would read out the csv files... (from the responses right!!?) the write it out to a new file...
For the lols Im going to create a scrapy version.
AS for that easy method... why not just use wget?
Found this... sums up on the whole csv read/write process... https://stackoverflow.com/a/21501574/3794089

Related

Want to extract links and titles from a certain website with lxml and python but cant

I am Yasa James, 14, and I am new to web scraping.
I am trying to extract titles and links from this website.
As an so called "Utako" and a want-to-be programmer, I want to create a program that extract links and titles at the same time. I am currently using lxml because I cant download selenium, limited internet, very slow internet because Im from a province in philippines and I think its faster than other modules that I've used.
here's my code:
from lxml import html
import requests
url = 'https://animixplay.to/dr.%20stone'
page = requests.get(url)
doc = html.fromstring(page.content)
anime = doc.xpath('//*[#id="result1"]/ul/li[1]/p[1]/a/text()')
print(anime)
One thing I've noticed is that is I want to grab the value of an element from any of the divs, is it gives out an empty list as an output.
I hope you can help me with this my Seniors. Thank You!
Update:
i used requests-html to fix my problem and now its working, Thank you!
The reason this does not work is that the site you're trying to fetch uses JavaScript to generate the results, which means Selenium is your only option if you want to scrape the HTML. Any static fetching and processing libraries like lxml and beautifulsoup simply do not have the ability to parse the result of JavaScript calls.

BeautifulSoup4 output with JS Filters

Newbie here. I'm trying to scrape some sports statistics off a website using BeautifulSoup4. The script below does output a table, but it's not actually the specific data that appears in the browser (the data that appears in browser is the data I'm after - goalscorer data for a season, not all time records).
#import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
#specify the url
stat_page = 'https://www.premierleague.com/stats/top/players/goals?se=79'
# query the website and return the html to the variable ‘page’
page = urlopen(stat_page)
#parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
stats = soup.find('tbody', attrs={'class': 'statsTableContainer'})
name = stats.text.strip()
print(name)
It appears there is some filtering of data going on behind the scenes but I am not sure how I can filter the output with BeautifulSoup4. It would appear there is some Javascript filtering happening on top of the HTML.
I have tried to identify what this specific filter is, and it appears the filtering is done here.
<div class="current" data-dropdown-current="FOOTBALL_COMPSEASON" role="button" tabindex="0" aria-expanded="false" aria-labelledby="dd-FOOTBALL_COMPSEASON" data-listen-keypress="true" data-listen-click="true">2017/18</div>
I've had a read of the below link, but I'm not entirely sure how to apply it to my answer (again, beginner here).
Having problems understanding BeautifulSoup filtering
I've tried installing, importing and applying the different parsers, but I always get the same error (Couldn't find a Tree Builder). Any suggestions on how I can pull data off a website that appears to be using a JS filter?
Thanks.
In these cases, it's usually useful to track the network requests using your browser's developer tools, since the data is usually retrieved using AJAX and then displayed in the browser with JS.
In this case, it looks like the data you're looking for can be accessed at:
https://footballapi.pulselive.com/football/stats/ranked/players/goals?page=0&pageSize=20&compSeasons=79&comps=1&compCodeForActivePlayer=EN_PR&altIds=true
It has a standard JSON format so you should be able to parse and extract the data with minimal effort.
However, note that this endpoint requieres the Origin HTTP header to be set to https://www.premierleague.com in order for it to serve your request.

Download with export button through Python

I am interested in downloading financial statements from the website Morningstar. Here there is an example of a page:
http://financials.morningstar.com/cash-flow/cf.html?t=PIRC&region=ita&culture=en-US
On the top right there is the export to csv button, and I would like to click it with Python. Pressing inspection, I have this HTML tag:
<div class="exportButton">
<span class="icon_1_span">
<a href="javascript:SRT_stocFund.Export()" class="rf_export">
</a> ==$0
My idea was to use bs4 - BeautifulSoup to parse (not sure at all whether I need to parse it) the page and find the button to click it. Something like:
quote_page = pageURL
page = urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
bs = soup.find(href="javascript:SRT_stocFund.Export()", attrs={"class":"rf_export"})
Obviously, this returns nothing. Do you have any suggestion on how could I tell to Python to export the data in the table? I.e. to automate the process of downloading the csv file instead of going on the webpage and doing it on my own.
Thank you very much!!
With the extension of google chrome "http trace", you can know, than it is a link:
Export
It can do, with requests library.
Example
I think, that it is the easy way (I think that if you modify the url parameter you can do the excel file as you want).
Regards!!!
I would do it with Selenium WebDriver in "headless" mode. Try Selenium, it's quite easy to understand and use. :)

BeautifulSoup4: Missing Parsed Table Data

I'm trying to extract the Earnings Per Share data through BeautifulSoup 4 from this page.
When I parse the data, the table information is missing using the default, lxml and HTML 5 parsers. I believe this has something to do with Javascript and I have been trying to implement PyV8 to transform the script into readable HTML for BS4. The problem is I don't know where to go from here.
Do you know if this is in fact my issue? I have been reading many posts and it's been a very big headache for me today. Below is a quick example. The financeWrap includes the table information, but beautifulSoup shows that it is empty.
import requests
from bs4 import BeautifulSoup
url = "http://financials.morningstar.com/ratios/r.html?t=AAPL&region=usa&culture=en-US"
response = requests.get(url)
soup_key_ratios = bs(response.content, 'html5lib')
financial_tables = soup_key_ratios.find("div", {"id":"financeWrap"})
print financial_tables
# Output: <div id="financeWrap">
# </div>
The issue is that you're trying to get data that is coming in through Ajax on the website. If you go to the link you provided, and looked at the source via the browser, you'll see that there should be no content with the data.
However, if you use a console manager, such as Firebug, you will see that there are Ajax requests made to the following URL, which is something you can parse via beautifulsoup (perhaps - I haven't tried it or looked at the structure of the data).
Keep in mind that this is quite possibly against the website's ToS.

how to get all the urls of a website using a crawler or a scraper?

i have to get many urls from a website and then i've to copy these in an excel file.
I'm looking for an automatic way to do that. The website is structured having a main page with about 300 links and inside of each link there are 2 or 3 links that are interesting for me.
Any suggestions ?
If you want to develop your solution in Python then I can recommend Scrapy framework.
As far as inserting the data into an Excel sheet is concerned, there are ways to do it directly, see for example here: Insert row into Excel spreadsheet using openpyxl in Python , but you can also write the data into a CSV file and then import it into Excel.
If the links are in the html... You can use beautiful soup. This has worked for me in the past.
import urllib2
from bs4 import BeautifulSoup
page = 'http://yourUrl.com'
opened = urllib2.urlopen(page)
soup = BeautifulSoup(opened)
for link in soup.find_all('a'):
print (link.get('href'))
have you tried selenium or urllib?.urllib is faster than selenium
http://useful-snippets.blogspot.in/2012/02/simple-website-crawler-with-selenium.html
You can use beautiful soup for parsing ,
[http://www.crummy.com/software/BeautifulSoup/]
More information about docs here http://www.crummy.com/software/BeautifulSoup/bs4/doc/
I won't suggest scrappy because you don't need that for work you described in your question.
For e.g. this code will use urllib2 library to open a google homepage and find all links in that output in the form of list
import urllib2
from bs4 import BeautifulSoup
data=urllib2.urlopen('http://www.google.com').read()
soup=BeautifulSoup(data)
print soup.find_all('a')
For handling excel files take a look at http://www.python-excel.org

Categories

Resources