I would like to perse the url https://www.horsedeathwatch.com/index.php and dump the data into a Pandas data frame.
Column like horse/date/course/cause of death
I tried pandas read_html to directly read this url and it didn't find the table even though it has table tag .
I tried using :
url='https://www.horsedeathwatch.com/index.php'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#print(page.text)
soup = BeautifulSoup(page.content,'lxml')
and then findall('tr') method but some reason not getting it to work.
Second thing i would like to do is .. each Horse(first column in the web page table) has a hyperlink with additional attribute.
any suggestion on how can i retrieve those additional attributes to a pandas data frame
Looking at the site I can see the data is loaded using a POST request to /loaddata.php passing the page number. Combining this with pandas.read_html:
import requests
import pandas
res = requests.post('https://www.horsedeathwatch.com/loaddata.php', data={'page': '3'})
html = pandas.read_html(res.content)
Although perhaps BeautifulSoup would give you a richer data structure .. because if you want to extract the further attributes against each horse you would need to get the anchor element's 'href' and perform another request - this one is a GET request and you need to parse the reponse content from <div class="view"> in the response.
Related
I need to get the data from a toll table in Argentina. The website where this information is located here
Now, I try to put together the code to bring the information that is inside the "tbody", and the "tr" and "td" to be able to create a list and then a dataframe with that information but every time I try to bring the information from the " tbody", brings up very little data. It is impossible for me to get the information from the rows of the table. I leave you the code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.argentina.gob.ar/obras-publicas/vialidad-nacional/institucional/informacion-publica/tarifas-de-peajes'
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
rows = soup.find('table',attrs={"class":"table table-spaced table-align-middle table-mobile"}).find_all('tbody')
rows
This code returns the following:
[<tbody class="list" id="show-data"></tbody>]
I've tried fetching the "tr" and "td" elements directly but can't fetch the rows. Does anyone know what I could be doing wrong?
From already thank you very much!
I'm very new to Python and am attempting my first web scraping project. I'm attempting to extract the data following a tag within a XML data source. I've attached an image of the data I'm working with. My issue is that, it seems like no matter what tag I try to extract I constantly return no results. I am able to return the entire data source so I know the connection is not the issue.
My ultimate goal is to loop through all of the data and return the data following a particular tag. I think if I can understand why I'm unable to print a singular particular tag I should be able to figure out how to loop through all of the data. I've looked through similar posts but I think the tree in my set of data is particularly troublesome (that and my inexperience).
My Code:
from bs4 import BeautifulSoup
import requests
#Assign URL to scrape
URL = "http://api.powertochoose.org/api/PowerToChoose/plans?zip_code=78364"
#Fetch the raw HTML Data
Data = requests.get(URL)
Soup = BeautifulSoup(Data.text, "html.parser")
tags = Soup.find_all('fact_sheet')
print (tags)
Try to check the response of your example first, it is JSON not XML so no BeautifulSoup needed here, simply iterate the data list to pick your fact_sheets:
for plan in Data.json()['data']:
print(plan['fact_sheet'])
Out:
https://rates.cleanskyenergy.com:8443/rates/DownloadDoc?path=a70e9298-5537-481a-985c-c7a005b2e4f3.html&id_plan=223344
https://texpo-prod-api.eroms.works/api/v1/document/ViewProductDocument?type=efl&rateCode=SRCPLF24PTC&lang=en
https://www.txu.com/Handlers/PDFGenerator.ashx?comProdId=TCXSIMVL1212AR&lang=en&formType=EnergyFactsLabel&custClass=3&tdsp=AEPTCC
https://signup.myvaluepower.com/Home/EFL?productId=32653&Promo=16410
https://docs.cloud.flagshippower.com/EFL?term=36&duns=007924772&product=galleon&lang=en&code=FPSPTC2
...
As you've already realized by now, you're getting the data as json, so doing something like:
fact_sheet_links = [d['fact_sheet'] for d in Data.json()['data']]
would get you the data you want.
But also, if you'd prefer to work with the xml, you can add headers to the request:
Data = requests.get(URL, headers={ 'Accept': 'application/xml' })
and get an xml response. When I did this, Soup.find_all('fact_sheet') still did not work (although I've seen this method used in some tutorials, so it might be a version problem - and it might still work for you), but it did work when I used find_all with lambda:
tags = Soup.find_all(lambda t: 'fact_sheet' in t.name)
and the results after altering your code looked like this. That just gives you the tags though, so if you want a list of the contents instead, one way would be to use list comprehension:
fact_sheet_links = [t.text for t in tags]
so that you get them like this.
So to preface the website I've been trying to scrape seems to have/use (I'm unsure about the jargon with things relating to web development and the like) javascript code and I've been having varying success trying to scrape different tables on different pages.
For instance on this page: http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic I was easily able to 'inspect element' then go to Network find the correct 'Name' of the script and then find the Request URL I needed to get the table that I wanted. The code I used for this was:
url = 'http://www.minorleaguesplits.com/tennisabstract/cgi-bin/frags/NovakDjokovic.js'
content = requests.get(url)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find('table', id='tour-years', attrs= {'class':'tablesorter'})
dfs = pd.read_html(str(table))
df = pd.concat(dfs)
However, now when I'm looking at a different page on the same site, say this one http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html, I'm unable to find the Request URL that will allow me to eventually get the table that I want. I repeat the same process as I did above, but there's no .js script under the Network tab that has the table. I do see the table when I'm looking at the html elements, but of course I can't get it without the correct url.
So my question would be, how can I get the table from this page http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html ?
TIA!
On looking at the source code of the html page, you can see that all the data is already loaded in the script tag. Only thing you want is extract the variable value and load it to beautifulsoup.
The following code gives all the variables and the values from script tag
import requests, re
from bs4 import BeautifulSoup
res = requests.get("http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html")
soup = BeautifulSoup(res.text, "lxml")
script = soup.find("script", attrs={"language":"JavaScript"}).text
var_only = script[:script.index("$(document)")].strip()
Next you can use regex to get the variable values - https://regex101.com/r/7cE85A/1
I'm still leaning how to utilize beautifulsoup. I've managed to use tags and what not to pull the data from Depth Chart table at https://fantasydata.com/nfl-stats/team-details/CHI
But now I'm try to pull the Full Roster table. I can't quite seem to figure out the tags for that. I do notice in the source though that the info is in a list with dictionaries, as seen:
vm.Roster = [{"PlayerId":16236,"Name":"Cody Parkey","Team":"CHI","Position":"K","FantasyPosition":"K","Height":"6\u00270\"","Weight":189,"Number":1,"CurrentStatus":"Healthy","CurrentStatusCol
...
What's an elegant way to pull that Full Roster table? My thought was if I could just grab that list/dictionary, I could just convert to a dataframe. But not sure how to grab that, or if there is a better way to do that to put that table in a dataframe in python.
One possible solution is to use a regular expression to extract the raw JSON object which then can be loaded using the json library.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import json
html_page = urlopen("https://fantasydata.com/nfl-stats/team-details/CHI")
soup = BeautifulSoup(html_page, "html.parser")
raw_data = re.search(r"vm.Roster = (\[.*\])", soup.text).group(1)
data = json.loads(raw_data)
print(data[0]["Name"]) # Cody Parkey
It should be noted that scraping data from that particular site in this fashion most likely violates their terms of service and might even be illegal in some jurisdictions.
I am trying to extract numeric data from a website. I tried using a simple web scraper to retrieve the data:
from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
url = "http://www.oanda.com/currency/live-exchange-rates/"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
data1 = soup.find(id='EUR_USD-b-int')
print data1
This kind of approach normally would give the line of data from the website including the contents of the element I am trying to extract. However it gives everything but the contents which is the part I need. I have tried .contents and it returns []. I've also tried .child and it returns 'none'. Does anyone know another method that could work. I have looked through the beautiful soup documentation but I can't seem to find a solution?
The value on this page is updated using Javascript by making a request to
GET http://www.oanda.com/lfr/rates_lrrr?tstamp=1392757175089&lrrr_inverts=1
Referer: http://www.oanda.com/currency/live-exchange-rates/
(Be aware that I was blocked 4 times just looking at this, they are extremely block-happy. This is because they sell this data commercially as a subscription service.)
The request is made and the response parsed in http://www.oanda.com/jslib/wl/lrrr/liverates.js. The response is "encrypted" with RC4 (http://en.wikipedia.org/wiki/RC4)
The RC4 decrypt method is coming from http://www.oanda.com/wandacache/rc4-ea63ca8c97e3cbcd75f72603d4e99df48eb46f66.js. It looks like this file is refreshed often so you'll need to grab the latest link from the homepage and extract the var key=<value> to fully decrypt the value.