Scraping with Python

Scraping with Python - python

I need to get the data from a toll table in Argentina. The website where this information is located here
Now, I try to put together the code to bring the information that is inside the "tbody", and the "tr" and "td" to be able to create a list and then a dataframe with that information but every time I try to bring the information from the " tbody", brings up very little data. It is impossible for me to get the information from the rows of the table. I leave you the code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.argentina.gob.ar/obras-publicas/vialidad-nacional/institucional/informacion-publica/tarifas-de-peajes'
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
rows = soup.find('table',attrs={"class":"table table-spaced table-align-middle table-mobile"}).find_all('tbody')
rows
This code returns the following:
[<tbody class="list" id="show-data"></tbody>]
I've tried fetching the "tr" and "td" elements directly but can't fetch the rows. Does anyone know what I could be doing wrong?
From already thank you very much!

Related

Scraping Webpage with Javascript Elements

So to preface the website I've been trying to scrape seems to have/use (I'm unsure about the jargon with things relating to web development and the like) javascript code and I've been having varying success trying to scrape different tables on different pages.
For instance on this page: http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic I was easily able to 'inspect element' then go to Network find the correct 'Name' of the script and then find the Request URL I needed to get the table that I wanted. The code I used for this was:
url = 'http://www.minorleaguesplits.com/tennisabstract/cgi-bin/frags/NovakDjokovic.js'
content = requests.get(url)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find('table', id='tour-years', attrs= {'class':'tablesorter'})
dfs = pd.read_html(str(table))
df = pd.concat(dfs)
However, now when I'm looking at a different page on the same site, say this one http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html, I'm unable to find the Request URL that will allow me to eventually get the table that I want. I repeat the same process as I did above, but there's no .js script under the Network tab that has the table. I do see the table when I'm looking at the html elements, but of course I can't get it without the correct url.
So my question would be, how can I get the table from this page http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html ?
TIA!

On looking at the source code of the html page, you can see that all the data is already loaded in the script tag. Only thing you want is extract the variable value and load it to beautifulsoup.
The following code gives all the variables and the values from script tag
import requests, re
from bs4 import BeautifulSoup
res = requests.get("http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html")
soup = BeautifulSoup(res.text, "lxml")
script = soup.find("script", attrs={"language":"JavaScript"}).text
var_only = script[:script.index("$(document)")].strip()
Next you can use regex to get the variable values - https://regex101.com/r/7cE85A/1

Showing 0 element in the list while web scraping using pandas

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.espn.in/nba/stats/player?stat=assists&season=2017&seasontype=2&position=center&conference=5"
page=requests.get(url)
soup=BeautifulSoup(page.content, 'html.parser')
elem=soup.find_all("table",class_="flex")
len(elem)
After executing this code it's showing me zero elements in the list. Is there anyway to fix it?

There is no table element with flex as its class. In the page there are two tables, left aligned and right aligned tables. The left table contains the names and right aligned tables contain all the table corresponding to the names.
You can get the left table using:
soup.find("table",class_="Table--fixed-left")
And the right table using:
soup.find("table",class_="Table Table--align-right")

I have blocking software that blocks me from browsing this site, and I can't check on live.
But it is worth checking that this class is on the page itself and not on a frame (or etc) in itself.

Need help parsing the php/html file using python

I would like to perse the url https://www.horsedeathwatch.com/index.php and dump the data into a Pandas data frame.
Column like horse/date/course/cause of death
I tried pandas read_html to directly read this url and it didn't find the table even though it has table tag .
I tried using :
url='https://www.horsedeathwatch.com/index.php'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#print(page.text)
soup = BeautifulSoup(page.content,'lxml')
and then findall('tr') method but some reason not getting it to work.
Second thing i would like to do is .. each Horse(first column in the web page table) has a hyperlink with additional attribute.
any suggestion on how can i retrieve those additional attributes to a pandas data frame

Looking at the site I can see the data is loaded using a POST request to /loaddata.php passing the page number. Combining this with pandas.read_html:
import requests
import pandas
res = requests.post('https://www.horsedeathwatch.com/loaddata.php', data={'page': '3'})
html = pandas.read_html(res.content)
Although perhaps BeautifulSoup would give you a richer data structure .. because if you want to extract the further attributes against each horse you would need to get the anchor element's 'href' and perform another request - this one is a GET request and you need to parse the reponse content from <div class="view"> in the response.

python beautifulsoup - pull a list/dictionary

I'm still leaning how to utilize beautifulsoup. I've managed to use tags and what not to pull the data from Depth Chart table at https://fantasydata.com/nfl-stats/team-details/CHI
But now I'm try to pull the Full Roster table. I can't quite seem to figure out the tags for that. I do notice in the source though that the info is in a list with dictionaries, as seen:
vm.Roster = [{"PlayerId":16236,"Name":"Cody Parkey","Team":"CHI","Position":"K","FantasyPosition":"K","Height":"6\u00270\"","Weight":189,"Number":1,"CurrentStatus":"Healthy","CurrentStatusCol
...
What's an elegant way to pull that Full Roster table? My thought was if I could just grab that list/dictionary, I could just convert to a dataframe. But not sure how to grab that, or if there is a better way to do that to put that table in a dataframe in python.

One possible solution is to use a regular expression to extract the raw JSON object which then can be loaded using the json library.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import json
html_page = urlopen("https://fantasydata.com/nfl-stats/team-details/CHI")
soup = BeautifulSoup(html_page, "html.parser")
raw_data = re.search(r"vm.Roster = (\[.*\])", soup.text).group(1)
data = json.loads(raw_data)
print(data[0]["Name"]) # Cody Parkey
It should be noted that scraping data from that particular site in this fashion most likely violates their terms of service and might even be illegal in some jurisdictions.

Problems splitting scraped data with python

I am trying to scrape the data on some pages with BeauitfulSoup but I cannot seem to get the data that I want. I am having trouble splitting the data. I'll post my code below but what I am trying to do is grab each address and split it. For instance, if you try the code below, I can get the data that I want but I can't seem to figure out how to split it on the tag. My output I am attempting is, address = ['2 Warriston's Close','High Street, Edinburgh EH1 1PG','United Kingdom']
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.hauntedplaces.org/item/mary-kings-close/'
page = requests.get(url)
soup = bs(page.text, 'lxml')
region = soup.select('dd.data')[0]
# Need something here to split the region variable so I can separate for csv file.
# Trying to use soup.select('dd.data')[0].split() but no avail.
print(region)

So, instead of the HTML, you want to get the text inside of the tags. BeautifulSoup has a text attribute. So, in this case to get what you want you can just add the line:
print(region.text.split('\n')[:3])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping with Python - python

Related

Scraping Webpage with Javascript Elements

Showing 0 element in the list while web scraping using pandas

Need help parsing the php/html file using python

python beautifulsoup - pull a list/dictionary

Problems splitting scraped data with python

Categories

Resources