Python BeautifulSoup HTML parse - python

Hi guys i have question about parsing HTML with BeautifulSoup
My question is how to parse this html:
<div class="time_table show_today" id="monday_schedule">
<h3>January 20, 2014</h3>
<table>
<tbody>
<tr>
<th>Time</th>
<th>Program</th>
</tr>
<tr>
<td class="time_part"> 0:00 </td>
<td class="show_content">
<h4>
First Up
</h4>
<p>
Bloomberg Television's award winning morning show takes a look at market openings in Asia and analyzes all the breaking news stories essential for your business day ahead. </p>
</td>
</tr>
<tr>
<td class="time_part"> 2:00 </td>
<td class="show_content">
<h4>
On the Move with Rishaad Salamat
</h4>
<p>
Rishaad Salamat brings you comprehensive coverage of market openings from Asia and live reporting on the stories most impacting business around the globe. </p>
</td>
</tr>
<tr>
<td class="time_part"> 4:00 </td>
<td class="show_content">
<h4>
Asia Edge
</h4>
<p>
Get to the bottom of the days major issues influencing business decisions with Rishaad Salamat. Asia Edge gives viewers a deeper perspective through extended interviews with the region's newsmakers as well as fast-paced panel discussions featuring Bloomberg's market reporters, business experts and influential guests. Stay ahead of the business day with Asia Edge. </p>
</td>
</tr>
My code looks like:
url = 'http://www.bloomberg.com/tv/schedule/europe/'
response = urllib2.urlopen(url)
soup = BeautifulSoup(response)
for line in soup.findAll('div',{'td','h4','p'}):
print line
What I'm doing wrong in code, some advice would be great.
The problem is when <h3>January 20, 2014</h3
is going for about week and he only take one but loop can't do anything to print it in every line with all others tags.

I'm not sure what you're trying to achieve with {'td','h4','p'} as a second argument. That's a set, and not a dict (as you may be thinking it is).
If you want to obtain the date, a simple soup.find('h3') should be good here:
>>> print soup.find('h3')
<h3>January 20, 2014</h3>
>>> print soup.find('h3').text
January 20, 2014

Related

Using Following-sibling in Xpath with Scrapy

I am trying to scrape the year from the html below (https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/punjab-kings-vs-delhi-capitals-64th-match-1304110/full-scorecard). Due to the way the site is coded I have to first identify the table cell that contains the word "Season" then get the year (2022 in this example).
I thought this would get it but it doesn't. There are no errors, just no results. I've not used the following-sibling approach before so I'd be grateful if someone could point out where I've messed up.
l.add_xpath(
'Season',
"//td[contains(text(),'Season')]/following-sibling::td[1]/a/text()")
html:
<tr class="ds-border-b ds-border-line">
<td class="ds-min-w-max ds-border-r ds-border-line">
<span class="ds-text-tight-s ds-font-medium">Season</span>
</td>
<td class="ds-min-w-max">
<span class="ds-inline-flex ds-items-center ds-leading-none">
<a href="https://www.espncricinfo.com/ci/engine/series/index.html?season2022" class="ds-text-ui-typo ds-underline ds-underline-offset-4 ds-decoration-ui-stroke hover:ds-text-ui-typo-primary hover:ds-decoration-ui-stroke-primary ds-block">
<span class="ds-text-tight-s ds-font-medium">2022</span>
</a>
</span>
</td>
</tr>
Try:
//span[contains(text(),"Season")]/../following-sibling::td/span/a/span/text()

How do I loop over this outerHTML code to get out certain data? (I don't know how to webscrape it so I want to try this)

I am trying to get a list that matches India's districts to its district codes as they were during the 2011 population census. Below I will post a small subset of the outerHTML I copied from a government website. I am trying to loop over it and extract a string and an int from each little html box and store these ideally in a pandas dataframe on the same row. The HTML blocks look like this, I represent 2, there are around 700 in my txt file:
<tr>
<td width="5%">1</td>
<td>603</td>
<td align="left">**NICOBARS**</td>
<td align="left">NICOBARS </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NIC</td>
<td align="left">02</td>
<td align="left">**638**</td>
<td align="left">
Not Covered
</td>
<td width="5%" align="center"><i class="fa fa-eye" aria-hidden="true"></i>
</td>
<td width="5%" align="center"><i class="fa fa-history" aria-hidden="true"></i>
</td>
<td width="5%" align="center">
</td>
<td width="3%" align="center">
<!-- Merging issue revert beck 05/10/2017 -->
<i class="fa fa-map-marker" aria-hidden="true"></i>
</td>
</tr>
<tr>
<td width="5%">2</td>
<td>632</td>
<td align="left">**NORTH AND MIDDLE ANDAMAN**</td>
<td align="left">NORTH AND MIDDLE ANDAMAN </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NMA</td>
<td align="left"></td>
<td align="left">**639**</td>
<td align="left">
Not Covered
I have put ** around ** the values that I want to get from the text file. I was wonder how I could loop through this text to extract this data. I thought about start counting each time after I encounter and than extract the data of the 1st and 6st but I don't know how to code this. Hope anyone is willing to help out. Or maybe anyone who already has this list, would be great!
If you're able to get the text of the entire html table, you can use df = pd.read_html(html_text_string). 50% of the time, it works everytime!
pd.read_html <-- docs

Scraping data from table with unique ID rows

I am trying to scrape from this website. My objective is to collect the most recent 10 results (win/loss/draw) of ANY team, I am just using this specific team as an example. The source for an individual row is:
<tr class="odd match no-date-repetition" data-timestamp="1515864600" id="page_team_1_block_team_matches_3_match-2463021" data-competition="8">
<td class="day no-repetition">Sat</td>
<td class="full-date" nowrap="nowrap">13/01/18</td>
<td class="competition">PRL</td>
<td class="team team-a ">
<a href="/teams/england/tottenham-hotspur-football-club/675/" title="Tottenham Hotspur">
Tottenham Hotspur
</a>
</td>
<td class="score-time score">
<a href="/matches/2018/01/13/england/premier-league/tottenham-hotspur-football-club/everton-football-club/2463021/" class="result-win">
4 - 0
</a>
</td>
<td class="team team-b ">
<a href="/teams/england/everton-football-club/674/" title="Everton">
Everton
</a>
</td>
<td class="events-button button first-occur">
View events
</td>
<td class="info-button button">
More info
</td>
</tr>
You can see in the <td class="score-time score", the result is stored.
My knowledge of Python and web crawling is pretty limited, so my current code is:
res2 = requests.get(soccerwayURL)
soup2 = bs4.BeautifulSoup(res2.text, 'html.parser')
elems2 = soup2.select('#page_team_1_block_team_matches_3_match-2463021 > td.score-time.score')
print(elems2[0].text.strip())
This prints out '4-0'. That is good, but the issue arises when I try to access a different row. The 7-digit number (2463021 in the example above) is unique to that row. That means that if I want to get the score from a different row, I would have to find that unique 7-digit number and place it in the CSS selector '#page_team_1_block_team_matches_3_match-******* > td.score-time.score' where the asterisks are the unique number.
The online course I took only showed how to reference things by the CSS Selector, so I am unsure how I can go about retrieving the scores without manually taking the CSS Selector for each row.
Within the <td class="score-time score"> class, there is another class that reads class="result-win">. Ideally I would like to be able to pull just that "result-win" because I am not looking for the score of the game, I am only looking for the outcome of win, loss, or draw.
I hope this is a little clearer. I greatly appreciate your patience with me.

How to auto extract data from a html file with python?

I'm beginning to learn python (2.7) and would like to extract certain information from a html code stored in a text file. The code below is just a snippet of the whole html code. In the full html text file the code structure is the same for all other firms data as well and these html code "blocks" are positioned underneath each other (if the latter info helps).
The html snippet code:
<body><div class="tab_content-wrapper noPrint"><div class="tab_content_card">
<div class="card-header">
<strong title="" d.="" kon.="" nl="">"Liberty Associates LLC"</strong>
<span class="tel" title="Phone contacts">Phone contacts</span>
</div>
<div class="card-content">
<table>
<tbody>
<tr>
<td colspan="4">
<label class="downdrill-sbi" title="Industry: Immigration">Industry: Immigration</label>
</td>
</tr>
<tr>
<td width="20"> </td>
<td width="245"> </td>
<td width="50"> </td>
<td width="80"> </td>
</tr>
<tr>
<td colspan="2">
59 Wall St</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="2">NJ 07105
<label class="downdrill-sbi" title="New York">New York</label>
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr><td>Phone:</td><td>+1 973-344-8300</td><td>Firm Nr:</td><td>KL4568TL</td></tr>
<tr><td>Fax:</td><td>+1 973-344-8300</td><td colspan="2"></td></tr>
<tr>
<td colspan="2"> www.liberty.edu </td>
<td>Active:</td>
<td>Yes</td>
</tr>
</tbody>
</table>
</div>
</div></div></body>
How it looks like on a webpage:
Right now im using the following script to extract the desired information:
from lxml import html
str = open('html1.txt', 'r').read()
tree = html.fromstring(str)
for variable in tree.xpath('/html/body/div/div'):
company_name = variable.xpath('/html/body/div/div/div[1]/strong/text()')
location = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[4]/td[1]/label/text()')
website = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[8]/td[1]/a/text()')
print(company_name, location, website)
Printed result:
('"Liberty Associates LLC"', 'New York', 'www.liberty.edu')
So far so good. However, when I use the script above to scape the whole html file, results are printed right after each other on one single line. But I would like to print the data (html code "blocks") under eachother like this:
Liberty Associates LLC | New York | +1 973-344-8300 | www.liberty.edu
Company B | Los Angeles | +1 213-802-1770 | perchla.com
I know I can use [0], [1], [2] etc. to get the data under each other like I would like, but doing this manually for all thousands of html "blocks" is just not really feasible.
So my question: how can I automatically extract the data "block by block" from the html code and print the results under each other like illustrated above?
I think what you want is
print(company_name, location, website,'\n')

python- is beautifulsoup misreporting my html?

I have two machines each, to the best of my knowledge, running python 2.5 and BeautifulSoup 3.1.0.1.
I'm trying to scrape http://utahcritseries.com/RawResults.aspx, using:
from BeautifulSoup import BeautifulSoup
import urllib2
base_url = "http://www.utahcritseries.com/RawResults.aspx"
data=urllib2.urlopen(base_url)
soup=BeautifulSoup(data)
i = 0
table=soup.find("table",id='ctl00_ContentPlaceHolder1_gridEvents')
#table=soup.table
print "begin table"
for row in table.findAll('tr')[1:10]:
i=i + 1
col = row.findAll('td')
date = col[0].string
event = col[1].a.string
confirmed = col[2].string
print '%s - %s' % (date, event)
print "end table"
print "%s rows processed" % i
On my windows machine,I get the correct result, which is a list of dates and event names. On my mac, I don't. instead, I get
3/2/2002 - Rocky Mtn Raceway Criterium
None - Rocky Mtn Raceway Criterium
3/23/2002 - Rocky Mtn Raceway Criterium
None - Rocky Mtn Raceway Criterium
4/2/2002 - Rocky Mtn Raceway Criterium
None - Saltair Time Trial
4/9/2002 - Rocky Mtn Raceway Criterium
None - DMV Criterium
4/16/2002 - Rocky Mtn Raceway Criterium
What I'm noticing is that when I
print row
on my windows machine, the tr data looks exactly the same as the source html. Note the style tag on the second table row. Here's the first two rows:
<tr>
<td>
3/2/2002
</td>
<td>
<a href="Event.aspx?id=226">
Rocky Mtn Raceway Criterium
</a>
</td>
<td>
Confirmed
</td>
<td>
<a href="Event.aspx?id=226">
Points
</a>
</td>
<td>
<a disabled="disabled">
Results
</a>
</td>
</tr>
<tr style="color:#333333;background-color:#EFEFEF;">
<td>
3/16/2002
</td>
<td>
<a href="Event.aspx?id=227">
Rocky Mtn Raceway Criterium
</a>
</td>
<td>
Confirmed
</td>
<td>
<a href="Event.aspx?id=227">
Points
</a>
</td>
<td>
<a disabled="disabled">
Results
</a>
</td>
</tr>
On my mac when I print the first two rows, the style information is removed from the tr tag and it's moved into each td field. I don't understand why this is happening. I'm getting None for every other date value, because BeautifulSoup is putting a font tag around every other date. Here's the mac's output:
<tr>
<td>
3/2/2002
</td>
<td>
<a href="Event.aspx?id=226">
Rocky Mtn Raceway Criterium
</a>
</td>
<td>
Confirmed
</td>
<td>
<a href="Event.aspx?id=226">
Points
</a>
</td>
<td>
<a disabled="disabled">
Results
</a>
</td>
</tr>
<tr bgcolor="#EFEFEF">
<td>
<font color="#333333">
3/16/2002
</font>
</td>
<td>
<font color="#333333">
<a href="Event.aspx?id=227">
Rocky Mtn Raceway Criterium
</a>
</font>
</td>
<td>
<font color="#333333">
Confirmed
</font>
</td>
<td>
<font color="#333333">
<a href="Event.aspx?id=227">
Points
</a>
</font>
</td>
<td>
<font color="#333333">
<a disabled="disabled">
Results
</a>
</font>
</td>
</tr>
My script is displaying the correct result under windows-what do I need to do in order to get my Mac to work correctly?
There are documented problems with version 3.1 of BeautifulSoup.
You might want to double check that is the version you in fact are using, and if so downgrade.
I suspect the problem is in the urlib2 request, not BeautifulSoup:
It might help if you show us the same section of the raw data as returned by this command on both machines:
urllib2.urlopen(base_url)
This page looks like it might help:
http://bytes.com/groups/python/635923-building-browser-like-get-request
The simplest solution is probably just to detect which environment the script is running in and change the parsing logic accordingly.
>>> import os
>>> os.uname()
('Darwin', 'skom.local', '9.6.0', 'Darwin Kernel Version 9.6.0: Mon Nov 24 17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386', 'i386')
Or get microsoft to use web standards :)
Also, didn't you use mechanize to fetch the pages? If so, the problem may be there.

Categories

Resources