I have been trying for a few hours now to extract a text from a specific cell in the following table for vain:
<tbody class="table-body">
<tr class=" " data-blah="25293454534534513" data-currency="1">
<td class="action-cell no-sort">
</td>
<td class="col1 id">
<a class="alert-ico " data-tooltip=""></a>
<a class="isin-btn " data-tooltip="" id="isin" data-portfolioid="2423424" data-status="0">US3</a>
</td>
<td class="col2 name hide">4%</td>
<td class="col9 colNo.9" title="Bid: 101.23; Mid: 101.28; Ask: 101.33;
Liquidity Score: -*/5*; Merit: -/4;" data-bprice="101.28" data-uprice="101.28">101.28<span class="estim-star">*</span></td>
<td class="col10 price_change" nowrap="" data-sort="0.02"><span class="positive-change">0.02%</span><span class="change-sign positive-change">↑</span></td>
<td class="col11 yield yield-val" title="" data-sort="3.33" data-byield="3.33" data-uyield="3.34%">3.33%</td>
<td class="col12 purchase_price" data-bprice="101.28" data-uprice="101.28" data-sort="101.28"><input type="text" name="purchase_price" class="positive-num-only default" value="101.28"></td>
<td class="col13 margin_bond" data-bond="sec" data-sort="0"><input type="text" name="margin_bond" maxlength="3" class="positive-num-only default" value="0"></td>
</tr>
</tbody>
I'm trying to extract a text from column 'Price Change' (col 10) using lxml.html which allows me to extract data from big tables in a manner of seconds. I'm doing it like that:
import lxml.html
import pandas as pd
root = lxml.html.fromstring(self.driver.page_source)
data = []
for row in root.xpath('.//*[#id=\'main\']/div[5]/div[2]/table/tbody/tr'):
cells = row.xpath('.//td/text()')
So, I succeeded to extract the whole table like that and I know that the only exception is column 10 ('price change') and tried the following and it returned the empty string (""):
row.xpath('.//tr[1]/td[11][#data-sort]/text()')
row.xpath('.//[#id='main']/div[5]/div[2]/table/tbody/tr[1]/td[11]/span/text()')
row.xpath('.//*[#id='main']/div[5]/div[2]/table/tbody/tr[1]/td[11]/text()')
I don't want to extract the text using WebElement but only with lxml.html library
Thank you!
There are two problems
There are total 7 tds and not 11, the td you are intersted is 5 and not 11.
the td you are intersted in has two span and you are not providing which span you are interested in.
this code works perfectly fine.
html_code = """
<tbody class="table-body">
<tr class=" " data-blah="25293454534534513" data-currency="1">
<td class="action-cell no-sort">
</td>
<td class="col1 id">
<a class="alert-ico " data-tooltip=""></a>
<a class="isin-btn " data-tooltip="" id="isin" data-portfolioid="2423424" data-status="0">US3</a>
</td>
<td class="col2 name hide">4%</td>
<td class="col9 colNo.9" title="Bid: 101.23; Mid: 101.28; Ask: 101.33;
Liquidity Score: -*/5*; Merit: -/4;" data-bprice="101.28" data-uprice="101.28">101.28<span class="estim-star">*</span></td>
<td class="col10 price_change" nowrap="" data-sort="0.02">
<span class="positive-change">0.02%</span>
<span class="change-sign positive-change">↑</span></td>
<td class="col11 yield yield-val" title="" data-sort="3.33" data-byield="3.33" data-uyield="3.34%">3.33%</td>
<td class="col12 purchase_price" data-bprice="101.28" data-uprice="101.28" data-sort="101.28"><input type="text" name="purchase_price" class="positive-num-only default" value="101.28"></td>
<td class="col13 margin_bond" data-bond="sec" data-sort="0"><input type="text" name="margin_bond" maxlength="3" class="positive-num-only default" value="0"></td>
</tr>
</tbody>
"""
tree = html.fromstring(html_code)
print "purchase price is %s" % tree.xpath(".//td[contains(#class,'col10')]/span[1]/text()")[0]
print "purchase price is %s" % tree.xpath(".//td[5]/span[1]/text()")[0]
Related
I'm parsing this page
I pull out links from the number2 classes. Further in the loop I go through each element of number2 and try to get the results from the class 'center bold table-odds'. To do this, I try to find the parents of each link, but the problem is that every time I get the result from the first element (in this example it is 31:25)
<table class="table-main odds prediction-table" id="prediction-table-1">
<tbody>
<tr class="odd">
<td rowspan="3" class="center status-text-won">W</td>
<td rowspan="3" id="status-IwnElQet" class="table-time center datet t1570978800-6-1-0-0 ">Today<br>15:00</td>
<td rowspan="3" colspan="1" class="table-participant">
<a class="number2" href="/handball/europe/challenge-cup/vogosca-sviesa-IwnElQet/#1X2;2">1X2</a>
</td>
<td rowspan="3" class="center bold table-odds">31:25</td>
<td class="center table-odds result-ok">1.50</td>
</tr>
<tr class="even">
<td rowspan="3" class="center status-text-lost">L</td>
<td rowspan="3" id="status-0IZCD4u8" class="table-time center datet t1570978800-6-1-0-0 ">Today<br>15:00</td>
<td rowspan="3" colspan="2" class="table-participant">
<a class="number2" href="/volleyball/italy/serie-a2-women/marignano-talmassons-0IZCD4u8/#ah;2;-14.50;3">AH -14.5 Points</a>
</td>
<td rowspan="3" class="center bold table-odds">3:1</td>
<td class="center table-odds result-ok">2.01</td>
</tr>
</tbody>
</table>
odds = driver.find_elements_by_class_name('number2')
for odd in odds:
print(odd.get_attribute('href'))
print(odd.find_element_by_xpath('../..').find_element_by_class_name('center bold table-odds').text)
Your way to do it:
odds = driver.find_elements_by_class_name('number2')
for odd in odds:
print(odd.get_attribute('href'))
print(odd.find_element_by_xpath('./ancestor::tr[1]').find_element_by_css_selector('.center.bold.table-odds').text)
# or
# print(odd.find_element_by_xpath('./ancestor::tr[1]//td[4]')
# or
# print(odd.find_element_by_xpath('./ancestor::tr[1]//td[contains(#class,'bold')]')
Second way:
rows = driver.find_element_by_css_selector('#prediction-table-1 > tbody > tr')
for row in rows:
print(row.find_element_by_css_selector('.number2').get_attribute('href'))
print(row.find_element_by_css_selector('.center.bold.table-odds').text)
You have a typo
find_element_by_class_name
should be
find_elements_by_class_name
Make it plural to get them all. Read more here
Since there is only one class with name "number2" you are getting only on element and your is iterating once only.
odds = driver.find_elements_by_class_name('number2')
Trying to find multiple tables using the CSS names and I am only getting the CSS in the output initially. I want to loop over each of the small tables and from there each row contains player info with the tds attributes about each player. How come what I have there doesn't actually print the table contents to begin with? I want to confirm I have made this first step right, before I then go on and into
the tr and tds for each mini table. I think part of the issue is that the first table.
My program -
import requests
from bs4 import BeautifulSoup
#url = 'https://www.skysports.com/premier-league-table'
base_url = 'https://www.skysports.com'
# Squad Data
squad_url = base_url + '/liverpool-squad'
squad_r = requests.get(squad_url)
print(squad_r.status_code)
premier_squad_soup = BeautifulSoup(squad_r.text, 'html.parser')
premier_squad_table = premier_squad_soup.find_all = ('table', {'class': 'table -small no-wrap football-squad-table '})
print(premier_squad_table)
HTML -
each table looks like the following but with a different title
<table class="table -small no-wrap football-squad-table " title="Goalkeeper">
<colgroup>
<col class="" style="">
<col class="digit-4 -bp30-hdn">
<col class="digit-3 ">
<col class="digit-3 ">
<col class="digit-3 ">
</colgroup>
<thead>
<tr class="text-s -interact text-h6" style="">
<th class=" text-h4 -txt-left" title="">Goalkeeper</th>
<th class=" text-h6" title="Played">Pld</th>
<th class=" text-h6" title="Goals">G</th>
<th class=" text-h6" title="Yellow Cards ">YC</th>
<th class=" text-h6" title="Red Cards">RC</th>
</tr>
</thead>
<tbody>
<tr class="text-h6 -center">
<td>
<a href="/football/player/141016/alisson-ramses-becker">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Alisson Ramses Becker</h6></span>
</div>
</a>
</td>
<td>
13 (0) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr class="text-h6 -center">
<td>
<a href="/simon-mignolet">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Simon Mignolet</h6></span>
</div>
</a>
</td>
<td>
1 (0) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr class="text-h6 -center">
<td>
<a href="/football/player/153304/kamil-grabara">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Kamil Grabara</h6></span>
</div>
</a>
</td>
<td>
1 (1) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
Output -
200
('table', {'class': 'table -small no-wrap football-squad-table '})
Had to find the div first to then get the table inside the div
premier_squad_div = premier_squad_soup.find('div', {'class': '-bp30-box col span1/1'})
premier_squad_table = premier_squad_div.find_all('table', {'class': 'table -small no-wrap football-squad-table '})
Please see the below html table
<table width=900 cellspacing=0 border=0 cellpadding=5 style='border-top:1px solid silver;border-left:1px solid silver;border-right:1px solid silver;'>
<tr >
<td style='border-bottom:1px solid silver;background:#ffffff;'>
<input checked type=checkbox name=jobs[] value='610974'>
<table border=0 cellpadding=2 cellspacing=0 style='border:4px #70797a; border-radius: 5px;'>
<tr>
<td style='background:lightgreen;' valign=top>
<img src='../images/checkwhite.png' style='width:30px;'>
</td>
<td style='background:lightgreen;'> 123 Charter Rd Wethersfield CT 06109 </td>
<tr>
<td>Your Input</td>
<td>123 CHARTER RD WETHERSFIELD CT 06109</td>
</tr>
</table>
<br clear=all>
<div style='margin-left:40px;'>09/11/2018
<br>Exterior BPO - Light Photo Set (3 photos*)
<br>$9.00 We found a rep 6.2 miles from job.
<span style='color:silver'> 640x480 Add Datestamp, </span>
<br clear=all>
<div style=float:left;'></div>
</div>
</td>
</td>
<tr >
<td style='border-bottom:1px solid silver;background:#ffffff;'>
<div style='color:red; font-weight:bold; '>Warning... Duplicate Found!</div>
<input checked type=checkbox name=jobs[] value='610975'>
<table border=0 cellpadding=2 cellspacing=0 style='border:4px #70797a; border-radius: 5px;'>
<tr>
<td style='background:lightgreen;' valign=top>
<img src='../images/checkwhite.png' style='width:30px;'>
</td>
<td style='background:lightgreen;'> 123 Charter Rd Wethersfield CT 06109 </td>
<tr>
<td>Your Input</td>
<td>123 CHARTER RD WETHERSFIELD CT 06109</td>
</tr>
</table>
<br clear=all>
<div style='margin-left:40px;'>09/11/2018
<br>Exterior BPO - Light Photo Set (3 photos*)
<br>$9.00 We found a rep 6.2 miles from job.
<span style='color:silver'> 640x480 Add Datestamp, </span>
<br clear=all>
<div style=float:left;'>
I need the output as :
id="610974" and Address="123 CHARTER RD WETHERSFIELD CT 06109" [Ist checkbox value is id and corresponding address]
id="610975" and Address="123 CHARTER RD WETHERSFIELD CT 06109" [Ist checkbox value is id and corresponding address]
etc....
soup = BeautifulSoup(bodystrip, "lxml")
for tr in response.find_all('tr'):
tds = tr.find_all('td')
print(tds[0].text)
jobid = tds[0].find('input')
print(jobid)
this is getting error on address are properly getting
With Scrapy:
for input_node in response.xpath('//input[#name="jobs[]"]'):
id = input_node.xpath(./#value).extract_first()
address = input_node.xpath('./following-sibling::table[1]//td[.="Your Input"]/following-sibling::td[1]/text()').extract_first()
With beautifulsoup this should work:
for job in soup.find_all('input',attrs={"type":"checkbox"}):
print(job['value'])
print(job.parent.find_all('td',attrs={'style':True})[1].text)
I am trying to define a function which extracts all rows of the 'Basisdaten' table on the website https://de.wikipedia.org/wiki/Stuttgart and return a dictionary whose keys and values correspond to the first and second cells in each row of the table.
The 'Basisdaten' table is part of a much larger table, as shown through the result of the following code:
from bs4 import BeautifulSoup
import requests
r=requests.get("https://de.wikipedia.org/wiki/Stuttgart")
soup=BeautifulSoup(r.text,"html.parser")
soup.find('th', text=re.compile('Basisdaten')).find_parent('table')
Unfortunately, there is no unique ID which I can use to only select those rows making up the 'Basisdaten' table. These are the rows which I hope to extract in HTML format:
<tr>
<th colspan="2">Basisdaten
</th></tr>
<tr class="hintergrundfarbe2">
<td>Bundesland:</td>
<td>Baden-Württemberg
</td></tr>
<tr class="hintergrundfarbe2">
<td>Regierungsbezirk:
</td>
<td>Stuttgart
</td></tr>
<tr class="hintergrundfarbe2">
<td>Höhe:
</td>
<td>247 m ü. NHN
</td></tr>
<tr class="hintergrundfarbe2">
<td>Fläche:
</td>
<td>207,35 km<sup>2</sup>
</td></tr>
<tr class="hintergrundfarbe2">
<td>Einwohner:
</td>
<td style="line-height: 1.2em;">628.032 <small><i>(31. Dez. 2016)</i></small><sup class="reference" id="cite_ref-Metadaten_Einwohnerzahl_DE-BW_1-0">[1]</sup>
</td></tr>
<tr class="hintergrundfarbe2">
<td>Bevölkerungsdichte:
</td>
<td>3029 Einwohner je km<sup>2</sup>
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;">Postleitzahlen:
</td>
<td>70173–70619
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;">Vorwahl:
</td>
<td>0711
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;">Kfz-Kennzeichen:
</td>
<td>S
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;">Gemeindeschlüssel:
</td>
<td>08 1 11 000
</td></tr>
<tr class="hintergrundfarbe2 metadata">
<td>LOCODE:
</td>
<td>DE STR
</td></tr>
<tr class="hintergrundfarbe2 metadata">
<td>NUTS:
</td>
<td>DE111
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;">Stadtgliederung:
</td>
<td>23 Stadtbezirke<br/>mit 152 Stadtteilen
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;">Adresse der<br/>Stadtverwaltung:
</td>
<td>Marktplatz 1<br/>70173 Stuttgart
</td></tr>
<tr class="hintergrundfarbe2" style="vertical-align: top;">
<td>Webpräsenz:
</td>
<td style="max-width: 10em; overflow: hidden; word-wrap: break-word;"><a class="external text" href="//www.stuttgart.de/" rel="nofollow">www.stuttgart.de</a>
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;">Oberbürgermeister:
</td>
<td>Fritz Kuhn (Bündnis 90/Die Grünen)
</td></tr>
I have succeeded in writing this code which gives me the desired result in dictionary form:
data = []
def extractDict(y):
results = y.find("th", {"colspan" : "2"}).find_parent('table').select('td')[3:35]
for row in results:
data.append(row.text.strip().replace('\xa0', '').replace(':', '').replace('[1]', ''))
return dict(zip(data[::2], data[1::2]))
basisdaten=extractDict(soup)
basisdaten
Result:
{'Adresse derStadtverwaltung': 'Marktplatz 170173 Stuttgart',
'Bevölkerungsdichte': '3029Einwohner je km2',
'Bundesland': 'Baden-Württemberg',
'Einwohner': '628.032 (31.Dez.2016)',
'Fläche': '207,35km2',
'Gemeindeschlüssel': '08111000',
'Höhe': '247m ü.NHN',
'Kfz-Kennzeichen': 'S',
'LOCODE': 'DE STR',
'NUTS': 'DE111',
'Oberbürgermeister': 'Fritz Kuhn (Bündnis 90/Die Grünen)',
'Postleitzahlen': '70173–70619',
'Regierungsbezirk': 'Stuttgart',
'Stadtgliederung': '23 Stadtbezirkemit 152 Stadtteilen',
'Vorwahl': '0711',
'Webpräsenz': 'www.stuttgart.de'}
However I am looking for a better solution which does not involve simply picking the 4th to 35th row from the parent table. I subsequently intend to use this code on other similar wikipedia urls and the 'Basisdaten' tables may vary across websites in terms of number of rows.
The similarity amongst all 'Basisdaten' tables is that they are all embedded within the first table and that they all have two columns, hence all start with 'th colspan="2"'. The parent table contains other subtables, for example in this case the subtable 'Lage der Stadt Stuttgart in Baden-Württemberg' comes after 'Basisdaten'.
Is it possible to write a loop which searches for the 'Basisdaten' subtable header and takes all rows thereafter, but stops when it reaches the next subtable header ('th colspan="2"')?
I have only gotten as far as to find the row which contains the start of the Basisdaten table:
soup.find('th', text=re.compile('Basisdaten'))
Hope that made sense! I am very new to Beautifulsoup and Python and this is a very challenging problem for me.
this should do
from bs4 import BeautifulSoup
import requests
data = requests.get("https://de.wikipedia.org/wiki/Stuttgart").text
soup = BeautifulSoup(data, "lxml")
trs = soup.select('table[id*="Infobox"] tr')
is_in_basisdaten = False
data = {}
clean_data = lambda x: x.get_text().strip().replace('\xa0', '').replace(':', '')
for tr in trs:
if tr.th:
if "Basisdaten" in tr.th.string:
is_in_basisdaten = True
if is_in_basisdaten and "Basisdaten" not in tr.th.string:
break
elif is_in_basisdaten:
key, val = tr.select('td')
data[clean_data(key)] = clean_data(val)
print(data)
I am scraping a page using Selenium, Python and Beautiful Soup, and I want to output the rows of a table as comma delimited values. Unfortunately the HTML of the page is all over the place. So far I have managed to extract two columns by using the IDs of their elements. The rest of the values are just contained in without an identifier such as class or id. Here is a sample of the results.
<table id="tblResults" style="z-index: 102; left: 18px; width: 956px;
height: 547px" cellspacing="1" width="956" border="0">
<tr style="color:Black;background-color:LightSkyBlue;font-family:Arial;font-weight:normal;font-style:normal;text-decoration:none;">
<td> </td>
<td> </td>
<td>Select</td>
<td>T</td>
<td>Party</td>
<td>Opposite Party</td>
<td style="width:50px;">Type</td>
<td style="width:100px;">Book-Page</td>
<td style="width:70px;">Date</td>
<td>Town</td>
</tr>
<tr style="font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnView" value="View" id="ContentPlaceHolder1_grdResults_btnView_0" title="Click to view this document" style="width:50px;" />
</td>
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_0" title="Click to add this document to My Documents" style="width:60px;" />
</td>
<td valign="top">
<span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_0" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$CheckBox1" /></span>
</td>
<td>1</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty1_0" title="Grantors:
ALBERT G MOSES FARM
MOSES ALBERT G
Grantees:
">MOSES ALBERT G</span>
</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty2_0" title="Grantors:
ALBERT G MOSES FARM
MOSES ALBERT G
Grantees:
"></span>
</td>
<td valign="top">MAP</td>
<td valign="top">- </td>
<td valign="top">01/16/1953</td>
<td valign="top">TOWN OF BINGHAMTON</td>
</tr>
<tr style="background-color:Gainsboro;font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnView" value="View*" id="ContentPlaceHolder1_grdResults_btnView_1" title="Click to view this document" style="width:50px;" />
</td>
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_1" title="Click to add this document to My Documents" style="width:60px;" />
</td>
<td valign="top">
<span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_1" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$CheckBox1" /></span>
</td>
<td>1</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty1_1" title="Grantors:
MOSS EMMY-IND&GDN
MOSES ALEXANDRA/GDN
Grantees:
GOODRICH MERLE L
GOODRICH CHARITY M
">MOSES ALEXANDRA/GDN</span>
</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty2_1" title="Grantors:
MOSS EMMY-IND&GDN
MOSES ALEXANDRA/GDN
Grantees:
GOODRICH MERLE L
GOODRICH CHARITY M
">GOODRICH MERLE L</span>
</td>
</table>
This is the script that i have written so far that works for two columns:
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = open('searched.html')
bsObj = BeautifulSoup(html)
myTable = bsObj.findAll("tr",{ "style":re.compile("font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;")} )
for table_ in myTable:
party = table_.find("span", {"id": re.compile("Party1_*")})
oppositeParty= table_.find("span", {"id": re.compile("Party2_*")})
print(party.get_text()+ "," + oppositeParty.get_text())
I have tried doing using children of myTable as follows:
myTable.children
If all you want is to just dump out the content, something like this should do:
myTable = bsObj.find_element_by_tag_name("table")
for table_ in myTable:
rows = table_.find_elements_by_tag_name("tr")
for row_ in rows:
columns = row_.find_elements_by_tag_name("td")
for column_ in columns:
# print out comma delimited text of columns...
# print the end of your row
If you're really wanting to scrape specific information, you'll need to provide us with more instructions about what your ultimate goal is.