Selecting specific values from a table using BeautifulSoup - python

I have searched similar questions and given it some thought but I am new to python and can't seem to figure this out. I am trying to scrape data from the player table on this page:
http://www.rotoworld.com/teams/depth-charts/mlb.aspx
The HTML for each entry (player) is for example:
<td><b>3B</b></td><td>1. <a href='/player/mlb/6242/manny-machado'>Manny Machado</a></td>
So I can run
players=soup.select('td > a')
to get a list of all players. However I would like to select only players of a specific position, i.e. all the 3B, SS etc. The position is just another text string, and I can't seem to differentiate by it. Does anybody have any idea where I might be able to start with this?
Edit: of course this would be simple if the same positions were always in the same rows, e.g. 1B always rows 2-3 but as can be seen from the table this is not the case.

You can loop over the rows of data and check siblings:
for row in soup.findAll('tr'):
cell = row.findNext('td')
if cell.text == '3B':
print(cell.next_sibling.find('a'))
Which will output:
Manny Machado

Related

Choose XPATH based on <th> string value with selenium

There is a table that I want to get the XPATH of, however the amount of rows and columns is inconsistent across results so I can't just right click and get copy the full XPATH.
My current code:
result_priority_number = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[6]/div/div[2]/table/tbody/tr[18]/td[2]")
The table header names though are always consistent. How do I get the value of an element where the table header specifically says something (i.e. "Priority Number")
I can't just right click and get copy the full XPATH.
Never use this method. Xpath has a very useful feature for search! It isn't just for nested pathing!
//td[contains(text(),'header value')]
or if it has many tables and you want only one of its:
//table[#id='id_of_table']//td[contains(text(),'header value')]
or the table hasn't id or class:
//table[2]//td[contains(text(),'header value')]
where 2 is index of table in page
and other many feature for searching in html nodes
in your case, for get Filing language:
//td[contains(text(),'Filing language')]/following-sibling::td

getting txt from multiple spans with python selenium

I would like to get the text value of a span class "currency-coins value" to be used in a comparison.
Basically I want to check the market value of a specific player. I get the player listed 20 times in a container. So the "currency-coins value" is shown 20 times on the page.
Now I need to get the "200" as shown in the screenshot of the HTML code above as value I can work with. And this for all 20 results on the page. The value might be different for all 20 results.
After I got all 20 values, I want to check which one is the lowest.
I will then afterwards use the lowest value as price to list my element on the market.
Is there a way to do this? Since I am learning python for a bit more than one week now, I cant figure it out myself.
The idea is to first iterate over the player containers - usually, these are table rows, and, for each container, locate that price element within. For instance:
for row in driver.find_elements_by_css_selector("table tbody > tr"):
coin_value = float(row.find_element_by_css_selector(".currency-coins.value").text)
print(coin_value)
Note that table tbody > tr is used as an example, your locator for table rows or player containers is likely different.

Python xpath to get text from a table

So with request and lxml I have been trying to create a small API that given certain parameters would download a timetable from a certain website, this one, the thing is I am a complete newbie at stuff like these and aside from the hours I can't seem to get anything else.
I've been messing around with xpath code but mostly what I get is a simple []. I've been trying to get the first line of classes that correspond to the first line of hours (8.00-8.30) which should probably appear as something like this [,,,Introdução à Gestão,].
page = requests.get('https://fenix.iscte-iul.pt/publico/siteViewer.do?method=roomViewer&roomName=2E04&objectCode=4787574275047425&executionPeriodOID=4787574275047425&selectedDay=1542067200000&contentContextPath_PATH=/estudante/consultar/horario&_request_checksum_=ae083a3cc967c40242304d1f720ad730dcb426cd')
tree = html.fromstring(page.content)
class_block_one = tree.xpath('//table[#class="timetable"]/tbody/tr[1]/td[#class=*]/a/abbr//text()')
print(class_block_one)
To get required text from first (actually second) row, you can try below XPath
'//table[#class="timetable"]//tr[2]/td/a/abbr//text()'
You can get values from all rows:
for row in tree.xpath('//table[#class="timetable"]//tr'):
print(row.xpath('./td/a/abbr//text()'))

How to get data-timestamp using python/selenium

Below is the html of the table I want to extract the data-timestamp from.
The webpage is at https://nl.soccerway.com/national/argentina/primera-division/20182019/regular-season/r47779/matches/?ICID=PL_3N_02
So far I tried verious variants I found on here but nothing seemed to work. Can someone help me to extract the (for example) 1536962400. So in other words I want to extract every data-timestamp value of the table. Any suggestions are more than welcome! I have used selenium/python to extract table data from the website but data-timestamp always gives errors.
data-timestamp is an attribute of tr element, you can try this:
element_list = driver.find_elements_by_xpath("//table[contains(#class,'matches')]/tbody/tr")
for items in element_list:
print(items.get_attribute('data-timestamp'))

Scraping Text from table using Soup / Xpath / Python

I need help in extracting data from : http://agmart.in/crop.aspx?ccid=1&crpid=1&sortby=QtyHigh-Low
Using the filter, there are about 4 pages of data (Under rice crops) in tables I need to store.
I'm not quite sure how to proceed with it. been reading up all the documentation possible. For someone who just started python, I'm very confused atm. Any help is appreciated.
Here's a code snipet I'm basing it on :
Example website : http://www.uscho.com/rankings/d-i-mens-poll/
from urllib2 import urlopen
from lxml import etree
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[#id="rankings"]'):
print section.xpath('h1[1]/text()')[0],
print section.xpath('h3[1]/text()')[0]
print
for row in section.xpath('table/tr[#class="even" or #class="odd"]'):
print '%-3s %-20s %10s %10s %10s %10s' % tuple(
''.join(col.xpath('.//text()')) for col in row.xpath('td'))
print
I can't seem to understand any of the code above. Only understood that the URL is being read. :(
Thank you for any help!
Just like we have CSS selectors like .window or #rankings, xpath is used to navigate through elements and attributes in XML.
So in for loop, you're first searching for an element called "section" give a condition that it has an attribute id whose value is rankings. But remember you are not done yet. This section also contains the heading "Final USCHO.com Division I Men's Polo", date and extra elements in the table. Well, there was only one element and this loop will run only once. That's where you're extracting the text (everything within the TAGS) in h1 (Heading) and h3 (Date).
Next part extracts a tag called table, with conditions on each row's classes - they can be even or odd. Well, because you need all the rows in this table, that part is not doing anything here.
You could replace the line
for row in section.xpath('table/tr[#class="even" or #class="odd"]'):
with
for row in section.xpath('table/tr'):
Now when we are inside the loop, it will return us each 'td' element - each cell in that row. That's why the last line says row.xpath('td'). When you iterate over them, you'll receive multiple cell elements, e.g. each for 1, Providence, 49, 26-13-2, 997, 15. Check first line in the webpage table.
Try this for yourself. Replace the last loop block with this much easier to read alternative:
for row in section.xpath('table/tr'):
print row.xpath('td//text()')
You will see that it presents all the table data in Pythonic lists - each list item containing one cell. Your code is just another fancier way to write these list items converted into a string with spaces between them. xpath() method returns objects of Element type which are representation of each XML/HTML element. xpath('something//text()') would produce the actual content within that tag.
Here're a few helpful references:
Easy to understand tutorial :
http://www.w3schools.com/xpath/xpath_examples.asp
Stackoverflow question : Extract text between tags with XPath including markup
Another tutorial : http://www.tutorialspoint.com/xpath/

Categories

Resources