Scraping Text from table using Soup / Xpath / Python

Scraping Text from table using Soup / Xpath / Python - python

I need help in extracting data from : http://agmart.in/crop.aspx?ccid=1&crpid=1&sortby=QtyHigh-Low
Using the filter, there are about 4 pages of data (Under rice crops) in tables I need to store.
I'm not quite sure how to proceed with it. been reading up all the documentation possible. For someone who just started python, I'm very confused atm. Any help is appreciated.
Here's a code snipet I'm basing it on :
Example website : http://www.uscho.com/rankings/d-i-mens-poll/
from urllib2 import urlopen
from lxml import etree
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[#id="rankings"]'):
print section.xpath('h1[1]/text()')[0],
print section.xpath('h3[1]/text()')[0]
print
for row in section.xpath('table/tr[#class="even" or #class="odd"]'):
print '%-3s %-20s %10s %10s %10s %10s' % tuple(
''.join(col.xpath('.//text()')) for col in row.xpath('td'))
print
I can't seem to understand any of the code above. Only understood that the URL is being read. :(
Thank you for any help!

Just like we have CSS selectors like .window or #rankings, xpath is used to navigate through elements and attributes in XML.
So in for loop, you're first searching for an element called "section" give a condition that it has an attribute id whose value is rankings. But remember you are not done yet. This section also contains the heading "Final USCHO.com Division I Men's Polo", date and extra elements in the table. Well, there was only one element and this loop will run only once. That's where you're extracting the text (everything within the TAGS) in h1 (Heading) and h3 (Date).
Next part extracts a tag called table, with conditions on each row's classes - they can be even or odd. Well, because you need all the rows in this table, that part is not doing anything here.
You could replace the line
for row in section.xpath('table/tr[#class="even" or #class="odd"]'):
with
for row in section.xpath('table/tr'):
Now when we are inside the loop, it will return us each 'td' element - each cell in that row. That's why the last line says row.xpath('td'). When you iterate over them, you'll receive multiple cell elements, e.g. each for 1, Providence, 49, 26-13-2, 997, 15. Check first line in the webpage table.
Try this for yourself. Replace the last loop block with this much easier to read alternative:
for row in section.xpath('table/tr'):
print row.xpath('td//text()')
You will see that it presents all the table data in Pythonic lists - each list item containing one cell. Your code is just another fancier way to write these list items converted into a string with spaces between them. xpath() method returns objects of Element type which are representation of each XML/HTML element. xpath('something//text()') would produce the actual content within that tag.
Here're a few helpful references:
Easy to understand tutorial :
http://www.w3schools.com/xpath/xpath_examples.asp
Stackoverflow question : Extract text between tags with XPath including markup
Another tutorial : http://www.tutorialspoint.com/xpath/

Related

I have created a list using find_all in Beautiful soup based on an attribute. How do I return he node I want?

I have a MS word document template that has Structured documents tags, including repeating sections. I am using a Python script to pull the important parts and and send them to a dataframe. My script works as intended on 80% of the documents I have attempted but I am often failing. The issue is when finding the first repeating section I have been doing the following:
from bs4 import BeautifulSoup as BS
soup = BS(f, 'xml') # entire xml; file is called soup
soupdocument=soup.document #document only child node of soup
soupbody=soupdocument.body # body is the only child node of document
ODR=soupbody.contents[5]
which often works however some users have managed to hit enter in some places in the document that are not locked down. I know the issue should be resolved by not choosing the 5th element of soupbody.
soupbody.find_all({tag})
><w:tag w:val="First Name"/>,
<w:tag w:val="Last Name"/>,
<w:tag w:val="Position"/>,
<w:tag w:val="Phone Number"/>,
<w:tag w:val="Email"/>,
<w:tag w:val="ODR Repeating Section"/>,
the above is a partial list of what is returned the actual list several dozen tags and some are repeated. the section I want is the last one I listed above and is usually but not always found by the first code block. I believe I can put a colon after find_all({tag:SOMETHING}} I have tried cutting and pasting all different parts of "ODR Repeating Section" but It doesn't work. What is the correct way to find this section?

Hi perhaps specify the attribute you're searching for in addition to the tag name?
tags = soup.findAll('tag', {'val" : 'ODR Repeating Section'})

Python xpath to get text from a table

So with request and lxml I have been trying to create a small API that given certain parameters would download a timetable from a certain website, this one, the thing is I am a complete newbie at stuff like these and aside from the hours I can't seem to get anything else.
I've been messing around with xpath code but mostly what I get is a simple []. I've been trying to get the first line of classes that correspond to the first line of hours (8.00-8.30) which should probably appear as something like this [,,,Introdução à Gestão,].
page = requests.get('https://fenix.iscte-iul.pt/publico/siteViewer.do?method=roomViewer&roomName=2E04&objectCode=4787574275047425&executionPeriodOID=4787574275047425&selectedDay=1542067200000&contentContextPath_PATH=/estudante/consultar/horario&_request_checksum_=ae083a3cc967c40242304d1f720ad730dcb426cd')
tree = html.fromstring(page.content)
class_block_one = tree.xpath('//table[#class="timetable"]/tbody/tr[1]/td[#class=*]/a/abbr//text()')
print(class_block_one)

To get required text from first (actually second) row, you can try below XPath
'//table[#class="timetable"]//tr[2]/td/a/abbr//text()'
You can get values from all rows:
for row in tree.xpath('//table[#class="timetable"]//tr'):
print(row.xpath('./td/a/abbr//text()'))

XPath for LXML with Intermediary Element

I'm trying to scrape some pages with python and LXML. My test page is http://www.sarpy.com/oldterra/prop/PDisplay3.asp?ParamValue1=010558233
I'm having good luck with most of the XPaths. For example,
tree.xpath('/html/body/table/tr[1]/td[contains(text(), "Sales Information")]/../../tr[3]/td[1]/text()')
successfully gets me the date of the first sale listed. I have several other pieces too. However, I cannot get the B&P listed under the sale date. For example the B&P of the first sale is 200639333.
I notice in the page structure that there is a form element preceding the tr of the B&P item. Since it's the next table row, I tried incrementing the tr index as follows:
tree.xpath('/html/body/table/tr[1]/td[contains(text(), "Sales Information")]/../../tr[4]/td[1]/text()')
That returns:
['\r\n ']
Because of the line breaks and sub element of br and input within the field, I tried making text() into text()[1], text()[2], etc., but no luck.
I tried to base the path off of the adjacent form like this:
tree.xpath('/html/body/table[7]/form[#action="../rod/ImageDisplay.asp"]/following-sibling::tr/td[1]/text()')
No luck.
I figure there are two potential issues: the intermediary form elements that may be breaking the indexing patterns, and the whitespace. I'd appreciate any help in correcting this xpath.

The <tr> you are looking for is the child of the <form> , not its sibling , try -
tree.xpath('/html/body/table/tr[1]/td[contains(text(), "Sales Information")]/../../form[1]/td[1]/text()')
This may get you 200639333 with a lot of whitespaces.
Or -
tree.xpath('/html/body/table[7]/form[#action="../rod/ImageDisplay.asp"]/tr[1]/td[1]/text()')
For all such elements.

[Python]Get a XPath Value from Steam and print it

I want to get an XPATH-Value from a Steamstoresite, e.g. http://store.steampowered.com/app/234160/. On the right side are 2 boxes. The first one contains Title, Genre, Developer ... I just need the Genre here. There is a different count on every game. Some have 4 Genres, some just one. And then there is another block, where the gamefeatures are listet (like Singleplayer, Multiplayer, Coop, Gamepad, ...)
I need all those values.
Also sometimes there is an image between (PEGI/USK)
http://store.steampowered.com/app/233290.
import requests
from lxml import html
page = requests.get('http://store.steampowered.com/app/234160/')
tree = html.fromstring(page.text)
blockone = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[1]")
blocktwo = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[2]")
print "Detailblock:" , blockone
print "Featureblock:" , blocktwo
This is the code I have so far. When I try it it just prints:
Detailblock: [<Element div at 0x2ce5868>]
Featureblock: [<Element div at 0x2ce58b8>]
How do I make this work?

xpath returns a list of matching elements. You're just printing out that list.
If you want the first element, you need blockone[0]. If you want all elements, you have to loop over them (e.g., with a comprehension).
And meanwhile, what do you want to print for each element? The direct inner text? The HTML for the whole subtree rooted at that element? Something else? Whatever you want, you need to use the appropriate method on the Element type to get it; lxml can't read your mind and figure out what you want, and neither can we.
It sounds like what you really want is just some elements deeper in the tree. You could xpath your way there. (Instead of going through all of the elements one by one and relying on index as you did, I'm just going to write the simplest way to get to what I think you're asking for.)
genres = [a.text for a in blockone[0].xpath('.//a')]
Or, really, why even get that blockone in the first place? Why not just xpath directly to the elements you wanted in the first place?
gtags = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[1]//a")
genres = [a.text for a in gtags]
Also, you could make this a lot simpler—and a lot more robust—if you used the information in the tags instead of finding them by explicitly walking the structure:
gtags = tree.xpath(".//div[#class='glance_tags popular_tags']//a")
Or, since there don't seem to be any other app_tag items anywhere, just:
gtags = tree.xpath(".//a[#class='app_tag']")

XPath for selecting multiple HTML `a` elements

I'm pretty new to XPath and couldn't figure it out looking at other solutions.
What I'm trying to do is select all the a elements inside a given td (td[2] in example) and running a for statement to output the text contained within the a elements.
Source code:
multiple = HTML.ElementFromURL(url).xpath('//table[contains(#class, "mg-b20")]/tr[3]/td[2]/*[self::a]')
for item in multiple:
Log("text = %s" %item.text)
Any pointer in how I can make this work?
Thanks!

The XPath you need is pretty close:
//table[contains(#class, "mg-b20")]/tr[3]/td[2]//a
I don't know what library you're using, but I suspect it is the Plex Parsekit API. If so, parsekit uses lxml.etree as its underlying library, so you can simplify your code even further:
element = HTML.ElementFromURL(url)
alltext = element.xpath('string(//table[contains(#class, "mg-b20")]/tr[3]/td[2]//a)')
for item in alltext:
Log("text = %s" % item);
This will even take care of corner cases like mixed content, e.g. this:
I am anchor text <span>But I am too and am not in Element.text</span> and I am in Element.tail

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.