I want to get an XPATH-Value from a Steamstoresite, e.g. http://store.steampowered.com/app/234160/. On the right side are 2 boxes. The first one contains Title, Genre, Developer ... I just need the Genre here. There is a different count on every game. Some have 4 Genres, some just one. And then there is another block, where the gamefeatures are listet (like Singleplayer, Multiplayer, Coop, Gamepad, ...)
I need all those values.
Also sometimes there is an image between (PEGI/USK)
http://store.steampowered.com/app/233290.
import requests
from lxml import html
page = requests.get('http://store.steampowered.com/app/234160/')
tree = html.fromstring(page.text)
blockone = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[1]")
blocktwo = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[2]")
print "Detailblock:" , blockone
print "Featureblock:" , blocktwo
This is the code I have so far. When I try it it just prints:
Detailblock: [<Element div at 0x2ce5868>]
Featureblock: [<Element div at 0x2ce58b8>]
How do I make this work?
xpath returns a list of matching elements. You're just printing out that list.
If you want the first element, you need blockone[0]. If you want all elements, you have to loop over them (e.g., with a comprehension).
And meanwhile, what do you want to print for each element? The direct inner text? The HTML for the whole subtree rooted at that element? Something else? Whatever you want, you need to use the appropriate method on the Element type to get it; lxml can't read your mind and figure out what you want, and neither can we.
It sounds like what you really want is just some elements deeper in the tree. You could xpath your way there. (Instead of going through all of the elements one by one and relying on index as you did, I'm just going to write the simplest way to get to what I think you're asking for.)
genres = [a.text for a in blockone[0].xpath('.//a')]
Or, really, why even get that blockone in the first place? Why not just xpath directly to the elements you wanted in the first place?
gtags = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[1]//a")
genres = [a.text for a in gtags]
Also, you could make this a lot simpler—and a lot more robust—if you used the information in the tags instead of finding them by explicitly walking the structure:
gtags = tree.xpath(".//div[#class='glance_tags popular_tags']//a")
Or, since there don't seem to be any other app_tag items anywhere, just:
gtags = tree.xpath(".//a[#class='app_tag']")
Related
I am working on a bot for a website and it requires a color and keyword to find the item. I am using selenium to look for the item from the keyword and then pick a color option (some items on the website, provide the item in multiple colors). I am having trouble looking for both the keyword and color at the same time, and then after choosing the correct colored version of the item from the user's color and keyword input. I want it to select on that option.
Formula I am trying to make in Python:
If the first Xpath(keyword) is found and the 2nd Xpath(color) is found
Then select on the item that contains those 2 properties.
This is the current code I have:
Item = driver.find_element_by_xpath('//*[contains(text(), "MLK")]' and contains ("Black")]')
if (item != None):
actions.moveToElement(item).click()
I've tried the code above and it doesn't work.
Here are the 2 pieces of code that I want to merge to find the item:
driver.find_element_by_xpath('//a[contains(text(), "MLK")]')
driver.find_element_by_xpath('//a[contains(text(), "Black")]')
The keyword is called MLK
The Color is called Black
After Merging, I want to find that Exact Element (Called MLK, Color version = Black)
This combined item should be clicked on, I only know to use .click()
If a better way, please let me know.
The website I am using to make a bot for: supremenewyork.com
The item I am using as an example, to pick a certain color (It's the Sweatshirt with MLK on it): http://www.supremenewyork.com/shop/all/sweatshirts
It took me a second to realize that there are 3 A tags for each shirt... one for the image, one for the name of the shirt, and one for the color. Since the last two A tags are the ones you are wanting to text search, you can't look for both strings in the same A tag. I've tested the XPath below and it works.
//article[.//a[contains(.,'MLK')]][.//a[.='Black']]//a
ARTICLE is the container for the shirt. This XPath is looking for an ARTICLE tag that contains an A tag that contains 'MLK' and then another A tag that contains 'Black' then finds the A tags that are descendants of the ARTICLE tag. You can click on any of them, they are all the same link.
BTW, your code has a problem. The first line below will throw an exception if there is no match so the next line will never be reached to test for None.
Item = driver.find_element_by_xpath('//*[contains(text(), "MLK")]' and contains ("Black")]')
if (Item != None):
actions.moveToElement(item).click()
A better practice is to use .find_elements() (plural) and check for an empty list. If the list is empty, that means there was no element that matched the locator.
Putting the pieces together:
items = driver.find_elements_by_xpath("//article[.//a[contains(.,'MLK')]][.//a[.='Black']]//a")
if items:
items[0].click()
I'm assuming you will be calling this code repeatedly so I would suggest that you put this in a function and pass the two strings to be searched for. I'll let you take it from here...
Try union "|" operator to combine two xpath.
Example:-
//p[#id='para1'] | //p[#id='para2']
('//a[contains(text(), "MLK")]' | '//a[contains(text(), "Black")]')
You can use a full XPath to select the item you want based on two conditions, you just need to start from a parent node and then apply the conditions on the child nodes:
//div[contains(./h1/a/text(), "MLK") and contains(./p/a/text(), "Black")]/a/#href
You first need to select the element itself, after that you need to get the attribute #href from the element, something like this:
Item = driver.find_element_by_xpath('//div[contains(./h1/a/text(), "MLK") and contains(./p/a/text(), "Black")]/a')
href = Item .get_attribute("href")
I have been struggling with this for a while now.
I have tried various was of finding the xpath for the following highlighted HTML
I am trying to grab the dollar value listed under the highlighted Strong tag.
Here is what my last attempt looks like below:
try:
price = browser.find_element_by_xpath(".//table[#role='presentation']")
price.find_element_by_xpath(".//tbody")
price.find_element_by_xpath(".//tr")
price.find_element_by_xpath(".//td[#align='right']")
price.find_element_by_xpath(".//strong")
print(price.get_attribute("text"))
except:
print("Unable to find element text")
I attempted to access the table and all nested elements but I am still unable to access the highlighted portion. Using .text and get_attribute('text') also does not work.
Is there another way of accessing the nested element?
Or maybe I am not using XPath as it properly should be.
I have also tried the below:
price = browser.find_element_by_xpath("/html/body/div[4]")
UPDATE:
Here is the Full Code of the Site.
The Site I am using here is www.concursolutions.com
I am attempting to automate booking a flight using selenium.
When you reach the end of the process of booking and receive the price I am unable to print out the price based on the HTML.
It may have something to do with the HTML being a java script that is executed as you proceed.
Looking at the structure of the html, you could use this xpath expression:
//div[#id="gdsfarequote"]/center/table/tbody/tr[14]/td[2]/strong
Making it work
There are a few things keeping your code from working.
price.find_element_by_xpath(...) returns a new element.
Each time, you're not saving it to use with your next query. Thus, when you finally ask it for its text, you're still asking the <table> element—not the <strong> element.
Instead, you'll need to save each found element in order to use it as the scope for the next query:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tbody = table.find_element_by_xpath(".//tbody")
tr = tbody.find_element_by_xpath(".//tr")
td = tr.find_element_by_xpath(".//td[#align='right']")
strong = td.find_element_by_xpath(".//strong")
find_element_by_* returns the first matching element.
This means your call to tbody.find_element_by_xpath(".//tr") will return the first <tr> element in the <tbody>.
Instead, it looks like you want the third:
tr = tbody.find_element_by_xpath(".//tr[3]")
Note: XPath is 1-indexed.
get_attribute(...) returns HTML element attributes.
Therefore, get_attribute("text") will return the value of the text attribute on the element.
To return the text content of the element, use element.text:
strong.text
Cleaning it up
But even with the code working, there’s more that can be done to improve it.
You often don't need to specify every intermediate element.
Unless there is some ambiguity that needs to be resolved, you can ignore the <tbody> and <td> elements entirely:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tr = table.find_element_by_xpath(".//tr[3]")
strong = tr.find_element_by_xpath(".//strong")
XPath can be overkill.
If you're just looking for an element by its tag name, you can avoid XPath entirely:
strong = tr.find_element_by_tag_name("strong")
The fare row may change.
Instead of relying on a specific position, you can scope using a text search:
tr = table.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
Other <table> elements may be added to the page.
If the table had some header text, you could use the same text search approach as with the <tr>.
In this case, it would probably be more meaningful to scope to the #gdsfarequite <div> rather than something as ambiguous as a <table>:
farequote = browser.find_element_by_id("gdsfarequote")
tr = farequote.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
But even better, capybara-py provides a nice wrapper on top of Selenium, helping to make this even simpler and clearer:
fare_quote = page.find("#gdsfarequote")
base_fare_row = fare_quote.find("tr", text="Base Fare"):
base_fare = tr.find("strong").text
First of all - I'm creating xml document with python BeautifulSoup.
Currently, what I'm trying to create, is very similar to this example.
<options>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
</options>
Notice, that there should be only one tag, called name.
As options can be much more in count, and different as well, I decided to create little python function, which could help me create such result.
array = ['FirstName','SecondName','ThirdName']
# This list will be guideline for function to let it know, how much options will be in result, and how option tags will be called.
def create_options(array):
soup.append(soup.new_tag('options'))
if len(array) > 0: # It's small error handling, so you could see, if given array isn't empty by any reason. Optional.
for i in range(len(array)):
soup.options.append(soup.new_tag('opt'))
# With beatifullsoup methods, we create opt tags inside options tag. Exact amount as in parsed array.
counter = 0
# There's option to use python range() method, but for testing purposes, current approach is sufficient enough.
for tag in soup.options.find_all():
soup.options.find('opt')['name'] = str(array[counter])
# Notice, that in this part tag name is assigned only to first opt element. We'll discuss this next.
counter += 1
print len(array), ' options were created.'
else:
print 'No options were created.'
You notice, that in function, tag assignment is handled by for loop, which, unfortunately, assigns all different tag names to first option in options element.
BeautifulSoup has .next_sibling and .previous_sibling, which can help me in this task.
As they describe by name, with them I can access next or previous sibling in element. So, by this example:
soup.options.find('opt').next_sibling['name'] = str(array[counter])
We can access second child of options element. So, if we add .next_sibling to each soup.items.find('opt'), we could then move from first element to next.
Problem is, that by finding option element in options with:
soup.options.find('opt')
each time we access first option. But my function is willing to access with each item in list, next option as well. So it means, as more items are in list, more .next_sibling methods it must add to first option.
In result, with logic I constructed, with 4th or further item in list, accessing relevant option for assigning it's appropriate tag, should look like this:
soup.options.find('opt').next_sibling.next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
And now we are ready to my questions:
1st. As I didn't found any other kind of method, how to do it with Python BeautifulSoup methods, I'm not sure, that my approach still is only way. Is there any other method?
2st. How could I achieve result by this approach, if as my experiments show me, that I can't put variable inside method row? (So I could multiply methods)
#Like this
thirdoption = .next_sibling.next_sibling.next_sibling
#As well, it's not quite possible, but it's just example.
soup.options.find('opt').next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
3st. May be I read BeautifulSoup documentation badly, and just didn't found method, which could help me in this task?
I managed to achieve result, ignoring BeatifulSoup methods.
Python has element tree methods, which were sufficient enough to work with.
So, let me show the example code, and explain it, what it does. Comments provide explanation more precisely.
"""
Before this code, there goes soup xml document generation. Except part, I mentioned in topic, we just create empty options tags in document, thus, creating almost done document.
Right after that, with this little script, we will use basic python provided element tree methods.
"""
import xml.etree.ElementTree as ET
ET_tree = ET.parse("exported_file.xml")
# Here we import exactly the same file, we opened with soup. Exporting can be done in different file, if you wish.
ET_root = ET_tree.getroot()
for position, opt in enumerate(item.find('options')):
# Position is pretty important, as this will remove 'counter' thing in for loop, I was using in code in first example. Position will be used for getting out exact items from array, which works like template for our option tag names.
opt.set('name', str(array[position]))
opt.text = 'text'
# Same way, with position, we can get data from relevant array, provided, that they are inherited or connected in same way.
tree = ET.ElementTree(ET_root).write('exported_file.xml',encoding="UTF-8",xml_declaration=True)
# This part was something, I researched before quite lot. This code will help save xml document with utf-8 encoding, which is very important.
This approach is pretty inefficient, as for achieving same result, I could use ET for everything.
Thought, BeatifulSoup prepares document in nice output, which in any way is very neat, as element-tree creates files for software friendly only look.
I have trouble getting nested Selectors to work as described in the documentation of Scrapy (http://doc.scrapy.org/en/latest/topics/selectors.html)
Here's what I got:
sel = Selector(response)
level3fields = sel.xpath('//ul/something/*')
for element in level3fields:
site = element.xpath('/span').extract()
When I print out "element" in the loop I get < Selector xpath='stuff seen above' data="u'< span class="something">text< /span>>
Now I got two problems:
Firstly, within the element, there should also be an "a"-node (as in <a href), but it doesn't show up in the print out, only if I extract it directly, then it does show up. Is that just a printing error or doesn't the "element-Selector" hold the a-node (without extraction)
when I print out "site" above, it should show a list with the span-nodes. However, it doesn't, it only prints out an empty list.
I tried a combination of changes (multiple to no slashes and stars (*) in different places), but none of it brought me any closer.
Essentially, I just want to get a nested Selector which gives me the span-node in the second step (the loop).
Anyone got any tips?
Regarding your first question, it's just a print "error". __repr__ and __str__ methods on Selectors only print the first 40 characters of the data (element represented as HTML/XML or text content). See https://github.com/scrapy/scrapy/blob/master/scrapy/selector/unified.py#L143
In your loop on level3fields you should use relative XPath expressions. Using /span will look for span elements directly under the root node, that's not what you want I guess.
Try this:
sel = Selector(response)
level3fields = sel.xpath('//ul/something')
for element in level3fields:
site = element.xpath('.//span').extract()
I'm pretty new to XPath and couldn't figure it out looking at other solutions.
What I'm trying to do is select all the a elements inside a given td (td[2] in example) and running a for statement to output the text contained within the a elements.
Source code:
multiple = HTML.ElementFromURL(url).xpath('//table[contains(#class, "mg-b20")]/tr[3]/td[2]/*[self::a]')
for item in multiple:
Log("text = %s" %item.text)
Any pointer in how I can make this work?
Thanks!
The XPath you need is pretty close:
//table[contains(#class, "mg-b20")]/tr[3]/td[2]//a
I don't know what library you're using, but I suspect it is the Plex Parsekit API. If so, parsekit uses lxml.etree as its underlying library, so you can simplify your code even further:
element = HTML.ElementFromURL(url)
alltext = element.xpath('string(//table[contains(#class, "mg-b20")]/tr[3]/td[2]//a)')
for item in alltext:
Log("text = %s" % item);
This will even take care of corner cases like mixed content, e.g. this:
I am anchor text <span>But I am too and am not in Element.text</span> and I am in Element.tail