I would like to search through child elements for specific attributes using BeautifulSoup, from what I can see using the below method each child is a string (child['value'] gives me "string indices must be integers"), which does not allow selection based on attributes or returning of those attributes, which incidently is what I need to do.
def get_value(container):
html_file = open(html_path)
html = html_file.read()
soup = BeautifulSoup(html)
values = {}
container = soup.find(attrs={"name" : container})
if (container.contents != []):
for child in container.children:
value = unicode(child['value']) # i would like to be able to search throught these children based on their attributes, and return one or more of their values
return value
Could probably get around this with a further child_soup Beautifulsoup(child) and then a find command but this seems really horrible, anyone got a better solution?
container.children is a generator that provides Tag objects, so you can operate on them normally.
You also might want to try element.find_all(..., recursive=False) in order to look for an element's direct children with some traits.
Related
I am trying to select all table elements from a div parent node by using a customized function.
This is what I've got so far:
import BeautifulSoup
import requests
import lxml
url = 'https://www.salario.com.br/profissao/abacaxicultor-cbo-612510'
def getTables(url):
url = requests.get(url)
soup=BeautifulSoup(url.text, 'lxml')
div_component = soup.find('div', attrs={'class':'td-post-content'})
tables = div_component.find_all('table', attrs={'class':'listas'})
return tables
However when applied as getTables(url) the output is an empty list [].
I expect this function to return all html tables elements inside div node given specific his specific attributes.
How could I adjust this function?
Is there any other library I could use to accomplish this task?
Taking what the other commenters have said, and expanding on it.
Your div_component returns 1 element and doesn't contain tables, but using find_all() yeilds 8 elements:
len(soup.find_all('div', attrs={'class':'td-post-content'}))
So you can't just use find() on a list you'll need to iterate through it to find a div that contains tables.
Another way to just go after the tables you want, you can just use
tables = soup.find_all('table', attrs={'class':'listas'})
where tables is a list with 6 elements. If you know which table you want, you can iterate through the tables until you find the one you want.
The first problem is that "find" finds only the first such match. The first td-post-content <div> does not contain any tables. I think you want "findall". Second, you can use CSS selectors with BeautifulSoup. So, you can search for soup.findall('div.td-post-content') without using the attributes parameter.
For parsing information from this url: http://py4e-data.dr-chuck.net/comments_42.xml
url = "http://py4e-data.dr-chuck.net/comments_42.xml"
fhandle = urllib.request.urlopen(url, context=ctx)
string_data = fhandle.read()
xml = ET.fromstring(string_data)
Why does
lst = xml.findall("./commentinfo/comments/comment")
Not put anything into lst while
lst = xml.findall("comments/comment")
creates a list of elements.
Thanks!
Element.findall uses a subset of the XPATH specification (see XPATH support) based on the element you are referencing. When you loaded the document, you referenced the root element <commentinfo>. An XPATH comments/comment selects all of that element's child elements named "comments" then selects all of their children named "comment".
./comments/comment is identical to comments/comment. "." is the current node (<commentinfo>) and the following "/comments" selects its child nodes as above.
./commentinfo/comments/comment is the same as commentinfo/comments/comment. It's easy to see the issue. Since you are already on the <commentinfo> node, there aren't any child elements also named "commentinfo". Some XPATH processors would let you reference from the root of the tree, as in //commentinfo/comments/comment but ElementTree doesn't do that.
'.' in the XPath already means the top-level element, here <commentinfo>. So your path is looking for a <commentinfo> child of that, which doesn't exist.
You can see this by cross-referencing the example from the documentation with the corresponding XML. Notice how none of the example XPaths mention data.
You want just './comments/comment'.
I have a block of xml content that varies depending on a code I submit through an API. The fetching process works fine. The xml tags I want to extract vary depending on the code. The function that determines the list of tags also works correctly.
I have been successful in extracting information from a block of content in xml, however in a certain report that I fetch there are multiple items in the content block with the same tags I wish to extract. I have split the content into several items by the tag <item> and removed the first index as it is not useful to me.
Now I want to search each item by my list of tags (which previously worked fine until I introduced multiple items and looped over them).
I have checked that each item i can be 'seen' in the for n in list loop by printing i and it appears correctly. But when it comes to searching the string it doesn't seem to be recognised since printing each var just shows as 'None' (on the note I have confirmed that each i is a string). The terms I am searching are 100% in the content, this process works until I introduce the for i in items loop.
def parser(content, report_code ):
list = list_type(report_code)
items = content.split('<item>')
items.pop(0)
for i in items:
arr = []
for n in list:
print i
var = BeautifulSoup(i, "xml").find(n)
var = str(var).split('>')[1].split('<')[0].strip()
print var
arr.append(var)
return arr
In case anyone comes across this I have found some solution to my problem.
The above code worked for me when I wasn't splitting 'content' by <item>. It turned out content had type 'bytes' and the BeautifulSoup parser worked with this. However when I split content by <item> and looped over each i as above, the input to BeautifulSoup (i in items) was now of type 'str'. I was surprised that this didn't work in BeautifulSoup to be honest.
After further research I came across bytes() and bytearray(), 2 built-in functions that convert strings to bytes amongst other things. But I was also having trouble with this. Maybe someone with more knowledge can explain how to properly use these functions.
In the end I figured out another structure of going about the above code. I instead looped over each var in the output of findAll in BeautifulSoup. I also included an exception for cases where nothing was found, which was also throwing an error.
def parser(content, report_code ):
list = list_type(report_code)
data = []
for n in list:
arr = []
try:
for var in BeautifulSoup(content, "lxml-xml").findAll(n):
var = str(var).split('>')[1].split('<')[0].strip()
arr.append(var)
except IndexError:
pass
data.append(arr)
I have been struggling with this for a while now.
I have tried various was of finding the xpath for the following highlighted HTML
I am trying to grab the dollar value listed under the highlighted Strong tag.
Here is what my last attempt looks like below:
try:
price = browser.find_element_by_xpath(".//table[#role='presentation']")
price.find_element_by_xpath(".//tbody")
price.find_element_by_xpath(".//tr")
price.find_element_by_xpath(".//td[#align='right']")
price.find_element_by_xpath(".//strong")
print(price.get_attribute("text"))
except:
print("Unable to find element text")
I attempted to access the table and all nested elements but I am still unable to access the highlighted portion. Using .text and get_attribute('text') also does not work.
Is there another way of accessing the nested element?
Or maybe I am not using XPath as it properly should be.
I have also tried the below:
price = browser.find_element_by_xpath("/html/body/div[4]")
UPDATE:
Here is the Full Code of the Site.
The Site I am using here is www.concursolutions.com
I am attempting to automate booking a flight using selenium.
When you reach the end of the process of booking and receive the price I am unable to print out the price based on the HTML.
It may have something to do with the HTML being a java script that is executed as you proceed.
Looking at the structure of the html, you could use this xpath expression:
//div[#id="gdsfarequote"]/center/table/tbody/tr[14]/td[2]/strong
Making it work
There are a few things keeping your code from working.
price.find_element_by_xpath(...) returns a new element.
Each time, you're not saving it to use with your next query. Thus, when you finally ask it for its text, you're still asking the <table> element—not the <strong> element.
Instead, you'll need to save each found element in order to use it as the scope for the next query:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tbody = table.find_element_by_xpath(".//tbody")
tr = tbody.find_element_by_xpath(".//tr")
td = tr.find_element_by_xpath(".//td[#align='right']")
strong = td.find_element_by_xpath(".//strong")
find_element_by_* returns the first matching element.
This means your call to tbody.find_element_by_xpath(".//tr") will return the first <tr> element in the <tbody>.
Instead, it looks like you want the third:
tr = tbody.find_element_by_xpath(".//tr[3]")
Note: XPath is 1-indexed.
get_attribute(...) returns HTML element attributes.
Therefore, get_attribute("text") will return the value of the text attribute on the element.
To return the text content of the element, use element.text:
strong.text
Cleaning it up
But even with the code working, there’s more that can be done to improve it.
You often don't need to specify every intermediate element.
Unless there is some ambiguity that needs to be resolved, you can ignore the <tbody> and <td> elements entirely:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tr = table.find_element_by_xpath(".//tr[3]")
strong = tr.find_element_by_xpath(".//strong")
XPath can be overkill.
If you're just looking for an element by its tag name, you can avoid XPath entirely:
strong = tr.find_element_by_tag_name("strong")
The fare row may change.
Instead of relying on a specific position, you can scope using a text search:
tr = table.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
Other <table> elements may be added to the page.
If the table had some header text, you could use the same text search approach as with the <tr>.
In this case, it would probably be more meaningful to scope to the #gdsfarequite <div> rather than something as ambiguous as a <table>:
farequote = browser.find_element_by_id("gdsfarequote")
tr = farequote.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
But even better, capybara-py provides a nice wrapper on top of Selenium, helping to make this even simpler and clearer:
fare_quote = page.find("#gdsfarequote")
base_fare_row = fare_quote.find("tr", text="Base Fare"):
base_fare = tr.find("strong").text
I want to get an XPATH-Value from a Steamstoresite, e.g. http://store.steampowered.com/app/234160/. On the right side are 2 boxes. The first one contains Title, Genre, Developer ... I just need the Genre here. There is a different count on every game. Some have 4 Genres, some just one. And then there is another block, where the gamefeatures are listet (like Singleplayer, Multiplayer, Coop, Gamepad, ...)
I need all those values.
Also sometimes there is an image between (PEGI/USK)
http://store.steampowered.com/app/233290.
import requests
from lxml import html
page = requests.get('http://store.steampowered.com/app/234160/')
tree = html.fromstring(page.text)
blockone = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[1]")
blocktwo = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[2]")
print "Detailblock:" , blockone
print "Featureblock:" , blocktwo
This is the code I have so far. When I try it it just prints:
Detailblock: [<Element div at 0x2ce5868>]
Featureblock: [<Element div at 0x2ce58b8>]
How do I make this work?
xpath returns a list of matching elements. You're just printing out that list.
If you want the first element, you need blockone[0]. If you want all elements, you have to loop over them (e.g., with a comprehension).
And meanwhile, what do you want to print for each element? The direct inner text? The HTML for the whole subtree rooted at that element? Something else? Whatever you want, you need to use the appropriate method on the Element type to get it; lxml can't read your mind and figure out what you want, and neither can we.
It sounds like what you really want is just some elements deeper in the tree. You could xpath your way there. (Instead of going through all of the elements one by one and relying on index as you did, I'm just going to write the simplest way to get to what I think you're asking for.)
genres = [a.text for a in blockone[0].xpath('.//a')]
Or, really, why even get that blockone in the first place? Why not just xpath directly to the elements you wanted in the first place?
gtags = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[1]//a")
genres = [a.text for a in gtags]
Also, you could make this a lot simpler—and a lot more robust—if you used the information in the tags instead of finding them by explicitly walking the structure:
gtags = tree.xpath(".//div[#class='glance_tags popular_tags']//a")
Or, since there don't seem to be any other app_tag items anywhere, just:
gtags = tree.xpath(".//a[#class='app_tag']")