I am trying to scrape reviews from this website: https://www.goodreads.com/book/show/4865.How_to_Win_Friends_and_Influence_People?from_search=true&from_srp=true&qid=zsfs3jEPvd&rank=1
Reviews are hidden down many nested classes, I am trying to reach them but facing issues. I am fairly new to selenium. So far, I tried:
'''
a = driver.find_element("class name", "BookPage__reviewsSection")
for i in a.find_element("xpath", "//* [#id='ReviewsSection']").find_elements("class name",'lazyload-wrapper '):
print(i.find_element("xpath","//div[#class='ReviewsList']").text)
'''
The print statement outputs:
Friends & Following
Create a free account to discover what your friends think of this book!
Friends & Following
Create a free account to discover what your friends think of this book!
According to the output it just finds 'BookPage__reviewsSection' class and then 'ReviewsList' class which explains the output. Why doesn't it find 'lazyload-wrapper' class and then 'ReviewsList' class inside it?
I appreciate the help.
#nikhil bhati, you can try the following code. Basically I directly took the Xpath for all the review comments. Let me know if this helps or you wanted some other output. Sorry I have not tried your way of finding the element.
driver.get("https://www.goodreads.com/book/show/4865.How_to_Win_Friends_and_Influence_People?from_search=true&from_srp=true&qid=zsfs3jEPvd&rank=1")
allReviewTexts = driver.find_elements("xpath", "//div[#id='other_reviews']//div[#id='bookReviews']//span[contains(#id, 'reviewTextContainer')]//span[contains(#id, 'freeTextContainer')]")
print(len(allReviewTexts))
for i in allReviewTexts:
print(i.text)
Related
So I'm trying to build a tool to transfer tickets that I sell. A sale comes into my POS, I do an API call for the section, row, and seat numbers ordered (as well as other information obviously). Using the section, row, and seat number, I want to plug those values into a contains (text) statement to in order to find and select the right tickets on the host site.
Here is a sample of how the tickets are laid out:
And here is a screenshot (sorry if this is inconvenient) of the DOM related to one of the rows above:
Given this, how should I structure my contains(text) statement so that it is able to find and select the correct seats? I am very new/inexperienced with automation. I messed around with it a few months ago with some success and have managed to get a tool that gets me right up to selecting the seats but the "div" path confuses me when it comes to searching for text that is tied to other text.
I tried the following structure:
for i in range(int(lowseat), int(highseat)):
web.find_element_by_xpath('//*[contains (text(), "'+section+'")]/following-sibling::[contains text(), "'+row+'")]/following-sibling::[contains text(), "'+str(i)+'")]').click()
to no avail. Can someone help me explain how to structure these statements correctly so that it searches for section, row, and seat number correctly?
Thanks!
Also, if needed, here is a screenshot with more context of the button (in cases its needed). Button is highlighted in sky blue:
you can't use text() for that because it's in nested elements. You probably want to map all these into dicts and select with filter.
Update
Here's an idea for a lazy way to do this (untested):
button = driver.execute_script('''
return [...document.querySelectorAll('button')].find(b => {
return b.innerText.match(/Section 107\b.*Row P.*Seat 10\b/)
})
''')
I'm new in python, web driver in particular and I'm trying to find a text-box - the source code looks like this :
I've tried this :
box = driver.find_element_by_class_name('_3F6QL._2WovP')
though no success.
I'll be happy to add more information if needed - as I said I'm new here. appreciate the help
The problem you have, I think, is that the class is compound - comprises of two classes: _3F6QL and _2WovP.
Selenium doesn't allow for finding elements by a compound class name.
Try this:
box = driver.find_element_by_xpath("//*[contains(#class, '_3F6QL') and contains(#class, '_2WovP')]")
or:
box = driver.find_element_by_xpath("//*[contains(#class, '_3F6QL') and contains(#tabindex, '-1')]")
(Not sure about the latter, though).
Also this should work:
box = driver.find_element_by_xpath("//*[contains(#class, '_1Plpp')]/div")
I have built a web crawler for a forum game in which players use specific keywords in [b] bold [/b] tags to issue their commands. The bot's job is to traverse through the thread and keep a record of all player's commands, however I'm running into a problem where if player A quotes a post from player B, the bot reads the command of player B in the quote and updates the table for player A.
I have found the specific class name of the quote box, but I cannot figure out how to remove the class from the entire post body.
I tried converting the post to text using the get_attribute('innerHTML') and successfully removed it using regex, however the code I wrote to extract the bold tags (find_attribute_by_tag_name) becomes invalid.
I have two questions for the geniuses that post here:
Is there a way I can delete a specific element from the post body? I searched throughout google and could not find a working solution
Otherwise, is there a way I can convert the HTML I get from get_attribute('innerHTML') back to an element?
def ScrapPosts( driver ):
posts=driver.find_elements_by_class_name("postdetails")
print("Total number of posts on this page:", len(posts))
for post in posts:
#print("username:",post.find_element_by_tag_name("Strong").text)
username=post.find_element_by_tag_name("Strong").text.upper()
#remove the quote boxes before sending to check command?
post_txt=post.find_element_by_class_name("content")
CheckCommand(post_txt, username)
Selenium doesn't have a built in method for deleting elements. However, you can execute some javascript code that can remove the quote box elements. See related question at: https://stackoverflow.com/a/22519967/7880461
This code will delete all elements with the class name quoteBox which I think would work for you if you just change the class name.
driver.execute_script('''
var element = document.getElementsByClassName("quoteBox"), index;
for (index = element.length - 1; index >= 0; index--) {
element[index].parentNode.removeChild(element[index]);
}
''')
Same answer- no built in way of doing that but you can use javascript. This approach would probably a lot more complicated than the first one.
I am trying to migrate a forum to phpbb3 with python/xpath. Although I am pretty new to python and xpath, it is going well. However, I need help with an error.
(The source file has been downloaded and processed with tagsoup.)
Firefox/Firebug show xpath: /html/body/table[5]/tbody/tr[position()>1]/td/a[3]/b
(in my script without tbody)
Here is an abbreviated version of my code:
forumfile="morethread-alte-korken-fruchtweinkeller-89069-6046822-0.html"
XPOSTS = "/html/body/table[5]/tr[position()>1]"
t = etree.parse(forumfile)
allposts = t.xpath(XPOSTS)
XUSER = "td[1]/a[3]/b"
XREG = "td/span"
XTIME = "td[2]/table/tr/td[1]/span"
XTEXT = "td[2]/p"
XSIG = "td[2]/i"
XAVAT = "td/img[last()]"
XPOSTITEL = "/html/body/table[3]/tr/td/table/tr/td/div/h3"
XSUBF = "/html/body/table[3]/tr/td/table/tr/td/div/strong[position()=1]"
for p in allposts:
unreg=0
username = None
username = p.find(XUSER).text #this is where it goes haywire
When the loop hits user "tompson" / position()=11 at the end of the file, I get
AttributeError: 'NoneType' object has no attribute 'text'
I've tried a lot of try except else finallys, but they weren't helpful.
I am getting much more information later in the script such as date of post, date of user registry, the url and attributes of the avatar, the content of the post...
The script works for hundreds of other files/sites of this forum.
This is no encode/decode problem. And it is not "limited" to the XUSER part. I tried to "hardcode" the username, then the date of registry will fail. If I skip those, the text of the post (code see below) will fail...
#text of getpost
text = etree.tostring(p.find(XTEXT),pretty_print=True)
Now, this whole error would make sense if my xpath would be wrong. However, all the other files and the first numbers of users in this file work. it is only this "one" at position()=11
Is position() uncapable of going >10 ? I don't think so?
Am I missing something?
Question answered!
I have found the answer...
I must have been very tired when I tried to fix it and came here to ask for help. I did not see something quite obvious...
The way I posted my problem, it was not visible either.
the HTML I downloaded and processed with tagsoup had an additional tag at position 11... this was not visible on the website and screwed with my xpath
(It probably is crappy html generated by the forum in combination with tagsoups attempt to make it parseable)
out of >20000 files less than 20 are afflicted, this one here just happened to be the first...
additionally sometimes the information is in table[4], other times in table[5]. I did account for this and wrote a function that will determine the correct table. Although I tested the function a LOT and thought it working correctly (hence did not inlcude it above), it did not.
So I made a better xpath:
'/html/body/table[tr/td[#width="20%"]]/tr[position()>1]'
and, although this is not related, I ran into another problem with unxpected encoding in the html file (not utf-8) which was fixed by adding:
parser = etree.XMLParser(encoding='ISO-8859-15')
t = etree.parse(forumfile, parser)
I am now confident that after adjusting for strange additional and multiple , and tags my code will work on all files...
Still I will be looking into lxml.html, as I mentioned in the comment, I have never used it before, but if it is more robust and may allow for using the files without tagsoup, it might be a better fit and save me extensive try/except statements and loops to fix the few files screwing with my current script...
Does anybody have experience in getting a wikipedia page using wikitools for python (and django)? I am trying to get the article but I get a few first lines and that's it. I need to fetch the whole article and I can't seem to figure it out. The documentation is not very helpful either. My code is:
wikiobj = wiki.Wiki("http://en.wikipedia.org/w/api.php?title=Some_Title&action=raw&maxlag=-1")
wikipage = page.Page(wikiobj, url, section='content')
wikidata = wikipage.getWikiText(True).decode('utf-8', 'replace')
Any help will be appreciated.
I'm using wikitools im my project, not for getting text on the page, but I initialize wiki object in a different way:
wikiobj = wiki.Wiki("http://en.wikipedia.org/w/api.php")
wikipage = page.Page(wikiobj, title="Some_Title")
You don't need to supply any query to after api.php in the Wiki class.
Next, look at the definition of Page class:
__init__(self, site, title=False, check=True, followRedir=True, section=False, sectionnumber=False, pageid=False, namespace=False)
So you need to supply title to the constructor of the Page class (you supplied some unknown url param).