Parse a web page with button "Load More" with python - python

I am trying to extract all the comments on a movie from this page https://www.imdb.com/title/tt0114709/reviews?ref_=tt_ql_3 but some of them are hidden behind a button "Load More", I have tried with selenium to to click on this button but it doesn't seem to work. Here is my code and the error message, if someone has an idea on how to achieve that.
h = httplib2.Http("./docs/.cache")
resp, content = h.request(url, "GET")
soup = bs4.BeautifulSoup(content, "html.parser")
divs = soup.find_all("div")
driver = webdriver.Chrome(executable_path='C:\Program Files\Intel\iCLS Client\chromedriver.exe')
driver.get(url)
html = driver.page_source.encode('utf-8')
while driver.find_elements_by_class_name("load-more-data"):
driver.find_elements_by_name("Load More").click()
Traceback (most recent call last):
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 567, in <module>
Mat()
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 518, in Mat
dicoCam =testC.extract_data()
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 368, in extract_data
self.extract_comment(movie, url)
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 469, in extract_comment
driver.find_elements_by_name("Load More").click()
AttributeError: 'list' object has no attribute 'click'```

As you can see in the error message, a list is returned when doing:
driver.find_elements_by_name("Load More")
That's why I suggest doing this:
driver.find_elements_by_name("Load More")[0].click()
You have to make sure that there is only 1 element named Load More.
If this is not the case, increase the list index [0] by 1 for each element
named Load More.
Hope that helped.
EDIT: If you still get error messages, like list index out of range , the driver.find_elements_by_name() function isn't working the proper way you want it to.
I'm not an expert when dealing with the Internet with Python,
but you should look for
functions like
driver.find_elements_by_innerhtml() or driver.find_elements_by_text().
Is there any function like that?

The reason of the error is, you search it with find_elements_by_name, beware of elements, so it returns a list since you are asking it to find multiple elements. If you want to click "Load More" button infinitely, I suggest:
while True:
try:
driver.find_element_by_class_name("load-more-data").click()
except selenium.common.exceptions.ElementNotFoundException:
break
I'm not sure if the class names are true though since they are based on your example. I didn't inspect the web page you gave. You can alter my code for your situation if it won't work.

Related

Why can't I interact (fill, click, etc) with this element using Playwright in my Python code?

I'm using Playwright to access and interact with a website and it was going perfect till I found myself in the page where I can't interact any button ou search bar to apply a filter. I can use .locator('xpath') to find the elemente, but when I tried .click('xpath'), .fill('xpath') or even .locator ('xpath').click(), I receive the below error:
Traceback (most recent call last):
File "c:\Users\Usuario\Desktop\Python Files\join\necessidades\join.py", line 24, in <module>
pagina.locator('//*[#id="jrhFrm:barFiltro:filtros:nomeDoCurso_hinput"]').click()
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\sync_api\_generated.py", line 13670, in click
self._sync(
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_sync_base.py", line 104, in _sync
return task.result()
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_locator.py", line 146, in click
return await self._frame.click(self._selector, strict=True, **params)
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_frame.py", line 489, in click
await self._channel.send("click", locals_to_params(locals()))
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_connection.py", line 44, in send
return await self._connection.wrap_api_call(
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_connection.py", line 419, in wrap_api_call
return await cb()
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_connection.py", line 79, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//*[#id=\"jrhFrm:barFiltro:filtros:nomeDoCurso_hinput\"]")
Here's the inspection of the page to ~maybe~ help understand the context. I don't know why the search bar is inside a tag.
Example of my code so far, with the Codegen "suggestion"
from playwright.sync_api import sync_playwright
from time import sleep
with sync_playwright() as p:
navegador = p.chromium.launch(headless=False)
pagina = navegador.new_page()
pagina.goto("page_url")
pagina.fill('full_xpath from Username input','USERNAME')
pagina.fill('full_xpath from Password input', 'Password')
pagina.click('full_xpath from Enter button')
try:
pagina.click('full_xpath from a boring pop-up that sometimes shows up')
except:
pass
sleep(10) #waiting the page to fully load
pagina.click('full_xpath from the title of a Menu Item called Trainings')
pagina.click('full_xpath from an Item called Course List that appeared from the Menu List')
# HERE'S WHERE I'M HAVING PROBLEM
sleep(5) #waiting the page to fully load
pagina.locator('full_xpath from the search bar that I want to fill').fill('text I need to insert to search the Training')
# THE BELOW CODE WAS GENERATED BY codegen
pagina.frame_locator("#embedJoinRhJsf").locator("[id=\"jrhFrm\\:barFiltro\\:filtros\\:cursoPesquisa\"]").fill("TEXT") #raise an exception that I posted above in the comments
Have you tried using general CSS selector?
Unsure why your Xpath selector doesn't work, and assuming there's a specific reason you are using xpaths, but by default I usually use general CSS selectors rather than xpaths if I can. Right click the element and Copy > Selector.
From your pic it looks like it will be "#jrhFRM:barFiltro:filtros:cursoPesquisa"
Thus
pagina.locator("#jrhFRM:barFiltro:filtros:cursoPesquisa").click()
Hope this works.
Also, check that you have copied the correct xpath, as the id in your error message...
"jrhFrm:barFiltro:filtros:nomeDoCurso_hinput"
...does not look like it matches the id in your screenshot...
"#jrhFRM:barFiltro:filtros:cursoPesquisa"
Let us know.

Can't print only text using Beautiful soup

I am struggling creating one of my first projects on python3. When I use the following code:
def scrape_offers():
r = requests.get("https://www.olx.bg/elektronika/kompyutrni-aksesoari-chasti/aksesoari-chasti/q-1070/?search%5Border%5D=filter_float_price%3Aasc", cookies=all_cookies)
soup = BeautifulSoup(r.text,"html.parser")
offers = soup.find_all("div",{'class':'offer-wrapper'})
for offer in offers:
offer_name = offer.findChildren("a", {'class':'marginright5 link linkWithHash detailsLink'})
print(offer_name.text.strip())
I get the following error:
Traceback (most recent call last):
File "scrape_products.py", line 45, in <module>
scrape_offers()
File "scrape_products.py", line 40, in scrape_offers
print(offer_name.text.strip())
File "/usr/local/lib/python3.7/site-packages/bs4/element.py", line 2128, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I've read many similar cases on StackOverFlow but I still can't help myself. If someone have any ideas, please help :)
P.S.: If i run the code without .text it show the entire <a class=...> ... </a>
findchildren returns a list. Sometimes you get an empty list, sometimes you get a list with one element.
You should add an if statement to check if the length of the returned list is greater than 1, then print the text.
import requests
from bs4 import BeautifulSoup
def scrape_offers():
r = requests.get("https://www.olx.bg/elektronika/kompyutrni-aksesoari-chasti/aksesoari-chasti/q-1070/?search%5Border%5D=filter_float_price%3Aasc")
soup = BeautifulSoup(r.text,"html.parser")
offers = soup.find_all("div",{'class':'offer-wrapper'})
for offer in offers:
offer_name = offer.findChildren("a", {'class':'marginright5 link linkWithHash detailsLink'})
if (len(offer_name) >= 1):
print(offer_name[0].text.strip())
scrape_offers()

Trouble getting the trade-price using "Requests-HTML" library

I've written a script in python to get the price of last trade from a javascript rendered webpage. I can get the content If I choose to go with selenium. My goal here is not to use any browser simulator like selenium or something because the latest release of Requests-HTML is supposed to have the ability to parse javascript encrypted content. However, I am not being able to make a go successfully. When I run the script, I get the following error. Any help on this will be highly appreciated.
Site address : webpage_link
The script I've tried with:
import requests_html
with requests_html.HTMLSession() as session:
r = session.get('https://www.gdax.com/trade/LTC-EUR')
js = r.html.render()
item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
print(item)
This is the complete traceback:
Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
self._callback(*self._args)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
self._timeout)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\experiment.py", line 6, in <module>
item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
AttributeError: 'NoneType' object has no attribute 'find'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\shutil.py", line 387, in _rmtree_unsafe
os.unlink(fullname)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\ar\\.pyppeteer\\.dev_profile\\tmp1gng46sw\\CrashpadMetrics-active.pma'
The price I'm after is available on the top of the page which can be visible like this 177.59 EUR Last trade price. I wish to get 177.59 or whatever the current price is.
You have several errors. The first is a 'navigation' timeout, showing that the page didn’t complete rendering:
Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
self._callback(*self._args)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
self._timeout)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
This traceback is not raised in the main thread, your code was not aborted because of this. Your page may or may not be complete; you may want to set a longer timeout or introduce a sleep cycle for the browser to have time to process AJAX responses.
Next, the response.html.render() element returns None. It loads the HTML into a headless Chromium browser, leaves JavaScript rendering to that browser, then copies back the page HTML into the response.html datasctructure in place, and nothing needs to be returned. So js is set to None, not a new HTML instance, causing your next traceback.
Use the existing response.html object to search, after rendering:
r.html.render()
item = r.html.find('.MarketInfo_market-num_1lAXs', first=True)
There is most likely no such CSS class, because the last 5 characters are generated on each page render, after JSON data is loaded over AJAX. This makes it hard to use CSS to find the element in question.
Moreover, I found that without a sleep cycle, the browser has no time to fetch AJAX resources and render the information you wanted to load. Give it, say, 10 seconds of sleep to do some work before copying back the HTML. Set a longer timeout (the default is 8 seconds) if you see network timeouts:
r.html.render(timeout=10, sleep=10)
You could set the timeout to 0 too, to remove the timeout and just wait indefinitely until the page has loaded.
Hopefully a future API update also provides features to wait for network activity to cease.
You can use the included parse library to find the matching CSS classes:
# search for CSS suffixes
suffixes = [r[0] for r in r.html.search_all('MarketInfo_market-num_{:w}')]
for suffix in suffixes:
# for each suffix, find all matching elements with that class
items = r.html.find('.MarketInfo_market-num_{}'.format(suffix))
for item in items:
print(item.text)
Now we get output produced:
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
Your last traceback shows that the Chromium user data path could not be cleaned up. The underlying Pyppeteer library configures the headless Chromium browser with a temporary user data path, and in your case the directory contains some still-locked resource. You can ignore the error, although you may want to try and remove any remaining files in the .pyppeteer folder at a later time.
Do you need it to go through Requests-HTML? On the day you posted, the repo was 4 days old and in the 3 days that have passed there have been 50 commits. It's not going to be completely stable for some time.
See here:
https://github.com/kennethreitz/requests-html/graphs/commit-activity
OTOH, there is an API for gdax.
https://docs.gdax.com/#market-data
Now if you're dead set on using Py3, there is a python client listed on the GDAX website. Upfront I'll mention that it's the unofficial client; however, if you use this you'd be able to quickly and easily get responses from the official GDAX api.
https://github.com/danpaquin/gdax-python
In case you want to use another way by running Selenium web scraping
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
chrome_path = r"C:\Users\Mike\Desktop\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.gdax.com/trade/LTC-EUR")
item = driver.find_element_by_xpath('''//span[#class='MarketInfo_market-num_1lAXs']''')
item = item.text
print item
driver.close()
result:177.60 EUR

Write data scraped to text file with python script

I am newbie to data scraping. This is my first program i am writting in python to scrape data and store it into the text file. I have written following code to scrape the data.
from bs4 import BeautifulSoup
import urllib2
text_file = open("scrape.txt","w")
url = urllib2.urlopen("http://ga.healthinspections.us/georgia/search.cfm?1=1&f=s&r=name&s=&inspectionType=&sd=04/24/2016&ed=05/24/2016&useDate=NO&county=Appling&")
content = url.read()
soup = BeautifulSoup(content, "html.parser")
type = soup.find('span',attrs={"style":"display:inline-block; font- size:10pt;"}).findAll()
for found in type:
text_file.write(found)
However i run this program using command prompt it shows me following error.
c:\PyProj\Scraping>python sample1.py
Traceback (most recent call last):
File "sample1.py", line 9, in <module>
text_file.write(found)
TypeError: expected a string or other character buffer object
What am i missing here, or is there anything i haven't added to. Thanks.
You need to check if type is None, ie soup.find did not actually find what you searched.
Also, don't use the name type, it's a builtin.
find, much like find_all return one/a list of Tag object(s). If you call print on a Tag you see a string representation. This automatism isn;t invoked on file.write. You have to decide what attribute of found you want to write.

Python Mechanize won't properly handle a redirect

I'm working on a scraper using Mechanize and Beautiful Soup in Python and for some reason redirects aren't working. Here's my code (I apologize for naming my variables "thing" and "stuff"; I don't normally do that, trust me):
stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
for thing in stuff:
pageUrl = thing['href']
print pageUrl
req = mechanize.Request(pageUrl)
response = browser.open(req)
searchPage = response.read()
soup = BeautifulSoup(searchPage)
soupString = soup.prettify()
print soupString
Anyway, products on Kraft's website that have more than one page for search results display a link to go the next page(s). The source code lists, for example, this as the next page for Kraft's line of steak sauces and marinades, which redirects to this
Anyway, thing['href'] has the old link in it because it scrapes the web page for it; one would think that doing browser.open() on that link would cause mechanize to go to the new link and return that as a response. However, running the code gives this result:
http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out
I get a time-out; I imagine it's because, for some reason, mechanize is looking for the old URL and isn't being redirected to the new one (I also tried this with urllib2 and received the same result). What's going on here?
Thanks for the help and let me know if you need any more information.
Update: Alright, I enabled logging; now my code reads:
req = mechanize.Request(pageUrl)
print logging.INFO
When I run it I get this:
url argument is not a URI (contains illegal characters) u'http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2'
20
Update 2 (which occurred while writing the first update): It turns out that it was the spaces in my string! All I had to do was this: pageUrl = thing['href'].replace(' ', "+") and it works perfectly.
Both urllib2 and mechanize openers include a handler for redirect responses by default (you can check looking at the handlers attribute), so I don't think the problem is that a redirect response isn't being correctly followed.
To troubleshoot the problem, you should capture the traffic in your web browser (in firefox, Live HTTP Headers and HttpFox are useful to do this) and compare it with the logs from your script (I'd recommend subclassing urllib2.BaseHandler to create your own handler to log all the information you need for every request and add the handler to your opener object using the add_handler method).

Categories

Resources