Problems With My Web Crawler (Python, Selenium)

Problems With My Web Crawler (Python, Selenium) - python

I have a little strange question, but I hope somebody can solve it anyway, because I've already tried so much and I just couldn't get any further. Thanks in advance
I have a problem with my Python script. To be precise, it's not just one problem. The aim of my script is to automatically search for keywords from a list in a search engine (www.startpage.com). Then it should count how often there is a search word on the search engine's results-page. If the searchword occurs more than 14 times there, it will be saved in a list. If it is less than 14 times, in another.
My problem now is that there are always errors. The program runs about 20 times (with success), but then,t there is just some error. The problem here is that the mistake is simply not correct. For example, an error occurs that says that a variable is undefined even though it is. Or another error looks like this:
File "webcrawler.py", line 42, in <module>
email_count=get_results(liste[x])
File "webcrawler.py", line 28, in get_results
search_box.send_keys(search_term)
AttributeError: 'str' object has no attribute 'send_keys'
But this mistake doesn't really make sense, since the program ran 20 times without errors.
My code looks like this:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
#opens textfile with list of all search terms
with open("list1.txt") as infile:
list1 = [list1.strip() for list1 in infile]
#this function searches for "search_term" on the website www.startpage.com
def get_results(search_term):
url="https://www.startpage.com"
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(chrome_options=options)
browser.get(url)
try:
search_box = browser.find_element_by_id("q")
except NoSuchElementException:
print("An error occured!")
search_box.send_keys(search_term)
search_box.submit()
time.sleep(3)
source_code = (browser.page_source).strip().lower()
browser.close()
time.sleep(1)
email=search_term.lower()
return source_code.count(email)
#textfiles for the results
f = open("works.txt", "a")
g = open("works_not.txt", "a")
x=0
while x < len(list1):
email_count=get_results(list1[x])
#saves listitem to result file (sorted after how often the search term appeared on the results page)
if email_count < 15:
g.write(list1[x])
g.write("\n")
g.flush()
time.sleep(1)
else:
f.write(list1[x])
f.write("\n")
f.flush()
time.sleep(1)
x=x+1
Is there something wrong with the code or should I add "sleep ()" somewhere? I'm sorry that this is such a strange question, but I hope someone sees the problem.
A thousand thanks in advance.

Related

whatsApp-web driver with python time out

I want to create a program, that can read all the messages from my whatsApp and print them to the screen using python.
In order to do that I tried using the whatsapp-web library https://pypi.org/project/whatsapp-web/.
But when i tried to run their code example I got a timeout error
this is the code
import time
from selenium import webdriver
from simon.accounts.pages import LoginPage
from simon.header.pages import HeaderPage
from simon.pages import BasePage
# Creating the driver (browser)
driver = webdriver.Firefox()
driver.maximize_window()
login_page = LoginPage(driver)
login_page.load()
login_page.remember_me = False
time.sleep(7)
base_page = BasePage(driver)
base_page.is_welcome_page_available()
base_page.is_nav_bar_page_available()
base_page.is_search_page_available()
base_page.is_pane_page_available()
base_page.is_chat_page_available()
# 3. Logout
header_page = HeaderPage(driver)
header_page.logout()
# Close the browser
driver.quit()
and this is the error
base_page.is_welcome_page_available()
File "D:\zoom\venv\lib\site-packages\simon\pages.py", line 18, in wrapper
return func(*args, **kwargs)
File "D:\zoom\venv\lib\site-packages\simon\pages.py", line 51, in is_welcome_page_available
if self._find_element(WelcomeLocators.WELCOME):
File "D:\zoom\venv\lib\site-packages\simon\pages.py", line 77, in _find_element
lambda driver: self.driver.find_element(*locator))
File "D:\zoom\venv\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

Your code is not completed to find error or correct.
Because the error is coming from the imported page.
Try by increasing the time limit to load the page,
use time.sleep(15)
You can try to automate the WhatsApp web by yourself without using pip of WhatsApp-web. because I guess that is not updated.
The code is giving error because of WhatsApp web has changed its elements classes names. Because of that the code section is not able to find welcome page in given time limit. So the program execution breaks
I have done the same without using WhatsApp-web pip.
I hope it will work for you.
A complete code reference Example :
from selenium import webdriver
import time
# You can use any web-browser which supported by selenium and which can run WhatsApp web.
# For using GoogleChrome
web_driver = webdriver.Chrome("Chrome_Driver_Path/chromedriver.exe")
web_driver.get("https://web.whatsapp.com/")
# For using Firefox
# web_driver = webdriver.Firefox(executable_path=r"C:/Users/Pascal/Desktop/geckodriver.exe")
# web_driver.get("https://web.whatsapp.com/")
time.sleep(25) # For scan the qr code
# Plese make sure that you have done the qr code scan successful.
confirm = int(input("Press 1 to proceed if sucessfully login or press 0 for retry : "))
if confirm == 1:
print("Continuing...")
elif confirm == 0:
web_driver.close()
exit()
else:
print("Sorry Please Try again")
web_driver.close()
exit()
while True:
unread_chats = web_driver.find_elements_by_xpath("// span[#class='_38M1B']")
# In the above line Change the xpath's class name from the current time class name by inspecting span element
# which containing the number of unread message showing the contact card inside a green circle before opening the chat room.
# Open each chat using loop and read message.
for chat in unread_chats:
chat.click()
time.sleep(2)
# For getting message to perform action
message = web_driver.find_elements_by_xpath("//span[#class='_3-8er selectable-text copyable-text']")
# In the above line Change the xpath's class name from the current time class name by inspecting span element
# which containing received text message of any chat room.
for i in message:
try:
print("Message received : " + str(i.text))
# Here you can use you code to perform action according to your need
except:
pass
Please make sure that the indentation is equal in code blocks if you are copying it.
Can following link for more info about WhatsApp web using python.
https://stackoverflow.com/a/68288416/15284163
I am developing WhatsApp bot using python.
For contribution you can contact at : anurag.cse016#gmail.com
Please give a star on my https://github.com/4NUR46 If this Answer helps you.

Python requests-html session GET correct usage

I'm working on a web scraper that needs to open several thousand pages and get some data.
Since one of the data fields I need the most is only loaded after all javascripts of the site have been loaded, I'm using html-requests to render the page and then get the data I need.
I want to know, what's the best way to do this?
1- Open a session at the start of the script, do my whole scraping and then close the session when the script finishes after thousands of "clicks" and several hours?
2- Or should I open a session everytime I open a link, render the page, get the data, and then close the session, and repeat n times in a cycle?
Currently I'm doing the 2nd option, but I'm getting a problem. This is the code I'm using:
def getSellerName(listingItems):
for item in listingItems:
builtURL = item['href']
try:
session = HTMLSession()
r = session.get(builtURL,timeout=5)
r.html.render()
sleep(1)
sellerInfo = r.html.search("<ul class=\"seller_name\"></ul></div><a href=\"{user}\" target=")["user"]
##
##Do some stuff with sellerinfo
##
session.close()
except requests.exceptions.Timeout:
log.exception("TimeOut Ex: ")
continue
except:
log.exception("Gen Ex")
continue
finally:
session.close()
break
This works pretty well and is quite fast. However, after about 1.5 or 2 hours, I start getting OS exception like this one:
OSError: [Errno 24] Too many open files
And then that's it, I just get this exception over and over again, until I kill the script.
I'm guessing I need to close something else after every get and render, but I'm not sure what or if I'm doing it correctly.
Any help and/or suggestions, please?
Thanks!

You should create a session object outside the loop
def getSellerName(listingItems):
session = HTMLSession()
for item in listingItems:
//code

Parse a web page with button "Load More" with python

I am trying to extract all the comments on a movie from this page https://www.imdb.com/title/tt0114709/reviews?ref_=tt_ql_3 but some of them are hidden behind a button "Load More", I have tried with selenium to to click on this button but it doesn't seem to work. Here is my code and the error message, if someone has an idea on how to achieve that.
h = httplib2.Http("./docs/.cache")
resp, content = h.request(url, "GET")
soup = bs4.BeautifulSoup(content, "html.parser")
divs = soup.find_all("div")
driver = webdriver.Chrome(executable_path='C:\Program Files\Intel\iCLS Client\chromedriver.exe')
driver.get(url)
html = driver.page_source.encode('utf-8')
while driver.find_elements_by_class_name("load-more-data"):
driver.find_elements_by_name("Load More").click()
Traceback (most recent call last):
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 567, in <module>
Mat()
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 518, in Mat
dicoCam =testC.extract_data()
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 368, in extract_data
self.extract_comment(movie, url)
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 469, in extract_comment
driver.find_elements_by_name("Load More").click()
AttributeError: 'list' object has no attribute 'click'```

As you can see in the error message, a list is returned when doing:
driver.find_elements_by_name("Load More")
That's why I suggest doing this:
driver.find_elements_by_name("Load More")[0].click()
You have to make sure that there is only 1 element named Load More.
If this is not the case, increase the list index [0] by 1 for each element
named Load More.
Hope that helped.
EDIT: If you still get error messages, like list index out of range , the driver.find_elements_by_name() function isn't working the proper way you want it to.
I'm not an expert when dealing with the Internet with Python,
but you should look for
functions like
driver.find_elements_by_innerhtml() or driver.find_elements_by_text().
Is there any function like that?

The reason of the error is, you search it with find_elements_by_name, beware of elements, so it returns a list since you are asking it to find multiple elements. If you want to click "Load More" button infinitely, I suggest:
while True:
try:
driver.find_element_by_class_name("load-more-data").click()
except selenium.common.exceptions.ElementNotFoundException:
break
I'm not sure if the class names are true though since they are based on your example. I didn't inspect the web page you gave. You can alter my code for your situation if it won't work.

Python Selenium: Unable to Find Element After First Refresh

I've seen a few instances of this question, but I was not sure how to apply the changes to my particular situation. I have code that monitors a webpage for changes and refreshes every 30 seconds, as follows:
import sys
import ctypes
from time import sleep
from Checker import Checker
USERNAME = sys.argv[1]
PASSWORD = sys.argv[2]
def main():
crawler = Checker()
crawler.login(USERNAME, PASSWORD)
crawler.click_data()
crawler.view_page()
while crawler.check_page():
crawler.wait_for_table()
crawler.refresh()
ctypes.windll.user32.MessageBoxW(0, "A change has been made!", "Attention", 1)
if __name__ == "__main__":
main()
The problem is that Selenium will always show an error stating it is unable to locate the element after the first refresh has been made. The element in question, I suspect, is a table from which I retrieve data using the following function:
def get_data_cells(self):
contents = []
table_id = "table.datadisplaytable:nth-child(4)"
table = self.driver.find_element(By.CSS_SELECTOR, table_id)
cells = table.find_elements_by_tag_name('td')
for cell in cells:
contents.append(cell.text)
return contents
I can't tell if the issue is in the above function or in the main(). What's an easy way to get Selenium to refresh the page without returning such an error?
Update:
I've added a wait function and adjusted the main() function accordinly:
def wait_for_table(self):
table_selector = "table.datadisplaytable:nth-child(4)"
delay = 60
try:
wait = ui.WebDriverWait(self.driver, delay)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, table_selector)))
except TimeoutError:
print("Operation timeout! The requested element never loaded.")
Since the same error is still occurring, either my timing function is not working properly or it is not a timing issue.

I've run into the same issue while doing web scraping before and found that re-sending the GET request (instead of refreshing) seemed to eliminate it.
It's not very elegant, but it worked for me.

I appear to have fixed my own problem.
My refresh() function was written as follows:
def refresh():
self.driver.refresh()
All I did was switch frames right after the refresh() call. That is:
def refresh():
self.driver.refresh()
self.driver.switch_to.frame("content")
This took care of it. I can see that the page is now refreshing without issues.

lxml not getting updated webpage

Simple script here, i'm just trying to get the number of people in a gym from a webpage every 15 minutes and save the result in a text file. However, the script is outputting the result from the first time I ran it (39), as opposed to the updated number of 93 (which can be seen by refreshing the webpage). Any ideas why this is? Note, I set the time to sleep to 10 seconds incase you want to run it yourself.
from lxml import html
import time
import requests
x = 'x'
while x == x:
time.sleep(10)
page = requests.get('http://www.puregym.com/gyms/holborn/whats-happening')
string = html.fromstring(page.content)
people = string.xpath('normalize-space(//span[#class="people-number"]/text()[last()])')
print people
#printing it for debug purposes
f = open("people.txt","w")
f.write(people)
f.write("\n")
Cheers

You are not closing the people.txt file after each loop, it is better to use Python's with function to do this as follows:
from lxml import html
import time
import requests
x = 'x'
while x == 'x':
time.sleep(10)
page = requests.get('http://www.puregym.com/gyms/holborn/whats-happening')
string = html.fromstring(page.content)
people = string.xpath('normalize-space(//span[#class="people-number"]/text()[last()])')
print people
#printing it for debug purposes
with open("people.txt", "w") as f:
f.write('{}\n'.format(people))
If you want to keep a log of all entries, you would need to move the with statement outside your while loop. Also I think you meant while x == 'x'. Currently the site is showing 39, which is seen in the people.txt.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems With My Web Crawler (Python, Selenium) - python

Related

whatsApp-web driver with python time out

Python requests-html session GET correct usage

Parse a web page with button "Load More" with python

Python Selenium: Unable to Find Element After First Refresh

lxml not getting updated webpage

Categories

Resources