Im trying to do some web-scraping on a site, but I can't access to the next page on safari.
The site is: https://www.emol.com/todas/
the code just give me the same results of the first page twice, I need the first 3st pages
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common import exceptions
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
browser = webdriver.Safari()
browser.get("https://www.emol.com/todas/")
noticias = []
i = 0
while i < 2:
try:
nav = browser.find_elements_by_class_name("cont_bus_txt_detall_2")
for value in nav:
noticias.append(value.text)
browser.find_element_by_css_selector("a[href*='javascript:Next();']").click()
i += 1
except exceptions.StaleElementReferenceException:
pass
the below code is when I inspect the button of next page on safari:
<a class="next current-page-next-prev" href="javascript:Next();"><span class="txt_siguiente">Siguiente</span> <i class="fa fa-chevron-right"></i></a>
<span class="txt_siguiente">Siguiente</span>
<i class="fa fa-chevron-right"></I>
The thing that you are doing here is that you are refering to CSS Selector, but the selector can be the same for multiple elements, which will lead to an error or as you said, its going to be jumping into the same website, because its selector is static (same) on all websites
try this (change N, for the number you want to go ( second page == 2 ):
browser.find_element_by_xpath("/html/body/div[4]/div/div/div/div[3]/div/nav[1]/ul/li[N]/a").click()
The simplest way to do this would be by getting the text of the pagination elements. Here is an example for the above:
>>> from selenium import webdriver
>>> driver=webdriver.Chrome()
>>> driver.get('https://www.emol.com/todas/')
>>> pagination_elements = [browser.find_element_by_xpath('//ul[#id="listPages"]/li/a[text()=%s]' % n) for n in range(1,4)]
>>> len(pagination_elements)
# 3
>>> pagination_elements[2].click() # to view page 3
Notice how much cleaner this is:
//ul[#id="listPages"]/li/a[text()=%s]
The "cleaner" you can make the xpaths the more resilient your scraping becomes to changes in the html. And believe me, the html changes all the time for a live site... Notice how we can easily get all the pagination elements you want here with a single line of code as well.
Finally, a much better way to scrape the page would be to inspect the network tab and get the actual data that is being emitted on that ajax call. For example, in Chrome dev tools it will give you something like this:
https://cache-elastic-pandora.ecn.cl/emol/noticia/_search?q=publicada:true+AND+ultimoMinuto:true+AND+seccion:+AND+temas.id:&sort=fechaModificacion:desc&size=15&from=45 (<== Note, StackOverflow doesn't markup the whole link so you'll need to copy-paste it).
This will give you json of size 15 starting from the 45th result. You can play around with the parameters there to grab the data much more easily. For example, try changing the size to "1000" and see what happens. Good luck!
Related
I am using this to code to scrape emails from google search results. However, it only scrapes the first 10 results despite having 100 search results loaded.
Ideally, I would like for it to scrape all search results.
Is there a reason for this?
from selenium import webdriver
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)
time.sleep(10)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx',index=False)
# print(emails)
driver.close()
The code is working as expected and scraping 10 results which is the default from google search. You can use the methods like 'find_element_by_xpath' to find the next button and click it.
This operation needs to be done till the sufficient results are collected in loop. Refer this for more details selenium locating elements
How to use the selenium commands, probably you can look upto web. I found one similar question which can provide some reference
Following up on Bijendra's answer,
you could update the code as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
emails = []
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
for i in range(2):
html = driver.page_source
for e in re.findall(email_pattern, html):
emails.append(e)
a_attr = driver.find_element(By.ID,"pnnext")
a_attr.click()
time.sleep(2)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_csv('email_addresses_.csv',index=False)
driver.close()
You could either change the range value passed in for loop or entirely replace the for loop with while loop so instead of
for i in range(2):
You could do:
while len(emails) < 100:
Make sure to manage the time as to when the page navigates to next page and wait for the next page to load before extracting the available emails and then moving on to clicking the next button on search result page.
Make sure to refer to docs to get a clear idea of what you should do to achieve what you want to. Happy Hacking!!
Selenium loads its own empty browser so your google settings for 100 results need to be on the code because the default is 10 results which is what your getting. You will have better luck using query parameters and adding the one for the number of results to the end of your URL
If you need further information on query parameters to achieve this its the second method described below
tldevtech.com/how-to-show-100-results-per-page-in-google-search
I am trying to do some web scrawling through Selenium. However, when I run the code, it does not show the result.
Here is my code:
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
import pandas as pd
driver = webdriver.Chrome()
url = 'https://vimeo.com/510879223'
driver.get(url)
#head > meta:nth-child(14)
#/html/head/meta[8]
title = driver.find_element(By.CSS_SELECTOR,"head > meta:nth-child(14)")
print (title.text)
description = driver.find_element(By.XPATH,"//meta[#property='og:description']").text
print (description)
Result:
Process finished with exit code 0
In this case, what should I add or delete? Is it happened because the site that I want to scrape does not support xpath scrape option?
If I do print (title), the result is:
<selenium.webdriver.remote.webelement.WebElement (session="6f182a4afb7c1173f1e74f1cd6a40d87", element="e10f1407-3a09-4f3e-96e4-19071cda7d8e")>
Feel like it has a result but I cannot check the result as text. In this case, what is the best way to fix it? Thank you!
In your case I would recommend finding title by Xpath since the css selector you are trying to use is showing me the description tag. Notice that the text you are looking for is not stored as text on the page but rather in the content attribute. Using the .get_attribute() method should help.
title = driver.find_element(By.XPATH,"//meta[#property='og:title']").get_attribute('content')
print (title)
description = driver.find_element(By.XPATH,"//meta[#property='og:description']").get_attribute('content')
print (description)
Hi sometimes errors may occur in xpath copying. There may be inconsistencies with the command you want to take action.You can try using commands.
find_element_by_css_selector
find_element_by_class_name
My example project may help you.
https://github.com/kaayaumutt/instagramBotApp/blob/main/instagramBotApp/instagramBotApp.py
Trying to identify a javascript button on a website and press it to extend the page.
The website in question is the tencent appstore after performing a basic search. At the bottom of the page is a button titled "div.load-more-new" where upon pressing will extend the page with more apps.
the html is as follows
<div data-v-33600cb4="" class="load-more-btn-new" style="">
<a data-v-33600cb4="" href="javascript:void(0);">加载更多
<i data-v-33600cb4="" class="load-more-icon">
</i>
</a>
</div>
At first I thought I could identify the button using BeautifulSoup but all calls to find result as empty.
from selenium import webdriver
import BeautifulSoup
import time
url = 'https://webcdn.m.qq.com/webapp/homepage/index.html#/appSearch?kw=%25E7%2594%25B5%25E5%25BD%25B1'
WebDriver = webdriver.Chrome('/chromedriver')
WebDriver.get(url)
time.sleep(5)
# Find using BeuatifulSoup
soup = BeautifulSoup(WebDriver.page_source,'lxml')
button = soup.find('div',{'class':'load-more-btn-new'})
[0] []
After looking around here, it became apparent that even if I could it in BeuatifulSoup, it would not help in pressing the button. Next I tried to find the element in the driver and use .click()
driver.find_element_by_class_name('div.load-more-btn-new').click()
[1] NoSuchElementException
driver.find_element_by_css_selector('.load-more-btn-new').click()
[2] NoSuchElementException
driver.find_element_by_class_name('a.load-more-new.load-more-btn-new[data-v-33600cb4]').click()
[3] NoSuchElementException
but all return with the same error: 'NoSuchElementException'
Your selections wont work, cause they do not point on the <a>.
This one selects by class name and you try to click the <div> that holds your <a>:
driver.find_element_by_class_name('div.load-more-btn-new').click()
This one is very close but is missing the a in selection:
driver.find_element_by_css_selector('.load-more-btn-new').click()
This one try to find_element_by_class_name but is a wild mix of tag, attribute and class:
driver.find_element_by_class_name('a.load-more-new.load-more-btn-new[data-v-33600cb4]').click()
How to fix?
Select your element more specific and nearly like in your second apporach:
driver.find_element_by_css_selector('.load-more-btn-new a').click()
or
driver.find_element_by_css_selector('a[data-v-33600cb4]').click()
Note:
While working with newer selenium versions you will get DeprecationWarning: find_element_by_ commands are deprecated. Please use find_element()*
from selenium.webdriver.common.by import By
driver.find_element(By.CSS_SELECTOR, '.load-more-btn-new a').click()
I am trying to scrape a public facebook group using beautifulsoup, I am using the mobile site for the lack of javascript there. So this script supposed to get the link from the 'more' keyword and get the text from p tag there, but it just gets the text from the current page's p tag. Can someone point me the problem? I am new to python and everything in this code.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import requests
browser = webdriver.Firefox()
browser.get('https://mobile.facebook.com/groups/22012931789?refid=27')
for elem in browser.find_elements_by_link_text('More'):
page = requests.get(elem.get_attribute("href"))
soup=BeautifulSoup(page.content,'html.parser')
print(soup.find_all('p')[0].get_text())
It's always useful to see what your script is actually doing, a quick way of doing this is by printing your results at certain steps along the way.
For example, using your code:
for elem in browser.find_elements_by_link_text('More'):
print("elem's href attribute: {}".format(elem.get_attribute("href")))
You'll notice that the first one's blank. We should test for this before trying to get requests to fetch it:
for elem in browser.find_elements_by_link_text('More'):
if elem.get_attribute("href"):
print("Trying to get {}".format(elem.get_attribute("href")))
page = requests.get(elem.get_attribute("href"))
soup=BeautifulSoup(page.content,'html.parser')
print(soup.find_all('p')[0].get_text())
Note that an empty elem.get_attribute("href") returns an empty unicode string, u'' - but pythons considers an empty string to be false, which is why that if works.
Which works fine on my machine. Hope that helps!
I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.
I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.
After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:
from selenium import webdriver
import codecs
filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')
driver = webdriver.Firefox()
driver.get("http://www.examplepage.com")
allelements = driver.find_elements_by_xpath("//*")
ferdigtxt = []
for i in allelements:
if i.text in ferdigtxt:
pass
else:
ferdigtxt.append(i.text)
filen.writelines(i.text)
filen.close()
driver.quit()
The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)
I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.
Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.
Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text
Using lxml, you might try something like this:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean
url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
browser.get(url) # Load page
content=browser.page_source
cleaner=clean.Cleaner()
content=cleaner.clean_html(content)
with open('/tmp/source.html','w') as f:
f.write(content.encode('utf-8'))
doc=LH.fromstring(content)
with open('/tmp/result.txt','w') as f:
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text=elt.text or ''
tail=elt.tail or ''
words=' '.join((text,tail)).strip()
if words:
words=words.encode('utf-8')
f.write(words+'\n')
This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).
Here's a variation on #unutbu's answer:
#!/usr/bin/env python
import sys
from contextlib import closing
import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean import Cleaner
from selenium.webdriver import Firefox # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
cache = FileSystemCache('.cachedir', threshold=100000)
url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
# get page
page_source = cache.get(url)
if page_source is None:
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
page_source = browser.page_source
cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text
I've separated your task in two:
get page (including elements generated by javascript)
extract text
The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.