Python Selenium - Get everything and write to file - python

noob here who just managed to be actively refused by the remote server. Too many connection attempts I suspect.
..and really, I should not be trying to connect every time I want to try some new code, so that got me to this question:
So, how can I grab everything off the page, and save it to file...and then just load the file offline to search for the fields I need.
I was in the process of testing the below code when I was Refused so I don't know what works - there are probably typos below :/
Could anyone please offer any suggestions or improvements.
print ("Get CSS elements from page")
parent_elements_css = driver.find_elements_by_css_selector("*")
driver.quit()
print ("Saving Parent_Elements to CSV")
with open('ReadingEggs_BookReviews_Dump.csv', 'w') as file:
file.write(parent_elements_css)
print ("Open CSV to Parents_Elements")
with open('ReadingEggs_BookReviews_Dump.csv', 'r') as file:
parent_elements_css = file
print ("Find the children of the Parent")
# Print stuff to screen to quickly find the css_selector 'codes'
# A bit brute force ish
for css in parent_elements_css:
print (css.text)
child_elements_span = parent_element.find_element_by_css_selector("span")
child_elements_class = parent_element.find_element_by_css_selector("class")
child_elements_table = parent_element.find_element_by_css_selector("table")
child_elements_tr = parent_element.find_element_by_css_selector("tr")
child_elements_td = parent_element.find_element_by_css_selector("td")
These other pages looked interesting:
python selenium xpath/css selector
Get all child elements
Locating Elements
xpath-partial-match-tr-id-with-python-selenium (ah cos I asked this one :D..but the answer by Sers is awesome)
My previous file save was using a dictionary and json...but I could not use it above because of this error: "TypeError: Object of type WebElement is not JSON serializable". I have not saved files before that.

You can get the html of the whole page via driver.page_source. You can then read from the html using beautiful soup so
from bs4 import BeautifulSoup
# navigate to page
html_doc = driver.page_source
soup = BeautifulSoup(html_doc, 'html.parser')
child_elements_span = soup.find_all('span')
child_elements_table = soup.find_all('table')
Here is a good documentation for parsing the html via BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Related

Python web scraping javascript table using selenium

Hello to all this is my first post hope you are good!
I try to extract the content of the table located on this site: https://pamestoixima.opap.gr/ . I want to extract the table as is. I have tried several ways such as beautifulSoup , pandas, and selenium but with no success! The latest code I have tried is this:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome( executable_path=r'C:path to /chromedriver.exe')
driver.get('https://pamestoixima.opap.gr/')
soup = BeautifulSoup(driver.page_source,'lxml')
author_element = soup.find("table", class_="results-table")
print(author_element.text)
driver.quit()
The error message I get is this:
USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection:
Thank you in advance for any help!
Do you specifically want to use Selenium? I haven't looked into detail but I think you can find all the data in this file: https://api.opap.gr/sb/sport/soccer/coupon?locale=el&onlyLive=false&marketIds=1%2C2%2C31%2C21%2C18%2C18%2C18%2C14&fromDate=2022-05-03&toDate=2022-05-03
In many cases you can use the Developer Tools in your browser to find a direct link to the data instead of scraping the page, which is usually much more difficult.

Python beautiful soup web scraper doesn't return tag contents

I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else
it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.
You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')
Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")

Errors with Beautiful Soup output

I'm trying to scrape data from a webpage on gamespot using beautifulsoup. However, the result is very different than what I get from the page source viewer. First off, alot of errors are produced. For instance, we have
r = requests.get(link)
soup = bs4.BeautifulSoup(r.text)
And yet soup.title gives
<title>404: Not Found - GameSpot</title>.
The data I actually want to scrape does not even appear. Is it because the webpage contains javascript alongside ? If so how can I get around this ?
You're only sending a HTTP request to the server. You need to process Javascript to get the content.
A headless browser with Javascript support, like Ghost, it'd be a good choice.
from ghost import Ghost
ghost = Ghost()
ghost.open(link)
page, resources = ghost.evaluate('document.documentElement.innerHTML;')
soup = BeautifulSoup(page)
.evaluate('document.documentElement.innerHTML') will show the dynamically generated content, not the static you'd see taking a look at the source.
Your connection error is: socket.error: [Errno 54] Connection reset by peer
When your first time connect to http://www.gamespot.com
you must catch cookie and use it for other pages in header of response.

How to read source web site already open in browser

I wondering to know if is any way to open url in browser and read source opened url ?
I'm trying to check if my XPath selector getting right value of captcha img src. I can't do this making 2 connections to url cause captcha will reload every single time i connect to url.
For reading source i'm using:
url = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx"
sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
To open url in browser i'm using:
if sys.platform=='win32':
os.startfile(url)
elif sys.platform=='darwin':
subprocess.Popen(['open', url])
else:
try:
subprocess.Popen(['xdg-open', url])
except OSError:
print 'Please open a browser on: '+url
Does any of you guys know how to solve it ?
Thanks
I found solution. To see url in browser and in the same time see source code of this page just use this code:
from selenium import webdriver
from lxml import etree, html
url = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx"
adres_prefix = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/"
adres_sufix = etree.XPath('string(//img[#class="captcha"]/#src)')
browser = webdriver.Firefox()
browser.get(url)
html_source = browser.page_source # i'm getting source code of open url
root = etree.HTML(html_source)
result = etree.tostring(root, pretty_print=True, method="html")
result2 = adres_sufix(root)
www = adres_prefix + result2
print www # now i see if XPath gives me right value
Hope it will help others
Thanks anyway for any help
Most of the cross platform python GUI tool kits such as wx.Python, pyside, etc., have a html display window that you can use to display the html source from within your python. I would recommend using one of those to display your content from within your python code.
You probably are going to need to make more than one request to get the CAPTCHA. Get yourself a copy of Fiddler 2 (free) http://fiddler2.com/get-fiddler. It will allow you to see the "conversation" between the server and your browser. Once you see that, you will probably know what you need.

python selenium - how can I get all visible text on the page (i.e., not the page source) [duplicate]

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.
I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.
After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:
from selenium import webdriver
import codecs
filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')
driver = webdriver.Firefox()
driver.get("http://www.examplepage.com")
allelements = driver.find_elements_by_xpath("//*")
ferdigtxt = []
for i in allelements:
if i.text in ferdigtxt:
pass
else:
ferdigtxt.append(i.text)
filen.writelines(i.text)
filen.close()
driver.quit()
The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)
I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.
Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.
Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text
Using lxml, you might try something like this:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean
url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
browser.get(url) # Load page
content=browser.page_source
cleaner=clean.Cleaner()
content=cleaner.clean_html(content)
with open('/tmp/source.html','w') as f:
f.write(content.encode('utf-8'))
doc=LH.fromstring(content)
with open('/tmp/result.txt','w') as f:
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text=elt.text or ''
tail=elt.tail or ''
words=' '.join((text,tail)).strip()
if words:
words=words.encode('utf-8')
f.write(words+'\n')
This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).
Here's a variation on #unutbu's answer:
#!/usr/bin/env python
import sys
from contextlib import closing
import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean import Cleaner
from selenium.webdriver import Firefox # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
cache = FileSystemCache('.cachedir', threshold=100000)
url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
# get page
page_source = cache.get(url)
if page_source is None:
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
page_source = browser.page_source
cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text
I've separated your task in two:
get page (including elements generated by javascript)
extract text
The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.

Categories

Resources