Alternative to pandas.read_html where ulr is not unique? - python

I want to access data from an html table from the section "ERGEBNIS" with python 3.7.
The problem is, that the results for each combination of the drop down values are only shown after clicking on submit. This does however not change the url, so that I have no idea how I can access the results table after updating the input values of the drop downs.
Here is what I've done so far:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
browser.get('https://daten.ktbl.de/feldarbeit/entry.html')
#Fix values of the drop down fields:
fertilizer = Select(browser.find_element_by_name("hgId"))
fertilizer.select_by_value("2")
fertilizer = Select(browser.find_element_by_name("gId"))
fertilizer.select_by_value("193")
fertilizer = Select(browser.find_element_by_name("avId"))
fertilizer.select_by_value("383")
fertilizer = Select(browser.find_element_by_name("hofID"))
fertilizer.select_by_value("2")
fertilizer = Select(browser.find_element_by_name("flaecheID"))
fertilizer.select_by_value("5")
fertilizer= Select(browser.find_element_by_name("mengeID"))
fertilizer.select_by_value("60")
# Submit changes to show the results of this particular combination of values
button = browser.find_element_by_xpath("//*[#type='submit']")
button.click()
Submitting the changes does, however, not change the url, so that I don't know how I can access the results (here "ERGEBINS") table.
Otherwise my approach would have been to use pd.read_html somehow like this:
...
url = browser.current_url
time.sleep(1)
df_list = pd.read_html(url, match = "Dieselbedarf")
But since the url isn't unique for each result, this doesn't make sense. Same issue would be with BeautifulSoup, or at least I don't understand how I can do it without a unique url..
Any ideas how I can access the html table otherwise?
EDIT: The answer of #bink1time could solve my problem how to access the table without the url, but via the raw HTML string:
html_source = browser.page_source
df_list = pd.read_html(html_source, match = "Dieselbedarf")

You can probably just get the html source:
html_source = browser.page_source
According to the docs:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html
read_html takes a URL, a file-like object, or a raw string containing HTML.
In this case you pass the raw string.
html_source = browser.page_source
df_list = pd.read_html(html_source, match = "Dieselbedarf")
Just a note you don't need to sleep.

Related

Use a csv to fill out a web form with python

I am trying to send several items from a CSV file to a webform using python so I don't have to type it all in by hand, especially when I update the sheet later. I tried using the answer to this question and the page comes up and seems to "submit" but I get told the import failed.
My Code
from bs4 import BeautifulSoup
from requests import get
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
# using Pandas to read the csv file
source_information = pd.read_csv('C:/chrome_driver/test_csv.csv', header=None, skiprows=[0])
print(source_information)
# setting the URL for BeautifulSoup to operate in
url = "https://www.roboform.com/filling-test-all-fields"
my_web_form = get(url).content
soup = BeautifulSoup(my_web_form, 'html.parser')
# creating a procedure to fill the form
def fulfill_form(first, email):
# Setting parameters for selenium to work
path = r'C:/chrome_driver/chromedriver.exe'
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(path, options=options)
driver.get(url)
# use Chrome Dev Tools to find the names or IDs for the fields in the form
input_first = driver.find_element_by_name('02frstname')
input_email = driver.find_element_by_name('24emailadr')
submit = driver.find_element_by_name('Reset')
# input the values and hold a bit for the next action
input_first.send_keys(first)
time.sleep(1)
input_email.send_keys(email)
time.sleep(5)
submit.click()
time.sleep(7)
# creating a list to hold any entries should them result in error
failed_attempts = []
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information:
try:
fulfill_form(str(source_information[0]), str(source_information[1]))
except:
failed_attempts.append(source_information[0])
pass
if len(failed_attempts) > 0:
print("{} cases have failed".format(len(failed_attempts)))
print("Procedure concluded")
This tells me that "2 cases have failed"
I checked the output of my "source_information" and it shows the following
0 1
0 Corey corey#test.com
1 Breana breana#hello.org
Where am I going wrong?
Maybe:
submit = driver.find_element_by_name('Reset')
Should be...
submit = driver.find_element_by_xpath("//input[#type='reset' and #value='Reset']")
Based on the page source of (it doesn't have a name)...
<input type="reset" value="Reset">
...and note the type reset vs the value Reset.
Then you have source_information as a dataframe so you probably want to change...
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information:
try:
fulfill_form(str(source_information[0]), str(source_information[1]))
except:
failed_attempts.append(source_information[0])
pass
To something like...
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information.iterrows():
try:
fulfill_form(customer[1][0], customer[1][1])
except:
failed_attempts.append(source_information[1][0])
pass
I'd also suggest changing all your time.sleep(5) and time.sleep(7) to 1 or 2 so it runs a little quicker.
Obviously this is all from looking at the code without running your data and seeing what happens.
Additional:
I reread the question and you do have an example of test data from the failures. Running this for the changes shown above works.

Python beautiful soup web scraper doesn't return tag contents

I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else
it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.
You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')
Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")

BeautifulSoup returns empty span elements?

I'm trying to pull prices from Binance's home page and BeautifulSoup returns empty elements for me. Binance's home page is at https://www.binance.com/en/, and the interesting block I'm trying to get text from is:
<div class="sc-62mpio-0-sc-iAyFgw iQwJlO" color="#999"><span>"/" "$" "35.49"</span></div>
On Binance's home page is a table and one of the columns is titled "Last Price". Next to the last price is the last USD price in a faded gray color and I'm trying to pull every one of those. Here's my code so far.
def grabPrices():
page = requests.get("https://www.binance.com/en")
soup = BeautifulSoup(page.text, "lxml")
prices = soup.find_all("span", {"class": None})
print(prices)
But the output is just a large array of "–" tags.
Selenium should be one way of scraping the table content you want from this biniance page. And google Selenium about its set up (pretty much by download a driver and place it in your local disk, if you are a chrome user, see this download link chrome driver). Here is my code to access the content you are interested:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Chrome(executable_path=r'C:\chromedriver\chromedriver.exe')
time.sleep(3) # Allow time to launch the controlled web
driver.get('https://www.binance.com/en/')
time.sleep(3) # Allow time to load the page
sel = Selector(text=driver.page_source)
Table = sel.xpath('//*[#id="__next"]/div/main/div[4]/div/div[2]/div/div[2]/div/div[2]/div')
Table.extract() # This basically gives you all the content of the table, see follow screen shot (screen shot is truncated for display purpose)
Then if you further process the entire table content with something like:
tb_rows = Table.xpath('.//div/a//div//div//span/text()').extract()
tb_rows # Then you will get follow screen shot
At this point, the result is narrowed down to pretty much what you are interested, but notice that the lastprice's two components (number/dollar price) are stored in two tag in source page, so we can do following to combine them together and reach to the destination:
for n in range(0,len(tb_rows),2):
LastPrice = tb_rows[n] + tb_rows[n+1]
print(LastPrice) # For sure, other than print, you could store each element in a list
driver.quit() # don't forget to quit driver by the end
The final output looks like:

python selenium - get (ctrl-u) equivalent page_source

I need to get the ctrl-u equivalent of browser.page_source for comparative purposes.
is this possible with browser.execute_script or another method?
I've tried various methods like browser.get(view-source:https://www.example.com) but haven't seen a solution.
Its works fine for me , I guess it's the problem with the quotes,
browser.get('https://www.example.com')
browser.page_source
You can also achieve the same using browser.execute_script()
browser.execute_script('return document.documentElement.outerHTML')
if I'm not wrong you want to compare original html ctrl+U and rendered html browser.page_source, for that you can use requests
import requests
originalHTML = requests.get('http://...').text
print(originalHTML)
or you can create another tab for view-source:
url = 'https://..../'
browser.get(url)
renderedHTML = browser.page_source
# open blank page because JS cannot open special URL like `view-source:`
browser.execute_script("window.open('about:blank', '_blank')")
# switch to tab 2
browser.switch_to_window(browser.window_handles[1])
browser.get("view-source:" + url)
originalHTML = originalHTML = browser.find_element_by_css_selector('body').text
# switch to tab 1
#browser.switch_to_window(browser.window_handles[0])

Scraping hidden jquery values using python

HERE is the site I believe its hosted on GitHub.
I am having trouble scraping the values in the input fields. Specifically private key and Public. I tried using Selenium and BeautifulSoup but it would give empty values, rather None (the HTML doesn't contain the keys).
I checked the page source and it seems that the input value is empty (not contained within the HTML) but when you load the page it is visible and existent in the input box.
Here is my code:
def openit(browser):
browser.get('file:///Users/Aha/Desktop/Code/english/index.html')
time.sleep(5)
nav = browser.find_element_by_id("addr")
print(nav.text)
return browser.page_source
soupdata = openit(browser)
soup = BeautifulSoup(soupdata, 'html.parser')
val = soup.find('input', {'id': 'addr'}).get('value')
print (val)
You can retrieve that value via execute_script method of the selenium webdriver
print(browser.execute_script("return $('#addr').val();"))
Output:
14ropRunS5iY9sx9d9mpCRNEsXj7RtTtuS

Categories

Resources