Need help web scraping table with beautifulsoup and selenium webdriver - python

So I am working on trying to webscrape https://data.bls.gov/cgi-bin/surveymost?bls and was able to figure out how to webcrawl through clicks to get to a table.
The selection that I am practicing on is after you select the checkbox associated with " Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A" under Compensation and then select "Retrieve data".
Once those two are processed a table shows. This is the table I am trying to scrape.
Below is the code that I have as of right now.
Note that you have to put your own path for your browser driver where I have put < browser driver >.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import numpy as np
import requests
import lxml.html as lh
from selenium import webdriver
url = "https://data.bls.gov/cgi-bin/surveymost?bls"
ChromeSource = r"<browser driver>"
# Open up a Chrome browser and navigate to web page.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless') # will run without opening browser.
driver = webdriver.Chrome(ChromeSource, chrome_options=options)
driver.get(url)
driver.find_element_by_xpath("//input[#type='checkbox' and #value = 'CIU1010000000000A']").click()
driver.find_element_by_xpath("//input[#type='Submit' and #value = 'Retrieve data']").click()
i = 2
def myTEST(i):
xpath = '//*[#id="col' + str(i) + '"]'
TEST = driver.find_elements_by_xpath(xpath)
num_page_items = len(TEST)
for i in range(num_page_items):
print(TEST[i].text)
myTEST(i)
# Clean up (close browser once completed task).
driver.close()
Right now this only is looking at the headers. I would like to also get the table content as well.
If I make i = 0, it produces "Year". i = 1, it produces "Period". But if I select i = 2 I get two variables which have the same col2 id for "Estimated Value" and "Standard Error".
I tried to think of a way to work around this and can't seem to get anything that I have researched to work.
In essence, it would be better to start at the point where I am done clicking and am at the table of interest and then look at the xpath of the header and pull in the text for all of the sub 's.
<tr> == $0
<th id="col0"> Year </th>
<th id="col1"> Period </th>
<th id="col2">Estimated Value</th>
<th id="col2">Standard Error</th>
<tr>
I am not sure how to do that. I also tried to loop through the {i} but obviously sharing with two header text causes an issue.
Once I am able to get the header, I want to get the contents. I could you some insight on if I am on the right path, overthinking it or if there is a simpler way to do all of this. I am learning and this is my first attempt using the selenium library for clicks. I just want to get it to work so I can try it again on a different table and make it as automate or reusable (with tweaking) as possible.

Actually you don't need selenium, You can just track the POST Form data, and apply the same within your POST request.
Then you can load the table using Pandas easily.
import requests
import pandas as pd
data = {
"series_id": "CIU1010000000000A",
"survey": "bls"
}
def main(url):
r = requests.post(url, data=data)
df = pd.read_html(r.content)[1]
print(df)
main("https://data.bls.gov/cgi-bin/surveymost")
Explanation:
open the site.
Select Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A
Now you have to open your browser Developer Tools and navigate to Network Monitor section. etc Press Ctrl + Shift + E ( Command + Option + E on a Mac).
Now you will found a POST request done.
Navigate to Params tab.
Now you can make the POST request. and since the Table is presented within the HTML source and it's not loaded via JavaScript, so you can parse it within bs4 or read it in nice format using pandas.read_html()
Note: You can read the table as long as it's not loaded via JavaScript. otherwise you can try to track the XHR request (Check previous answer) or you can use selenium or requests_html to render JS since requests is an HTTP library which can't render it for you.

Related

Gathering data from table using Pandas and Beautiful Soup after logging in using Selenium

I'm trying to scrape data from a paginated table. The table can only be accessed by logging in to a user account. I've decided to approach this using Selenium to log in. I then hope to be able to read this into a Pandas DataFrame. I plan on using BeautifulSoup as a go between.
Here is my code:
from selenium import webdriver
import time
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.example.com/userarea"
driver = webdriver.Chrome()
time.sleep(6)
driver.get(url)
time.sleep(6)
username = driver.find_element_by_id("user")
username.clear()
username.send_keys("xyz#email.com")
password = driver.find_element_by_id("password")
password.clear()
password.send_keys('password')
driver.find_element_by_xpath('//button[]').click()
driver.find_element_by_xpath('//button[text()="Log in"]').click()
time.sleep(6)
driver.find_element_by_xpath('//span[text()="Text"]').click()
driver.find_element_by_xpath('//span[text()="Text"]').click()
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
try:
tables = soup.find_all('th')
print(tables) #Returns an empty list
df = pd.read_html(str(tables))
df.head()
except:
driver.close()
driver.close()
Unfortunately, this is only printing an empty list. I've tried using lxml too but no joy.
Using the inspection tools it does seem that there aren't any table tags, so I tried to find all <th> tags instead (which definitely are present). Again no joy. I've not yet tried to work through the individual pages. I only mention the pagination in case it offers a clue to the issue.
Any idea what I'm missing?
Thank you to those that offered suggestions. In the end furas' suggestion was best placed and it turned out the script was running too quickly. I paused Python for 6 seconds after clicking on the page with the table on. Seems to run on javascript and I can actually see the values pop into place now as the script works through the pagination.
import time
#Navigate to page, then let it load using:
time.sleep(6)

Using Python and Selenium to scrape hard-to-find web tables

I've been using Python and Selenium to scrape data from specific state health web pages and output the table to a local CSV.
I've had a lot of success on several other states using similar code. But, I have hit a state that is using what appears to be R to create dynamic dashboards that I can't really access using my normal methods.
I've spent a great deal of time combing through StackOverflow . . . I've checked to see if there's an iframe to switch to, but, I'm just not seeing the data I want located in the iframe on the page.
I can find the table info easy enough using Chrome's "Inspect" feature. But, starting from the original URL, the data I need is not on that page and I can't find a source URL for the the table. I've even used Fiddler to see if there's a call somewhere.
So, I'm not sure what to do. I can see the data--but, I don't know where it is to tell Selenium and BS4 where to access it.
The page is here: https://coronavirus.utah.gov/case-counts/
The page takes a while to load . . . I've had other states have this issue and Selenium could work through it.
The table I need looks like this:
Any help or suggestions would be appreciated.
Here is the code I've been using . . . it doesn't work here, but, the structure is very similar to that which has worked for other states.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located()((By.ID, "total-number-of-lab-confirmed-covid-19-cases-living-in-utah")))
# Now, scrape table
html = driver.find_element_by_id("total-number-of-lab-confirmed-covid-19-cases-living-in-utah")
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='#DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()
I found another way to the get the information I needed.
Thanks to Julian Stanley for letting me know about the Katalon Recorder product. That allowed me to see what the iframe was where the table was.
Using my old method of finding an element by CSS or XPATH was causing a Pickle error due to locked thread. I have no clue how to deal with that . . . but, it caused the entire project to just hang.
But, I was able to get the text/HTML of the table via attribute. After that, I just read it with BS4 as usual.
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
#driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout)
# Get name of frame (or use index=0)
frames = [frame.get_attribute('id') for frame in driver.find_elements_by_tag_name('iframe')]
# Switch to frame
#driver.switch_to_frame("coronavirus-dashboard")
driver.switch_to_frame(0)
# Now, scrape table
html = driver.find_element_by_css_selector('#DataTables_Table_0_wrapper').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()

Next page on Selenium/Safari

Im trying to do some web-scraping on a site, but I can't access to the next page on safari.
The site is: https://www.emol.com/todas/
the code just give me the same results of the first page twice, I need the first 3st pages
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common import exceptions
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
browser = webdriver.Safari()
browser.get("https://www.emol.com/todas/")
noticias = []
i = 0
while i < 2:
try:
nav = browser.find_elements_by_class_name("cont_bus_txt_detall_2")
for value in nav:
noticias.append(value.text)
browser.find_element_by_css_selector("a[href*='javascript:Next();']").click()
i += 1
except exceptions.StaleElementReferenceException:
pass
the below code is when I inspect the button of next page on safari:
<a class="next current-page-next-prev" href="javascript:Next();"><span class="txt_siguiente">Siguiente</span> <i class="fa fa-chevron-right"></i></a>
<span class="txt_siguiente">Siguiente</span>
<i class="fa fa-chevron-right"></I>
The thing that you are doing here is that you are refering to CSS Selector, but the selector can be the same for multiple elements, which will lead to an error or as you said, its going to be jumping into the same website, because its selector is static (same) on all websites
try this (change N, for the number you want to go ( second page == 2 ):
browser.find_element_by_xpath("/html/body/div[4]/div/div/div/div[3]/div/nav[1]/ul/li[N]/a").click()
The simplest way to do this would be by getting the text of the pagination elements. Here is an example for the above:
>>> from selenium import webdriver
>>> driver=webdriver.Chrome()
>>> driver.get('https://www.emol.com/todas/')
>>> pagination_elements = [browser.find_element_by_xpath('//ul[#id="listPages"]/li/a[text()=%s]' % n) for n in range(1,4)]
>>> len(pagination_elements)
# 3
>>> pagination_elements[2].click() # to view page 3
Notice how much cleaner this is:
//ul[#id="listPages"]/li/a[text()=%s]
The "cleaner" you can make the xpaths the more resilient your scraping becomes to changes in the html. And believe me, the html changes all the time for a live site... Notice how we can easily get all the pagination elements you want here with a single line of code as well.
Finally, a much better way to scrape the page would be to inspect the network tab and get the actual data that is being emitted on that ajax call. For example, in Chrome dev tools it will give you something like this:
https://cache-elastic-pandora.ecn.cl/emol/noticia/_search?q=publicada:true+AND+ultimoMinuto:true+AND+seccion:+AND+temas.id:&sort=fechaModificacion:desc&size=15&from=45 (<== Note, StackOverflow doesn't markup the whole link so you'll need to copy-paste it).
This will give you json of size 15 starting from the 45th result. You can play around with the parameters there to grab the data much more easily. For example, try changing the size to "1000" and see what happens. Good luck!

BeautifoulSoup not returning everything in Facebook

I'm trying to extract all the pages liked by a given person on Facebook. Therefore, I'm using Python with BeautifulSoup and selenium to automatize the connection.
However, even though my code works, it doesn't actually return all the results (on my own profile, for instance, it only returns about 20% of all pages).
I read that it might be the parser used in BeautifulSoup, but I tried a bunch of them (html.parser, lxml...) and it's always the same thing.
Could that be because Facebook is dynamically generating the pages with AJAX? But then I have Selenium, which should correctly interpret it..!
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
id_user = ""
driver = webdriver.Chrome()
driver.get('https://facebook.com')
driver.find_element_by_id('email').send_keys('')
driver.find_element_by_id('pass').send_keys('')
driver.find_element_by_id('loginbutton').click()
time.sleep(2)
pages_liked = "https://www.facebook.com/search/" + id_user + "/pages-liked"
driver.get(pages_liked)
soup = BeautifulSoup(driver.page_source, 'html.parser')
likes_divs = soup.find_all('a', class_="_32mo")
for div in likes_divs:
print(div['href'].split("/?")[0])
print(div.find('span').text)
Thank you very much,
Loïc
Facebook is famous for make web scrapers's life dificult... That said, looks like you do your homework correctly, the snipet looks rigth to the point.
Start to look into 'driver.page_source', what Selenium gets... if the information is in there, the problem is within BeautifulSoup, if its not, Facebook found an strategy to hide the page (looking at browser signature or fingerprint - yes, these are diferent concepts).

How to retrieve the values of dynamic html content using Python

I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:
url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);
response = request.urlopen(url)
data = str(response.read(10000))
data = data.replace("\\n", "\n")
print(data)
Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".
How can I make it so that I can retrieve the value instead of the placeholder text?
Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}
Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.
The code I have now is:
from bs4 import BeautifulSoup
from selenium import webdriver
#...
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
print "Finding..."
for tag in soup.find_all('formatPrice median'):
print tag.text
Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.
Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)
I used selenium + chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "www.sitetotarget.com"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')`
Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.
Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.
import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()
Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:
url = 'http://www.example.org' ## Any website
driver.get(url)
pageSource = driver.page_source
print(driver.get_window_size())
locations = []
for element in driver.find_elements_by_tag_name("a"):
location = element.location;
size = element.size;
# Collect coordinates of object: left/right, top/bottom
x1 = location['x'];
y1 = location['y'];
x2 = location['x']+size['width'];
y2 = location['y']+size['height'];
locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])
locations = [ (el,x1,y1,x2,y2, width,height)
for el,x1,y1,x2,y2,width,height in locations
if not (
## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
x2 <= x1 or y2 <= y1 or x2<0 or y2<0
## Further restrict if you expect the objects to be around a specific size
## or width<200 or height<100
)
]
for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
try:
print('-'*100,f'({width},{height})')
print(el.text[:100])
element_png = el.screenshot_as_png
with open('/tmp/_pageImage.png', 'wb') as f:
f.write(element_png)
img = Image.open('/tmp/_pageImage.png')
display(img)
except Exception as err:
print(err)
Installation for mac+chrome:
pip install selenium
brew cask install chromedriver
brew cask install google-chrome
I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.
Regarding responsibility, please respect robots.txt on each site.
I know this is an old question, but sometimes there is a better solution than using heavy selenium.
This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal.
Though, sometimes if you have to click elements or sth, I guess selenium is the only option.

Categories

Resources