How to webscrape old school website that uses frames - python

I am trying to webscrape a government site that uses frameset.
Here is the URL - https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm
I've tried using splinter/selenium
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
browser.visit(url)
time.sleep(10)
full_xpath_frame = '/html/frameset/frameset/frame[2]'
tree = browser.find_by_xpath(full_xpath_frame)
for i in tree:
print(i.text)
It just returns an empty string.
I've tried using the requests library.
import requests
from lxml import HTML
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
# get response object
response = requests.get(url)
# get byte string
data = response.content
print(data)
And it returns this
b"<html>\r\n<head>\r\n<meta http-equiv='Content-Type'\r\ncontent='text/html; charset=iso-
8859-1'>\r\n<title>Lake_ County Election Results</title>\r\n</head>\r\n<FRAMESET rows='20%,
*'>\r\n<FRAME src='titlebar.htm' scrolling='no'>\r\n<FRAMESET cols='20%, *'>\r\n<FRAME
src='menu.htm'>\r\n<FRAME src='Lake_ElecSumm_all.htm' name='reports'>\r\n</FRAMESET>
\r\n</FRAMESET>\r\n<body>\r\n</body>\r\n</html>\r\n"
I've also tried using beautiful soup and it gave me the same thing. Is there another python library I can use in order to get the data that's inside the second table?
Thank you for any feedback.

As mentioned you could go for the frames and its src:
BeautifulSoup(r.text).select('frame')[1].get('src')
or directly to the menu.htm:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/menu.htm')
link_list = ['https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults'+a.get('href') for a in BeautifulSoup(r.text).select('a')]
for link in link_list[:1]:
r = requests.get(link)
soup = BeautifulSoup(r.text)
###...scrape what is needed

Related

how to remove unwanted text from retrieving title of a page using python

Hi All I have written a python program to retrieve the title of a page it works fine but with some pages, it also receives some unwanted text how to avoid that
here is my program
# importing the modules
import requests
from bs4 import BeautifulSoup
# target url
url = 'https://atlasobscura.com'
# making requests instance
reqs = requests.get(url)
# using the BeaitifulSoup module
soup = BeautifulSoup(reqs.text, 'html.parser')
# displaying the title
print("Title of the website is : ")
for title in soup.find_all('title'):
title_data = title.get_text().lower().strip()
print(title_data)
here is my output
atlas obscura - curious and wondrous travel destinations
aoc-full-screen
aoc-heart-solid
aoc-compass
aoc-flipboard
aoc-globe
aoc-pocket
aoc-share
aoc-cancel
aoc-video
aoc-building
aoc-clock
aoc-clipboard
aoc-help
aoc-arrow-right
aoc-arrow-left
aoc-ticket
aoc-place-entry
aoc-facebook
aoc-instagram
aoc-reddit
aoc-rss
aoc-twitter
aoc-accommodation
aoc-activity-level
aoc-add-a-photo
aoc-add-box
aoc-add-shape
aoc-arrow-forward
aoc-been-here
aoc-chat-bubbles
aoc-close
aoc-expand-more
aoc-expand-less
aoc-forum-flag
aoc-group-size
aoc-heart-outline
aoc-heart-solid
aoc-home
aoc-important
aoc-knife-fork
aoc-library-books
aoc-link
aoc-list-circle-bullets
aoc-list
aoc-location-add
aoc-location
aoc-mail
aoc-map
aoc-menu
aoc-more-horizontal
aoc-my-location
aoc-near-me
aoc-notifications-alert
aoc-notifications-mentions
aoc-notifications-muted
aoc-notifications-tracking
aoc-open-in-new
aoc-pencil
aoc-person
aoc-pinned
aoc-plane-takeoff
aoc-plane
aoc-print
aoc-reply
aoc-search
aoc-shuffle
aoc-star
aoc-subject
aoc-trip-style
aoc-unpinned
aoc-send
aoc-phone
aoc-apps
aoc-lock
aoc-verified
instead of this I suppose to receive only this line
"atlas obscura - curious and wondrous travel destinations"
please help me with some idea all other websites are working only some websites gives these problem
Your problem is that you're finding all the occurences of "title" in the page. Beautiful soup has an attribute title specifically for what you're trying to do. Here's your modified code:
# importing the modules
import requests
from bs4 import BeautifulSoup
# target url
url = 'https://atlasobscura.com'
# making requests instance
reqs = requests.get(url)
# using the BeaitifulSoup module
soup = BeautifulSoup(reqs.text, 'html.parser')
title_data = soup.title.text.lower()
# displaying the title
print("Title of the website is : ")
print(title_data)

Get XHR info from URL

I have this website https://www.futbin.com/22/player/7504 and I want to know if there is a way to get the XHR url for the information using python. For example for the URL above I know the XHR I want is https://www.futbin.com/22/playerPrices?player=231443 (got it from inspect element -> network).
My objective is to get the price value from https://www.futbin.com/22/player/1 to https://www.futbin.com/22/player/10000 at once without using inspect element one by one.
import requests
URL = 'https://www.futbin.com/22/playerPrices?player=231443'
page = requests.get(URL)
x = page.json()
data = x['231443']['prices']
print(data['pc']['LCPrice'])
print(data['ps']['LCPrice'])
print(data['xbox']['LCPrice'])
You can find the player-resource id and build the url yourself. I use beautifulsoup. It's made for parsing websites, but you can take the requests content and throw that into an html parser as well if you don't want to install beautifulsoup
With it, read the first url, get the id and use your code to pull the prices. To test, change the 10000 to 2 or 3 and you'll see it works.
import re, requests
from bs4 import BeautifulSoup
for i in range(1,10000):
url = 'https://www.futbin.com/22/player/{}'.format(str(i))
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
player_resource = soup.find(id=re.compile('page-info')).get('data-player-resource')
# print(player_resource)
URL = 'https://www.futbin.com/22/playerPrices?player={}'.format(player_resource)
page = requests.get(URL)
x = page.json()
# print(x)
data = x[player_resource]['prices']
print(data['pc']['LCPrice'])
print(data['ps']['LCPrice'])
print(data['xbox']['LCPrice'])

How do I log data from a live website using beautiful soup

Hello I am trying to use beautiful soup and requests to log the data coming from an anemometer which updates live every second. The link to this website here:
http://88.97.23.70:81/
The piece of data I want to scrape is highlighted in purple in the image :
from inspection of the html in my browser.
I have written the code bellow in to try to print out the data however when I run the code it prints: None. I think this means that the soup object doesnt infact contain the whole html page? Upon printing soup.prettify() I cannot find the same id=js-2-text I find when inspecting the html in my browser. If anyone has any ideas why this might be or how to fix it I would be most grateful.
from bs4 import BeautifulSoup
import requests
wind_url='http://88.97.23.70:81/'
r = requests.get(wind_url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
print(soup.find(id='js-2-text'))
All the best,
Brendan
The data is loaded from external URL, so beautifulsoup doesn't need it. You can try to use API URL the page is connecting to:
import requests
from bs4 import BeautifulSoup
api_url = "http://88.97.23.70:81/cgi-bin/CGI_GetMeasurement.cgi"
data = {"input_id": "1"}
soup = BeautifulSoup(requests.post(api_url, data=data).content, "html.parser")
_, direction, metres_per_second, *_ = soup.csv.text.split(",")
knots = float(metres_per_second) * 1.9438445
print(direction, metres_per_second, knots)
Prints:
210 006.58 12.79049681

Read data from URL / XML with python

this is my first question.
Im trying to learn some python, so.. i have this problem
how i can get data from this url that shows info in XML:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/v01/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup)
output:
Output
but i wanna get access to this type of data, < RUTEmisor> data for example:
linkurl_invoice
hope guys you can try to advice me with the code and how to read xml docs.
By examining the URL you gave, it seems that the data is actually held a few links away at the following URL: http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49
As such, you can access it directly as follows:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup.find('RUTEmisor').text)

How to scrape a specific table form website using python (beautifulsoup4 and requests or any other library)?

https://en.wikipedia.org/wiki/Economy_of_the_European_Union
Above is the link to website and I want to scrape table: Fortune top 10 E.U. corporations by revenue (2016).
Please, share the code for the same:
import requests
from bs4 import BeautifulSoup
def web_crawler(url):
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)
soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")
following what #FanMan said , this is simple code to help you get started, keep in mind that you will need to clean it and also perform the rest of the work on your own.
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
w=text.findAll(text=True)
if(len(w)>0):
temp_datastore.append(w)
Some documentation
beautiful soup:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests: http://docs.python-requests.org/en/master/user/intro/
urllib: https://docs.python.org/2/library/urllib.html
You're first issue is that your url is not properly defined. After that you need to find the table to extract and it's class. In this case the class was "wikitable" and it was a the first table. I have started your code for you so it gives you the extracted data from the table. Web-scraping is good to learn but if your are just starting to program, practice with some simpler stuff first.
import requests
from bs4 import BeautifulSoup
def webcrawler():
url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
page = requests.get(url)
soup = BeautifulSoup(page.text,"html.parser")
tables = soup.findAll("table", class_='wikitable')[0]
print(tables)
webcrawler()

Categories

Resources