This is my first post so I apologize if it is a duplicate but I could not find an answer relevant to mine. If there is one please let me know and I will check it out.
I am attempting to scrape a website(below) to find Berkeley rent ceiling, the trouble I'm having is I cannot seem to figure out how to insert an address into the search box and scrape the info from the next page. In the past the URLs I've worked with change with search input, but not on this website. I thought my best bet would be using bs4 to scrape the info and request.session and requests.post to get to each subsequent address.
#Berkeley Rent Scrape
from bs4 import BeauitfulSoup
import sys
import requests
import openpyxl
import pprint
import csv
#wb = openpyxl.load_workbook('workbook.xlsx', data_only=True)
#sheet = wb.get_sheet_by_name('worksheet')
props_payload={'aspnetForm':'1150 Oxford St'}
URL = 'http://www.ci.berkeley.ca.us/RentBoardUnitSearch.aspx'
s = requests.session()
p = s.post(ULR, data = props_payload)
soup = BeauitfulSoup(p.text)
data = soup.find_all('td', class="gridItem")
UPDATE How do you get the info from the new webpage once the post has been sent? Or in other words, what is framework for using a request.post then a request.get or bs4 scrape when the URL does not change?
I was thinking it would look something like this, but I'm sure I need a GET request somewhere in there but don't understand how sessions work when the URL doesn't change.
I will be exporting the info to a csv file and to a excel sheet, but I can deal with that later. Just want to get the meat out of the way.
Thank you for any help!
As you can see in the link this search works not through the redirection, so you can't pass your query into the URL.
I'm not sure how you can work directly with the ASP.NET backend (it might be tricky due to authentication/validation on the backend).
I think some automation (test) tool can help you (e.g PhantomJS and/or CasperJS). It gives you control over the rendered web page and you can programmatically put query into the input and grab data after response
Related
I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.
I am trying to scrape the table from this link. There are no table tags on the page so I am trying to access it using the class "rt-table". When I inspect the table in developer tools, I can see the html I need in the elements section (I am using Chrome). However, when I view the source code using requests, this part of the code is now missing. Does anyone know what the problem could be?
If you just want those stats you can access the hidden api like this:
import requests
import pandas as pd
url = f'https://api.nhle.com/stats/rest/en/team/summary?isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22points%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22wins%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=100&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20seasonId%3C=20072008%20and%20seasonId%3E=20072008'
data = requests.get(url).json()
df = pd.DataFrame(data['data'])
df.to_csv('nj devils data.csv',index=False)
To see where I got that URL go to the page you are trying to scrape in your browser, then open Developer Tools - Network - fetch/XHR and hit refresh, you'll see a bunch of network requests happen if you click on them you can explore the data returns in the "preview" tab. This can be recreated like above. This is not always the case, sometimes you need to send certain headers or a payload of information to authenticate a request to an api, but in this case it works quite easily.
I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!
Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.
I am trying scrape the live data at the to of this page:
https://www.wallstreet-online.de/devisen/euro-us-dollar-eur-usd-kurs/realtime
My current method:
import time
import re
import bs4 from bs4 import BeautifulSoup as soup
import requests
while (1==1):
con = requests.request('get','https://www.wallstreet-
online.de/devisen/euro-us-dollar-eur-usd-kurs/realtime', stream = True)
page = con.text
kursSoup = soup(page, "html.parser")
kursDiv = kursSoup.find("div", {"class":"pull-left quoteValue"})
print(kursDiv.span)
del con
del page
del kursSoup
del kursDiv
#time.sleep(2)
print("end")
works but is not in sync with the data on the website. I dont really get why because i delete all the variables at the end of the loop so the result should change when the data on the website changes but seems to stay the same for a fixed amount of times. Does anyone know why or has a better way of doing this (Im a bloody beginner and have no idea how the site even works thats why im parsing the html).
It looks like that web page may be using JavaScript to populate and update that number. I'm not familiar with BeautifulSoup but I don't think it will run the JavaScript on the page to update that number.
You may want to use something like Chrome Developer Tools to keep an eye on the network tab. I looked and it looks like there is a websocket connection to wss://push.wallstreet-online.de/lightstreamer going on behind the scenes. You may want to use a websocket client Python library to read from this socket and either find some API docs or reverse engineer the data that comes from the socket. Good luck!
In this video, I give you a look at the dataset I want to scrape/take from the web. Very sorry about the audio, but did the best with what I have. It is hard for me to describe what I am trying to do as I see a page with thousands of pages and obviously has tables, but pd.read_html doesn't work! Until it hit me, this page has a form to be filled out first....
https://opir.fiu.edu/instructor_eval.asp
Going to this link will allow you to select a semester, and in doing so, will show thousands upon thousands of tables. I attempted to use the URL after selecting a semester hoping to read HTML, but no such luck.. I still don't know what I'm even looking at (like, is it a webpage, or is it ASP? What even IS ASP?). If you follow the video link, you'll see that it gives an ugly error if you select spring semester, copy the link, and put it in the search bar. Some SQL error.
So this is my dilemma. I'm trying to GET this data... All these tables. Last post I made, I did a brute force attempt to get them by just clicking and dragging for 10+ minutes, then pasting into excel. That's an awful way of doing it, and it wasn't even particularly useful when I imported that excel sheet into python because the data was very difficult to work with. Very unstructured. So I thought, hey, why not scrape with bs4? Not that easy either, it seems, as the URL won't work. After filtering to spring semester, the URL just won't work, not for you, and not if you paste it into python for bs4 to use...
So I'm sort of at a loss here of how to reasonably work with this data. I want to scrape it with bs4, and put it into dataframes to be manipulated later. However, as it is ASP or whatever it is, I can't find a way to do so yet :\
ASP stands for Active Server Pages and is a page running a server-side script (usually vbs), so this shouldn't concern you as you want to scrape data from the rendered page.
In order to get a valid response from /instructor_evals/instr_eval_result.asp you have to submit a POST request with the form data of /instructor_eval.asp, otherwise the page returns an error message.
If you submit the correct data with urllib you should be able to get the tables with bs4.
from urllib.request import urlopen, Request
from urllib.parse import urlencode
from bs4 import BeautifulSoup
url = 'https://opir.fiu.edu/instructor_evals/instr_eval_result.asp'
data = {'Term':'1171', 'Coll':'%', 'Dept':'','RefNum':'','Crse':'','Instr':''}
r = urlopen(Request(url, data=urlencode(data).encode()))
html = r.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')
By the way this error message is a strong indication that the page is vulnerable to SQL Injection which is a very nasty bug, and i think you should inform the admin about it.