I'm trying to get the prices from this URL.
Can someone tell me what I'm doing wrong?
https://stockcharts.com/h-hd/?%24GSPCE
resp = requests.get(BASE_URL).content
soup = BeautifulSoup(resp, 'html.parser')
prices = soup.find('div', {'class': 'historical-data-descrip'})
content = str(prices)
print(content)
Stockcharts only provides historical data to StockChart members, so you probably need to pass some kind of authentication.
Or use an api like this one
You must be logged in to your StockCharts.com account to see this information.
are you signed in on your browser?
get the soup.contents() and open them in a notepad document and search for the above to see if you're straight up getting a login error
Related
I am new to web scraping and I am trying to scrape wind data from a website. Here is the website: https://wx.ikitesurf.com/spot/507.
I understand that I can do this using selenium to find elements but I think I may have found a better way. Please correct if I am wrong. When in developer tools I can find this page by going to network->JS->getGraph?
https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881
This page contains all the data I need and it is constantly updating. Here is my code:
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
time.sleep(3)
wind = soup.find("last_ob_wind_desc")
print (wind)
I tried using beautiful soup to scrape but I always receive the answer "None". Does anyone know how I can scrape this page? I would like to know what I am doing wrong. Thanks for any help!
Removing callback=jQuery17200020271765600428093_1619158293267& from the api url will make it return proper json:
import requests
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url).json()
response is now a dictionary with the data. last_ob_wind_desc can be retrieved with response['last_ob_wind_desc'].
You can also save the data to csv or other file formats with pandas:
import pandas as pd
df = pd.json_normalize(response)
df.to_csv('filename.csv')
I need to scrape the only table from this website: https://core.ap.gov.in/CMDashBoard/UserInterface/eAgriculture/eAgricultureVillagewise.aspx?mandal=Agali&district=Anantapuramu
I used beautiful soup and requests but not successful. Can you guys suggest me where I am going wrong?
mandal_url = "https://core.ap.gov.in/CMDashBoard/UserInterface/eAgriculture/eAgricultureVillagewise.aspx?mandal=Agali&district=Anantapuramu"
r = requests.get(mandal_url, verify=False).content
soup = bs4.BeautifulSoup(r, 'lxml')
df = pd.read_html(str(soup.find('table',{"id":"gvAgricultureVillage"})))
I am geeting 'Page not found' in the data frame. I don't know where I am going wrong!
The page likely requires some sort of sign-in. Viewing it myself by clicking on the link, I get .
You would need to add the cookies / some other headers to the request to appear "signed in".
Try to click the link you're trying to scrape from an invalid link. When I click the link you provide, or the link you store in mandal_url, both return an 'Page not found' page. So you are scraping in the correct way, but the url you provide to the scraper is invalid/not up anymore.
I couldn't get access to the website. But you can read form(s) on a webpage directly by using:
dfs = pd.read_html(your_url, header=0)
In the case that url requires authentication, you can get the form by:
r = requests.get(url_need_authentivation, auth=('myuser', 'mypasswd'))
pd.read_html(r.text, header=0)[1]
This will simplify your code. Hope it helps!
I want to extract the username from Facebook posts without API. I've already succeeded in extraction the timestamp, but the same algorithm is not working with the username.
As input I have a list of links like these:
https://www.facebook.com/barackobama/photos/a.10155401589571749/10156901908101749/?type=3&theater
https://www.facebook.com/photo.php?fbid=391679854902607&set=gm.325851774772841&type=1&theater
https://www.facebook.com/FisherHouse/photos/pcb.10157433176029134/10157433170239134/?type=3&theater
I've already tried searching with pageTitle, but it is not working as expected because there are many unuseful information.
facebook = BeautifulSoup(req.text, "html.parser")
facebookusername = str (facebook.select('[id="pageTitle"]'))
My code now is:
req = requests.get(url)
facebook = BeautifulSoup(req.text, "html.parser")
divs = facebook.find_all('div', class_="_title")
for iteration in range (len(divs)):
if 'title' in str(divs[iteration]):
print (divs[iteration])
I need only the username as output.
As WizKid said, you should use the API. But to give you an answer: The name of the page seems to be nested within the h5-title. Extract the h5 first and then get the name.
x = facebook.find('h5')
title = x.find('a').getText()
I can't try it at the moment but that should do the trick.
I am new to webscraping. So I have been given a task to extract data from : Here
I am choosing dataset of "comments". Below is my code for scraping.
import requests
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/hacker-news/hacker-news'
headers = {'User-Agent' : 'Mozilla/5.0'}
response = requests.get(url, headers = headers)
response.status_code
response.content
soup = BeautifulSoup(response.content, 'html.parser')
soup.find_all('tbody', class_ = 'TableBody-kSbjpE jGqIxa')
When I try to execute the last command it returns : [].
So, I am stuck here. I know we can get the data from kernel, but just for practice purpose where am I going wrong? Am I choosing wrong class? I want to scrape the data and probably save it to a CSV file or to a No-SQL Database, preferred Cassandra.
you are getting this [] because data you want to scrape is coming from API which loads after you web page load so page you are accessing does not contain that class
you can open you browser console and check in network as given in screenshot there you find data you want to scrape so you have to make request to that URL to get data
you can retrive data in this URL in preview tab you can see all data.
also if you have good knowledge of python you can also use this to scrape data
https://doc.scrapy.org/en/latest/intro/overview.html
Even though you were able to see the 'tbody', class_ = 'TableBody-kSbjpE jGqIxa' in the element inspector, the request that you make does not contain this class. See for yourself print(soup.prettify()). This is most likely because you're not requesting the correct url.
This may be not something you're aware of, but as a fyi:
You don't actually need to scrape using BeautifulSoup, you can get a list of all the available datasets from the API. Once you have it installed and configured, you can get the dataset: kaggle datasets download -d . Here's more info if you wish to proceed with the API instead: https://github.com/Kaggle/kaggle-api
I am new to Python and working on a scraping project. I am using Firebug to copy the CSS path of required links. I am trying to collect the links under the tab of "UPCOMING EVENTS" from http://kiascenehai.pk/ but it is just for learning how I can get the specified links.
I am looking for the fix of this problem and also suggestions for how to retrieve specified links using CSS selectors.
from bs4 import BeautifulSoup
import requests
url = "http://kiascenehai.pk/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select("html body div.body-outer-wrapper div.body-wrapper.boxed-mode div.main- outer-wrapper.mt30 div.main-wrapper.container div.row.row-wrapper div.page-wrapper.twelve.columns.b0 div.row div.page-wrapper.twelve.columns div.row div.eight.columns.b0 div.content.clearfix section#main-content div.row div.six.columns div.small-post-wrapper div.small-post-content h2.small-post-title a"):
print link.get('href')
First of all, that page requires a city selection to be made (in a cookie). Use a Session object to handle this:
s = requests.Session()
s.post('http://kiascenehai.pk/select_city/submit_city', data={'city': 'Lahore'})
response = s.get('http://kiascenehai.pk/')
Now the response gets the actual page content, not redirected to the city selection page.
Next, keep your CSS selector no larger than needed. In this page there isn't much to go on as it uses a grid layout, so we first need to zoom in on the right rows:
upcoming_events_header = soup.find('div', class_='featured-event')
upcoming_events_row = upcoming_events_header.find_next(class_='row')
for link in upcoming_events_row.select('h2 a[href]'):
print link['href']
This is co-founder KiaSceneHai.pk; please don't scrape websites, alot of effort goes into collecting the data, we offer access through our API, you can use the contact form to request access, ty