Why can't I scrape using beautiful soup? - python

I need to scrape the only table from this website: https://core.ap.gov.in/CMDashBoard/UserInterface/eAgriculture/eAgricultureVillagewise.aspx?mandal=Agali&district=Anantapuramu
I used beautiful soup and requests but not successful. Can you guys suggest me where I am going wrong?
mandal_url = "https://core.ap.gov.in/CMDashBoard/UserInterface/eAgriculture/eAgricultureVillagewise.aspx?mandal=Agali&district=Anantapuramu"
r = requests.get(mandal_url, verify=False).content
soup = bs4.BeautifulSoup(r, 'lxml')
df = pd.read_html(str(soup.find('table',{"id":"gvAgricultureVillage"})))
I am geeting 'Page not found' in the data frame. I don't know where I am going wrong!

The page likely requires some sort of sign-in. Viewing it myself by clicking on the link, I get .
You would need to add the cookies / some other headers to the request to appear "signed in".

Try to click the link you're trying to scrape from an invalid link. When I click the link you provide, or the link you store in mandal_url, both return an 'Page not found' page. So you are scraping in the correct way, but the url you provide to the scraper is invalid/not up anymore.

I couldn't get access to the website. But you can read form(s) on a webpage directly by using:
dfs = pd.read_html(your_url, header=0)
In the case that url requires authentication, you can get the form by:
r = requests.get(url_need_authentivation, auth=('myuser', 'mypasswd'))
pd.read_html(r.text, header=0)[1]
This will simplify your code. Hope it helps!

Related

Scraping stock price data

I'm trying to get the prices from this URL.
Can someone tell me what I'm doing wrong?
https://stockcharts.com/h-hd/?%24GSPCE
resp = requests.get(BASE_URL).content
soup = BeautifulSoup(resp, 'html.parser')
prices = soup.find('div', {'class': 'historical-data-descrip'})
content = str(prices)
print(content)
Stockcharts only provides historical data to StockChart members, so you probably need to pass some kind of authentication.
Or use an api like this one
You must be logged in to your StockCharts.com account to see this information.
are you signed in on your browser?
get the soup.contents() and open them in a notepad document and search for the above to see if you're straight up getting a login error

Scraping Webpage With Beautiful Soup

I am new to web scraping and I am trying to scrape wind data from a website. Here is the website: https://wx.ikitesurf.com/spot/507.
I understand that I can do this using selenium to find elements but I think I may have found a better way. Please correct if I am wrong. When in developer tools I can find this page by going to network->JS->getGraph?
https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881
This page contains all the data I need and it is constantly updating. Here is my code:
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
time.sleep(3)
wind = soup.find("last_ob_wind_desc")
print (wind)
I tried using beautiful soup to scrape but I always receive the answer "None". Does anyone know how I can scrape this page? I would like to know what I am doing wrong. Thanks for any help!
Removing callback=jQuery17200020271765600428093_1619158293267& from the api url will make it return proper json:
import requests
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url).json()
response is now a dictionary with the data. last_ob_wind_desc can be retrieved with response['last_ob_wind_desc'].
You can also save the data to csv or other file formats with pandas:
import pandas as pd
df = pd.json_normalize(response)
df.to_csv('filename.csv')

request.get url of second page of a search result

I am trying to use request.get(url) to get the response of a url from a server.
The following code works for the url of the first page of a search result:
r = requests.get("https://www.epocacosmeticos.com.br/perfumes")
soup = BeautifulSoup(r.text)
However, when I try to use the same code for the url of the second page of the the search result, which is "https://www.epocacosmeticos.com.br/perfumes#2",
r = requests.get("https://www.epocacosmeticos.com.br/perfumes#2")
soup = BeautifulSoup(r.text)
it returns the response of the first page. It ignores the '#2' at the end of the URL. How can I get the response of the second page of a search result?
You can use a web proxy like BurpSuite to view the requests made by the page. When you click on the "Page 2" button, this is what is being sent in the background:
GET /buscapagina?fq=C%3a%2f1000001%2f&PS=16&sl=f804bbc5-5fa8-4b8b-b93a-641c059b35b3&cc=4&sm=0&PageNumber=2 HTTP/1.1
Therefore, this is the url you will need to query if you want to properly scrape the website.
BurpSuite also allows you to play with the requests, so you can try changing the request (like changing the 2 for a 3) and see if you get the expected result.
It seems this website uses dynamic html. Because of this, the second results page is not a "new page", but the same page with the search content reloaded.
You probably won't be able to scrap only using requests. This probably requires a browser. Selenium with PhantomJS or Headless-Chrome are good choices for this job, and after that you can use beautifulSoup to parse.

CSS selectors to be used for scraping specific links

I am new to Python and working on a scraping project. I am using Firebug to copy the CSS path of required links. I am trying to collect the links under the tab of "UPCOMING EVENTS" from http://kiascenehai.pk/ but it is just for learning how I can get the specified links.
I am looking for the fix of this problem and also suggestions for how to retrieve specified links using CSS selectors.
from bs4 import BeautifulSoup
import requests
url = "http://kiascenehai.pk/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select("html body div.body-outer-wrapper div.body-wrapper.boxed-mode div.main- outer-wrapper.mt30 div.main-wrapper.container div.row.row-wrapper div.page-wrapper.twelve.columns.b0 div.row div.page-wrapper.twelve.columns div.row div.eight.columns.b0 div.content.clearfix section#main-content div.row div.six.columns div.small-post-wrapper div.small-post-content h2.small-post-title a"):
print link.get('href')
First of all, that page requires a city selection to be made (in a cookie). Use a Session object to handle this:
s = requests.Session()
s.post('http://kiascenehai.pk/select_city/submit_city', data={'city': 'Lahore'})
response = s.get('http://kiascenehai.pk/')
Now the response gets the actual page content, not redirected to the city selection page.
Next, keep your CSS selector no larger than needed. In this page there isn't much to go on as it uses a grid layout, so we first need to zoom in on the right rows:
upcoming_events_header = soup.find('div', class_='featured-event')
upcoming_events_row = upcoming_events_header.find_next(class_='row')
for link in upcoming_events_row.select('h2 a[href]'):
print link['href']
This is co-founder KiaSceneHai.pk; please don't scrape websites, alot of effort goes into collecting the data, we offer access through our API, you can use the contact form to request access, ty

Using urllib2 and BeautifulSoup not receiving the data I view in browser

I am trying to scrape a website:
http://www.gabar.org/membersearchresults.cfm?start=26&id=E640EC74-9C8E-9913-79DB5D9C376528C0
I know the link above will show that there are no search results, but when I do the search manually there are results.
The problem I am having is when I open this link in my browser I am able to see a page as expected however when I open it in beautiful soup the output I get something along the lines that this search is not available.
I am new to this so not quite sure how this works, do websites have things built in that make things like this (urllib2/beautifulsoup) not work?
File = urllib2.urlopen("http://www.gabar.org/membersearchresults.cfm?start=26&id=E640EC74-9C8E-9913-79DB5D9C376528C0")
Html = File.read()
File.close()
soup = BeautifulSoup(Html)
AllLinks = soup.find_all("a")
lawyerlinks = []
for link in soup.find_all("a"):
lawyerlinks.append(link.get('href'))
lawyerlinks = lawyerlinks[76:100]
print lawyerlinks
That's fascinating. Going to the first page of results works, and then clicking "Next" works, and all it does is take you to the URL you posted. But if I visit that URL directly, I get no results.
Note that urllib2.urlopen is indeed behaving exactly like a browser here. If you open a browser directly to that page, you get no results - which is exactly what you get with urlopen.
What you want to do is mimic a browser, visit the first page of results, and then mimic clicking 'next' just like a browser would. The best library I know of for this is mechanize.
import mechanize
br = mechanize.Browser()
br.open("http://www.gabar.org/membersearchresults.cfm?id=ED162783-9C8E-9913-79DBE86CBE9FB115")
response1 = br.follow_link(text_regex=r"Next", nr=0)
Html = response1.read()
#rest is the same

Categories

Resources