Python Requests Getting HTML Code from Wrong Page - python

So I started using python requests to look at the html code from some websites. I would do r = requests.get(url) to get all the information I need.
However, I noticed that this doesn't work sometimes. For example, I'm using steamcommunity.com to get some market data, so I used the url: http://steamcommunity.com/market/search. This brings up the first page of items out of over 7,000 pages. To get a different page, I used this url: http://steamcommunity.com/market/search#p4_quantity_desc. This should take you to the 4th page of the website, and it does if you put it in your browser. But, when I go to read the html code, I get the same code from both urls, even though they should point to different pages with different code.
Any help is greatly appreciated! Thanks!

Related

Stucked with infinite scrolling using Python, Requests and BeautifulSoup

I've been scraping (python) articles from a couple of news websites from my country succesfully, basically by parsing the main page, fetching the hrefs and accesing them to parse the articles. But I just hit a wall with https://www.clarin.com/. I am only getting a very limited amount of elements because of the infinite scrolling. I researched a lot but I couldn't find the right resource to overcome this, but of course it is more than likely that I am doing it wrong.
For what I see in the devtools the url request that loads more is a json file, but I don't know how to fetch it automatically in order to parse it. I would like to get some quick guidance on what to learn to do this. I hope I made some sense, this is my base code:
source = requests.get(https://www.clarin.com/)
html = BeautifulSoup(source.text, "lxml")
This is an example request url I am seeing in chrome devtools.
https://www.clarin.com/ondemand/eyJtb2R1bGVDbGFzcyI6IkNMQUNsYXJpbkNvbnRhaW5lckJNTyIsImNvbnRhaW5lcklkIjoidjNfY29sZnVsbF9ob21lIiwibW9kdWxlSWQiOiJtb2RfMjAxOTYyMjQ4OTE0MDgzIiwiYm9hcmRJZCI6IjEiLCJib2FyZFZlcnNpb25JZCI6IjIwMjAwNDMwXzAwNjYiLCJuIjoiMiJ9.json

Get method from requests library seems to return homepage rather than specific URL

I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.
I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.
I haven't been able to find any documentation on the web about anyone else having this problem...
I'm using the Spyder IDE.
# Import library
import requests
# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'
# Pull contracts page
page = requests.get(url)
# Check that correct page is being pulled
print(page.text)
This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.
After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.
You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.
(Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)
As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.
Simply printing the result of the get request on the terminal won't be very helpful, as the HTML page content returned is long - your terminal will truncate the printed response. I'm assuming in your case maybe the website has parts of the homepage reused in other pages as well, so it might get confusing.
I recommend writing the response into a file and then opening the file in the browser. You will see that your code is pulling the right page.

Script cannot fetch data from a web page

I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.
If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.

Scraping a paginated website: Scraping page 2 gives back page 1 results

I am using the get method of the requests library in python to scrape information from a website which is organized into pages (i.e paginated with numbers at the bottom).
Page 1 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian
I am able to extract the data that I need from the first page but when I feed my code the url for the second page, I get the same data from the first page. Now after carefully analyzing my code, I am certain the issue is not my code logic but the way the second page url is structured.
So my question is how can I get my code to work as I want. I suspect it is a question of parameters but I am not 100% percent sure. If indeed it is parameters that I need to pass to request, I would appreciate some guidance on how to break down the parameters. My page 2 link is attached below.
Thanks.
Page 2 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'
Note: The pages are not really links per se.
It looks like platform is ASP.NET and pagination links are operated by JS. I seriously doubt you will have it easy with python, since beautifulsoup is a HTML parser/extractor, so if you really want to use this site, I would suggest to looking into Selenium or even PhantomJS, since they fully replicate the browser.
But in this particular case you are lucky, because there's a legacy website version which doesn't use modern bells and whistles :)
http://legacy.realfood.tesco.com/recipes/search.html?st=vegetarian&cr=False&page=3&srt=searchRelevance
It looks like the pagination of this site is handled by the query parameters passed in the second URL you posted, ie:
https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'
The query string is url encoded. %3D is = and %26 is &. It might be more readable like this:
q='selectedobjecttype=RECIPES&page=2&perpage=30&DietaryOption=Vegetarian'
For example, if you wanted to pull back the fifth page of Vegetarian Recipes the URL would look like this:
https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D5%26perpage%3D30%26DietaryOption%3DVegetarian'
You can keep incrementing the page number until you get a page with no results which looks like this.
What about this?
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '10'):
resp = urllib.request.urlopen("https://realfood.tesco.com/search.html?DietaryOption=Vegetarian")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
Hopefully it works for you. I can't test it because my office blocks these kinds of things. I'll try it when I get home tonight to see if it does what it should do...

How to scrape ASP webpage in Python?

In this video, I give you a look at the dataset I want to scrape/take from the web. Very sorry about the audio, but did the best with what I have. It is hard for me to describe what I am trying to do as I see a page with thousands of pages and obviously has tables, but pd.read_html doesn't work! Until it hit me, this page has a form to be filled out first....
https://opir.fiu.edu/instructor_eval.asp
Going to this link will allow you to select a semester, and in doing so, will show thousands upon thousands of tables. I attempted to use the URL after selecting a semester hoping to read HTML, but no such luck.. I still don't know what I'm even looking at (like, is it a webpage, or is it ASP? What even IS ASP?). If you follow the video link, you'll see that it gives an ugly error if you select spring semester, copy the link, and put it in the search bar. Some SQL error.
So this is my dilemma. I'm trying to GET this data... All these tables. Last post I made, I did a brute force attempt to get them by just clicking and dragging for 10+ minutes, then pasting into excel. That's an awful way of doing it, and it wasn't even particularly useful when I imported that excel sheet into python because the data was very difficult to work with. Very unstructured. So I thought, hey, why not scrape with bs4? Not that easy either, it seems, as the URL won't work. After filtering to spring semester, the URL just won't work, not for you, and not if you paste it into python for bs4 to use...
So I'm sort of at a loss here of how to reasonably work with this data. I want to scrape it with bs4, and put it into dataframes to be manipulated later. However, as it is ASP or whatever it is, I can't find a way to do so yet :\
ASP stands for Active Server Pages and is a page running a server-side script (usually vbs), so this shouldn't concern you as you want to scrape data from the rendered page.
In order to get a valid response from /instructor_evals/instr_eval_result.asp you have to submit a POST request with the form data of /instructor_eval.asp, otherwise the page returns an error message.
If you submit the correct data with urllib you should be able to get the tables with bs4.
from urllib.request import urlopen, Request
from urllib.parse import urlencode
from bs4 import BeautifulSoup
url = 'https://opir.fiu.edu/instructor_evals/instr_eval_result.asp'
data = {'Term':'1171', 'Coll':'%', 'Dept':'','RefNum':'','Crse':'','Instr':''}
r = urlopen(Request(url, data=urlencode(data).encode()))
html = r.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')
By the way this error message is a strong indication that the page is vulnerable to SQL Injection which is a very nasty bug, and i think you should inform the admin about it.

Categories

Resources