Scraping Google news search - python

I am trying to get the number of results from a google news search for a specific day. In a browser this is easy - Do a google search, click the "news" tab, click "tools", then change the time period to the date you want, then click "tools" again and you can see a count for how many stories it found.
The start and end dates can be seen in the URL. For example here is a search for "stack overflow" over the past week - https://www.google.com/search?q=stack+overflow&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F3%2F2018%2Ccd_max%3A1%2F10%2F2018&tbm=nws
The problem is when I try to request one of these URLs it gives me the current results for it and ignores the date range I specify. I can change these parameters around in my browser and the results change as expected, it just doesn't work programmatically.
I have tried several ways in both python and C#, always with the same results.
For example -
import requests
response = requests.get('https://www.google.com/search?q=stack+overflow&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2018%2Ccd_max%3A1%2F10%2F2018&tbm=nws')
print(response.content)

I finally found a working method using a headless web browser and Selenium. I suppose it has something to do with not being able to get the magic created by java by a simple request. I would still be interested in hearing an explanation or other ways to do this though.

Related

Why does this search URL redirect to a different search URL when copied and pasted?

Web-scraping adjacent question about URLs acting whacky.
If I go to glassdoor job search and enter in 6 fields (Austin, "engineering manager", fulltime, exact city, etc.. ). I get a results page with 38 results. This is the link I get. Ideally I'd like to save this link with its search criteria and reference it later.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&sc.locationSeoString=austin&locId=1139761&locT=C?jobType=fulltime&fromAge=30&radius=0&minRating=4.00
However, If I copy that exact link and paste it into a new tab, it doesn't act as desired.
It redirects to this different link, maintaining some of the criteria but losing the location criteria, bringing up thousands of results from around the country instead of just Austin.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&fromAge=30&radius=0&minRating=4.0
I understand I could use selenium to select all 6 fields, I'd just like to understand what's going on here and know if there is a solution involving just using a URL.
The change of URL seems to happen on the server that is handling the request. I would think this is how it's configured on the server-side endpoint for it to trim out extra parameters and redirects you to another URL. There's nothing you can do about this since however you pass it, it will always resolve into the second URL format.
I have also tried URL shortener but the same behavior persists.
The only way around this is to use Automation such as Selenium to enable the same behaviour to select and display the results from the first URL.

Get method from requests library seems to return homepage rather than specific URL

I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.
I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.
I haven't been able to find any documentation on the web about anyone else having this problem...
I'm using the Spyder IDE.
# Import library
import requests
# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'
# Pull contracts page
page = requests.get(url)
# Check that correct page is being pulled
print(page.text)
This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.
After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.
You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.
(Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)
As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.
Simply printing the result of the get request on the terminal won't be very helpful, as the HTML page content returned is long - your terminal will truncate the printed response. I'm assuming in your case maybe the website has parts of the homepage reused in other pages as well, so it might get confusing.
I recommend writing the response into a file and then opening the file in the browser. You will see that your code is pulling the right page.

Script cannot fetch data from a web page

I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.
If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.

scrape data from URL into pandas

I am trying to scrape date from a URL. The data is not in HTML tables, so pandas.read_html() is not picking it up.
The URL is:
https://www.athlinks.com/event/1015/results/Event/638761/Course/988506/Results
The data I'd like to get is a table gender, age, time for the past 5k races (name is not really important). The data is presented in the web page 50 at a time for around 25 pages.
It uses various javascript frameworks for the UI (node.js, react). Found this out using the "What Runs" ad-on in chrome browser.
Here's the real reason I'd like to get this data. I'm a new runner and will be participating in this 5k next weeked and would like to explore some of the distribution statistics for past faces (its an annual race, and data goes back to 1980's).
Thanks in advance!
The data comes from socket.io, and there are python packages for it. How did I find it?
If you open Network panel in your browser and choose XHR filter, you'll find something like
https://results-hub.athlinks.com/socket.io/?EIO=3&transport=polling&t=MYOPtCN&sid=5C1HrIXd0GRFLf0KAZZi
Look into content it is what we need.
Luckily this site has a source maps.
Now you can go to More tools -> Search and find this domain.
And then find resultsHubUrl in settings.
This property used inside setUpSocket.
And setUpSocket used inside IndividualResultsStream.js and RaseStreams.js.
Now you can press CMD + P and go deep down to this files.
So... I've spent around five minutes to find it. You can go ahead! Now you have all the necessary tools. Feel free to use breakpoints and read more about chrome developer tools.
You actually need to render the JS in a browser engine before crawling the generated HTML. Have you tried https://github.com/scrapinghub/splash, https://github.com/miyakogi/pyppeteer, or https://www.npmjs.com/package/spa-crawler ? You can also try to inspect the page (F12 -> Networking) while is loading the data relevant to you (from a restful api, I suppose), and then make the same calls from command line using curl or the requests python library.

How to get elements of webpage that load after initial webpage?

I'm trying to download stock option data from Yahoo Finance (here's Google as an example) with requests.get, which doesn't seem to be downloading everything. I'm trying to get the dropdown of dates with an XPath but even //option doesn't return anything even though Chrome DevTools says there are 13 instances!
I expect this has something to do with the fact that the parts of the site that actually matter are being loaded after all the navigation bars and such, and I don't know how to get all of it. Could you please suggest a method for getting the text of each item in the date dropdown menu?
If you open the dev console and refresh the page again (caches might need to be purged), you can see some requests with type xhr.
They are usually initiated by JavaScript programs and will load some data besides those provided by HTML body.
That's what you can look into.

Categories

Resources