I don't have much experience scraping data from websites. I normally use Python "requests" and "BeautifulSoup".
I need to download the table from here https://publons.com/awards/highly-cited/2019/
I do the usual with right click and Inspect, but the format is not the one I'm used to working with. I did a bit of reading and seems to be Javascript, where I could potentially extract the data from https://publons.com/static/cache/js/app-59ff4a.js. I read other posts that recommend Selenium and PhantomJS. However, I can't modify the paths as I'm not admin in this computer (I'm using Windows). Any idea on how to tackle this? Happy to go with R if Python isn't an option.
Thanks!
If you monitor the web traffic via dev tools you will see the API calls the page makes to update content. The info returned is in json format.
For example: page 1
import requests
r = requests.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()
You can alter the page param in a loop to get all results.
The total number of results is already indicated in the first response via r['count'] so easy enough to calculate the # pages to loop for at 10 results per page. Just be sure to be polite in how you make your requests.
Outline:
import math, requests
with requests.Session() as s:
r = s.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()
#do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?
number_pages = math.ceil(r['count']/10)
for page in range(2, number_pages + 1):
#perhaps have a delay after X requests
r = s.get(f'https://publons.com/awards/api/2019/hcr/?page={page}&per_page=10').json()
#do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?
Related
I'm trying to scrape all pages with different ids from a site that is formatted url.com/page?id=1, but there are millions of ids so even at 1 request per second it will take weeks to get them all.
I am a total noob at this so I was wondering if there was a better way than going one by one such as a bulk request or something or should I just increase the requests per second to whatever I can get away with.
I am using requests and beautifulsoup in python to scrape the pages currently.
The grequests library is one possible approach you could take. The results are returned in the order they are obtained (which is not the same order as event_list).
import grequests
event_list = [grequests.get(f"http://url.com/page?id={req_id}") for req_id in range(1, 100)]
for r in grequests.imap(event_list, size=5):
print(r.request.url)
print(r.text[:100])
print()
Note: You are likely to be blocked if you attempt this on most sites. A better approach would be to see if the website has a suitable API you could use to obtain the same information which could be found by using your browser's network tools whilst using the site.
I'm using the method the post link below to scraping instagram profiles.
Can I change the number of images I retrieve? In the Json response I saw the 'has_next_page' parameter, but I'm not sure how to use it.
Thanks in advance.
Post link:
What is the new instagram json endpoint?
Used code:
r = requests.get('https://www.instagram.com/' + profile + '/')
soup = BeautifulSoup(r.content)
scripts = soup.find_all('script', type="text/javascript",
text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
data = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
You can find the Instagram API here: https://www.instagram.com/developer/
The documentatiopn is pretty neat I think, you just have to register to get an access token.
Your problem is the following: In your code you scrape data from the profile page, which means you only get the images which have been loaded already.
That's why you can't just set a larger number for it to get you more images.
I'd recommend one of the following:
1. Use Instagram's API, which comes with already built methods to do exactly what you seem to want to achieve (don't reinvent the wheel).
2. If instead you want to do most of the work yourself (let's say as an exercise) I'd recommend that you use Selenium, which is an automation.
In your code you use BeautifulSoup which is great for retrieving data from HTML files, but you need to do something more: scroll - this is in order to allow for more pictures to be loaded. This way you can get as many pictures as you like.
In case you need an example, you can check out an example of something similar I wrote for Twitter here
Currently, I am trying to gather data from my realtor from the listings she sends me. It always comes through a link from the main site "http://v3.torontomls.net" I think only realtors can go into this site and filter on houses, but when she sends it to me I can see a list of houses.
I am wondering if it is possible to create a python script that:)
1) opens Gmail
2) filter's on her emails
3) opens one of her emails
4) clicks on the link
5) Scrapes the house data into a CSV format
I am not sure about the feasibility of this, I have never used python to scrape web pages. I can see step 5 is doable, but how do I go about step 1 to 4?
Yes, this is possible, but you need to do some requirements gathering beforehand to determine which parts of the process can be eliminated. For instance, if your realtor is sending you the same link each time, you can just target that web address directly. If the link changes but is parameterized by month, for instance, you can just adjust the web address each month when you want to process the results.
To make the requests, I would suggest using the requests package along with bs4 (BeautifulSoup 4) to target elements. For creating CSV files, you may choose to use csv, but there are many alternatives if you require something that's more specific to your use case.
I am trying to grab information from the Chicago Transit Authority bustracker website. In particular, I would like to quickly output the arrival ETAs for the top two buses. I can do this rather easily with Splinter; however I am running this script on a headless Raspberry Pi model B and Splinter plus pyvirtualdisplay results in a significant amount of overhead.
Something along the lines of
from bs4 import BeautifulSoup
import requests
url = 'http://www.ctabustracker.com/bustime/eta/eta.jsp?id=15475'
r = requests.get(url)
s = BeautifulSoup(r.text,'html.parser')
does not do the trick. All of the data fields are empty (well, have  ). For example, when the page looks like this:
This code snippet s.find(id='time1').text gives me u'\xa0' instead of "12 MINUTES" when I perform the analogous search with Splinter.
I'm not wedded to BeautifulSoup/requests; I just want something that doesn't require the overhead of Splinter/pyvirtualdisplay since the project requires that I obtain a short list of strings (e.g. for the image above, [['9','104th/Vincennes','1158','12 MINUTES'],['9','95th','1300','13 MINUTES']]) and then exits.
The bad news
So the bad news is the page you are trying to scrape is rendered via Javascript. Whilst tools like Splinter, Selenium, PhantomJS can render this for you and give you the output to easily scrape, Python + Requests + BeautifulSoup don't give you this out of the box.
The good news
The data pulled in from the Javascript has to come from somewhere, and usually this will come in an easier to parse format (as it's designed to be read by machines).
In this case your example loads this XML.
Now with an XML response it's not as nice as JSON so I'd recommend reading this answer about integrating with the requests library. But it will be a lot more lightweight than Splinter.
I am new to Python. I am trying to scrape data from a website and the data I want can not be seen on view > source in the browser. It comes from another file. It is possible to scrape the actual data on the screen with Beautifulsoup and Python?
example site www[dot]catleylakeman[dot]co(dot)uk/cds_banks.php
If not, is this possible using another route?
Thanks
The "other file" is http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369145707664 - you can find this out (and I suspect you already have) by using chrome's developer tools, network tab (or the equivalent in your browser).
This format is easier to parse than the final html would be; generally HTML scrapers should be used as a last resort if the website does not publish raw data like the above.
My guess is, the url you are actually looking for is:
http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122
I found it using the developer toolbar and looking at the network traffic (builtin to chrome and firefox, also using firebug). It gets called in with Ajax. You do not even need beatiful soup to parse that one as it seems to be a long string separated with *| and sometimes **|. The following should get you initial access to that data:
import urllib2
f = urllib2.urlopen('http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122')
try:
data = f.read().split('*|')
finally:
f.close()
print data