I'm trying to use Python 3 requests.getto retrieve data from this page using its API. I'm interested in retrieving it using the data found here and saving the entire table into my own JSON.
Here's my attempt so far
source = requests.get("https://www.mwebexplorer.com/api/mwebblocks").json()
with open('mweb.json', 'w') as json_file:
json.dump(source, json_file)
I've looked through other questions in regards to pagination and all the other problems are able to write for loops to iterate through all the pages, but in my specific case, the link does not change when clicking next to go to the next page of data. I also can't use scrapy's xpath method to click next due to the entire table and its pagination not being accessible through HTML or XML.
Is there something I can add to my requests.get to include the entire JSON of all pages of the table?
Depending on what browser you're using it might be different, but in chrome I can go to the network tab in devtools and view the full details of the request. This reveals that it's actually a POST request, not a GET request. If you look at the payload, you can see a bunch of key-value pairs, including a start and a length.
So, try something like
requests.post("https://www.mwebexplorer.com/api/mwebblocks", data={"start": "50", "length": "50"})
or similar. You might need to include the other parts of the form data, depending on the response you get.
Keep in mind that sites frequently don't like it when you try to scrape them like this.
Related
Web-scraping adjacent question about URLs acting whacky.
If I go to glassdoor job search and enter in 6 fields (Austin, "engineering manager", fulltime, exact city, etc.. ). I get a results page with 38 results. This is the link I get. Ideally I'd like to save this link with its search criteria and reference it later.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&sc.locationSeoString=austin&locId=1139761&locT=C?jobType=fulltime&fromAge=30&radius=0&minRating=4.00
However, If I copy that exact link and paste it into a new tab, it doesn't act as desired.
It redirects to this different link, maintaining some of the criteria but losing the location criteria, bringing up thousands of results from around the country instead of just Austin.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&fromAge=30&radius=0&minRating=4.0
I understand I could use selenium to select all 6 fields, I'd just like to understand what's going on here and know if there is a solution involving just using a URL.
The change of URL seems to happen on the server that is handling the request. I would think this is how it's configured on the server-side endpoint for it to trim out extra parameters and redirects you to another URL. There's nothing you can do about this since however you pass it, it will always resolve into the second URL format.
I have also tried URL shortener but the same behavior persists.
The only way around this is to use Automation such as Selenium to enable the same behaviour to select and display the results from the first URL.
I'm trying to use the requests module to retrieve data from this website:
https://toelatingen.ctgb.nl/
I want to receive the found data when I put in "11462" the "Zoekterm" field for example.
data = { "searchTerm": "11462"}
session = requests.Session()
r = session.post('https://toelatingen.ctgb.nl/',data=data)
body_data = r.text
The content of the body_data does not, unfortunately, contain the information searched for.
Thanks for helping me.
The reason you're not getting the response data is because the site doesn't do the search at that url. Instead it makes a call to https://toelatingen.ctgb.nl/nl/admissions/overview.
When you're trying to get information off the internet the first thing you want to do is check how your web browser is getting the data. If you open up whatever inspection tool comes with your browser of choice (typically the hotkey is ctrl+shift+i), you should be able to find a Network tab that tracks the requests and responses the browser makes. Once you have that open, get your browser to display the information you want and watch the Network Tab while it's doing so. Check whatever responses come up to find the one that has the information you want and then replicate the request your browser used.
In your case:
The root page loads an empty page first from https://toelatingen.ctgb.nl/
It then loads a bunch of static files (mostly woff and js; these are used for styling the webpage and handling different proceedures)
Then it makes a call to https://toelatingen.ctgb.nl/nl/admissions/overview. We can be pretty sure this is the call we want at this point because the response is a json which contains the information that we see displayed on the screen.
We then copy out all the information- headers and forms, line for line- from that request, plug it in, and see if the requests module returns the same json.
If it doesn't then that most likely means we're missing something (most often a CSRF Token or a special Accept-Encoding) and we need to do some more tinkering.
I would also recommend taking a little bit of time to prune out parts of the request data/headers: most of the time they contain extra terms that the server doesn't actually need. This will save space and give you a better idea of what parts of the request you can change.
As all we know in web application we have get method and post data method.
Here my problem appear with post data.
For example i want to make my python code that access for search bar of website by insert same values and submit (the website button), then check for the page.
How the code gonna be then if there any documentation about this python concepts!
I am totally confused
Note : i am just beginner in python.
If the website relies on javascript, you're going to need to use something like Selenium which will emulate a typical browser and allow you to insert information onto a page and execute javascript commands.
If, however, the search bar simply posts data to a URL. You can determine that URL and then use requests to post the data and retrieve the result.
resp = requests.post('http://website/search', data = {'term':'value'})
I come from a world of scientific computing and number crunching.
I am trying to interact with the internet to compile data so I don't have to. One task it to auto-fill out searches on Marriott.com so I can see what the best deals are all on my own.
I've attempted something simple like
import urllib
import urllib2
url = "http://marriott.com"
values = {'Location':'New York'}
data = urllib.urlencode(values)
website = urllib2.Request(url, data)
response = urllib2.urlopen(website)
stuff = response.read()
f = open('test.html','w')
f.write(stuff)
My questions are the following:
How do you know how the website receives information?
How do I know a simple "Post" will work?
If it is simple, how do I know what the names of the dictionary should be for "Values?"
How to check if it's working? The write lines at the end are an attempt for me to see if my inputs are working properly but that is insufficient.
You may also have a look at splinter, where urllib may not be useful (JS, AJAX, etc.)
For finding out the form parameters firebug could be useful.
You need to read and analyze the HTML code of the related side. Every browser has decent tools for introspecting the DOM of a site, analyzing the network traffic and requests.
Usually you want to use the mechanize module for performing automatized interactions with a web site. There is no guarantee given that this will work in every case. Nowadays many websites use AJAX or more complex client-side programming making it hard to "emulate" a human user using Python.
Apart from that: the mariott.com site does not contain an input field "Location"...so you are guessing URL parameters with having analyzed their forms and functionality?
What i do to check is use a Web-debugging proxy to view the request you send
first send a real request with your browser and compare that request to the request that your script sends. try to make the two requests match
What I use for this is Charles Proxy
Another way is view the html file you saved (in this case test.html) and view it in your browser and compare this to the actual request reponse
To findout what the dictionary should have in it is look at the page source of the page and find out the names of the forms your trying to fill. in you're case the "location"should actually be "destinationAddress.destination"
Here is a picture:
So look in the HTML code to get the names of the forms and that is what should be in the dictionary. i know that Google Chrome and Mozilla Firefox both have tools to view the structure of the html (in the Picture i used inspect element in Google Chrome)
for more info on urllib2 read here
I really hope this helps :)
If you visit this link right now, you will probably get a VBScript error.
On the other hand, if you visit this link first and then the above link (in the same session), the page comes through.
The way this application is set up, the first page is meant to serve as a frame in the second (main) page. If you click around a bit, you'll see how it works.
My question: How do I scrape the first page with Python? I've tried everything I can think of -- urllib, urllib2, mechanize -- and all I get is 500 errors or timeouts.
I suspect the answers lies with mechanize, but my mechanize-fu isn't good enough to crack this. Can anyone help?
It always comes down to the request/response model. You just have to craft a series of http requests such that you get the desired responses. In this case, you also need the server to treat each request as part of the same session. To do that, you need to figure out how the server is tracking sessions. It could be a number of things, from cookies to hidden inputs to form actions, post data, or query strings. If I had to guess I'd put my money on a cookie in this case (I haven't checked the links). If this holds true, you need to send the first request, save the cookie you get back, and then send that cookie along with the 2nd request.
It could also be that the initial page will have buttons and links that get you to the second page. Those links will have something like <A href="http://cad.chp.ca.gov/iiqr.asp?Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b="> where a lot of the gobbedlygook is generated by the first page.
The "Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b=" part encodes some session information that you must get from the first page.
And, of course, you might even need to do both.
You might also try BeautifulSoup in addition to Mechanize. I'm not positive, but you should be able to parse the DOM down into the framed page.
I also find Tamper Data to be a rather useful plugin when I'm writing scrapers.