I'm trying to use the requests module to retrieve data from this website:
https://toelatingen.ctgb.nl/
I want to receive the found data when I put in "11462" the "Zoekterm" field for example.
data = { "searchTerm": "11462"}
session = requests.Session()
r = session.post('https://toelatingen.ctgb.nl/',data=data)
body_data = r.text
The content of the body_data does not, unfortunately, contain the information searched for.
Thanks for helping me.
The reason you're not getting the response data is because the site doesn't do the search at that url. Instead it makes a call to https://toelatingen.ctgb.nl/nl/admissions/overview.
When you're trying to get information off the internet the first thing you want to do is check how your web browser is getting the data. If you open up whatever inspection tool comes with your browser of choice (typically the hotkey is ctrl+shift+i), you should be able to find a Network tab that tracks the requests and responses the browser makes. Once you have that open, get your browser to display the information you want and watch the Network Tab while it's doing so. Check whatever responses come up to find the one that has the information you want and then replicate the request your browser used.
In your case:
The root page loads an empty page first from https://toelatingen.ctgb.nl/
It then loads a bunch of static files (mostly woff and js; these are used for styling the webpage and handling different proceedures)
Then it makes a call to https://toelatingen.ctgb.nl/nl/admissions/overview. We can be pretty sure this is the call we want at this point because the response is a json which contains the information that we see displayed on the screen.
We then copy out all the information- headers and forms, line for line- from that request, plug it in, and see if the requests module returns the same json.
If it doesn't then that most likely means we're missing something (most often a CSRF Token or a special Accept-Encoding) and we need to do some more tinkering.
I would also recommend taking a little bit of time to prune out parts of the request data/headers: most of the time they contain extra terms that the server doesn't actually need. This will save space and give you a better idea of what parts of the request you can change.
Related
I'm trying to use Python 3 requests.getto retrieve data from this page using its API. I'm interested in retrieving it using the data found here and saving the entire table into my own JSON.
Here's my attempt so far
source = requests.get("https://www.mwebexplorer.com/api/mwebblocks").json()
with open('mweb.json', 'w') as json_file:
json.dump(source, json_file)
I've looked through other questions in regards to pagination and all the other problems are able to write for loops to iterate through all the pages, but in my specific case, the link does not change when clicking next to go to the next page of data. I also can't use scrapy's xpath method to click next due to the entire table and its pagination not being accessible through HTML or XML.
Is there something I can add to my requests.get to include the entire JSON of all pages of the table?
Depending on what browser you're using it might be different, but in chrome I can go to the network tab in devtools and view the full details of the request. This reveals that it's actually a POST request, not a GET request. If you look at the payload, you can see a bunch of key-value pairs, including a start and a length.
So, try something like
requests.post("https://www.mwebexplorer.com/api/mwebblocks", data={"start": "50", "length": "50"})
or similar. You might need to include the other parts of the form data, depending on the response you get.
Keep in mind that sites frequently don't like it when you try to scrape them like this.
I need to get some numbers from this website
http://www.preciodolar.com/
But the data I need, takes a little time to load and shows a message of 'wait' until it completely loads.
I used find all and some regular expressions to get the data I need, but when I execute, python gives me the 'wait' message that appears before the data loads.
Is there a way to make python 'wait' until all data is loaded?
my code looks like this,
import urllib.request
from re import findall
def divisas():
pag = urllib.request.urlopen('http://www.preciodolar.com/')
html = str(pag.read())
brasil = findall('<td class="usdbrl_buy">(.*?)</td>',html)
return brasil
This is because the page is generated with JavaScript. You're getting the full HTML, but the JavaScript handles changing the DOM and showing the information.
You have two options:
Try and interpret the JavaScript (not easy). There are a lot of questions about this in stack overflow already.
Find the URL the page is hitting with AJAX to get the actual data and use that.
It really just depends on what you need the page for. It looks like you are trying to parse the data and so the second option allows you to make a single request to just get the raw data.
You should find ajax request or jsonp request instead.
In this case , it's jsonp: http://api.preciodolar.com/api/crossdata.php?callback=jQuery1112024555979575961828_1442466073980&_=1442466073981
I try to read sudoku through a url and want to openthe same page that contains sudoku I've just read through the url
I am able to read sudokus by reading "show.websudoku.com" url, however when I try to open the same page I've just read, I get a different sudoku than I read in url because the site refreshes sudoku each time. I look some cookie libs but do not understand how to use them. Should I use cookies (and how, in this case) or is there something easier?
import urllib.request
import webbrowser
url="http://show.websudoku.com"
response = urllib.request.urlopen('http://show.websudoku.com/')
webbrowser.open("http://show.websudoku.com") ## This page displays different sudoku than print(html)
html = response.read()
print (html)
I'm not sure there is such a mechanism on this website to enable fetching the same sudoku over and over again. The decision of which sudoku to send is taken server-side.
One solution to your problem would be to save the output of the first page and access that data over and over again to your liking.
If your goal is only to validate your finished sudoku, you don't need to reload the page, just send the right form values. Get them from the first page, change the necessary fields and issue a POST request.
For this particular site, however, no need to join the server, the answer is included in the page. Look at the field with the id "cheat". It contains all the digits organized by lines. And the field with id "editmask" controls which digit is displayed (0) and which cell should be turned to an editable field (1).
I come from a world of scientific computing and number crunching.
I am trying to interact with the internet to compile data so I don't have to. One task it to auto-fill out searches on Marriott.com so I can see what the best deals are all on my own.
I've attempted something simple like
import urllib
import urllib2
url = "http://marriott.com"
values = {'Location':'New York'}
data = urllib.urlencode(values)
website = urllib2.Request(url, data)
response = urllib2.urlopen(website)
stuff = response.read()
f = open('test.html','w')
f.write(stuff)
My questions are the following:
How do you know how the website receives information?
How do I know a simple "Post" will work?
If it is simple, how do I know what the names of the dictionary should be for "Values?"
How to check if it's working? The write lines at the end are an attempt for me to see if my inputs are working properly but that is insufficient.
You may also have a look at splinter, where urllib may not be useful (JS, AJAX, etc.)
For finding out the form parameters firebug could be useful.
You need to read and analyze the HTML code of the related side. Every browser has decent tools for introspecting the DOM of a site, analyzing the network traffic and requests.
Usually you want to use the mechanize module for performing automatized interactions with a web site. There is no guarantee given that this will work in every case. Nowadays many websites use AJAX or more complex client-side programming making it hard to "emulate" a human user using Python.
Apart from that: the mariott.com site does not contain an input field "Location"...so you are guessing URL parameters with having analyzed their forms and functionality?
What i do to check is use a Web-debugging proxy to view the request you send
first send a real request with your browser and compare that request to the request that your script sends. try to make the two requests match
What I use for this is Charles Proxy
Another way is view the html file you saved (in this case test.html) and view it in your browser and compare this to the actual request reponse
To findout what the dictionary should have in it is look at the page source of the page and find out the names of the forms your trying to fill. in you're case the "location"should actually be "destinationAddress.destination"
Here is a picture:
So look in the HTML code to get the names of the forms and that is what should be in the dictionary. i know that Google Chrome and Mozilla Firefox both have tools to view the structure of the html (in the Picture i used inspect element in Google Chrome)
for more info on urllib2 read here
I really hope this helps :)
If you visit this link right now, you will probably get a VBScript error.
On the other hand, if you visit this link first and then the above link (in the same session), the page comes through.
The way this application is set up, the first page is meant to serve as a frame in the second (main) page. If you click around a bit, you'll see how it works.
My question: How do I scrape the first page with Python? I've tried everything I can think of -- urllib, urllib2, mechanize -- and all I get is 500 errors or timeouts.
I suspect the answers lies with mechanize, but my mechanize-fu isn't good enough to crack this. Can anyone help?
It always comes down to the request/response model. You just have to craft a series of http requests such that you get the desired responses. In this case, you also need the server to treat each request as part of the same session. To do that, you need to figure out how the server is tracking sessions. It could be a number of things, from cookies to hidden inputs to form actions, post data, or query strings. If I had to guess I'd put my money on a cookie in this case (I haven't checked the links). If this holds true, you need to send the first request, save the cookie you get back, and then send that cookie along with the 2nd request.
It could also be that the initial page will have buttons and links that get you to the second page. Those links will have something like <A href="http://cad.chp.ca.gov/iiqr.asp?Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b="> where a lot of the gobbedlygook is generated by the first page.
The "Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b=" part encodes some session information that you must get from the first page.
And, of course, you might even need to do both.
You might also try BeautifulSoup in addition to Mechanize. I'm not positive, but you should be able to parse the DOM down into the framed page.
I also find Tamper Data to be a rather useful plugin when I'm writing scrapers.