As all we know in web application we have get method and post data method.
Here my problem appear with post data.
For example i want to make my python code that access for search bar of website by insert same values and submit (the website button), then check for the page.
How the code gonna be then if there any documentation about this python concepts!
I am totally confused
Note : i am just beginner in python.
If the website relies on javascript, you're going to need to use something like Selenium which will emulate a typical browser and allow you to insert information onto a page and execute javascript commands.
If, however, the search bar simply posts data to a URL. You can determine that URL and then use requests to post the data and retrieve the result.
resp = requests.post('http://website/search', data = {'term':'value'})
Related
I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.
I'm very new to learn about web scraping. By using xpath selector i am trying to get the knowledge on that webpage : https://seffaflik.epias.com.tr/transparency/uretim/planlama/kgup.xhtml
But the point is, whenever you change the date or the powerplant name, URL does not change therefore when you fetch the response, you are getting always the same and wrong answer. Is there a way to find the correct URL or anything else related to HTML Markup etc ?
For a scraping operation like this, you'll need to do a bit more than just load the document and then grab the content. The document in-question relies on JavaScript to load new information from some other resource after the user has defined a particular set of parameters and updated the form.
After loading the document, you'll need to define your search parameters. You can do this via JavaScript injection or via your browser's console. For example, if you were trying to define the value for the first date field, you could use
document.querySelectorAll('#j_idt199 input')[1].value = "Some/New/Date";
Repeat this process for the other fields you wish to define in your search, and then run the following code to programmatically execute your search:
document.querySelector('#j_idt199 button').click();
After that, you can either grab the information you want using plain JS query selectors, or you can implement a scraping library like artoo.js to help you interpret the data and export it.
If given a website like e.g. http://www.barchart.com/historicaldata.php, is there a way to fill in the text box and then click the submit button to download the data?
I'm use to using urllib to download entire pages, but can seem to figure out how to submit text into the text box and then click the button, from my script.
There are two paths I can think of:
Selenium
It's possible to directly simulate filling in data and clicking the button using a great library called Selenium Webdriver
Using Selenium, you can open up a programmatic browser session and do all manner of things that a user would do. Combined with ghost browser, this can be done behind the scenes in a browser-independent way (useful if this is going to run on a server, which won't have chrome installed).
While an awesome library (fantastic for testing web pages) Selenium requires learning quite a bit. It's required if you specifically want to perform the action of filling out and clicking. But I think there might be an easier way to accomplish what you're trying to do using Python requests.
Requests
Python's requests library is another library for requesting data from pages. You can use it to submit a GET request (what the browser would be doing in the event of just visiting the page) or a POST request (where the browser sends its form data off to after you click submit).
To know which fields you want to send off data to, look at the page's HTML for each form field, and grab the "name" attribute.
If it weren't for the fact that your content seems to be paywalled, you could accomplish this pretty easily. For example, let's say your form has 3 fields to fill in, with name attributes consisting of 'start_date', 'end_date', and 'type'. You could accomplish this with the following:
import requests
url = "http://www.barchart.com/historicaldata.php/"
r = requests.post(url, data = {
'item1': 'one of the form fields',
'color': 'green',
'location': 'Boston, MA',
...
}
)
with open("~~DESIRED FILE LOCATION~~", "wb") as code:
code.write(r.content)
Because of the paywall, you'll have to log in first, and retain that session data. I defer explanation of how to do that to this excellent answer
EDIT:
Possibly one more thing to note regarding where you should be submitting your data to. The url for where you should submit your POST data might be the same as the barchart url that you gave, but it also might not be. To find out, look at the "action" attribute of the HTML form object itself. 9 times out of 10, that's where the data is getting submitted. If the site does something wonky with Javascript, you might have to open up a console and examine where exactly data is getting sent upon submission. But that bridge can be crossed if/when needed.
I come from a world of scientific computing and number crunching.
I am trying to interact with the internet to compile data so I don't have to. One task it to auto-fill out searches on Marriott.com so I can see what the best deals are all on my own.
I've attempted something simple like
import urllib
import urllib2
url = "http://marriott.com"
values = {'Location':'New York'}
data = urllib.urlencode(values)
website = urllib2.Request(url, data)
response = urllib2.urlopen(website)
stuff = response.read()
f = open('test.html','w')
f.write(stuff)
My questions are the following:
How do you know how the website receives information?
How do I know a simple "Post" will work?
If it is simple, how do I know what the names of the dictionary should be for "Values?"
How to check if it's working? The write lines at the end are an attempt for me to see if my inputs are working properly but that is insufficient.
You may also have a look at splinter, where urllib may not be useful (JS, AJAX, etc.)
For finding out the form parameters firebug could be useful.
You need to read and analyze the HTML code of the related side. Every browser has decent tools for introspecting the DOM of a site, analyzing the network traffic and requests.
Usually you want to use the mechanize module for performing automatized interactions with a web site. There is no guarantee given that this will work in every case. Nowadays many websites use AJAX or more complex client-side programming making it hard to "emulate" a human user using Python.
Apart from that: the mariott.com site does not contain an input field "Location"...so you are guessing URL parameters with having analyzed their forms and functionality?
What i do to check is use a Web-debugging proxy to view the request you send
first send a real request with your browser and compare that request to the request that your script sends. try to make the two requests match
What I use for this is Charles Proxy
Another way is view the html file you saved (in this case test.html) and view it in your browser and compare this to the actual request reponse
To findout what the dictionary should have in it is look at the page source of the page and find out the names of the forms your trying to fill. in you're case the "location"should actually be "destinationAddress.destination"
Here is a picture:
So look in the HTML code to get the names of the forms and that is what should be in the dictionary. i know that Google Chrome and Mozilla Firefox both have tools to view the structure of the html (in the Picture i used inspect element in Google Chrome)
for more info on urllib2 read here
I really hope this helps :)
I want to crawl a website having multiple pages and when a page number is clicked it is dynamically loaded.How to screen scrape it?
i.e as the url is not present as href or a how to crawl to other pages?
Would be greatful if someone helped me on this.
PS:URL remains the same when different page is clicked.
You should consider also Ghost.py, since it allows you tu run arbitrary javascript commands, fill forms and take snapshoot very quickly.
if you are using google chrome, you can check the url which is dynamically being called in
network->headers of the developer tools
so based on that you can identify whether it is a GET or POST request.
If it is a GET request you can find the parameters straight away from the url.
If it is a POST request you can find the parameters from form data in network->headers
of the developer tools.
You could look for the data you want in the javascript code instead of the HTML. This is usually a pain but you can do fun things with regular expressions.
Alternatively, some of the browser testing libraries like splinter work by loading the page up in an actual browser like firefox or chrome before scraping. One of those would work if you are running this on a machine with a browser installed.
Since this post has been tagged with python and web-crawler, Beautiful Soup has to be mentioned: http://www.crummy.com/software/BeautifulSoup/
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs3/download/2.x/documentation.html
You can not do that easily since it is an ajax pagination (even with mechanize). Instead, open the source file of the page and try to know what is the url request used for the ajax pagination. Then, you can create a fake request back and process the returned data by your own way
If you don't mind using gevent.GRobot is another good choose.