I'm in the process of writing an API for a craigslist like website, and I finished the data getting part using html from lxml.
Now I want to submit data (login info, things to be posted ...) to the website.
Can I do it using lxml or do I have to use another module ?
As #furas mentioned you can use requests module for posting the data (login in your case).
Here is the simple example where you can use requests module for both get and post
import requests
# get the token
resp = requests.get("https://www.botoxcosmetic.com/sc/api/findclinic/GetFadToken?_=1556315102966")
# print the token
print (resp.json())
# storing all the input data in dataI. In your case you have to replace them with username and password (check the API documentation or devtools to make sure the param name(s) is correct)
dataI = {'ZipCode':'10022','MileRadius':'1','PerPage':'5','Token':resp.json()}
# post the data (you don't have to click on any submit button) and capture the response to the the post.
resp = requests.post("https://www.botoxcosmetic.com/sc/api/findclinic/FindSpecialists",data= dataI)
# print the response from post call
print(resp.json())
Related
Can anyone help me to grab the date from this website?
I want to get the data from the new website after I click "Submit Query". That needs a post request because a form being submitted
https://henke.lbl.gov/optical_constants/pert_form.html
I tried multiple methods (post request) online but all failed. Don't know why.
Many thanks!
If You want to grab text contents of page for example, try:
import requests
r = requests.get('https://henke.lbl.gov/optical_constants/pert_form.html')
print(r.text)
For more go to https://docs.python-requests.org/en/master/
I have a task to crawl all Pulitzer Winner, and I found this page has all I want: https://www.pulitzer.org/prize-winners-by-year/2018.
But I got the following problems,
Problem 1: How to crawl a dynamic page? I use python/urllib2.urlopen, to get the page's content, but this dynamic page doesn't return the real content from this.
Problem 2: I then found an API URL from devtool: https://www.pulitzer.org/cache/api/1/winners/year/166/raw.json. But when I sent a GET request from urllib2.urlopen, I always get null. How does it happen? Or how can I handle with it?
If this is too naive for you, please name some words so that I can learn it from Google.
Thanks in advance!
One way to handle is to create a session using requests module. This way, it passes necessary session details required for next api call, you also have to pass one more parameter Referer to the header. This differentiates which year you are looking for in the api call.
import requests
s = requests.session()
url = "https://www.pulitzer.org/prize-winners-by-year/2017"
resp1 = s.get(url)
headers = {'Referer': 'https://www.pulitzer.org/prize-winners-by-year/2017'}
api = "https://www.pulitzer.org/cache/api/1/winners/year/166/raw.json"
data = s.get(api,headers=headers)
now you can extract the data from the response in data.
I'm trying to scrape multiple financial websites (Wells Fargo, etc.) to pull my transaction history for data analysis purposes. I can do the scraping part once I get to the page I need; the problem I'm having is getting there. I don't know how to pass my username and password and then navigate from there. I would like to do this without actually opening a browser.
I found Michael Foord's article "HOWTO Fetch Internet Resources Using The urllib Package" and tried to adapt one of the examples to meet my needs but can't get it to work (I've tried adapting to several other search results as well). Here's my code:
import bs4
import urllib.request
import urllib.parse
##Navigate to the website.
url = 'https://www.wellsfargo.com/'
values = {'j_username':'USERNAME', 'j_password':'PASSWORD'}
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
the_page = response.read()
soup = bs4.BeautifulSoup(the_page,"html.parser")
The 'j_username' and 'j_password' both come from inspecting the text boxes on the login page.
I just don't think I'm pointing to the right place or passing my credentials correctly. The URL I'm using is just the login page so is it actually logging me in? When I print the URL from response it returns https://wellsfargo.com/. If I'm ever able to successfully login, it just takes me to a summary page of my accounts. I would then need to follow another link to my checking, savings, etc.
I really appreciate any help you can offer.
I've had problems accessing www.bizi.si or more specifically
http://www.bizi.si/BALMAR-D-O-O/ for instance. If you look at it without registering you won't see any financial data. But if you use the free registration, I used username: Lukec, password: lukec12345, you can see some of the financial data. I've used this next code:
import urllib.parse
import urllib.request
import re
import csv
username = 'Lukec'
password = 'lukec12345'
url = 'http://www.bizi.si/BALMAR-D-O-O/'
values = {'username':username, 'password':password}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
req = urllib.request.Request(url,data,values)
resp = urllib.request.urlopen(req,data)
respData = resp.read()
paragraphs = re.findall('<tbody>(.*?)</tbody>',str(respData))
And my len(paragraphs) is zero. I would really be grateful if anyone of you would be able to tell me how to access the page correctly. I know that the length being zero isnt the best indicator, but also the len(respData) if I use values as stated in my code or if I take it out of my code is the same, so I know I have not accessed the page through username, password.
Thank you for you're help in advance and have a nice day.
There are two issues here:
You are not using POST, but GET for the request.
There is no <tbody> element in the HTML produced; any such tags have been automatically added by your browser, do not rely on them being there.
To create a POST request, use:
req = urllib.request.Request(url, data, method='POST')
resp = urllib.request.urlopen(req)
Note that I removed the values argument (those are not headers, the third positional argument to Request() and you don't pass in a data argument when using a Request object.
The resulting HTML returned does not necessarily include the same data that is sent to a browser; you probably need to maintain a session here, return the cookies that the site sets.
It is far easier to do this with better tools such as the requests library and BeautifulSoup (the latter lets you parse HTML without having to resort to regular expressions), which can be combined with the robobrowser project to help you fill and submit forms on websites.
Note however that the page forms and state are managed by ASP.NET JavaScript code and are not easily reverse-engineered even by robobrowser. When you log in with a browser (which has run the JavaScript code for you), the POST looks like this:
{'__EVENTTARGET': ['ctl00$ctl00$loginBoxPopup$loginBox1$ButtonLogin'],
'__VSTATE': [''],
'ctl00$ctl00$SearchAdvanced1$ActivitiesAndProductsSearch1$RadioButtonList1': ['TSMEDIA '
'dejavnost'],
'ctl00$ctl00$SearchAdvanced1$DropDownListYearSelection': ['2013'],
'ctl00$ctl00$SearchAdvanced1$SteviloZaposlenihDo': ['do'],
'ctl00$ctl00$SearchAdvanced1$SteviloZaposlenihOd': ['od'],
'ctl00$ctl00$SearchAdvanced1$ddlLegalEvents': ['0'],
'ctl00$ctl00$loginBoxPopup$loginBox1$Password': ['lukec12345'],
'ctl00$ctl00$loginBoxPopup$loginBox1$UserName': ['Lukec'],
'ctl00_ctl00_ScriptManager1_HiddenField': [';;AjaxControlToolkit, '
'Version=3.5.40412.0, '
'Culture=neutral, '
'PublicKeyToken=28f01b0e84b6d53e:sl:1547e793-5b7e-48fe-8490-03a375b13a33:475a4ef5:effe2a26:3ac3e789:5546a2b:d2e10b12:37e2e5c9:5a682656:12bbc599:1d3ed089:497ef277:a43b07eb:751cdd15:dfad98a5:3cf12cf1'],
'hiddenInputToUpdateATBuffer_CommonToolkitScripts': ['1']}
That's a lot more information than a simple username / password combination.
See post request using python to asp.net page for approaches on how to handle such pages instead.
As part of my quest to become better at Python I am now attempting to sign in to a website I frequent, send myself a private message, and then sign out. So far, I've managed to sign in (using urllib, cookiejar and urllib2). However, I cannot work out how to fill in the required form to send myself a message.
The form is located at /messages.php?action=send. There's three things that need to be filled for the message to send: three text fields named name, title and message. Additionally, there is a submit button (named "submit").
How can I fill in this form and send it?
import urllib
import urllib2
name = "name field"
data = {
"name" : name
}
encoded_data = urllib.urlencode(data)
content = urllib2.urlopen("http://www.abc.com/messages.php?action=send",
encoded_data)
print content.readlines()
just replace http://www.abc.com/messages.php?action=send with the url where your form is being submitted
reply to your comment: if the url is the url where your form is located, and you need to do this just for one website, look at the source code of the page and find
<form method="POST" action="some_address.php">
and put this address as parameter for urllib2.urlopen
And you have to realise what submit button does.
It just send a Http request to the url defined by action in the form.
So what you do is to simulate this request with urllib2
You can use mechanize to work easily with this. This will ease your work of submitting the form. Don't forget to check with the parameters like name, title, message by seeing the source code of the html form.
import mechanize
br = mechanize.Browser()
br.open("http://mywebsite.com/messages.php?action=send")
br.select_form(nr=0)
br.form['name'] = 'Enter your Name'
br.form['title'] = 'Enter your Title'
br.form['message'] = 'Enter your message'
req = br.submit()
You want the mechanize library. This lets you easily automate the process of browsing websites and submitting forms/following links. The site I've linked to has quite good examples and documentation.
Try to work out the requests that are made (e.g. using the Chrome web developer tool or with Firefox/Firebug) and imitate the POST request containing the desired form data.
In addition to the great mechanize library mentioned by Andrew, in case I'd also suggest you use BeautifulSoup to parse the HTML.
If you don't want to use mechanize but still want an easy, clean solution to create HTTP requests, I recommend the excellend requests module.
To post data to webpage, use cURL something like this,
curl -d Name="Shrimant" -d title="Hello world" -d message="Hello, how are you" -d Form_Submit="Send" http://www.example.com/messages.php?action=send
The ā-dā option tells cURL that the next item is some data to be sent to the server at http://www.example.com/messages.php?action=send