Need to download the PDF, NOT the content of the webpage - python

So as it stands I am able to get the content of the webpage of the PDF link EXAMPLE OF THE LINK HERE BUT, I don't want the content of the webpage I want the content of the PDF so I can put the content into a PDF on my computer in a folder.
I have been successful in doing this on sites that I don't need to log into and without a proxy server.
Relevant CODE:
import os
import urllib2
import time
import requests
import urllib3
from random import *
s = requests.Session()
data = {"Username":"username", "Password":"password"}
url = "https://login.url.com"
print "doing things"
r2 = s.post(url, data=data, proxies = {'https' : 'https://PROXYip:PORT'}, verify=False)
#I get a response 200 from printing r2
print r2
downlaod_url = "http://msds.walmartstores.com/client/document?productid=1000527&productguid=54e8aa24-0db4-4973-a81f-87368312069a&DocumentKey=undefined&HazdocumentKey=undefined&MSDS=0&subformat=NAM"
file = open("F:\my_filepath\document" + str(maxCounter) + ".pdf", 'wb')
temp = s.get(download_url, proxies = {'https' : 'https://PROXYip:PORT'}, verify=False)
#This prints out the response from the proxy server (i.e. 200)
print temp
something = uniform(5,6)
print something
time.sleep(something)
#This gets me the content of the web page, not the content of the PDF
print temp.content
file.write(temp.content)
file.close()
I need help figuring out how to "download" the content of the PDF

try this:
import requests
url = 'http://msds.walmartstores.com/client/document?productid=1000527&productguid=54e8aa24-0db4-4973-a81f-87368312069a&DocumentKey=undefined&HazdocumentKey=undefined&MSDS=0&subformat=NAM'
pdf = requests.get(url)
with open('walmart.pdf', 'wb') as file:
file.write(pdf.content)
Edit
Try again with a requests session to manage cookies (assuming they send you those after login) and also maybe a different proxy
proxy_dict = {'https': 'ip:port'}
with requests.Session() as session:
# Authentication request, use GET/POST whatever is needed
# data variable should hold user/password information
auth = session.get(login_url, data=data, proxies=proxy_dict, verify=False)
if auth.status_code == 200:
print(auth.cookies) # Tell me if you got anything
pdf = auth.get('download_url') # Were continuing the same session
with open('walmart.pdf', 'wb') as file:
file.write(pdf.content)
else:
print('No go, got {0} response'.format(auth.status_code))

Related

How to scrape data behind a login

I am going to extract posts in a forum, named positive wellbeing during isolation" in HealthUnlocked.com
I can extract posts without login, but I cannot extract posts with logging. I used " url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular?pageNumber={0}'.format(page)" to extract pots, but I don't know how I can connect it to login as the URL is in JSON format.
I would appreciate it if you could help me.
import requests, json
import pandas as pd
from bs4 import BeautifulSoup
from time import sleep
url = "https://healthunlocked.com/private/programs/subscribed?user-id=1290456"
payload = {
"username" : "my username goes here",
"Password" : "my password goes hereh"
}
s= requests.Session()
p= s.post(url, data = payload)
headers = {"user-agent": "Mozilla/5.0"}
pages =2
data = []
listtitles=[]
listpost=[]
listreplies=[]
listpostID=[]
listauthorID=[]
listauthorName=[]
for page in range(1,pages):
url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular?pageNumber=
{0}'.format(page)
r = requests.get(url,headers=headers)
posts = json.loads(r.text)
for post in posts:
sleep(3.5)
listtitles.append(post['title'])
listreplies.append(post ["totalResponses"])
listpostID.append(post["postId"])
listauthorID.append(post ["author"]["userId"])
listauthorName.append(post ["author"]["username"])
url = 'https://healthunlocked.com/positivewellbeing/posts/{0}'.format(post['postId'])
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
listpost.append(soup.select_one('div.post-body').get_text('|', strip=True))
## save to CSV
df=pd.DataFrame(list(zip(*
[listpostID,listtitles,listpost,listreplies,listauthorID,listauthorName]))).add_prefix('Col')
df.to_csv('out1.csv',index=False)
print(df)
sleep(2)
For most websites, you have to first get a token by logging in. Most of the time, this is a cookie. Then, in authorized requests, you can send that cookie along. Open the network tab in developer tools and then log in with your username and password. You'll be able to see how the request is formatted and where it is too. From there, just try to replicate it in your code.

python can't send post request image

I'm trying to decode a qr image from a website with python: https://zxing.org/w/decode.jspx
And i don't know why my post requests fail and i don't get any response
import requests
url ="https://zxing.org/w/decode.jspx"
session = requests.Session()
f = {'f':open("new.png","rb")}
response = session.post(url,files = f)
f = open("page.html","w")
f.write(response.text)
f.close()
session.close()
Even when i do it with a get requests it still fail ... :/
url ="https://zxing.org/w/decode.jspx"
session = requests.Session()
data = {'u':'https://www.qrstuff.com/images/default_qrcode.png'}
response = session.post(url,data = data)
f = open("page.html","w")
f.write(response.text)
f.close()
session.close()
maby because the website contain two forms ? ...
Thanks for helping
You can do this:
import urllib
url ="https://zxing.org/w/decode?u=https://www.qrstuff.com/images/default_qrcode.png"
response = urllib.urlopen(url)
f = open("page.html","w")
f.write(response.read())
f.close()
If you want to send url action == get and if you want to post data as a file, action == post.
You can check it with Hackbar addons on Firefox
Well i just saw my mistake ...
the web site is : https://zxing.org/w/decode.jspx
but once you have a post or a get it'll be
https://zxing.org/w/decode without ".jspx" so i just removed it and every thing worked well !!

Mintos.com login with python requests

I'm trying to write a tiny piece of software that logs into mintos.com, and saves the account overview page (which is displayed after a successful login) in a html file. I tried some different approaches, and this is my current version.
import requests
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
username = 'abc'
password = '123'
loginUrl = 'https://www.mintos.com/en/login'
resp = requests.get(loginUrl, auth=(username, password))
file = codecs.open("mint.html", "w", "UTF-8")
file.write(resp.text)
file.close()
When I run the code, I only save the original page, not the one I should get when logged in. I guess I'm messing up the login (I mean...there's not much else to mess up). I spent an embarrassing amount of time on this problem already.
Edit:
I also tried something along the lines of:
import requests
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
loginUrl = "https://www.mintos.com/en/login";
username = "abc"
password = "123"
payload = {"username": username, "password": password}
with requests.session() as s:
resp = s.post(loginUrl, data = payload)
file = codecs.open("mint.html", "w", "UTF-8")
file.write(resp.text)
file.close()
Edit 2: Another non working version, this time with _csrf_token
with requests.session() as s:
resp = s.get(loginUrl)
toFind = '_csrf_token" value="'
splited = resp.text.split(toFind)[1]
_csrf_token = splited.split('"',1)[0]
payload = {"_username": _username, "_password": _password, "_csrf_token": _csrf_token}
final = s.post(loginUrl, data = payload)
file = codecs.open("mint.html", "w", "UTF-8")
file.write(final.text)
file.close()
But I still get the same result. The downloaded page has the same token as the one I extract, though.
Final Edit: I made it work, and I feel stupid now. I needed to use "'https://www.mintos.com/en/login/check' as my loginUrl.
The auth parameter is just a shorthand for HTTPBasicAuth, which is not what most websites use. Most of them use cookies or session data in order to store your login / info on your computer so they can check who you are while you're browsing the pages.
If you want to be able to log in on the website, you'll have to make a POST request on the login form and then store (and give back every time) the cookies they'll send to you. Also, this implies they don't have any kind of "anti-bot filter" (which makes you unable to login without having a real browser or, at least, not that easily).

Python append json to json file in a while loop

I'm trying to get all users information from GitHub API using Python Requests library. Here is my code:
import requests
import json
url = 'https://api.github.com/users'
token = "my_token"
headers = {'Authorization': 'token %s' % token}
r = requests.get(url, headers=headers)
users = r.json()
with open('users.json', 'w') as outfile:
json.dump(users, outfile)
I can dump first page of users into a json file by now. I can also find the 'next' page's url:
next_url = r.links['next'].get('url')
r2 = requests.get(next_url, headers=headers)
users2 = r2.json()
Since I don't know how many pages yet, how can I append 2nd, 3rd... page to 'users.json' sequentially in a while loop as fast as possible?
Thanks!
First, you need to open file in 'a' mode, otherwise subsequence write will overwrite everything
import requests
import json
url = 'https://api.github.com/users'
token = "my_token"
headers = {'Authorization': 'token %s' % token}
outfile = open('users.json', 'a')
while True:
r = requests.get(url, headers=headers)
users = r.json()
json.dump(users, outfile)
url = r.links['next'].get('url')
# I don't know what Github return in case there is no more users, so you need to double check by yourself
if url == '':
break
outfile.close()
Append the data you get from the requests query to a list and move on to the next query.
Once you have all of the data you want, then proceed to try to concatenate the data into a file or into an object. You can also use threading to do multiple queries in parallel, but most likely there is going to be rate limiting on the api.

Make a post request with File content via Python

i have very large POSt data (over 100 MB) with one cookie, now i want to send it to a server through Python, the POSt request is in a file like this:
a=true&b=random&c=2222&d=pppp
This is my following code which only sends Cookies but not the POST content.
import requests
import sys
count = len(sys.argv)
if count < 3:
print 'usage a.py FILE URL LOGFILE'
else:
url = sys.argv[2]
data= {'file': open(sys.argv[1], 'rb')}
cookies = {'session':'testsession'}
r = requests.get(url, data=data, cookies=cookies)
f = open(sys.argv[3], 'w')
f.write(r.text)
f.close()
The code takes File which has POSt data, then the URL to send it , then the OUTPUT file where the response is to be stored.
Note: I am not trying to upload a file but to send the post content which is inside a file.
Firstly you should be using requests.post. Secondly if you want to post just the data inside the file you need to read the contents of the file and parse it to a dict since this is the format that requests.post expects data to come in.
Example: (Note: Just showing the relevant parts)
import urlparse
import requests
import sys
with open(sys.argv[1], 'rb') as f:
data = urlparse.parse_qs('&'.join(f.read().splitlines()))
r = requests.post(url, data=data, cookies=cookies)

Categories

Resources