I'm trying to download some pages as PDF files. However, the pages require me to log in, so I simply sent some cookies along with my request (using the requests module). This worked. However, I'm not sure how to send cookies with PDFKit to achieve the same thing.
Here is the code I tried. I also tried to incorperate headers (to prevent a 403 error), but it didn't work. I can't seem to find this in the documentation, either. Does anyone know how I can send cookies to download the pages?
import pdfkit
url = r'www.someurl.com'
cookies = {
"cookie1": "cookie"
}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
config = pdfkit.configuration(wkhtmltopdf="C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
pdfkit.from_url(url, 'out.pdf', configuration=config, options=cookies)
According to the PDFkit project description you can set cookies using this approach:
options = {'cookie': [('cookie-name1', 'cookie-value1'),
('cookie-name2', 'cookie-value2')]}
pdfkit.from_url('http://google.com', 'out.pdf', options=options)
Related
I'm new to python. I have to download some images from the web and save it to my local file system. I've noticed that the response content does not contain any image data.
The problem only occurs with this specific url, with every other image url the code works fine.
I know the easiest solution would be just use another url but still i'd like to ask if someone had a similar problem.
import requests
url = 'https://assets.coingecko.com/coins/images/1/large/bitcoin.png'
filename = "bitcoin.png"
response = requests.get(url, stream = True)
response.raw.decode_content = True
with open(f'images/{filename}', 'wb') as outfile:
outfile.write(response.content)
First, look at the content of the response with response.text, you'll see the website blocked your request.
Please turn JavaScript on and reload the page.
Then, you can try to check if changing the User-Agent of your request fixes your issue.
response = requests.get(
url,
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
},
stream = True
)
If it doesn't, you may need to get your data with something which can parse javascript like selenium or Puppeteer.
I am trying to download the file from the URL:
https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf
I tried using the python requests library, but the request just timed out. I tried specifying the 'User-Agent' from my browser as a header, but it still just timed out, including when I copied across every single header from my browser into my python script. I tried setting allow_redirects=True, this did not help. I've also tried wget and curl, everything fails apart from actually opening the browser, visiting the URL and downloading the file.
I'm wondering what the actual difference is between the requests in my browser and the python requests where I set the headers to match those in my browser - is there any way I can download this file using python?
Code snippet:
import requests
requests.get("https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf") # hangs
Check this, It's worked for me.
import requests
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
response = requests.get(
"https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf", headers=headers)
pdf = open("Chadv20-239.pdf", 'wb')
pdf.write(response.content)
pdf.close()
It is difficult to understand what might be going wrong without some code snippet. How is the file being downloaded? Are you getting raw response content and saving that as pdf? The official docs(https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content) suggest using chunk based approach to save the streamed/raw content. Did you try that approach?
I have a question re: requests module in Python.
So far I have been using this to scrape and it's been working well.
However when I do it against one particular website (code below - and refer to the Jupyter Notebook snapshot), it just doesn't want to complete the task (showing [*] forever).
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
page = requests.get('https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets', verify = False)
soup = BeautifulSoup(page.content, 'html.parser')
Some users also suggest using headers such as below to speed it up but it doesnt work for me as well:
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = requests.get(url = url, headers = headers)
Not sure what's going on (this is the first time for me) but I might be missing on something obvious. If someone can explain why this is not working? Or if it's working in your machine, please do let me know!
The page attempts to add a cookie the first time you visit it. By using the requests module and not defining a cookie will prevent you from being able to connect to the page.
I've modified your script to include my cookie which should work - if it doesn't, copy your cookie (for this host domain) from the browser to the script.
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
cookies = {
'TS01e58ec0': '01a1c9e334eb0b8b191d36d0da302b2bca8927a0ffd2565884aff3ce69db2486850b7fb8e283001c711cc882a8d1f749838ff59d3d'
}
req = requests.get(url = url, headers = headers, cookies=cookies)
I am trying to get some data from a page. I open Chrome's development tools and successfully find the data I wanted. It's in XHR with GET method (sorry I don't know how to descript it).Then I copy the params, headers, and put all these to requests.get() method. The response I get is totally different to what I saw on the development tools.
Here is my code
import requests
queryList={
"category":"summary",
"subcategory":"all",
"statsAccumulationType":"0",
"isCurrent":"true",
"playerId":None,
"teamIds":"825",
"matchId":"1103063",
"stageId":None,
"tournamentOptions":None,
"sortBy":None,
"sortAscending":None,
"age":None,
"ageComparisonType":None,
"appearances":None,
"appearancesComparisonType":None,
"field":None,
"nationality":None,
"positionOptions":None,
"timeOfTheGameEnd":None,
"timeOfTheGameStart":None,
"isMinApp":None,
"page":None,
"includeZeroValues":None,
"numberOfPlayersToPick":None,
}
header={
'modei-last-mode':'JL7BrhwmeqKfQpbWy6CpG/eDlC0gPRS2BCvKvImVEts=',
'Referer':'https://www.whoscored.com/Matches/1103063/LiveStatistics/Spain-La-Liga-2016-2017-Leganes-Real-Madrid',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
"x-requested-with":"XMLHttpRequest",
}
url='https://www.whoscored.com/StatisticsFeed/1/GetMatchCentrePlayerStatistics'
test=requests.get(url=url,params=queryList,headers=header)
print(test.text)
I follow this post below but it's already 2 years ago and I believe the structure is changed.
XHR request URL says does not exist when attempting to parse it's content
BackgroundInfo:
I am scraping amazon. I need to set up the session cookies before using requests.session.get() to get the final version of the page source code of a url.
Code:
import requests
# I am currently working in China, so it's cn.
# Use the homepage to get cookies. Then use it later to scrape data.
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = requests.get(homepage,headers = headers)
cookies = response.cookies
#set up the Session object, so as to preserve the cookies between requests.
session = requests.Session()
session.headers = headers
session.cookies = cookies
#now begin download the source code
url = 'https://www.amazon.cn/TCL-%E7%8E%8B%E7%89%8C-L65C2-CUDG-65%E8%8B%B1%E5%AF%B8-%E6%96%B0%E7%9A%84HDR%E6%8A%80%E6%9C%AF-%E5%85%A8%E6%96%B0%E7%9A%84%E9%87%8F%E5%AD%90%E7%82%B9%E6%8A%80%E6%9C%AF-%E9%BB%91%E8%89%B2/dp/B01FXB0ZG4/ref=sr_1_2?ie=UTF8&qid=1476165637&sr=8-2&keywords=L65C2-CUDG'
response = session.get(url)
Desired Result:
When navigate to the amazon homepage in Chrome, the cookies should be something like:
As you can find in the cookies part,which I underscore in red, part of the cookies set by the response to our request to the homepage is "ubid-acbcn", which is also part of the request header, probably left from last visit.
So that is the cookie I want, which I attempted to get by the above code.
In python code, it should be a cookieJar, or a dictionary. Either way, its content should be something that contains 'ubid-acbcn' and 'session-id':
{'ubid-acbcn':'453-7613662-1073007','session-id':'455-1363863-7141553','otherparts':'otherparts'}
What I am getting instead:
The 'session-id' is there, but the 'ubid-acbcn' is missing.
>>homepage = 'http://www.amazon.cn'
>>headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
>>response = requests.get(homepage,headers = headers)
>>cookies = response.cookies
>>print(cookies.get_dict()):
>>{'session-id': '456-2975694-3270026','otherparts':'otherparts'}
Related Info:
OS: WINDOWS 10
PYTHON: 3.5
requests: 2.11.1
I am sorry for being a bit verbose.
What I tried and figure:
I googled for certain keywords, but nobody seems to be facing this
problem.
I figure it might be something to do with the amazon
anti-scraping measure. But other than change my headers to disguise
myself as a human, there isn't much I know I should do.
I have also entertained the possibility that tt might not be a case of missing cookie. But rather I have not set up my requests.get(homepage,headers = headers) properly, hence the response.cookie is not as expected. Given this,I have tried to copying the request header in my browser, leaving out only the cookie part, but still the response cookie is missing the 'ubid-acbcn' part. Maybe some other parameter has to be set up?
You're trying to get cookies from simple "nameless" GET request. But if to sent it "on behalf" of Session you can get required ubid-acbcn value:
session = requests.Session()
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = session.get(homepage,headers = headers)
cookies = response.cookies
print(cookies.get_dict())
Output:
{'ubid-acbcn': '456-2652288-5841140' ...}
The cookies being set are from other pages/resources, probably loaded by JavaScript code. So you probably need to used selenium web driver for it. Check out the link for detail discussion.
not getting all cookie info using python requests module