I am trying to download a pdf from the internet. I have a battery of links needed to pull the pdf from the internet.
I have this block of code:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
url = 'http://webapps.rrc.texas.gov/CMPL/viewPdfReportFormAction.do?method=cmplG1FormPdf&packetSummaryId=2928'
opts = Options()
opts.headless = True
assert opts.headless # Operating in headless mode
browser_detail = Firefox(options=opts)
browser_detail.get(url)
print(browser_detail.page_source)
with open('temp/metadata.pdf', 'wb') as fd:
fd.write(browser_detail.page_source)
browser_detail.close()
I also have tried requests. Same response:
import requests
url = 'http://webapps.rrc.texas.gov/CMPL/viewPdfReportFormAction.do?method=cmplG1FormPdf&packetSummaryId=2928'
r = requests.get(url, stream=True)
with open('temp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(2000):
fd.write(chunk)
the problem is if I put the url into a browser, the pdf comes up, but when I put it to this code, the page_source is html. This makes me think that there's a forwarding or server-side processing involved.
How do I get the PDF down?
Thanks!
I was able to pull down the PDF file using requests.
The page is looking for a proper User-Agent so I set it to Chrome MacOS.
h = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" }
r = requests.get(url, stream=True, headers=h)
And it worked.
tmp/project/1> file metadata.pdf
metadata.pdf: PDF document, version 1.4
with open('temp/metadata.pdf', 'wb') as fd:
fd.write(r.content)
Related
I want to download excel file from this link via python
https://www.tfex.co.th/tfex/historicalTrading.html?locale=en_US&symbol=S50Z21&decorator=excel&series=&page=4&locale=en_US&locale=en_US&periodView=A
Here is my code:
url = 'https://www.tfex.co.th/tfex/historicalTrading.html?locale=en_US&symbol=S50Z21&decorator=excel&series=&page=4&locale=en_US&periodView=A'
resp = requests.get(url)
with open('file.xls','wb') as f:
f.write(resp.content)
But the file.xls is instead a html text file.
file.xls looks like this.1
I've tried add headers
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
resp = requests.get(url, headers=headers)
But it didn't help. Thank you in advance.
Edit:
Found a way using pandas.
import pandas as pd
url = r'https://www.tfex.co.th/tfex/historicalTrading.html?locale=en_US&symbol=S50Z21&decorator=excel&series=&page=4&locale=en_US&periodView=A'
# read into HTML tables
tables = pd.read_html(url)
# merge HTML tables
merged = pd.concat(tables)
# Write tables to excel file
merged.to_excel("output.xlsx")
Hope this helps :)
Ignore below, this was before edit:
I know this is still problematic depending on your downstream application. The code below does seem to still download it into a HTML format, but this format can be opened in excel regardless.
import requests
url = r'https://www.tfex.co.th/tfex/historicalTrading.html?locale=en_US&symbol=S50Z21&decorator=excel&series=&page=4&locale=en_US&periodView=A'
r = requests.get(url, allow_redirects=False)
excel_url = r.url
open('out.xls', 'wb').write(r.content)
When I open this in excel I get a warning, and click okay.
screenshot of file
I am trying to do a project where I search for similar images using Google Image stuff and Google's Custom Search API. From that, I get the correct URL that gets me similar images. Then, I simply want the HTML of that page. The page looks like this LINK. I just want the HTML to the page this leads to. But, I tried this:
r = requests.get(fetchUrl)
print(r.text)
This is just the HTML to a really old Google main page. I am not sure where this is coming from. I also tried adding a header to ensure that Google doesn't block me from scraping.
Entire code:
import requests
filePath = 'Initial_Img/a/frame1.jpg'
searchUrl = 'http://www.google.com/searchbyimage/upload'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']
print(fetchUrl)
Do you have any ideas? Any help is truly appreciated.
The problem is something with the way Google renders the page. You would have to use Selenium and physically use the web browser to get the HTML. To solve your problem:
Run: sudo apt install firefox-geckodriver and install Firefox
Run: pip install selenium
Change your code to this:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
filePath = 'Initial_Img/a/test.jpg'
searchUrl = 'http://www.google.com/searchbyimage/upload'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']
options = Options()
options.add_argument("--disable-extensions")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox") # linux only
options.add_argument("--headless")
options.headless = True # also works
nav = webdriver.Firefox(options=options)
nav.get(fetchUrl)
print(nav.page_source)
nav.page_source gets you the HTML of the end page. I hope this helps. I don't know why the normal method doesn't work. If anyone knows the reason, please comment below.
I am trying to dowload a big amount of HTML pages from a certain website, with the following python code using "requests" package:
FROM = 547495
TO = 570000
for page_number in range(FROM, TO):
url = DEFAULT_URL + str(page_number)
response = requests.get(url)
if response.status_code == 200:
with open(str(page_number) + ".html", "wb") as file:
file.write(response.content)
time.sleep(0.5)
I put a sleep(0.5) command in order that the web server will not think it is a DDOS attack.
after about 20,000 pages, I started getting only 403 forbiden http status code, and I can't anymore download pages.
But, if I try to open the same pages in my browser It opens well, so I guess the web server did not block me.
does someone has an Idea what caused it? and how can I handle it?
thank you
Make it look like it's your browser using headers, and set a cookie ID if it requires a session, here is an example. You can retrieve values of headers from inspecting "Network" tab in your browser when visiting the pages.
with requests.session() as sess:
sess.headers["User-Agent"]= "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0"
sess.get(url)
sess.headers["Cookie"] = "eZSESSID={}".format(sess.cookies.get("eZSESSID"))
for page_number in range(FROM, TO):
if response.status_code == 200:
with open(str(page_number) + ".html", "wb") as file:
file.write(response.content)
time.sleep(0.5)
when I access this website, my browser opens a box to download a zip file.
I am trying to download the zip file through a python script (I am a begginer in coding). I would like to automate the process of downloading a batch of similar links in the future, but I am testing with only one link for now. Here is my code:
import requests
url = 'https://sigef.incra.gov.br/geo/exportar/vertice/shp/454698fd-6dfa-49a1-8096-bd9bb57b62ca'
r = requests.get(url, verify=False, allow_redirects=True)
open('verticeshp454698fd-6dfa-49a1-8096-bd9bb57b62ca.zip', 'wb').write(r.content)
As an output I get a broken zip file, not the one i wanted. I also get the following message in the command prompt:
C:\Users\joaop\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py:979: InsecureRequestWarning: Unverified HTTPS request is being made to host 'sigef.incra.gov.br'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
What steps am I missing here?
Thanks in advance for your help.
I got it working by adding / at the end of the url
import requests
# the `/` at the end is important
url = 'https://sigef.incra.gov.br/geo/exportar/vertice/shp/454698fd-6dfa-49a1-8096-bd9bb57b62ca/'
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2866.71 Safari/537.36",
}
r = requests.get(url, headers=headers, verify=False, allow_redirects=True)
# get the filename from the headers `454698fd-6dfa-49a1-8096-bd9bb57b62ca_vertice.zip`
filename = r.headers['Content-Disposition'].split("filename=")[-1]
with open(filename, 'wb') as f:
f.write(r.content)
See it in action here.
Here is the code I use for downloading any image (it always works fine except for this site www.pexels.com) . It actually download the image, but corrupted when it comes to this site ? I wonder why ??
url = "https://images.pexels.com/photos/844297/pexels-photo-844297.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=650&w=940"
response = requests.get(url , stream = True)
file= open("Hello.jpg" , 'wb')
for chunk in response.iter_content(10000):
file.write(chunk)
file.close()
You need to add a user-agent to your request headers.
The following code works:
import requests
url = "https://images.pexels.com/photos/844297/pexels-photo-844297.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=650&w=940"
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
}
response = requests.get(url , stream = True, headers=headers)
file= open("Hello.jpg" , 'wb')
for chunk in response.iter_content(10000):
file.write(chunk)
file.close()