Write contents of URL request to file - python

I am trying to fetch a list from a php file using python and save it to a file:
import urllib.request
page = urllib.request.urlopen('http://crypto-bot.hopto.org/server/list.php')
f = open("test.txt", "w")
f.write(str(page))
f.close()
print(page.read())
Output on screen (divided onto four lines for readability):
ALF\nAMC\nANC\nARG\nBQC\nBTB\nBTE\nBTG\nBUK\nCAP\nCGB\nCLR\nCMC\nCRC\nCSC\nDGC\n
DMD\nELC\nEMD\nFRC\nFRK\nFST\nFTC\nGDC\nGLC\nGLD\nGLX\nHBN\nIXC\nKGC\nLBW\nLKY\n
LTC\nMEC\nMNC\nNBL\nNEC\nNMC\nNRB\nNVC\nPHS\nPPC\nPXC\nPYC\nQRK\nSBC\nSPT\nSRC\n
STR\nTRC\nWDC\nXPM\nYAC\nYBC\nZET\n
Output in file:
<http.client.HTTPResponse object at 0x00000000031DAEF0>
Can you tell me what I am doing wrong?

Use urllib.urlretrieve (urllib.request.urlretrieve in Python 3).
In the console:
>>> import urllib
>>> urllib.urlretrieve('http://crypto-bot.hopto.org/server/list.php','test.txt')
('test.txt', <httplib.HTTPMessage instance at 0x101338050>)
This results in a file, test.txt, being saving in the current working directory with the contents
ALF
AMC
ANC
ARG
...etc...

You need to read from the file object before writing to the file. Also you should the same object to both file and screen.
Do this:
import urllib.request
page = urllib.request.urlopen('http://crypto-bot.hopto.org/server/list.php')
f = open("test.txt", "w")
content = page.read()
f.write(content)
f.close()
print(content)

You're not reading the content from the urlopen file-like when you write to the file.
Also, shutil.copyfileobj().

Related

Python: downloading xml files in batch returns a damaged zip file

Drawing inspiration from this post, I am trying to download a bunch of xml files in batch from a website:
import urllib2
url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
data = f.read()
with open("C:\Users\MyName\Desktop\data.zip", "wb") as code:
code.write(data)
The zip file is created within seconds, but as I attempt to access it, an error window comes up:
Windows cannot open the folder.
The Compressed (zipped) Folder "C:\Users\MyName\Desktop\data.zip" is invalid.
What am I doing wrong here?
you are not opening file handles inside the zip file:
import urllib2
from bs4 import BeautifulSoup
import zipfile
url='http://ratings.food.gov.uk/open-data/'
fileurls = []
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')
tablewrapper = soup.find(id='openDataStatic')
for table in tablewrapper.find_all('table'):
for link in table.find_all('a'):
fileurls.append(link['href'])
with zipfile.ZipFile("data.zip", "w") as code:
for url in fileurls:
print('Downloading: %s' % url)
f = urllib2.urlopen(url)
data = f.read()
xmlfilename = url.rsplit('/', 1)[-1]
code.writestr(xmlfilename, data)
You are doing nothing to encode this as zip file. If instead you choose to open it in a plain text editor such as notepad it should show you the raw xml.

Python reading CSV file using URL giving error

I am trying to read a CSV file using the requests library but I am having issues.
import requests
import csv
url = 'https://storage.googleapis.com/sentiment-analysis-dataset/training_data.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
I then tried
for row in reader:
print(row)
but it gave me this error:
Error: iterator should return strings, not bytes (did you open the file in text mode?)
How should I fix this?
What you probably want is:
text = r.iter_lines(decode_unicode=True)
This will return a strings-iterator instead of a bytes-iterator. (See here for documentation.)

zip downloaded csv file

I'm trying to download a csv file and zip it before saving it.
The code I'm using is:
req = urllib2.Request(url)
fh = urllib2.urlopen(req)
with contextlib.closing(ZipFile("test.csv.zip", "w", zipfile.ZIP_STORED)) as f:
f.write(fh.read())
f.close()
What this does is to print the contents of the csv file to stdout and create an empty zipfile.
Any ideas of what could be wrong?
Thanks,
Isaac
Take a look at the documentation for ZipFile.write(). Here is the function signature:
ZipFile.write(filename[, arcname[, compress_type]])
The first argument should be the file name of the file that you are adding to the zip archive, not the contents of the file. Instead you are passing the entire contents of the downloaded resource as the file name and, because that will likely be illegal (too long), you see the file contents dumped as part of the error message of the raised exception.
To fix this you need to use is ZipFile.writestr():
req = urllib2.Request(url)
fh = urllib2.urlopen(req)
with ZipFile("test.csv.zip", "w", zipfile.ZIP_STORED) as f:
f.writestr('test.csv', fh.read())
If it is your intention to compress only a single file, you probably don't need to use a zip archive, and you might be better off using gzip or bzip2.

Download and save PDF file with Python requests module

I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.
In [1]: import requests
In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
In [3]: response = requests.get(url)
In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
...: f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2 f.write(response.text)
3
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)
In [5]: import codecs
In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
...: f.write(response.text)
...:
I know it is a codec problem of some kind but I can't seem to get it to work.
You should use response.content in this case:
with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
From the document:
You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...
So that means: response.text return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.
And response.content return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.
You can also use response.raw instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)
with open('/tmp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
chunk_size is the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000 bytes, write them into the file, and do this again, again and again, unless it finished.
So this can save your RAM. But I'd prefer use response.content instead in this case since your file is small. As you can see use response.raw is complex.
Relates:
How to download large file in python with requests.py?
How to download image using requests
In Python 3, I find pathlib is the easiest way to do this. Request's response.content marries up nicely with pathlib's write_bytes.
from pathlib import Path
import requests
filename = Path('metadata.pdf')
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
filename.write_bytes(response.content)
You can use urllib:
import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")
Please note I'm a beginner. If My solution is wrong, please feel free to correct and/or let me know. I may learn something new too.
My solution:
Change the downloadPath accordingly to where you want your file to be saved. Feel free to use the absolute path too for your usage.
Save the below as downloadFile.py.
Usage: python downloadFile.py url-of-the-file-to-download new-file-name.extension
Remember to add an extension!
Example usage: python downloadFile.py http://www.google.co.uk google.html
import requests
import sys
import os
def downloadFile(url, fileName):
with open(fileName, "wb") as file:
response = requests.get(url)
file.write(response.content)
scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')
Generally, this should work in Python3:
import urllib.request
..
urllib.request.get(url)
Remember that urllib and urllib2 don't work properly after Python2.
If in some mysterious cases requests don't work (happened with me), you can also try using
wget.download(url)
Related:
Here's a decent explanation/solution to find and download all pdf files on a webpage:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
regarding Kevin answer to write in a folder tmp, it should be like this:
with open('./tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
he forgot . before the address and of-course your folder tmp should have been created already

Replace and overwrite instead of appending

I have the following code:
import re
#open the xml file for reading:
file = open('path/test.xml','r+')
#convert to string:
data = file.read()
file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
file.close()
where I'd like to replace the old content that's in the file with the new content. However, when I execute my code, the file "test.xml" is appended, i.e. I have the old content follwed by the new "replaced" content. What can I do in order to delete the old stuff and only keep the new?
You need seek to the beginning of the file before writing and then use file.truncate() if you want to do inplace replace:
import re
myfile = "path/test.xml"
with open(myfile, "r+") as f:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
f.truncate()
The other way is to read the file then open it again with open(myfile, 'w'):
with open(myfile, "r") as f:
data = f.read()
with open(myfile, "w") as f:
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
Neither truncate nor open(..., 'w') will change the inode number of the file (I tested twice, once with Ubuntu 12.04 NFS and once with ext4).
By the way, this is not really related to Python. The interpreter calls the corresponding low level API. The method truncate() works the same in the C programming language: See http://man7.org/linux/man-pages/man2/truncate.2.html
file='path/test.xml'
with open(file, 'w') as filetowrite:
filetowrite.write('new content')
Open the file in 'w' mode, you will be able to replace its current text save the file with new contents.
Using truncate(), the solution could be
import re
#open the xml file for reading:
with open('path/test.xml','r+') as f:
#convert to string:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
f.truncate()
import os#must import this library
if os.path.exists('TwitterDB.csv'):
os.remove('TwitterDB.csv') #this deletes the file
else:
print("The file does not exist")#add this to prevent errors
I had a similar problem, and instead of overwriting my existing file using the different 'modes', I just deleted the file before using it again, so that it would be as if I was appending to a new file on each run of my code.
See from How to Replace String in File works in a simple way and is an answer that works with replace
fin = open("data.txt", "rt")
fout = open("out.txt", "wt")
for line in fin:
fout.write(line.replace('pyton', 'python'))
fin.close()
fout.close()
in my case the following code did the trick
with open("output.json", "w+") as outfile: #using w+ mode to create file if it not exists. and overwrite the existing content
json.dump(result_plot, outfile)
Using python3 pathlib library:
import re
from pathlib import Path
import shutil
shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
filepath = Path("/tmp/test.xml")
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
Similar method using different approach to backups:
from pathlib import Path
filepath = Path("/tmp/test.xml")
filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))

Categories

Resources