Images downloaded with Python are corrupted? - python

I try to download images, but they become corrupted for some reason? For example: This is an image I want to get.
And the result is this
My test code is:
import urllib2
def download_web_image(url):
request = urllib2.Request(url)
img = urllib2.urlopen(request).read()
with open ('test.jpg', 'w') as f: f.write(img)
download_web_image("http://upload.wikimedia.org/wikipedia/commons/8/8c/JPEG_example_JPG_RIP_025.jpg")
Why is this and how do I fix this?

You are opening 'test.jpg' file in the default (text) mode, which causes Python to use the "correct" newlines on Windows:
In text mode, the default when reading is to convert platform-specific
line endings (\n on Unix, \r\n on Windows) to just \n. When writing in
text mode, the default is to convert occurrences of \n back to
platform-specific line endings.
Of course, JPEG files are not text files, and 'fixing' the newlines will only corrupt the image. Instead, open the file in binary mode:
with open('test.jpg', 'wb') as f:
f.write(img)
For more details, see the documentation.

Related

written out bytes size and actually size are different using python's file.write()

here is my code , it's really simple, I download a file(with lib requests ) and save it to disk,but the size i got is different from actually size write to disk
mus_resp =r.get("http://audio.xmcdn.com/group7/M07/21/73/wKgDWlbmOa3TD0D_AArDQp_Mj5Y641.m4a",headers=headers, stream=True)
#print len(mus_resp.content) here is 705346 bytes
fd = open( "file", 'w')
fd.write(mus_resp.content)
fd.flush()
fd.close()
exit()
print os.path.getsize('file') here is 708677 bytes
Your data is binary data, not text, and it likely contains \n characters semi-randomly (they don't mean newlines, it's just the same byte as ASCII newline). When you write them to a text mode file on Windows, it's seamlessly converting to \r\n (Windows standard line endings), bloating the final file. Open the file in binary mode and you'll disable line ending conversions:
fd = open("file", 'wb') # 'wb' means write binary mode

What is os.linesep for?

Python's os module contains a value for a platform specific line separating string, but the docs explicitly say not to use it when writing to a file:
Do not use os.linesep as a line terminator when writing files opened in text mode (the default); use a single '\n' instead, on all platforms.
Docs
Previous questions have explored why you shouldn't use it in this context, but then what context is it useful for? When should you use the line separator, and for what?
the docs explicitly say not to use it when writing to a file
Not exactly. The doc says not to use it in text mode.
The os.linesep is used when you want to iterate through the lines of a text file. The internal scanner recognises the os.linesep and replaces it by a single \n.
For illustration, we write a binary file which contains 3 lines separated by \r\n (Windows delimiter):
import io
filename = "text.txt"
content = b'line1\r\nline2\r\nline3'
with io.open(filename, mode="wb") as fd:
fd.write(content)
The content of the binary file is:
with io.open(filename, mode="rb") as fd:
for line in fd:
print(repr(line))
NB: I used the "rb" mode to read the file as a binary file.
I get:
b'line1\r\n'
b'line2\r\n'
b'line3'
If I read the content of the file using the text mode, like this:
with io.open(filename, mode="r", encoding="ascii") as fd:
for line in fd:
print(repr(line))
I get:
'line1\n'
'line2\n'
'line3'
The delimiter is replaced by \n.
The os.linesep is also used in write mode. Any \n character is converted to the system default line separator: \r\n on Windows, \n on POSIX, etc.
With the io.open function you can force the line separator to whatever you want.
Example: how to write a Windows text file:
with io.open(filename, mode="w", encoding="ascii", newline="\r\n") as fd:
fd.write("one\ntwo\nthree\n")
If you read this file in text mode like this:
with io.open(filename, mode="rb") as fd:
content = fd.read()
print(repr(content))
You get:
b'one\r\ntwo\r\nthree\r\n'
As you know, reading and writing files in text mode in python converts the platform specific line separator to '\n' and vice versa. But if you would read a file in binary mode, no conversion takes place. Then you can explicitly convert the line endings using string.replace(os.linesep, '\n'). This can be useful if a file (or stream or whatever) contains a combination of binary and text data.

Python urllib2 Images Distorted

I'm making a program using the website http://placekitten.com, but I've run into a bit of a problem. Using this:
im = urllib2.urlopen(url).read()
f = open('kitten.jpeg', 'w')
f.write(im)
f.close()
The image turns out distorted with mismatched colors, like this:
http://imgur.com/zVg64Kn.jpeg
I was wondering if there was an alternative to extracting images with urllib2. If anyone could help, that would be great!
You need to open the file in binary mode:
f = open('kitten.jpeg', 'wb')
Python will otherwise translate line endings to the native platform form, a transformation that breaks binary data, as documented for the open() function:
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability.
When copying data from a URL to a file, you could use shutil.copyfileob() to handle streaming efficiently:
from shutil import copyfileobj
im = urllib2.urlopen(url)
with open('kitten.jpeg', 'wb') as out:
copyfileobj(im, out)
This will read data in chunks, avoiding filling memory with large blobs of binary data. The with statement handles closing the file object for you.
Change
f = open('kitten.jpeg', 'w')
to read
f = open('kitten.jpeg', 'wb')
See http://docs.python.org/2/library/functions.html#open for more information. What's happening is that the newlines in the jpeg are getting modified in the process of saving, and opening as a binary file will prevent this.
If you're using Windows, you have to open the file in binary mode:
f = open('kitten.jpeg', 'wb')
Or more Pythonically:
import urllib2
url = 'http://placekitten.com.s3.amazonaws.com/homepage-samples/200/140.jpg'
image = urllib2.urlopen(url).read()
with open('kitten.jpg', 'wb') as handle:
handle.write(image)

Read a zip an write it to an other file python

I want to read a file and write it back out. Here's my code:
file = open( zipname , 'r' )
content = file.read()
file.close()
alt = open('x.zip', 'w')
alt.write(content )
alt.close()
This doesn't work, why?????
Edit:
The rewritten file is corrupt
(python 2.7.1 on windows)
Read and write in the binary mode, 'rb' and 'wb':
f = open(zipname , 'rb')
content = f.read()
f.close()
alt = open('x.zip', 'wb')
alt.write(content )
alt.close()
The reason the text mode didn't work on Windows is that the newline translation from '\r\n' to '\r' mangled the binary data in the zip file.
From this bit of the manual:
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
If I run this program on my OS X or Linux box, it works exactly as you would expect. The file x.zip has exactly the same checksum as the original zip file and is not corrupt. I believe that Windows is one of the platforms where you need to explicitly open files in binary mode; try:
file = open(zipname, 'rb')

python opens text file with a space between every character

Whenever I try to open a .csv file with the python command
fread = open('input.csv', 'r')
it always opens the file with spaces between every single character. I'm guessing it's something wrong with the text file because I can open other text files with the same command and they are loaded correctly. Does anyone know why a text file would load like this in python?
Thanks.
Update
Ok, I got it with the help of Jarret Hardie's post
this is the code that I used to convert the file to ascii
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
mytext = mytext.encode('ascii', 'ignore')
fwrite = open('input-ascii.csv', 'wb')
fwrite.write(mytext)
Thanks!
The post by recursive is probably right... the contents of the file are likely encoded with a multi-byte charset. If this is, in fact, the case you can likely read the file in python itself without having to convert it first outside of python.
Try something like:
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
The 'b' flag ensures the file is read as binary data. You'll need to know (or guess) the original encoding... in this example, I've used utf-16, but YMMV. This will convert the file to unicode. If you truly have a file with multi-byte chars, I don't recommend converting it to ascii as you may end up losing a lot of the characters in the process.
EDIT: Thanks for uploading the file. There are two bytes at the front of the file which indicates that it does, indeed, use a wide charset. If you're curious, open the file in a hex editor as some have suggested... you'll see something in the text version like 'I.D.|.' (etc). The dot is the extra byte for each char.
The code snippet above seems to work on my machine with that file.
The file is encoded in some unicode encoding, but you are reading it as ascii. Try to convert the file to ascii before using it in python.
Isn't csv a simple txt file with values separated with comma.
Just try to open it with a text editor to see if the file is correctly formed.
To read an encoded file, you can simply replace open with codecs.open.
fread = codecs.open('input.csv', 'r', 'utf-16')
It did never ocurred to me, but as truppo said, it must be something wrong with the file.
Try to open the file in Excel/BrOffice Calc and Save As the file as Csv again.
If the problem persists, try a subset of the data: fist 10/last 10/intermediate 10 lines of the file.
Ok, I got it with the help of Jarret Hardie's post
this is the code that I used to convert the file to ascii
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
mytext = mytext.encode('ascii', 'ignore')
fwrite = open('input-ascii.csv', 'wb')
fwrite.write(mytext)
Thanks!
Open the file in binary mode, 'rb'. Check it in a HEX Editor and check for null padding '00'. Open the file in something like Scintilla Text Editor to check the characters present in the file.
Here's the quick and easy way, esp if python won't parse the input correctly
sed 's/ \(.\)/\1/g'

Categories

Resources