Does read(size) have a built in pointer? - python

I found this code here which monitors the progress of downloads. -
import urllib2
url = "http://download.thinkbroadband.com/10MB.zip"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
I do not see the block size being modified at any point in the while loop, so, to me, buffer = u.read(block_sz) should keep reading the same thing over and over again.
We know this doesn't happen, is it because read() in while loop has a built in pointer that starts reading from where you left off last time?
What about write()? Does it keep appending after where it last left off, even though the file is not opened in append mode?

File objects and network sockets and other forms of I/O communication are data streams. Reading from them always produces the next section of data, calling .read() with a buffer size does not re-start the stream from the beginning.
So yes, there is a virtual 'position' in streams where you are reading from.
For files, that position is called the file pointer, and this pointer advances both for reading and writing. You can alter the position by using seeking, or by simply re-opening the file. You can ask a file object to tell you the current position, too.
For network sockets however, you can only go forward; the data is received from outside your computer and reading consumes that data.

Related

exit a for statement and write stream into a new file in python

I am a python beginner, I try to get an webradio and save the stream inside a file. I want to flush the content after a while (only keep 1 hour of stream for exemple). So I don't write all the stream in one file, I try to store the stream in many files (output_1.bin for one minute, output_2.bin for the next one...)
But I am not able to correctly exit from the "for". exit() isn't made to this purpose ?
def download_file(url):
r = requests.get(url, stream=True)
i = 1
while True:
local_filename = "output_"+str(i)+".bin"
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
i = i+1
print("iteration",i,"and modulo result :",i % 10,"\n")
if i % 10 == 0:
exit()
f.close()
print("Am I out of the for ?")
return local_filename
download_file('http://direct.franceinfo.fr/live/franceinfo-lofi.mp3')
exit() quits a python application. What you want is break.

Differentiating between compressed .gz files and archived tar.gz files properly?

What is the proper way to deal with differentiating between a plain compressed file in gzip or bzip2 format (eg. .gz) and a tarball compressed with gzip or bzip2 (eg. .tar.gz) Identification using suffix extensions is not a reliable option as it's possible files may end up renamed.
Now on the command line I am able to do something like this:
bzip2 -dc test.tar.bz2 |head|file -
So I attempted something similar in python with the following function:
def get_magic(self, store_file, buffer=False, look_deeper=False):
# see what we're indexing
if look_deeper == True:
m = magic.Magic(mime=True, uncompress=True)
else:
m = magic.Magic(mime=True)
if buffer == False:
try:
file_type = m.from_file(store_file)
except Exception, e:
raise e
else:
try:
file_type = m.from_buffer(store_file)
except Exception, e:
raise e
return file_type
Then when trying to read a compressed tarball I'll pass in the buffer from elsewhere via:
file_buffer = open(file_name).read(8096)
archive_check = self.get_magic(file_buffer, True, True)
Unfortunately this then becomes problematic using the uncompress flag in python-magic because it appears that python-magic is expecting me to pass in the entire file even though I only want it to read the buffer. I end up with the exception:
bzip2 ERROR: Compressed file ends unexpectedly
Seeing as the the files I am looking at can end up being 2M to 20GB in size this becomes rather problematic. I don't want to read the entire file.
Can it be hacked and chop the end of the compressed file off and append it to the buffer? Is it better to ignore the idea of uncompressing the file using python-magic and instead do it before I pass in a buffer to identify via:
file_buffer = open(file_name, "r:bz2").read(8096)
Is there a better way?
It is very likely a tar file if the uncompressed data at offset 257 is "ustar", or if the uncompressed data in its entirety is 1024 zero bytes (an empty tar file).
You can read just the first 1024 bytes of the uncompressed data using z = zlib.decompressobj() or z = bz2.BZ2Decompressor(), and z.decompress().
I'm actually going to mark Mark's answer as the correct one as it gave me the hint.
I ended up dumping the project to do other things for a good six months and was stumped as the bz2.BZ2Decompressor didn't seem to be doing as it was supposed to. It turns out the problem isn't solvable in 1024 bytes.
#!/usr/bin/env python
import os
import bz2
import magic
store_file = "10mb_test_file.tar.bz2"
m = magic.Magic(mime=True)
file_buffer = open(store_file, "rb").read(1000000)
buffer_chunk = ""
decompressor = bz2.BZ2Decompressor()
print ( "encapsulating bz2" )
print ( type(file_buffer) )
print ( len(file_buffer) )
file_type = m.from_buffer(file_buffer)
print ( "file type: %s :" % file_type)
buffer_chunk += decompressor.decompress( file_buffer )
print ( "compressed file contents" )
print ( type(buffer_chunk) )
print ( len(buffer_chunk) )
file_type = m.from_buffer(buffer_chunk)
print ( "file type: %s :" % file_type)
Strangely, with a 20MB tar.bz2 file I can use a value of 200,000 bytes rather than 1,000,000 bytes but this value won't work on the 10MB test file. I don't know if it is specific to the tar.bz2 archive involved and I haven't looked into the algorithms involved to see if they are at specific points but reading roughly 10MB of data so far seems to work on every archive file up to 5GB. An open().read(buffer) will read up to the size of the buffer or EOF so this is okay.

File size not updated after write()?

In the below code I am trying to receive a file from Python Sockets and write it to a local file
I have following code
chunk=clientDtSocket.recv(1024)
while chunk:
print("In Chunk"+str(chunk))
incomingFile.write(chunk)
chunk=clientDtSocket.recv(1024)
I get following
In Chunkb'Sohail Khan'
But the file size remains same.
Also how can I count the no of bytes I have recieved.
Make sure the file is closed after the loop.
You can check the received bytes count using len function:
chunk = clientDtSocket.recv(1024)
while chunk:
print("received {} bytes".format(len(chunk))) # <-----
print("In Chunk " + str(chunk))
incomingFile.write(chunk)
chunk = clientDtSocket.recv(1024)
incomingFile.close() # <----
Instead of manually closing the file, consider using with statement:
with open('/path/to/localfile', 'wb') as incomingFile:
....
When you open file set write buffer size to 0:
bufsize = 0
incomingFile = open('....', 'w', bufsize)
It is normal behevior, that data are saved to the file not immidiately after calling write function, but afted write buffer will be totaly filled. But if you setup buffer size to 0 as in axample above, yoor data will be written immidiatel. Writting data from write buffer to file often callaed "flushing"
Flushing also occured when you close file:
incomingFile.close()

Downloading .Jar File using Python

I am attempting to download a file from the internet using Python along with the sys and urllib2 modules. The general idea behind the program is for the user to input the version of the file they want to download, 1_4 for example. The program then adds the user input and the "/whateverfile.jar" to the url and downloads the file. My problem arises when the program inserts the "/whateverfile.jar" instead of inserting onto the same line the program inserts the "/whateverfile.jar" onto a new line. Which causes the program to fail to download the .jar properly.
Can anyone help me with this? The code and output is below.
Code:
import sys
import urllib2
print('Type version of file you wish to download.')
print('To download 1.4 for instance type "1_4" using underscores in place of the periods.')
W = ('http://assets.file.net/')
X = sys.stdin.readline()
Y = ('/file.jar')
Z = X+Y
V = W+X
U = V+Y
T = U.lstrip()
print(T)
def JarDownload():
url = "T"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
Output:
Type version of file you wish to download.
To download 1.4 for instance type "1_4" using underscores in place of the periods.
1_4
http://assets.file.net/1_4
/file.jar
I am currently not calling the JarDownload() function at all until the URL will display as a single line when printed to screen
When you type the input and hit Return, the sys.stdin.readline() call will append the new line symbol to the string and return it. To get the desired effect, you should strip the new line from the input before using it. This should work:
X = sys.stdin.readline().rstrip()
As a side note, you should probably give more meaningful names to your variables. Names like X, Y, Z, etc. say nothing about the variables content and make even simple operations, like your concatenations, unnecessarily hard to understand.

Best way to download all artifacts listed in an HTML5 cache.manifest file?

I am attempting to look at how an HTML5 app works and any attempts to save the page inside the webkit browsers (chrome, Safari) includes some, but not all of the cache.manifest resources. Is there a library or set of code that will parse the cache.manifest file, and download all the resources (images, scripts, css)?
(original code moved to answer... noob mistake >.<)
I originally posted this as part of the question... (no newbie stackoverflow poster EVER does this ;)
since there was a resounding lack of answers. Here you go:
I was able to come up with the following python script to do so, but any input would be appreciated =) (This is my first stab at python code so there might be a better way)
import os
import urllib2
import urllib
cmServerURL = 'http://<serverURL>:<port>/<path-to-cache.manifest>'
# download file code taken from stackoverflow
# http://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python
def loadURL(url, dirToSave):
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(dirToSave, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
# download the cache.manifest file
# since this request doesn't include the Conent-Length header we will use a different api =P
urllib.urlretrieve (cmServerURL+ 'cache.manifest', './cache.manifest')
# open the cache.manifest and go through line-by-line checking for the existance of files
f = open('cache.manifest', 'r')
for line in f:
filepath = line.split('/')
if len(filepath) > 1:
fileName = line.strip()
# if the file doesn't exist, lets download it
if not os.path.exists(fileName):
print 'NOT FOUND: ' + line
dirName = os.path.dirname(fileName)
print 'checking dirctory: ' + dirName
if not os.path.exists(dirName):
os.makedirs(dirName)
else:
print 'directory exists'
print 'downloading file: ' + cmServerURL + line,
loadURL (cmServerURL+fileName, fileName)

Categories

Resources