Python - split files

Python - split files - python

I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?
----Adding current code----
if not os.path.exists(output_path):
os.makedirs(output_path)
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
outFile = open('output/tempfile', 'wb')
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
outFile.write(chunk)
f = open('output/tempfile', 'rb').read().split('\r\n\r\n')
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Okay, I was bored and wanted to figure out the best way to do this. Turns out that my initial way in the comments above was overly complicated (unless considering some scenario where time is absolutely critical, or memory is severely constrained). A buffer is a much simpler way to achieve this, so long as you take two or more blocks at a time. This code emulates the questions scenario for demonstration.
Note: depending on the regex engine implementation, this is more efficient and requires significantly less str/byte conversions, as using regex requires casting each block of bytes to string. The approach below requires no string conversions, instead operating solely on the bytes returned from request.post(), and in turn writing those same bytes to file, without conversions.
from pprint import pprint
someString = '''I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?'''
n = 16
# emulate a stream by creating 37 blocks of 16 bytes
byteBlocks = [bytearray(someString[i:i+n]) for i in range(0, len(someString), n)]
pprint(byteBlocks)
# this string is present twice, but both times it is split across two bytearrays
matchBytes = bytearray('requests.post()')
# our buffer
buff = bytearray()
count = 0
for bb in byteBlocks:
buff += bb
count += 1
# every two blocks
if (count % 2) == 0:
if count == 2:
start = 0
else:
start = len(matchBytes)
# check the bytes starting from block (n -2 -len(matchBytes)) to (len(buff) -len(matchBytes))
# this will check all the bytes only once...
if matchBytes in buff[ ((count-2)*n)-start : len(buff)-len(matchBytes) ]:
print('Match starting at index:', buff.index(matchBytes), 'ending at:', buff.index(matchBytes)+len(matchBytes))
Update:
So, given the updated question, this code may remove the need to create a temporary file. I haven't been able to test it exactly, as I don't have a similar response, but you should be able to figure out any bugs yourself.
Since you aren't actually working with a stream directly, i.e. you're given the finished response object from requests.post(), then you don't have to worry about using chunks in the networking sense. The "chunks" that requests refers to is really it's way of dishing out the bytes, of which it already has all of. You can access the bytes directly using r.raw.read(n) but as far as I can tell, the request object doesn't allow you to see how many bytes there are in "r.raw", thus you're more or less forced to use the "iter_content" method.
Anyway, this code should copy all the bytes from the request object into a string, then you can search and split that string as before.
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
match = '\r\n\r\n'
data = ''
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
data += chunk
f = data.split(match)
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Related

Hash files in Python larger than available RAM? [duplicate]

This question already has answers here:
Hashing a file in Python
(9 answers)
Closed 10 months ago.
I am currently doing a project where I turn my Pi (Model 4 2GB) into a sort of NAS archive. I decided to learn a bit of Python along the way and wrote a small console app to "manage" my data base. One function I added was that it hashes the files in the database so it knows when files are corrupted.
To achieve this I hash a file like this:
with open(file, "rb") as f:
rbytes = f.read()
readable_hash = sha256(rbytes).hexdigest()
Now when I run this on smaller files it works just fine but on large files like videos it spits out a MemoryError - I presume this is because it doesn't have enough RAM to hold the file?
I've seen that you can break the read up into chunks but does this also work for hashing? If so, how?
Also I'm not a programmer. I want to learn in the process, so the simpler the solution the better - I want to actually understand the code I use. :) Doesn't need to be a super fast algorithm that squeezes out every millisecond either, as long as it gets the job done.
Thanks for any help in advance!

One Solution is adding a part of the File with another already hashed file, the hash at the end will still consist of the File there a just a few extra steps.
import hashlib
def hash_file(filename, bytes):
hashed = "" #make a string
with open(filename, 'rb') as f: #read from file
while True: #read the defined number of bytes until the loop is closed/broke
chunk = f.read(bytes) #read bytes
if chunk: #as long as "chunk" is not None/Empty
hashed = str(hashlib.md5(str(chunk).encode() + hashed.encode()).digest()) #Hash the old Hash and append the newly hashed chunk of text
else:
break #stop the Loop
return hashed
print(hash_file('file.txt', 1000))
By Hashing the Contents over and over again we always create a string that originates from the old string/hash, this way the string is always new and smaller (because MD5 hashes always have the same size) than the Whole File while being basically the old file.
PS: the bytes variable can be anything but more bytes = more Memory while less bytes = longer compute time, try what fits your needs. 1000–9000 Seems to be a good spot.

How to print "dots" (or other kind of feedback) while writing a file in python?

I am trying to print a visible feedback for the user in the terminal while my aplication donwloads a file from the web and write it into the hard drive, but I could not find how to do this reading the documentation or googling it.
This is my code:
res = requests.get(url_to_file)
with open("./downloads/%s" % (file_name), 'wb') as f:
f.write(res.content)
I was expecting to figure out how to make something like this:
Downloading file ........
# it keeps going ultil the download is finished and the file writen
Done!
I am realy strugling even to start, because none of the methods returns a "promise" (like in JS).
Any help would be very apreciated!
Thanks!

requests.get by default downloads the entirety of the requested resource before it gets back to you. However, it has an optional argument stream, which allows you to invoke .iter_content or .iter_lines on the Response object. This allows you to take action every N bytes (or as each chunk of data arrives), or at every line, respectively. Something like this:
chunks = []
chunk_size = 16384 # 16Kb chunks
# alternately
# chunk_size = None # whenever a chunk arrives
res = requests.get(url_to_file, stream=True)
for chunk in res.iter_content(chunk_size):
chunks.append(chunk)
print(".", end="")
data = b''.join(chunks)
This still blocks though, so nothing else will be happening. If you want more of the JavaScript style, per Grismar's comment, you should run under Python's async loop. In that case, I suggest using aiohttp rather than requests, as it is created with async style in mind.

Here's a version that will download the file into a bytearray in a separate thread.
As mentioned in other answers and comments, there are other alternativs that are developed with async operations in mind, so don't read too much into the decision to go with threading, it's just to demonstrate the concept (and because of convenience, since it comes with python).
In the below code, if the size of the file is known, each . will correspond to 1%. As a bonus, the downloaded and total number of bytes will be printed at the start of the line like (1234 B / 1234567 B). If size is not known, the fallback solution is to have each . represent a chunk.
import requests
import threading
def download_file(url: str):
headers = {"<some_key>": "<some_value>"}
data = bytearray()
with requests.get(url, headers=headers, stream=True) as request:
if file_size := request.headers.get("Content-Length"):
file_size = int(file_size)
else:
file_size = None
received = 0
for chunk in request.iter_content(chunk_size=2**15):
received += len(chunk)
data += chunk
try:
num_dots = int(received * 100 / file_size)
print(
f"({received} B/{file_size} B) "
+ "." * num_dots, end="\r"
)
except TypeError:
print(".", end="")
print("\nDone!")
url = "<some_url>"
thread = threading.Thread(target=download_file, args=(url,))
thread.start()
# Do something in the meantime
thread.join()
Do keep in mind that I've left out the lock to protect against simultaneous access to stdout to reduce the noise. I've also left out writing the bytarray to file at the end (or writing the chunks to file as they are received if the file is large), but keep in mind that you may want to use a lock for that as well if you read and/or write to the same file in any other part of your script.

Python - read huge online csv through proxy

I have a huuuuuge csv online and I wan't to read it line by line whitout download it. But this file is behind a proxy.
I wrote this code :
import requests
import pandas as pd
import io
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
content = requests.get(url, proxies=proxies, auth=auth, verify=cafile).content
csv_read = pd.read_csv(io.StringIO(content.decode('utf-8')))
pattern = 'mypattern'
for row in csv_read:
if row[0] == pattern:
print(row)
break
This code above works but the line 'content = requests.get(...' takes soooo much time ! Because of the size of the csv file.
So my question is :
Is it possible to read an online csv line by line through proxy ?
In the best way, I wish to read the first row, check if it equals to my pattern, if yes = break, if not = read the second line ans so on.
Thank's for your help

You can pass stream=True to requests.get to avoid fetching the entire result immediately. In that case you can access a pseudo-file object through response.raw, you can build your CSV reader based on that (alternatively, the response object has iter_content and iter_lines methods but I don't know how easy it is to feed that to a CSV parser).
However while the stdlib's csv module simply yields a sequence of lists or dicts and can therefore easily be lazy, pandas returns a dataframe which are not lazy, so you need to specify some special parameters then you get a dataframe per chunk or something it looks like.

The requests.get call will get you the whole file anyway. You'd need to implement your own HTTP code, down to the socket level, to be able to process the content as it gets in, in a plain HTTP Get method.
The only way of getting partial results and slice the download is to add HTTP "range" request headers, if the server providing the file support then. (requests can let you set these headers).
enter requests advanced usage:
The good news is that requests can do that for you under the hood -
you can set stream=True parameter when calling requests, and it even will let you iterate the contents line by line. Check the documentation on that part.
Here is more or less what requests does under the hood so that you can get your contents line by line:
It will get reasobale sized chunks of your data, - but certainly not equest one line at a time (think ~80 bytes versus 100.000 bytes), because otherwise it'd need a new HTTP request for each line,and the overhead for each request is not trivial, even if made over the same TCP connection.
Anyway, as CSV being a text format, neither requests nor any other software could know the size of the lines, and even less the exact size of the "next" line to be read - before setting the range headers accordingly.
So, for this to work, ther have to have to be Python code to:
accept a request for a "new line" of the CSV if there are buffered
text lines, yield the next line,
otherwise make an HTTP request for
the next 100KB or so
Concatenate the downloaded data to the
remainder of the last downloaded line
split the downloaded data at
the last line-feed in the binary data,
save the remainder of the
last line
convert your binary buffer to text, (you'd have to take
care of multi-byte character boundaries in a multi-byte encoding
(like utf-8) - but cutting at newlines may save you that)
yield the
next text line

According to Masklinn's answer, my code looks like this now :
import requests
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
pattern = 'mypattern'
r = requests.get(url, stream=True, proxies=proxies, verify=cafile)
if r.encoding is None:
r.encoding = 'ISO-8859-1'
for line in r.iter_lines(decode_unicode=True):
if line.split(';')[0] == pattern:
print(line)
break

Read N number of bytes from stdin of python and output to a temp file for further processing

I would like to read a fixed number of bytes from stdin of a python script and output it to one temporary file batch by batch for further processing. Therefore, when the first N number of bytes are passed to the temp file, I want it to execute the subsequent scripts and then read the next N bytes from stdin. I am not sure what to iterate over in the top loop before While true. This is an example of what I tried.
import sys
While True:
data = sys.stdin.read(2330049) # Number of bytes I would like to read in one iteration
if data == "":
break
file1=open('temp.fil','wb') #temp file
file1.write(data)
file1.close()
further_processing on temp.fil (I think this can only be done after file1 is closed)

Two quick suggestions:
You should pretty much never do While True
Python3
Are you trying to read from a file? or from actual standard in? (Like the output of a script piped to this?)
Here is an answer I think will work for you, if you are reading from a file, that I pieced together from some other answers listed at the bottom:
with open("in-file", "rb") as in_file, open("out-file", "wb") as out_file:
data = in_file.read(2330049)
while byte != "":
out_file.write(data)
If you want to read from actual standard in, I would read all of it in, then split it up by bytes. The only way this won't work is if you are trying to deal with constant streaming data...which I would most definitely not use standard in for.
The .encode('UTF-8') and .decode('hex') methods might be of use to you also.
Sources: https://stackoverflow.com/a/1035360/957648 & Python, how to read bytes from file and save it?

Python and zlib: Terribly slow decompressing concatenated streams

I've been supplied with a zipped file containing multiple individual streams of compressed XML. The compressed file is 833 mb.
If I try to decompress it as a single object, I only get the first stream (about 19 kb).
I've modified the following code supplied as a answer to an older question to decompress each stream and write it to a file:
import zlib
outfile = open('output.xml', 'w')
def zipstreams(filename):
"""Return all zip streams and their positions in file."""
with open(filename, 'rb') as fh:
data = fh.read()
i = 0
print "got it"
while i < len(data):
try:
zo = zlib.decompressobj()
dat =zo.decompress(data[i:])
outfile.write(dat)
zo.flush()
i += len(data[i:]) - len(zo.unused_data)
except zlib.error:
i += 1
outfile.close()
zipstreams('payload')
infile.close()
This code runs and produces the desired result (all the XML data decompressed to a single file). The problem is that it takes several days to work!
Even though there are tens of thousands of streams in the compressed file, it still seems like this should be a much faster process. Roughly 8 days to decompress 833mb (estimated 3gb raw) suggests that I'm doing something very wrong.
Is there another way to do this more efficiently, or is the slow speed the result of a read-decompress-write---repeat bottleneck that I'm stuck with?
Thanks for any pointers or suggestions you have!

It's hard to say very much without more specific knowledge of the file format you're actually dealing with, but it's clear that your algorithm's handling of substrings is quadratic-- not a good thing when you've got tens of thousands of them. So let's see what we know:
You say that the vendor states that they are
using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built.
From this we can conclude that the component streams are in raw zlib format, and are not encapsulated in a gzip wrapper (or a PKZIP archive, or whatever). The authoritative documentation on the ZLIB format is here: https://www.rfc-editor.org/rfc/rfc1950
So let's assume that your file is exactly as you describe: A 32-byte header, followed by raw ZLIB streams concatenated together, without any other stuff in between. (Edit: That's not the case, after all).
Python's zlib documentation provides a Decompress class that is actually pretty well suited to churning through your file. It includes an attribute unused_data whose documentation states clearly that:
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
So, this is what you can do: Write a loop that reads through data, say, one block at a time (no need to even read the entire 800MB file into memory). Push each block to the Decompress object, and check the unused_data attribute. When it becomes non-empty, you've got a complete object. Write it to disk, create a new decompress object and initialize iw with the unused_data from the last one. This just might work (untested, so check for correctness).
Edit: Since you do have other data in your data stream, I've added a routine that aligns to the next ZLIB start. You'll need to find and fill in the two-byte sequence that identifies a ZLIB stream in your data. (Feel free to use your old code to discover it.) While there's no fixed ZLIB header in general, it should be the same for each stream since it consists of protocol options and flags, which are presumably the same for the entire run.
import zlib
# FILL IN: ZHEAD is two bytes with the actual ZLIB settings in the input
ZHEAD = CMF+FLG
def findstart(header, buf, source):
"""Find `header` in str `buf`, reading more from `source` if necessary"""
while buf.find(header) == -1:
more = source.read(2**12)
if len(more) == 0: # EOF without finding the header
return ''
buf += more
offset = buf.find(header)
return buf[offset:]
You can then advance to the start of the next stream. I've added a try/except pair since the same byte sequence might occur outside a stream:
source = open(datafile, 'rb')
skip_ = source.read(32) # Skip non-zlib header
buf = ''
while True:
decomp = zlib.decompressobj()
# Find the start of the next stream
buf = findstart(ZHEAD, buf, source)
try:
stream = decomp.decompress(buf)
except zlib.error:
print "Spurious match(?) at output offset %d." % outfile.tell(),
print "Skipping 2 bytes"
buf = buf[2:]
continue
# Read until zlib decides it's seen a complete file
while decomp.unused_data == '':
block = source.read(2**12)
if len(block) > 0:
stream += decomp.decompress(block)
else:
break # We've reached EOF
outfile.write(stream)
buf = decomp.unused_data # Save for the next stream
if len(block) == 0:
break # EOF
outfile.close()
PS 1. If I were you I'd write each XML stream into a separate file.
PS 2. You can test whatever you do on the first MB of your file, till you get adequate performance.

Decompressing 833 MB should take about 30 seconds on a modern processor (e.g. a 2 GHz i7). So yes, you are doing something very wrong. Attempting to decompress at every byte offset to see if you get an error is part of the problem, though not all of it. There are better ways to find the compressed data. Ideally you should find or figure out the format. Alternatively, you can search for valid zlib headers using the RFC 1950 specification, though you may get false positives.
More significant may be that you are reading the entire 833 MB into memory at once, and decompressing the 3 GB to memory, possibly in large pieces each time. How much memory does your machine have? You may be thrashing to virtual memory.
If the code you show works, then the data is not zipped. zip is a specific file format, normally with the .zip extension, that encapsulates raw deflate data within a structure of local and central directory information intended to reconstruct a directory in a file system. You must have something rather different, since your code is looking for and apparently finding zlib streams. What is the format you have? Where did you get it? How is it documented? Can you provide a dump of, say, the first 100 bytes?
The way this should be done is not to read the whole thing into memory and decompress entire streams at once, also into memory. Instead, make use of the zlib.decompressobj interface which allows you provide a piece at a time, and get the resulting available decompressed data. You can read the input file in much smaller pieces, find the decompressed data streams by using the documented format or looking for zlib (RFC 1950 headers), and then running those a chunk at a time through the decompressed object, writing out the decompressed data where you want it. decomp.unused_data can be used to detect the end of the compressed stream (as in the example you found).

From what you've described in the comments, it sounds like they're concatenating together the individual files they would have sent you separately. Which means each one has a 32-byte header you need to skip.
If you don't skip those headers, it would probably have exactly the behavior you described: If you get lucky, you'll get 32 invalid-header errors and then successfully parse the next stream. If you get unlucky, the 32 bytes of garbage will look like the start of a real stream, and you'll waste a whole lot of time parsing some arbitrary number of bytes until you finally get a decoding error. (If you get really unlucky, it'll actually decode successfully, giving you a giant hunk of garbage and eating up one or more subsequent streams.)
So, try just skipping 32 bytes after each stream finishes.
Or, if you have a more reliable way of detecting the start of the next stream (this is why I told you to print out the offsets and look at the data in a hex editor, and while alexis told you to look at the zlib spec), do that instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.