I have a database with some of the data is binary (blob datatype in MySQL), which was actually webpages scrapped and gzipped. Now I want to extract them and write each record into a gzip file, which I'd assume to be doable - after all they are gzipped-data right?
The question is, however, how would I do that? By searching I could find a million of examples on how to write gzip file from original data, not gzipped one. Writing the gzipped string directly into a file doesn't result in a gzip file, not to mention I got a load of "ordinal not in range" exceptions.
Could you guys help? Thanks in advance. I'm a newbie to Python...
Edit: Here is the method I used:
def store_cache(self, content, news_id):
if not content:
return
# some of the records may contain normal data (not gzipp-ed), hence this try block
try:
content = self.gunzip(content)
except:
return
import gzip
with gzip.open('static/cache/%s' % (self.base36encode(news_id), ), 'wb') as f:
f.write(content)
f.close()
This causes an exception:
<type 'exceptions.UnicodeEncodeError'> at /migrate
'ascii' codec can't encode character u'\u1edb' in position 186: ordinal not in range(128)
And this is the innermost traceback:
E:\Python27\lib\gzip.py in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
You said it yourself: extract them and then write them into a gzip file. There is nothing special about writing "from gzipped data": you un-gzip the data to get the original data, and then write the original data as if it were original data (because it is). The documentation shows you how to do these things.
However, gzip is just a compression format, not an archive format. It is not built to handle multiple files, so you must use something else to create a single file from the multiple inputs. Typically this is done by making a tar archive which is then gzipped. You can do this in Python using the tarfile module. Since your data will come from gzip-decompression streams, you will want to use the TarFile.addfile(tarinfo, fileobj) method to add them to the archive. You should be able to use the gzip.GzipFile instance as the fileobj to add this way.
Related
I have come across an error while parsing json with ijson.
Background:
I have a series(approx - 1000) of large files of twitter data that are compressed in a '.bz2' format. I need to get elements from the file into a pd.DataFrame for further analysis. I have identified the keys I need to get. I am cautious putting twitter data up.
Attempt:
I have managed to decompress the files using bz2.decompress with the following code:
## Code in loop specific for decompressing and parsing -
with open(file, 'rb') as source:
# Decompress the file
json_r = bz2.decompress(source.read())
json_decom = json_r.decode('utf-8') # decompresses one file at a time rather than a stream
# Parse the JSON with ijson
parser = ijson.parse(json_decom)
for prefix, event, value in parser:
# Print selected items as part of testing
if prefix=="created_at":
print(value)
if prefix=="text":
print(value)
if prefix=="user.id_str":
print(value)
This gives the following error:
IncompleteJSONError: parse error: trailing garbage
estamp_ms":"1609466366680"} {"created_at":"Fri Jan 01 01:59
(right here) ------^
Two things:
Is my decompression method correct and giving the right type of file for ijson to parse (ijson takes both bytes and str)?
Is is a JSON error? // If it is a JSON error is it possible to develop some kind of error handler to move to the next file - if so any suggestion would be appreciated?
Any assistance would be greatly appreciated.
Thank you, James
To directly answer your two questions:
The decompression method is correct in the sense that it yields JSON data that you then feed to ijson. As you point out, ijson works both with str and bytes inputs (although the latter is preferred); if you were giving ijson some non-JSON input you wouldn't see an error showing JSON data in it.
This is a very common error that is described in ijson's FAQ. It basically means your JSON document has more than one top-level value, which is not standard JSON, but is supported by ijson by using the multiple_values option (see docs for details).
About the code as a whole: while it's working correctly, it could be improved on: the whole point of using ijson is that you can avoid loading the full JSON contents in memory. The code you posted doesn't use this to its advantage though: it first opens the bz-compressed file, reads it as a whole, decompresses that as a whole, (unnecessarily) decodes that as a whole, and then gives the decoded data as input to ijson. If your input file is small, and the decompressed data is also small you won't see any impact, but if your files are big then you'll definitely start noticing it.
A better approach is to stream the data through all the operations so that everything happens incrementally: decompression, no decoding and JSON parsing. Something along the lines of:
with bz2.BZ2File(filename, mode='r') as f:
for prefix, event, value in ijson.parse(f):
# ...
As the cherry on the cake, if you want to build a DataFrame from that you can use DataFrame's data argument to build the DataFrame directly with the results from the above. data can be an iterable, so you can, for example, make the code above a generator and use it as data. Again, something along the lines of:
def json_input():
with bz2.BZ2File(filename, mode='r') as f:
for prefix, event, value in ijson.parse(f):
# yield your results
df = pandas.DataFrame(data=json_input())
I'm attempting to use python's gzip library to streamline some python scripts that create csv output files. I've tried a number of different methods of creating the gzip file, but no matter which method I've tried, I'm running into the same issue.
My python script runs successfully, but when I try to decompress the gzip file in Finder (using MacOS 10.15.6), I'm prompted with the following error:
Unable to expand "file.csv.gz" into "Documents". (Error 79 - Inappropriate file type or format.)
After some debugging, I've narrowed down the cause of the error to the file content containing line break (\n) characters.
This simple example code triggers the above error on gzip expansion:
import gzip
content = b'Id,Food\n1,Spam\n2,Eggs\n'
f = gzip.open('file.csv.gz', 'wb')
f.write(content)
f.close()
When I remove all \n characters from the content variable, everything works fine:
import gzip
content = b'Id,Food,1,Spam,2,Eggs'
f = gzip.open('file.csv.gz', 'wb')
f.write(content)
f.close()
Does gzip want me to use a different line break mechanism? I'm sure I'm missing some sort of foundational knowledge about gzip or binaries, so any info that helps get me back on track would be much appreciated.
It has nothing to do with Python's gzip. It is, arguably, a bug in macOS where it sometimes detects the resulting uncompressed data as an mtree by the Archive Utility, but then finds the uncompressed data violates the mtree format.
The solution is to not double-click to decompress. Use gzip to decompress.
I'm currently pulling from an FTP server to run a parser on a group of files when parsing the file from downloading in-browser and feeding the file into the parser, I get no errors since it's a plain TXT file.
But, when I pull the data from the server through Python's ftplib, I get differently formatted data. An example is the extra \x00 and 'b' characters inside my data. It's important that the data has integrity since I'm working with a COBOL database structure which can't be manipulated in the slightest else it will ruin the entire store.
I am successfully downloading using the following functions:
data = []
def handle_binary(more_data):
data.append(str(more_data))
def get_file(filename):
resp = ftp.retrbinary("RETR {0}".format(filename), callback=handle_binary)
file = "".join(data)
# returning the data as well as saving it to inspect
save = open("save.txt", "w+")
save.write(file)
save.close()
return file
I tried to change my store from:
data.append(str(more_data))
To
data.append(more_data)
As well as changing my join function to be b"" to indicate a byte join, but I got errors following that.
An example of a long string of data that wasn't in the original download:
\x00\x00\x00'b'\x00\n
Edit: Upon comparison, it seems that the FTP-downloaded data has no newlines (which makes sense since there's a missed newline in the above) as the downloaded copy does.
Thanks for any help regarding this question.
I have a csv file compressed into a bz2 file that I'm trying to load from a website, decompress, and write to a local csv file by
# Get zip file from website
archive = StringIO()
url_data = urllib2.urlopen(url)
archive.write(url_data.read())
# Extract the training data
data = bz2.decompress(archive.read())
# Write to csv
output_file = open('dataset_' + mode + '.csv', 'w')
output_file.write(data)
On the decompress call, I get IOError: invalid data stream. As a note, the csv file contained in the archive has quite a few characters that could be causing some issues. Particularly, if I try putting the file contents in unicode, I get an error about not being able to decode 0xfd. I only have the single file within the archive, but I'm wondering if something could also be going on due to not extracting a specific file.
Any ideas?
I suspect you are getting this error because the stream you are feeding the decompress() function is not a valid bz2 stream.
You must also "rewind" your StringIO buffer after writing to it. See the notes below in comments. The following code (same as yours with the exception of imports, and the seek() fix) works if the URL points to a valid bz2 file.
from StringIO import StringIO
import urllib2
import bz2
# Get zip file from website
url = "http://www.7-zip.org/a/7z920.tar.bz2" # just an example bz2 file
archive = StringIO()
# in case the request fails (e.g. 404, 500), this will raise
# a `urllib2.HTTPError`
url_data = urllib2.urlopen(url)
archive.write(url_data.read())
# will print how much compressed data you have buffered.
print "Length of file:", archive.tell()
# important!... make sure to reset the file descriptor read position
# to the start of the file.
archive.seek(0)
# Extract the training data
data = bz2.decompress(archive.read())
# Write to csv
output_file = open('output_file', 'w')
output_file.write(data)
re: encoding issues
Generally, character encoding errors will generate UnicodeError (or one of its cousins), but not IOError. IOError suggests something is wrong with the input, like truncation, or some error that would prevent the decompressor to do its work completely.
You have omitted the imports from your question, and one of the subtle differences between the StringIO and cStringIO (according to the docs ) is that cStringIO cannot work with unicode strings that cannot be converted to ascii. That no longer seems to hold (in my tests at least), but it may be at play.
Unlike the StringIO module, this module (cStringIO) is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.
I just tried to update a program i wrote and i needed to add another pickle file. So i created the blank .pkl and then use this command to open it(just as i did with all my others):
with open('tryagain.pkl', 'r') as input:
self.open_multi_clock = pickle.load(input)
only this time around i keep getting this really weird error for no obvious reason,
cPickle.UnpicklingError: invalid load key, 'Γ'.
The pickle file does contain the necessary information to be loaded, it is an exact match to other blank .pkl's that i have and they load fine. I don't know what that last key is in the error but i suspect that could give me some incite if i know what it means.
So have have figured out the solution to this problem, and i thought I'd take the time to list some examples of what to do and what not to do when using pickle files. Firstly, the solution to this was to simply just make a plain old .txt file and dump the pickle data to it.
If you are under the impression that you have to actually make a new file and save it with a .pkl ending you would be wrong. I was creating my .pkl's with notepad++ and saving them as .pkl's. Now from my experience this does work sometimes and sometimes it doesn't, if your semi-new to programming this may cause a fair amount of confusion as it did for me. All that being said, i recommend just using plain old .txt files. It's the information stored inside the file not necessarily the extension that is important here.
#Notice file hasn't been pickled.
#What not to do. No need to name the file .pkl yourself.
with open('tryagain.pkl', 'r') as input:
self.open_multi_clock = pickle.load(input)
The proper way:
#Pickle your new file
with open(filename, 'wb') as output:
pickle.dump(obj, output, -1)
#Now open with the original .txt ext. DONT RENAME.
with open('tryagain.txt', 'r') as input:
self.open_multi_clock = pickle.load(input)
Gonna guess the pickled data is throwing off portability by the outputted characters. I'd suggest base64 encoding the pickled data before writing it to file. What what I ran:
import base64
import pickle
value_p = pickle.dumps("abdfg")
value_p_b64 = base64.b64encode(value_p)
f = file("output.pkl", "w+")
f.write(value_p_b64)
f.close()
for line in open("output.pkl", 'r'):
readable += pickle.loads(base64.b64decode(line))
>>> readable
'abdfg'