how to convert string into byte array in this particular scenario - python

I am following the tutorial which was designed for Python 2x in Python 3.5. I have made updates to the code in order to make it compatible, but I am falling at the last hurdle.
In 2x Python changes between text and binary as and when, whereas Python 3x marks a clear demarcation.
I am trying to produce a CSV file for submission, but it requires that the data be in int32 format.
test_file = open('/Users/williamneal/Scratch/Titanic/test.csv', 'rt')
test_file_object = csv.reader(test_file)
header = test_file_object.__next__
Opens up a file object before making it a read object. I modified the original code, 'wb' --> 'wt' to account for the fact that by default python returns a string.
prediction_file = open("/Users/williamneal/Scratch/Titanic/genderbasedmodel.csv", "wt")
prediction_file_object = csv.writer(prediction_file)
The opens a file object before making it a writing object. Like previously I modified the mode.
prediction_file_object.writerow([bytearray(b"PassengerId"), bytearray(b"Survived")])
for row in test_file_object:
if row[3] == 'female':
prediction_file_object.writerow([row[0],int(1)])
else:
prediction_file_object.writerow([row[0],int(0)])
test_file.close()
prediction_file.close()
I changed the writers in the for-loop into integers but as I attempt to cast the string into binary I get the following error at the point of submission:
I am flummoxed, I do not see how I can submit my file which needs the headers "PassengerId" and "Survivors" and also be type int32.
Any suggestions?

Related

Python ijson - parse error: trailing garbage // bz2.decompress()

I have come across an error while parsing json with ijson.
Background:
I have a series(approx - 1000) of large files of twitter data that are compressed in a '.bz2' format. I need to get elements from the file into a pd.DataFrame for further analysis. I have identified the keys I need to get. I am cautious putting twitter data up.
Attempt:
I have managed to decompress the files using bz2.decompress with the following code:
## Code in loop specific for decompressing and parsing -
with open(file, 'rb') as source:
# Decompress the file
json_r = bz2.decompress(source.read())
json_decom = json_r.decode('utf-8') # decompresses one file at a time rather than a stream
# Parse the JSON with ijson
parser = ijson.parse(json_decom)
for prefix, event, value in parser:
# Print selected items as part of testing
if prefix=="created_at":
print(value)
if prefix=="text":
print(value)
if prefix=="user.id_str":
print(value)
This gives the following error:
IncompleteJSONError: parse error: trailing garbage
estamp_ms":"1609466366680"} {"created_at":"Fri Jan 01 01:59
(right here) ------^
Two things:
Is my decompression method correct and giving the right type of file for ijson to parse (ijson takes both bytes and str)?
Is is a JSON error? // If it is a JSON error is it possible to develop some kind of error handler to move to the next file - if so any suggestion would be appreciated?
Any assistance would be greatly appreciated.
Thank you, James
To directly answer your two questions:
The decompression method is correct in the sense that it yields JSON data that you then feed to ijson. As you point out, ijson works both with str and bytes inputs (although the latter is preferred); if you were giving ijson some non-JSON input you wouldn't see an error showing JSON data in it.
This is a very common error that is described in ijson's FAQ. It basically means your JSON document has more than one top-level value, which is not standard JSON, but is supported by ijson by using the multiple_values option (see docs for details).
About the code as a whole: while it's working correctly, it could be improved on: the whole point of using ijson is that you can avoid loading the full JSON contents in memory. The code you posted doesn't use this to its advantage though: it first opens the bz-compressed file, reads it as a whole, decompresses that as a whole, (unnecessarily) decodes that as a whole, and then gives the decoded data as input to ijson. If your input file is small, and the decompressed data is also small you won't see any impact, but if your files are big then you'll definitely start noticing it.
A better approach is to stream the data through all the operations so that everything happens incrementally: decompression, no decoding and JSON parsing. Something along the lines of:
with bz2.BZ2File(filename, mode='r') as f:
for prefix, event, value in ijson.parse(f):
# ...
As the cherry on the cake, if you want to build a DataFrame from that you can use DataFrame's data argument to build the DataFrame directly with the results from the above. data can be an iterable, so you can, for example, make the code above a generator and use it as data. Again, something along the lines of:
def json_input():
with bz2.BZ2File(filename, mode='r') as f:
for prefix, event, value in ijson.parse(f):
# yield your results
df = pandas.DataFrame(data=json_input())

Python: pickle: No code suggestion after extracting string object from pickle file

for example, this is my code:
#extract the object from "lastringa.pickle" and save it
extracted = ""
with open("lastringa.pickle","rb") as f:
extracted = pickle.load(f)
Where "lasting.pickle" contains a string object with some text.
So if I type extracted. before the opening of the file, I'm able to get the code suggestion as shown in the picture:
But then, after this operation extracted = pickle.load(f), if I type extracted. I don't get code suggestion anymore.
Can somebody explain me why is that and how to solve this?
Pickle reads and writes objects as binary files. You can confirm this by the open('lastringa.pickle', 'rb'), command where you are using the rb option, i.e. read binary.
Your IDE doesn't know the type of the object that the pickle is expected to read, so that it can suggest the string methods (e.g. .split(), .read())
On the other hand, in the first photo, your IDE knows that expected is a string and it knows what to suggest.

Why doesn't pickling something in python save the exact thing I save?

I have a problem that when I pickle something it doesn't save the exact text to what I saved. The thing I wanted to save was a list, but in the code it adds stuff to the list (I don't know if that's what affects it).
import pickle
inventoryFile = 'inventory.file'
with open(inventoryFile, "wb") as fi:
pickle.dump(inventory, fi)
with open(inventoryFile, "rb") as fi:
inventory = pickle.load(fi)
inventory was a list that I kept adding to. When I looked into the inventory.file file, all it said was
�]q�.
I don't know what it means. Also I'm a beginner at Python, so I'm not too good.
Pickling something doesn't create a plain-text representation of an object; it creates a file that can be opened to produce the same object in Python. This representation isn't designed to be human-readable (hence the need to "unpickle" it with pickle.load(). So you can't open the pickled file with a text editor and expect to see the list.
If pickle.load(fi) is producing the same list that you saved, then the pickling is going fine. If you want to create a human-readable file instead, try:
converting the list to a string (for example ', '.join(seq)) and saving that
another module for data storage (like json or csv or pandas)
pickling will save a representation of the object into the file. that wont be a human readable representation it will be one that can be reloaded into memory by unpickling it again. we can see this if after pickling the inventory we set it to None then we load it via pickle from the file you see we get the exact same representation of the list back.
import pickle
inventoryFile = 'inventory.file'
inventory = [1,2,3]
print(inventory)
with open(inventoryFile, "wb") as fi:
pickle.dump(inventory, fi)
inventory = None
print(inventory)
with open(inventoryFile, "rb") as fi:
inventory = pickle.load(fi)
print(inventory)
OUTPUT
[1, 2, 3]
None
[1, 2, 3]
If we look in the file from just a plain text editor we see
�]q (KKKe.
but this is not designed to be human readable.

The python command-line file handling doesn't work? am i working correctly?

I am a new python learner and now i have entered into file handling.
I tried solution for my problem but failed, so posting my question. before duplication please consider my question.
I tried to create a file, it worked.
writing in the file also worked.
But when i tried to read the text or values in the file, it returns empty.
I use command line terminal to work with python and running in Ubuntu OS.
The coding which I have tried is given below. The file is created in the desired location and the written text is also present.
f0=open("filehandling.txt","wb")
f0.write("my second attempt")
s=f0.read(10);
print s
I also tried with wb+, r+. But it just returns as empty
edit 1:
I have attached the coding below. I entered one by one in command line
fo = open("samp.txt", "wb")
fo.write( "Text is here\n");
fo.close()
fo = open("samp.txt", "r+")
str = fo.read(10);
print "Read String is : ", str
fo.close()
First of all if you open with wb flag then the file will be only in writeable mode. If you want to both read and write then you need wb+ flag. If you don't want the file to be truncated each time then you need rb+.
Now files are streams with pointers pointing at a certain location inside the file. If you write
f0.write("my second attempt")
then the pointer points at the [pointer before writing] (in your case the begining of the file, i.e. 0) plus [length of written bytes] (in your case 17, which is the end of the file). In order to read whole file you have to move that pointer back to the begining and then read:
f0.seek(0)
data = f0.read()

Open a file for RawIOBase python

I need to read in the the binary of a file for a function, and from this link https://docs.python.org/2/library/io.html, it looks like I should be using a RawIOBase object to read it in. But I can't find any where on how to open a file to use with RawIOBase. Right now I have tried this to read the binary into a string
with (open(documentFileName+".bin", "rb")) as binFile:
document = binFile.RawIOBase.read()
print document
but that throws the error AttributeError: 'file' object has no attribute 'RawIOBase'
So with no open attribute in RawIOBase, how do I open the file for it to read from?
Don't delve into the implementation details of the io thicket unless you need to code your own peculiar file-oid-like types! In your case,
with open(documentFileName+".bin", "rb") as binFile:
document = binFile.read()
will be perfectly fine!
Note in passing that I've killed the superfluous parentheses you were using -- "no unneeded pixels!!!" -- but, while important!, that's a side issue to your goal here.
Now, assuming Python 2, document is a str -- an immutable array of bytes. It may be confusing that displaying document shows it as a string of characters, but that's just Py2's confusion between text and byte strings (in Py3, the returned type would be bytes).
If you prefer to work with (e.g) a mutable array of ints, use e.g
theints = map(ord, document)
or, for an immutable array of bytes that displays numerically,
import array
thearray = array.array('b', document)

Categories

Resources