I'm using an automatic data acquisition software that exports the data as .txt files. I then imported the file into python (using the pandas package and turning the columns into arrays) but I'm facing a problem. Python can't "read" the data because the automatic data acquisition software exported it into the following number format, and so Python is treating each entry of the array as a string instead of a number:
Is there any way I can "teach" python to read my data? Or to automatically rewrite the entries in the array so they're read as numbers?
You can simply change the comma in the strings with a dot and use float() to parse it.
number = float('7,025985E-36'.replace(',', '.'))
print(number)
print(type(number))
The above code would print:
7.025985e-36
<class 'float'>
You can try this way.
>>> value=str('7,025985e-36')
>>> value2=value.replace(',', '.')
>>> float(value2)
7.025985e-36
I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line. I have made a smaller version of the JSON file, with only 100 records, for testing.
I can read the entire test file into a pandas dataframe and examine it.
The complete dataset file, however, has about 6 million lines. The recommendation is to use chunksize and build a json reader. I'm hitting errors, even with my test input.
My code currently looks like this
path = 'file://localhost/Users/.../DSC_Intro/'
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader =
pd.read_json(StringIO(filename), lines=True, chunksize=10)
type(review_reader)
The type call returns
pandas.io.json.json.JsonReader
which looks good.
Then I try
for chunk in review_reader:
print(chunk)
as referenced in pandas user guide
and I get an error:
ValueError: Unexpected character found when decoding 'false'
Update - it has been suggested that the issue is caused by embedded (quoted) "\n" characters in the data file; that pandas is seeing the JSON records as, not one per line, but multiple lines.
The error message is VERY opaque, if that's the case. Also, with 6 million lines, how should I tell pd.read_json to ignore "\n" and only look at actual newlines in the data?
Update
It's been suggested that if I fix my typo (it was a typo in this post, not a typo in my code) and use a Unix file path instead of a URL (JSON doesn't care: see docs).
When I do this but keep StringIO(), I get a different ValueError.
When I do this but remove StringIO(), the code works.
This seems to be very fragile. :-(
Note The tutorial has an answer key. I've tried that code. The answer key uses
review_reader =
pd.read_json(filename, lines=True, chunksize=10)
which throws the TypeError
sequence item 0: expected str instance, bytes found
Adding StringIO() seems to have solved that.
Input Sample JSON record, one per line of the input file.
{"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"}
Firstly, your
path 'file://localhost/Users/.../DSC_Intro/'
is not valid python code. If you try to execute that as such, you will get an invalid syntax error. I assume, then, that this is just showing the value of the path variable. I don't know whether those ellipses are literal or the result of your environment truncating the display of path. I'll assume here that your path is a valid file URL for your system as it doesn't seem germane here to consider an incorrect path.
Either way, yes, read_json can read json from a file URL as you're specifying there (I learned something there) if you read it in one go:
pd.read_json(fname, lines=True)
But if you try to create a reader from this, by specifying
pd.read_json(fname, lines=True, chunksize=...)
then you get
TypeError: sequence item 0: expected str instance, bytes found
Secondly, yes, wrapping your file-like argument with StringIO makes this error go away, but it isn't helping for any reason you might think and its use is based on a misreading of the pandas docs you point to.
I'll quote a couple of bits from the read_json doc here:
Signature: pd.read_json(
path_or_buf=None, ...
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json
So with read_json, you can either give it an actual string that is valid JSON, or you can give it a file-like object that points to a file that contains JSON.
Notice in the pandas docs that you cite:
In [258]: jsonl = '''
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: '''
.....:
is JSON, not a path. When their example then does:
df = pd.read_json(jsonl, lines=True)
it is merely parsing the JSON in the string - no files are involved here.
When it then wants to demonstrate reading from a file in chunks, it does
# reader is an iterator that returns `chunksize` lines each iteration
In [262]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)
In other words, they are wrapping a JSON string, not a path, by StringIO(). This is just for the purposes of the documented example, so you can see that if you treated the JSON string as if it were being read from a file you can read it in chunks. That's what StringIO() does. So when you wrap the string that describes your file URL in StringIO(), I expect that read_json is then trying to interpret that string as JSON that's being read from a file and parse it. It understandably falls over because it isn't JSON.
This brings us back to why read_json cannot read your file URL in chunks. I don't have an immediate good answer to that. I suspect it lies in the internals of how read_json opens file URLs, or what function underlies this. If you were intent upon, or forced to, do this chunking from a file URL then I suspect you'd be looking at controlling the mode in which the file is opened, or perhaps somehow providing explicit guidance to read_json how to interpret the bytestream it gets. Libraries such as urllib2 may be useful here, I'm not sure.
But let's cut to the best fix here. Why are we trying to specify the path as a file URL? Simply specify your path as an OS path, e.g.
path = '/path/to/my/data/'
and then
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader = pd.read_json(filename, lines=True, chunksize=10)
And I betcha it works as intended! (It does for me, as it always has).
Caveat: windows doesn't use forward-slash path delimiters, and constructing paths by concatenating strings in the above fashion can be fragile, but usually if you use 'proper' forward-slash delimiters (smile), decent languages internally understand that. It's constructing paths using backslashes that is guaranteed to cause you pain. But just keep an eye on that.
I am very new to Python and I am trying to read in a file that partially contains binary data. There is a header with some information about the data and after the header binary data follow. If one opens the file in a texteditor it looks like this:
>>> Begin of header <<<
value1: 5
value2: 7
...
value65: 9
>>> End of header <<<
���ÄI›C¿���†¨¨v#���ÄW]c¿��� U⁄z#���#¬P\¿����∂:q#���#Ò˚U¿���†÷Us#���`ªw4¿��� :‘m#���#À›9#���ÄAs#���¿‹ ¿����ır#���¿#&%#���†„bq#����*˙-#��� [q#����ÚN8#����
Òo#���#√·T#���†‰zm#����9\#����ÃÜq#����€dZ#���`Ëäs#���†∏8I#���¿¬Ot#���†�6
an additional problem is that I did not create the file myself and do not now if those are double or float data.
So how can I interpret those data?
So first, thanks to all for the help: So basically the problem is the header. I can read in the data quit well, when i remove the header from the file. This can be done with
x = numpy.fromfile(f, dtype = numpy.complex128 , count = -1)
quite easily. The problem is that I cannot find any option for the function fromfile that skips lines (one can skip bytes, but the header size may be different from file to file.
In this great thread I found the how to convert an binary array to an numpy array:
convert binary string to numpy array
With this I could overcome the problem by reading in the datafile line for line and then merge every line after the end header line together in one string.
This string was then changed into an nice array exactly as I wanted it.
I'm kinda new to Python. I need to read some text from one file (A), compare it with the text from another file (B), change a part of the earlier mentioned file and write it to the third one (C). Problem is A and B files files have unusual notation that involves this symbol "¶".
So, I managed to bypass it (ignore it) by reading (or writing) in the following way:
input = codecs.open('bla.txt', 'r', 'ascii', 'ignore');
But it's not good. I NEED to read it in precise way and compare it and write successfully.
So, content of my B file is: "Sugar=[Sugar#Butter¶Cherry]"
but when I read it, my variable has the value Sugar=[Sugar#Butter¶Cherry]
You can see, there is additional "Â"
Then my A file contains a lot of text which needs to be copied to the C file, except the certain part that follows after the above mentioned text in in B. That part needs to be changed and then written, BUT they are not the same, my program never enters the IF condition in which I am comparing the "Sugar=[Sugar#Butter¶Cherry]" form A and "Sugar=[Sugar#Butter¶Cherry]" from B.
Is there a way I can read the text so that this symbol "¶" appears as it is?
Yes.
Use the correct encoding.
input = codecs.open('bla.txt', 'r', 'UTF-8', 'ignore')
I have a file - which I read it into memory as a list, then split the list based on some rule, say there is list1, list2, .., listn. Now I want to get the size of each list, and this size is the file size when this list write to a file. the following is a code I have, the file name is 'temp', which size is: 744 bytes.
from os import stat
from sys import getsizeof
print(stat('temp').st_size) # we get exactly 744 here.
# Now read file into a list and use getsizeof() function:
with open('temp', 'r') as f:
chunks = f.readlines()
print(getsizeof(chunks)) # here i get 240, which is quite different than 744.
since I can't use getsizeof() to directly get the file size (on disk), so once i get the split list, i have to write this list to a tmp file:
open('tmp','w').write("".join(list1))
print(stat('tmp','w').st_size) # Here is the value I want.
os.remove('tmp')
this solution is very slow and require a lot of write/read to disk. Is there any better way to do? thanks a lot!
Instead of writing a series of bytes to a file and then looking at the file length1, you could just check the length of the string that you would have written to the file:
print(len("".join(list1)))
Here, I'm assuming that your list contains byte strings. If it doesn't, you can always encode a byte string from your unicode string:
print(len("".join(list1).encode(your_codec)))
which I think you would need for write to work properly anyway in your original solution.
1Your original code could also give flaky (wrong!) results since you never close the file. It's not guaranteed that all the contents of the string will be written to the file when you use os.stat on it due to buffering.