I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.
It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.
Here's some test code:
import pandas as pd
import numpy as np
# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)
# Now read the file into memory.
arr = np.fromfile(filename)
print len(arr)
I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?
The docs show an optional sep parameter. But when that is used:
arr = np.fromfile(filename, sep = ',')
...we get a length of 0?!
Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.
What am I missing here?
numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.
From the docs:
A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.
By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.
To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).
Related
I have a JSON file that I want to convert into DataFrame. Since the dataset is pretty large (~30 GB), I found that I need to set the chunksize as the limitation. The code is like this:
import pandas as pd
pd.options.display.max_rows
datas = pd.read_json('/Users/xxxxx/Downloads/Books.json', chunksize = 1, lines = True)
datas
Then when I run it the result is
<pandas.io.json._json.JsonReader at 0x15ce38550>
Is this an error?
Also I found that if you use loop in the datas, it worked. Is there any way to use the standard way?
I don't think pandas is the way to go when reading giant json files.
First you should check out if your file is actually in a valid JSON format (it is completely wrapped in one dictionary) or if it is a JSONL file (each row is one dictionary in JSON format but the rows are not connected).
Because if you are e.g. using the Amazon Review Dataset which has these huge files they are all in JSONL but named JSON.
Two packages I can recommend for parsing huge JSON/JSONL files:
Pretty fast: ujson
import ujson
with open(file_path) as f:
file_contents = ujson.load(f)
Even faster: ijson
import ijson
file_contents = [t for t in ijson.items(open(file_path), "item")]
This allows you to use a progress bar like tqdm as well:
import ijson
from tqdm import tqdm
file_contents = [t for t in tqdm(ijson.items(open(file_path), "item"))]
This is good to know how fast it is going as it shows you how many lines it already read. Since your file is 30GB it might take quite a while to read it all and it's good to know if its still going or the memory crashed or w/e else could have gone wrong.
You can then try to create a DataFrame from the dictionary by using pandas.DataFrame.from_dict(file_contents) but I think 30GB of contents is way more than pandas allows as maximum number of rows. Not quite sure though. Generally I would really recommend working with dictionaries for this amount of content as it's much faster. Then only when you need to display some parts of it for visualization or analysis convert it into a DataFrame.
I am attempting to read MapInfo .dat files into .csv files using Python. So far, I have found the easiest way to do this is though xlwings and pandas.
When I do this (below code) I get a mostly correct .csv file. The only issue is that some columns are appearing as symbols/gibberish instead of their real values. I know this because I also have the correct data on hand, exported from MapInfo.
import xlwings as xw
import pandas as pd
app = xw.App(visible=False)
tracker = app.books.open('./cable.dat')
last_row = xw.Range('A1').current_region.last_cell.row
data = xw.Range("A1:AE" + str(last_row))
test_dataframe = data.options(pd.DataFrame, header=True).value
test_dataframe.columns = list(schema)
test_dataframe.to_csv('./output.csv')
When I compare to the real data, I can see that the symbols do actually map the correct number (meaning that (1 = Â?, 2=#, 3=#, etc.)
Below is the first part of the 'dictionary' as to how they map:
My question is this:
Is there an encoding that I can use to turn these series of symbols into their correct representation? The floats aren't the only column affected by this, but they are the most important to my data.
Any help is appreciated.
import pandas as pd
from simpledbf import Dbf5
dbf = Dbf5('path/filename.dat')
df = dbf.to_dataframe()
.dat files are dbase files underneath https://www.loc.gov/preservation/digital/formats/fdd/fdd000324.shtml. so just use that method.
then just output the data
df.to_csv('outpath/filename.csv')
EDIT
If I understand well you are using XLWings to load the .dat file into excel. And then read it into pandas dataframe to export it into a csv file.
Somewhere along this it seems indeed that some binary data is not/incorrectly interpreted and then written as text to you csv file.
directly read dBase file
My first suggestion would be to try to read the input file directly into Python without the use of an excel instance.
According to Wikipedia, mapinfo .dat files are actually are dBase III files. These you can parse in python using a library like dbfread.
inspect data before writing to csv
Secondly, I would inspect the 'corrupted' columns in python instead of immediately writing them to disk.
Either something is going wrong in the excel import and the data of these columns gets imported as text instead of some binary number format,
Or this data is correctly into memory as a byte array (instead of a float), and when you write it to csv, it just gets byte-wise dumped to disk instead of interpreting it as a number format and making a text representation of it
note
Small remark about your initial question regarding mapping text to numbers:
Probably it will not be possible create a straightforward map of characters to numbers:
These numbers could have any encoding and might not be stored as decimal text values like you now seem to assume
These text representations are just a decoding using some character encoding (UTF-8, UTF-16). E.g. for UTF-8 several bytes might map to one character. And the question marks or squares you see, might indicate that one or more characters could not be decoded.
In any case you will be losing information if start from the text, you must start from the binary data to decode.
I have a dataset in CSV containing lists of values as strings in a single field that looks more or less like this:
Id,sequence
1,'1;0;2;6'
2,'0;1'
3,'1;0;9'
In the real dataset I'm dealing with, the sequence length vary greatly and can contain from one up to few thousands observations. There are many columns containing sequences all stored as strings.
I'm reading those CSV's and parsing strings to become lists nested inside Pandas DataFrame. This takes some time, but I'm ok with it.
However, later when I save the parsed results to pickle the read time of this pickle file is very high.
I'm facing the following:
Reading a raw ~600mb CSV file of such structure to Pandas takes around ~3
seconds.
Reading the same (raw, unprocessed) data from pickle takes ~0.1 second.
Reading the processed data from pickle takes 8 seconds!
I'm trying to find a way to read processed data from disk in the quickest possible way.
Already tried:
Experimenting with different storage formats but most of them can't store nested structures. The only one that worked was msgpack but that didn't improve the performance much.
Using structures other than Pandas DataFrame (like tuple of tuples) - faced similar performance.
I'm not very tied to the exact data structure. The thing is I would like to quickly read parsed data from disk directly to Python.
This might be a duplicate to this question
HDF5 is quite a bit quicker at handling nested pandas dataframes. I would give that a shot.
An example usage borrowed from here shows how you can chunk it efficiently when dumping:
import glob, os
import pandas as pd
df = DataFrame(np.random.randn(1000,2),columns=list('AB'))
df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
store = pd.HDFStore('test.h5')
nrows = store.get_storer('df').nrows
chunksize = 100
for i in xrange(nrows//chunksize + 1):
chunk = store.select('df',
start=i*chunksize,
stop=(i+1)*chunksize)
store.close()
When reading it back, you can do it in chunks like this, too:
for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000,chunksize = 3000):
print df.info()
print(df.head(5))
I have a big 17 GB JSON file placed in hdfs . I need to read that file and convert into nummy array which is then passed into K-Means clustering algorithm. I tried many ways but system slows down and getting a memory error or kernel dies.
the code i tried is
from hdfs3 import HDFileSystem
import pandas as pd
import numpy as nm
import json
hdfs = HDFileSystem(host='hostname', port=8020)
with hdfs.open('/user/iot_all_valid.json/') as f:
for line in f:
data = json.loads(line)
df = pd.DataFrame(data)
dataset= nm.array(df)
I tried using ijson but still not sure which is the right way to do this in faster way.
I would stay away from both numpy and Pandas, since you will get memory issues in both cases. I'd rather stick with SFrame or the Blaze ecosystem, which are designed specifically to handle this kind of "big data" cases. Amazing tools!
Because the data types are all going to be different per column a pandas dataframe would be a more appropriate data structure to keep it in. You can still manipulate the data with numpy functions.
import pandas as pd
data = pd.read_json('/user/iot_all_valid.json', dtype={<express converters for the different types here>})
In order to avoid the crashing issue, try running the k-means on a small sample on the data set. Make sure that works like expected. Then you can increase the data size till you feel comfortable with the whole data set.
In order to deal with a numpy array potentially larger than available ram I would use a memory mapped numpy array. On my machine ujson was 3.8 times faster than builtin json. Assuming rows is the number of lines of json:
from hdfs3 import HDFileSystem
import numpy as nm
import ujson as json
rows=int(1e8)
columns=4
# 'w+' will overwrite any existing output.npy
out = np.memmap('output.npy', dtype='float32', mode='w+', shape=(rows,columns))
with hdfs.open('/user/iot_all_valid.json/') as f:
for row, line in enumerate(f):
data = json.loads(line)
# convert data to numerical array
out[row] = data
out.flush()
# memmap closes on delete.
del out
I have several different data files that I need to import using genfromtxt. Each data file has different content. For example, file 1 may have all floats, file 2 may have all strings, and file 3 may have a combination of floats and strings etc. Also the number of columns vary from file to file, and since there are hundreds of files, I don't know which columns are floats and strings in each file. However, all the entries in each column are the same data type.
Is there a way to set up a converter for genfromtxt that will detect the type of data in each column and convert it to the right data type?
Thanks!
If you're able to use the Pandas library, pandas.read_csv is much more generally useful than np.genfromtxt, and will automatically handle the kind of type inference mentioned in your question. The result will be a dataframe, but you can get out a numpy array in one of several ways. e.g.
import pandas as pd
data = pd.read_csv(filename)
# get a numpy array; this will be an object array if data has mixed/incompatible types
arr = data.values
# get a record array; this is how numpy handles mixed types in a single array
arr = data.to_records()
pd.read_csv has dozens of options for various forms of text inputs; see more in the pandas.read_csv documentation.