I am loading data of size comparable to my memory limits, so I am conscious about efficient indexing and not making copies. I would need to work on columns 3:8 and 9: (also labeled), but combining ranges does not seem to work. Rearranging the columns in the underlying data is needlessly costly (an IO operation). Referencing two dataframes and combining them also sounds like something that would make copies. What is an efficient way to do this?
import numpy as np
import pandas as pd
data = pd.read_stata('S:/data/controls/lasso.dta')
X = pd.concat([data.iloc[:,3:8],data.iloc[:,9:888]])
By the way, if I could read in only half of my data (a random half, even), that would help, again I would not open the original data and save another, smaller copy just for this.
import numpy as np
import pandas as pd
data = pd.read_stata('S:/data/controls/lasso.dta')
cols = np.zeros(len(data.columns), np.dtype=bool)
cols[3:8] = True
cols[9:888] = True
X = data.iloc[:, cols]
del data
This still makes a copy (but just one...). It does not seem to be possible to return a view instead of a copy for this sort of shape (source).
Another suggestion is converting the .dta file to a .csv file (howto). Pandas read_csv is much more flexible: you can specify the columns you are interested in (usecols), and how many rows you would like to read (nrows). Unfortunately this requires a file copy.
Related
I have a file bigger than 7GB. I am trying to place it into a dataframe using pandas, like this:
df = pd.read_csv('data.csv')
But it takes too long. Is there a better way to speed up the dataframe creation? I was considering changing the parameter engine='c', since it says in the documentation:
"engine{‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete."
But I dont see much gain in speed
If the problem is you are not able to create the dataframe since the big size makes the operation to fail, you can check how to chunk it in this answer
In case it is created at some point, but you consider it is too slow, then you can use datatable to read the file, then convert to pandas, and continue with your operations:
import pandas as pd
import datatable as dt
# Read with databale
datatable_df = dt.fread('myfile.csv')
# Then convert the dataframe into pandas
pandas_df = frame_datatable.to_pandas()
How can I write only first N rows or from P to Q rows to csv from pandas dataframe without subseting the df first? I cannot subset the data I want to export because of memory issues.
I am thinking of a function which writes to csv row by row.
Thank you
Use head- Return the first n rows.
Ex.
import pandas as pd
import numpy as np
date = pd.date_range('20190101',periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=date, columns=list('ABCD'))
#wtire only top two rows into csv file
print(df.head(2).to_csv("test.csv"))
Does this work for you?
df.iloc[:N, :].to_csv()
Or
df.iloc[P:Q, :].to_csv()
I believe df.iloc generally produces references to the original dataframe rather than copying the data.
If this still doesn't work, you might also try setting the chunksize in the to_csv call. It may be that pandas is able to create the subset without using much more memory, but then it makes a complete copy of the rows written to each chunk. If the chunksize is the whole frame, you would end up copying the whole frame at that point and running out of memory.
If all else fails, you can loop through df.iterrows() or df.iloc[P:Q, :].iterrows() or df.iloc[P:Q, :].itertuples() and write each row using the csv module (possibly writer.writerows(df.iloc[P:Q, :].itertuples()).
Maybe you can select the rows index that you want to write on your CSV file like this:
df[df.index.isin([1, 2, ...])].to_csv('file.csv')
Or use this one:
df.loc[n:n].to_csv('file.csv')
I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.
It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.
Here's some test code:
import pandas as pd
import numpy as np
# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)
# Now read the file into memory.
arr = np.fromfile(filename)
print len(arr)
I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?
The docs show an optional sep parameter. But when that is used:
arr = np.fromfile(filename, sep = ',')
...we get a length of 0?!
Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.
What am I missing here?
numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.
From the docs:
A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.
By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.
To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).
I have a big 17 GB JSON file placed in hdfs . I need to read that file and convert into nummy array which is then passed into K-Means clustering algorithm. I tried many ways but system slows down and getting a memory error or kernel dies.
the code i tried is
from hdfs3 import HDFileSystem
import pandas as pd
import numpy as nm
import json
hdfs = HDFileSystem(host='hostname', port=8020)
with hdfs.open('/user/iot_all_valid.json/') as f:
for line in f:
data = json.loads(line)
df = pd.DataFrame(data)
dataset= nm.array(df)
I tried using ijson but still not sure which is the right way to do this in faster way.
I would stay away from both numpy and Pandas, since you will get memory issues in both cases. I'd rather stick with SFrame or the Blaze ecosystem, which are designed specifically to handle this kind of "big data" cases. Amazing tools!
Because the data types are all going to be different per column a pandas dataframe would be a more appropriate data structure to keep it in. You can still manipulate the data with numpy functions.
import pandas as pd
data = pd.read_json('/user/iot_all_valid.json', dtype={<express converters for the different types here>})
In order to avoid the crashing issue, try running the k-means on a small sample on the data set. Make sure that works like expected. Then you can increase the data size till you feel comfortable with the whole data set.
In order to deal with a numpy array potentially larger than available ram I would use a memory mapped numpy array. On my machine ujson was 3.8 times faster than builtin json. Assuming rows is the number of lines of json:
from hdfs3 import HDFileSystem
import numpy as nm
import ujson as json
rows=int(1e8)
columns=4
# 'w+' will overwrite any existing output.npy
out = np.memmap('output.npy', dtype='float32', mode='w+', shape=(rows,columns))
with hdfs.open('/user/iot_all_valid.json/') as f:
for row, line in enumerate(f):
data = json.loads(line)
# convert data to numerical array
out[row] = data
out.flush()
# memmap closes on delete.
del out
I understand that one of the reasons why pandas can be relatively slow importing csv files is that it needs to scan the entire content of a column before guessing the type (see the discussions around the mostly deprecated low_memory option for pandas.read_csv). Is my understanding correct?
If it is, what would be a good format in which to store a dataframe, and which explicitly specifies data types, so pandas doesn't have to guess (SQL is not an option for now)?
Any option in particular from those listed here?
My dataframes have floats, integers, dates, strings and Y/N, so formats supporting numeric values only won't do.
One option is to use numpy.genfromtxt
with delimiter=',', names=True, then to initialize the pandas dataframe with the numpy array. The numpy array will be structured and the pandas constructor should automatically set the field names.
In my experience this performs well.
You can improve the efficiency of importing from a CSV file by specifying column names and their datatypes to your call to pandas.read_csv. If you have existing column headers in the file, you probably don't have to specify the names and can just use those, but I like to skip the header and specify names for completeness:
import pandas as pd
import numpy as np
col_names = ['a', 'b', 'whatever', 'your', 'names', 'are']
col_types = {k: np.int32 for k in col_names} # create the type dict
col_types['a'] = 'object' # can change whichever ones you like
df = pd.read_csv(fname,
header = None, # since we are specifying our own names
skiprows=[0], # if you *do* have a header row, skip it
names=col_names,
dtype=col_types)
On a large sample dataset comprising mostly integer columns, this was about 20% faster than specifying dtype='object' in the call to pd.read_csv for me.
I would consider either HDF5 format or Feather Format. Both of them are pretty fast (Feather might be faster, but HDF5 is more feature rich - for example reading from disk by index) and both of them store the type of columns, so they don't have to guess dtypes and they don't have to convert data types (for example strings to numerical or strings to datetimes) when loading data.
Here are some speed comparisons:
which is faster for load: pickle or hdf5 in python
What is the fastest way to upload a big csv file in notebook to work with python pandas?