Suppose I have a csv file with 400 columns. I cannot load the entire file into a DataFrame (won't fit in memory). However, I only really want 50 columns, and this will fit in memory. I don't see any built in Pandas way to do this. What do you suggest? I'm open to using the PyTables interface, or pandas.io.sql.
The best-case scenario would be a function like: pandas.read_csv(...., columns=['name', 'age',...,'income']). I.e. we pass a list of column names (or numbers) that will be loaded.
Ian, I implemented a usecols option which does exactly what you describe. It will be in upcoming pandas 0.10; development version will be available soon.
Since 0.10, you can use usecols like
df = pd.read_csv(...., usecols=['name', 'age',..., 'income'])
There's no default way to do this right now. I would suggest chunking the file and iterating over it and discarding the columns you don't want.
So something like pd.concat([x.ix[:, cols_to_keep] for x in pd.read_csv(..., chunksize=200)])
Related
being new to python I am looking for some help reshaping this data, already know how to do so in excel but want a python specific solution.
I want it to be in this format.
entire dataset is 70k rows with different vc_firm_names, any help would be great.
If you care about performance, then I suggest you take a look at other methods (such as using numpy, or sorting the table):
https://stackoverflow.com/a/42550516/17323241
https://stackoverflow.com/a/66018377/17323241
https://stackoverflow.com/a/22221675/17323241 (look at second comment)
Otherwise, you can do:
# load data from csv file
df = pd.read_csv("example.csv")
# aggregate
df.groupby("vc_first_name")["investment_industry"].apply(list)
Assuming the original file is "original.csv", and you want to save it as "new.csv" I would do:
pd.read_csv("original.csv").groupby(by=["vc_firm_name"],as_index=False).aggregate(lambda x: ','.join(x)).to_csv("new.csv", index=False)
As sort of follow on to my previous question [1], is there a way to open a hdf5 dataset in vaex, perform operations and then store the results to the same dataset?
I tried the following:
import vaex as vx
vxframe = vx.open('somedata.hdf5')
vxframe = some_transformation(vxframe)
vxframe.export_hdf5('somedata.hdf5')
This results in the error OSError: Unable to create file (unable to truncate a file which is already open), so h5py can't write to the file while it is open. Is there another workflow to achieve this? I can write to another file as a workaround, but that seems quite inefficient as (I imagine) it has to copy all the data that has not changed as well.
[1] Convert large hdf5 dataset written via pandas/pytables to vaex
Copying to a new file would not be less efficient than writing to itself (at least not for this example), since it will have to write the same amount of bytes. I also would not recommend it, since if you make a mistake, you will mess up your data.
Exporting data is actually quite efficient, but even better, you can also choose to just export the columns you want:
df = vaex.open('somedata.hdf5')
df2 = some_transformation(df)
df2[['new_column1', 'new_columns2']].export('somedata_extra.hdf5')
...
# next time
df = vaex.open('somedata.hdf5')
df2 = vaex.open('somedata_extra.hdf5')
df = df.join(df2) # merge without column name will merge on rows basis
We used this approach alot, to create auxiliary datasets on disk that were precomputed. Joining them back (on row bases) is instant, it does not take any time or memory.
It seems that you can look at columns in a file no problem, but there's no apparent way to look at rows. I know I can read the entire file (CSV or excel) into a crazy huge dataframe in order to select rows, but I'd rather be able to grab particular rows straight from the file and store those in a reasonably sized dataframe.
I do realize that I could just transpose/pivot the df before saving it to the aforementioned CVS/Excel file. This would be a problem for Excel because I'd run out of columns (the transposed rows) far too quickly. I'd rather use Excel than CSV.
My original, not transposed data file has 9000+ rows and 20ish cols. I'm using Excel 2003 which supports up to 256 columns.
EDIT: Figured out a solution that works for me. It's a lot simpler than I expected. I did end up using CSV instead of Excel (I found no serious difference in terms of my project) Here it is for whoever may have the same problem:
import pandas as pd
selectionList = (2, 43, 792, 4760) #rows to select
df = pd.read_csv(your_csv_file, index_col=0).T
selection = {}
for item in selectionList:
selection[item] = df[item]
selection = pd.DataFrame.from_dict(selection)
selection.T.to_csv(your_path)
I think you can use the skiprows and nrows arguments in pandas.read_csv to pick out individual rows to read in.
With skiprows, you can provide it a long list (0 indexed) of rows not to import , e.g. [0,5,6,10]. That might end up being a huge list though. If you provide it a single integer, it will skip that number of rows and start reading. Set nrows to whatever to pick up the number of rows you want at the point where you have it start.
If I've misunderstood the issue, let me know.
I understand that one of the reasons why pandas can be relatively slow importing csv files is that it needs to scan the entire content of a column before guessing the type (see the discussions around the mostly deprecated low_memory option for pandas.read_csv). Is my understanding correct?
If it is, what would be a good format in which to store a dataframe, and which explicitly specifies data types, so pandas doesn't have to guess (SQL is not an option for now)?
Any option in particular from those listed here?
My dataframes have floats, integers, dates, strings and Y/N, so formats supporting numeric values only won't do.
One option is to use numpy.genfromtxt
with delimiter=',', names=True, then to initialize the pandas dataframe with the numpy array. The numpy array will be structured and the pandas constructor should automatically set the field names.
In my experience this performs well.
You can improve the efficiency of importing from a CSV file by specifying column names and their datatypes to your call to pandas.read_csv. If you have existing column headers in the file, you probably don't have to specify the names and can just use those, but I like to skip the header and specify names for completeness:
import pandas as pd
import numpy as np
col_names = ['a', 'b', 'whatever', 'your', 'names', 'are']
col_types = {k: np.int32 for k in col_names} # create the type dict
col_types['a'] = 'object' # can change whichever ones you like
df = pd.read_csv(fname,
header = None, # since we are specifying our own names
skiprows=[0], # if you *do* have a header row, skip it
names=col_names,
dtype=col_types)
On a large sample dataset comprising mostly integer columns, this was about 20% faster than specifying dtype='object' in the call to pd.read_csv for me.
I would consider either HDF5 format or Feather Format. Both of them are pretty fast (Feather might be faster, but HDF5 is more feature rich - for example reading from disk by index) and both of them store the type of columns, so they don't have to guess dtypes and they don't have to convert data types (for example strings to numerical or strings to datetimes) when loading data.
Here are some speed comparisons:
which is faster for load: pickle or hdf5 in python
What is the fastest way to upload a big csv file in notebook to work with python pandas?
I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.
You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.
The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.