How to speed up importing dataframes into pandas - python

I understand that one of the reasons why pandas can be relatively slow importing csv files is that it needs to scan the entire content of a column before guessing the type (see the discussions around the mostly deprecated low_memory option for pandas.read_csv). Is my understanding correct?
If it is, what would be a good format in which to store a dataframe, and which explicitly specifies data types, so pandas doesn't have to guess (SQL is not an option for now)?
Any option in particular from those listed here?
My dataframes have floats, integers, dates, strings and Y/N, so formats supporting numeric values only won't do.

One option is to use numpy.genfromtxt
with delimiter=',', names=True, then to initialize the pandas dataframe with the numpy array. The numpy array will be structured and the pandas constructor should automatically set the field names.
In my experience this performs well.

You can improve the efficiency of importing from a CSV file by specifying column names and their datatypes to your call to pandas.read_csv. If you have existing column headers in the file, you probably don't have to specify the names and can just use those, but I like to skip the header and specify names for completeness:
import pandas as pd
import numpy as np
col_names = ['a', 'b', 'whatever', 'your', 'names', 'are']
col_types = {k: np.int32 for k in col_names} # create the type dict
col_types['a'] = 'object' # can change whichever ones you like
df = pd.read_csv(fname,
header = None, # since we are specifying our own names
skiprows=[0], # if you *do* have a header row, skip it
names=col_names,
dtype=col_types)
On a large sample dataset comprising mostly integer columns, this was about 20% faster than specifying dtype='object' in the call to pd.read_csv for me.

I would consider either HDF5 format or Feather Format. Both of them are pretty fast (Feather might be faster, but HDF5 is more feature rich - for example reading from disk by index) and both of them store the type of columns, so they don't have to guess dtypes and they don't have to convert data types (for example strings to numerical or strings to datetimes) when loading data.
Here are some speed comparisons:
which is faster for load: pickle or hdf5 in python
What is the fastest way to upload a big csv file in notebook to work with python pandas?

Related

What is the fastest way to retrieve header names from excel files using pandas

I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)
Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.
If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.

Trying to import all columns from a csv with an object data type with pandas

I'm trying to read a csv into a new dataframe with pandas. A number of the columns may only contain numeric values, but I still want to have them imported in as strings/objects, rather than having columns of float type.
I'm trying to write some python scripts for data conversion/migration. I'm not an advanced Python programmer and mostly learning as I come across a problem that needs solving.
The csvs I am importing have varying number of columns, and even different column titles, and in any order, over which I have no control, so I can't explicitly specify data types using the dtype parameter with read_csv. I just want any column imported to be treated as an object data type so I can analyse it further for data quality.
Examples would be 'Staff ID', and 'License Number' columns on one CSV I tried which should be strings fields holding 7-digit IDs, being imported as type float64.
I have tried using astype with read_csv and apply map on the dataframe after import
Note, there is no hard and fast rule on the contents of the type or quality of the data which is why I want to always import them as dtype of object.
Thanks in advance for anyone who can help me figure this out.
I've used the following code to read it in.
import pandas as pd
df = pd.read_csv("agent.csv",encoding="ISO-8859-1")
This creates the 'License Number' column in df with a type of float64 (among others).
Here is an example of License Number which should be a string:
'1275595' being stored as 1275595.0
Converting it back to a string/object in df after the import changes it back to '1275595.0'
It should stop converting data.
pd.read_csv(..., dtype=str)
Doc: read_csv
dtype: ... Use str or object together with suitable na_values settings
to preserve and not interpret dtype.
I recommend you split your csv reading process into multiple, specific-purpose functions.
For example:
import pandas as pd
# Base function for reading a csv. All the parsing/formatting is done here
def read_csv(file_content, header=False, columns=None, encoding='utf-8'):
df = pd.read_csv(file_content, header=header, encoding=encoding)
df.columns = columns
return df
# Function with a specific purpose as stated in the name.
def read_csv_license_plates(file_content, encoding='utf-8'):
columns = ['col1', 'col2', 'col3']
df = read_csv(file_content, True, columns)
return df
read_csv_license_plates('agent.csv', encoding='ISO-8859-1')

Numpy CSV fromfile()

I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.
It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.
Here's some test code:
import pandas as pd
import numpy as np
# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)
# Now read the file into memory.
arr = np.fromfile(filename)
print len(arr)
I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?
The docs show an optional sep parameter. But when that is used:
arr = np.fromfile(filename, sep = ',')
...we get a length of 0?!
Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.
What am I missing here?
numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.
From the docs:
A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.
By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.
To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).

Proper way of writing and reading Dataframe to file in Python

I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.
You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.
The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.

Auto convert strings and float columns using genfromtxt from numpy/python

I have several different data files that I need to import using genfromtxt. Each data file has different content. For example, file 1 may have all floats, file 2 may have all strings, and file 3 may have a combination of floats and strings etc. Also the number of columns vary from file to file, and since there are hundreds of files, I don't know which columns are floats and strings in each file. However, all the entries in each column are the same data type.
Is there a way to set up a converter for genfromtxt that will detect the type of data in each column and convert it to the right data type?
Thanks!
If you're able to use the Pandas library, pandas.read_csv is much more generally useful than np.genfromtxt, and will automatically handle the kind of type inference mentioned in your question. The result will be a dataframe, but you can get out a numpy array in one of several ways. e.g.
import pandas as pd
data = pd.read_csv(filename)
# get a numpy array; this will be an object array if data has mixed/incompatible types
arr = data.values
# get a record array; this is how numpy handles mixed types in a single array
arr = data.to_records()
pd.read_csv has dozens of options for various forms of text inputs; see more in the pandas.read_csv documentation.

Categories

Resources