Why is there no native string folding in Pandas?

Why is there no native string folding in Pandas? - python

I'm building an application to read from a SQL database into Pandas for analysis. The data is 'medium data' - too big for one computer (8GB RAM) to hold in memory. I really didn't want the cost and hassle of spinning up AWS instances constantly and getting beefier hardware was difficult (I work in non profit), so I looked to optimize my own data read memory cost.
I've spent a long time implementing this solution from mobify: http://www.mobify.com/blog/sqlalchemy-memory-magic/
Specifically method 3: they use a dictionary that stores all unique string values. This avoids duplication of objects holding the same string value, by instead passing a reference to the same string. I took their code and implemented it and the results have been very impressive (reduction of memory use by 2-10x, depending on data slice).
It was straightforward enough that I'm confused as to why Pandas doesn't have this natively. I'm a noob to the world of Pandas, but it seems like duplication of strings within large datasets is a given these day. Is there any drawback to a default string folding in the DataFrames? Am I missing something here?
TL;DR Weakness of pandas is the high memory cost. String folding seems like a straightforward way to significantly reduce the memory overhead. Why doesn't it have it?

Pandas does have something similar built-in in the form of categoricals. They probably only work well for relatively small numbers of unique strings, but they do save on memory usage by mapping each unique string to a numeric code and then storing those codes, e.g.:
import pandas as pd
import random
df = pd.DataFrame({'strs': [random.choice(['banana', 'pineapple', 'orange']) for i in range(100000)]})
df['catted'] = pd.Categorical(df['strs'])
df.memory_usage()
Out[10]:
strs 800000
catted 100024
dtype: int64

Related

Speeding up PyArrow Parquet to Pandas for dataframe with lots of strings

I have a pandas DataFrame I want to query often (in ray via an API). I'm trying to speed up the loading of it but it takes significant time (3+s) to cast it into pandas. For most of my datasets it's fast but this one is not. My guess is that it's because 90% of these are strings.
[742461 rows x 248 columns]
Which is about 137MB on disk. To eliminate disk speed as a factor I've placed the .parq file in a tmpfs mount.
Now I've tried:
Reading the parquet using pyArrow Parquet (read_table) and then casting it to pandas (reading into table is immediate, but using to_pandas takes 3s)
Playing around with pretty much every setting of to_pandas I can think of in pyarrow/parquet
Reading it using pd.from_parquet
Reading it from Plasma memory store (https://arrow.apache.org/docs/python/plasma.html) and converting to pandas. Again, reading is immediate but to_pandas takes time.
Casting all strings as categories
Anyone has any good tips on how to speed up pandas conversion when dealing with strings? I have plenty of cores and ram.
My end results wants to be a pandas DataFrame, so I'm not bound to the parquet file format although it's generally my favourite.
Regards,
Niklas

In the end I reduced the time by more carefully handling the data, mainly by removing blank values, making sure we had as much NA values as possible (instead of blank strings etc) and making categories on all text data with less than 50% unique content.
I ended up generating the schemas via PyArrow so I could create categorical values with a custom index size (int64 instead of int16) so my categories could hold more values. The data size was reduces by 50% in the end.

Read only Pandas dataset in Dask Distributed

TL;DR
I want to allow workers to use a scattered Pandas Dataframe, but not allow them to mutate any data. Look below for sample code. Is this possible? (or is this a pure Pandas question?)
Full question
I'm reading a Pandas Dataframe, and scattering it to the workers. I then use this future when I submit new tasks, and store it in a Variable for easy access.
Sample code:
df = pq.read_table('/data/%s.parq' % dataset).to_pandas()
df = client.scatter(df, broadcast=True, direct=True)
v = Variable('dataset')
v.set(df)
When I submit a job I use:
def top_ten_x_based_on_y(dataset, column):
return dataset.groupby(dataset['something'])[column].mean().sort_values(ascending=False)[0:10].to_dict()
a = client.submit(top_ten_x_based_on_y, df, 'a_column')
Now, I want to run 10-20 QPS on this dataset which all workers have in memory (data < RAM) but I want to protect against accidental changes of the dataset, such as one worker "corrupting" it's own memory which can lead to inconsistencies. Preferably raising an exception on trying to modify.
The data set is roughly 2GB.
I understand this might be problematic since a Pandas Dataframe itself is not immutable (although a Numpy array can be made to).
Other ideas:
Copy the dataset on each query, but 2GB copy takes time even in RAM (roughly 1.4 seconds)
Devise a way to hash a dataframe (probematic in itself, even though hash_pandas_object now exists), and check before and after (or every minute) if dataframe is the same as expected. Running hash_pandas_object takes roughly 5 seconds.

Unfortunately Dask currently offers no additional features on top of Python to avoid mutation in this way. Dask just runs Python functions, and those Python functions can do anything they like.
Your suggestions of copying or checking before running operations seems sensible to me.
You might also consider raising this as a question or feature request to Pandas itself.

Large Pandas Dataframe parallel processing

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.
Eg.
df = db.query("select id, a_lot_of_data from table")
def process(id):
temp_df = df.loc[id]
temp_df.apply(another_function)
Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())
Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)

The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.
One solution is to store your data in HDF (df.to_hdf) using the table format. You can then use select to select subsets of data for further processing. In practice this will be too slow for interactive use. It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step.
An alternative would be to explore numba.vectorize with target='parallel'. This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.
In the long run, dask is hoped to bring parallel execution to Pandas, but this is not something to expect soon.

Python multiprocessing is typically done using separate processes, as you noted, meaning that the processes don't share memory. There's a potential workaround if you can get things to work with np.memmap as mentioned a little farther down the joblib docs, though dumping to disk will obviously add some overhead of its own: https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping

Fastest way to parse large CSV files in Pandas

I am using pandas to analyse large CSV data files. They are around 100 megs in size.
Each load from csv takes a few seconds, and then more time to convert the dates.
I have tried loading the files, converting the dates from strings to datetimes, and then re-saving them as pickle files. But loading those takes a few seconds as well.
What fast methods could I use to load/save the data from disk?

As #chrisb said, pandas' read_csv is probably faster than csv.reader/numpy.genfromtxt/loadtxt. I don't think you will find something better to parse the csv (as a note, read_csv is not a 'pure python' solution, as the CSV parser is implemented in C).
But, if you have to load/query the data often, a solution would be to parse the CSV only once and then store it in another format, eg HDF5. You can use pandas (with PyTables in background) to query that efficiently (docs).
See here for a comparison of the io performance of HDF5, csv and SQL with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations
And a possibly relevant other question: "Large data" work flows using pandas

Posting this late in response to a similar question that had found simply using modin out of the box fell short. The answer will be similar with dask - use all of the below strategies in combination for best results, as appropriate for your use case.
The pandas docs on Scaling to Large Datasets have some great tips which I'll summarize here:
Load less data. Read in a subset of the columns or rows using the usecols or nrows parameters to pd.read_csv. For example, if your data has many columns but you only need the col1 and col2 columns, use pd.read_csv(filepath, usecols=['col1', 'col2']). This can be especially important if you're loading datasets with lots of extra commas (e.g. the rows look like index,col1,col2,,,,,,,,,,,. In this case, use nrows to read in only a subset of the data to make sure that the result only includes the columns you need.
Use efficient datatypes. By default, pandas stores all integer data as signed 64-bit integers, floats as 64-bit floats, and strings as objects or string types (depending on the version). You can convert these to smaller data types with tools such as Series.astype or pd.to_numeric with the downcast option.
Use Chunking. Parsing huge blocks of data can be slow, especially if your plan is to operate row-wise and then write it out or to cut the data down to a smaller final form. You can use the chunksize and iterator arguments to loop over chunks of the data and process the file in smaller pieces. See the docs on Iterating through files chunk by chunk for more detail. Alternately, use the low_memory flag to get Pandas to use the chunked iterator on the backend, but return a single dataframe.
Use other libraries. There are a couple great libraries listed here, but I'd especially call out dask.dataframe, which specifically works toward your use case, by enabling chunked, multi-core processing of CSV files which mirrors the pandas API and has easy ways of converting the data back into a normal pandas dataframe (if desired) after processing the data.
Additionally, there are some csv-specific things I think you should consider:
Specifying column data types. Especially if chunking, but even if you're not, specifying the column types can dramatically reduce read time and memory usage and highlight problem areas in your data (e.g. NaN indicators or Flags that don't meet one of pandas's defaults). Use the dtypes parameter with a single data type to apply to all columns or a dict of column name, data type pairs to indicate the types to read in. Optionally, you can provide converters to format dates, times, or other numerical data if it's not in a format recognized by pandas.
Specifying the parser engine - pandas can read csvs in pure python (slow) or C (much faster). The python engine has slightly more features (e.g. currently the C parser can't read files with complex multi-character delimeters and it can't skip footers). Try using the argument engine='c' to make sure the C engine is being used. If your file can't be read by the c engine, I'd try fixing the file(s) first manually (e.g. stripping out a footer or standardizing the delimiters) and then parsing with the C engine, if possible.
Make sure you're catching all NaNs and data flags in numeric columns. This can be a tough one, and specifying specific data types in your inputs can be helpful in catching bad cases. Use the na_values, keep_default_na, date_parser, and converters argumentss to pd.read_csv. Currently, the default list of values interpreted as NaN are ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'].For example, if your numeric columns have non-numeric values coded as notANumber then this would be missed and would either cause an error (if you had dtypes specified) or would cause pandas to re-categorieze the entire column as an object column (suuuper bad for memory and speed!).
Read the pd.read_csv docs over and over and over again. Many of the arguments to read_csv have important performance considerations. pd.read_csv is optimized to smooth over a large amount of variation in what can be considered a csv, and the more magic pandas has to be ready to perform (determine types, interpret nans, convert dates (maybe), skip headers/footers, infer indices/columns, handle bad lines, etc) the slower the read will be. Give it as many hints/constraints as you can and you might see performance increase a lot! And if it's still not enough, many of these tweaks will also apply to the dask.dataframe API, so this scales up further nicely.
Additionally, if you have the option, save the files in a stable binary storage format. Apache Parquet is a good columnar storage format with pandas support, but there are many others (see that Pandas IO guide for more options). Pickles can be a bit brittle across pandas versions (of course, so can any binary storage format, but this is usually less a concern for explicit data storage formats rather than pickles), and CSVs are inefficient and under-specified, hence the need for type conversion and interpretation.

One thing to check is the actual performance of the disk system itself. Especially if you use spinning disks (not SSD), your practical disk read speed may be one of the explaining factors for the performance. So, before doing too much optimization, check if reading the same data into memory (by, e.g., mydata = open('myfile.txt').read()) takes an equivalent amount of time. (Just make sure you do not get bitten by disk caches; if you load the same data twice, the second time it will be much faster because the data is already in RAM cache.)
See the update below before believing what I write underneath
If your problem is really parsing of the files, then I am not sure if any pure Python solution will help you. As you know the actual structure of the files, you do not need to use a generic CSV parser.
There are three things to try, though:
Python csv package and csv.reader
NumPy genfromtext
Numpy loadtxt
The third one is probably fastest if you can use it with your data. At the same time it has the most limited set of features. (Which actually may make it fast.)
Also, the suggestions given you in the comments by crclayton, BKay, and EdChum are good ones.
Try the different alternatives! If they do not work, then you will have to do write something in a compiled language (either compiled Python or, e.g. C).
Update: I do believe what chrisb says below, i.e. the pandas parser is fast.
Then the only way to make the parsing faster is to write an application-specific parser in C (or other compiled language). Generic parsing of CSV files is not straightforward, but if the exact structure of the file is known there may be shortcuts. In any case parsing text files is slow, so if you ever can translate it into something more palatable (HDF5, NumPy array), loading will be only limited by the I/O performance.

Modin is an early-stage project at UC Berkeley’s RISELab designed to facilitate the use of distributed computing for Data Science. It is a multiprocess Dataframe library with an identical API to pandas that allows users to speed up their Pandas workflows.
Modin accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks.
pip install modin
if using dask
pip install modin[dask]
import modin by typing
import modin.pandas as pd
It uses all CPU cores to import csv file and it is almost like pandas.

Most of the solutions are helpful here, I would like to say that parallelizing the loading can also help. Simple code below:
import os
import glob
path = r'C:\Users\data' # or whatever your path
data_files = glob.glob(os.path.join(path, "*.psv")) #list of all the files to be read
import reader
from multiprocessing import Pool
def read_psv_all (file_name):
return pd.read_csv(file_name,
delimiter='|', # change this as needed
low_memory=False
)
pool = Pool(processes=3) # can change 3 to number of processors you want to utilize
df_list = pool.map(read_psv_all, data_files)
df = pd.concat(df_list, ignore_index=True,axis=0, sort=False)
Note that if you are using windows/jupyter, it might be a sinister combination to use parallel processing. You might need to use if __name__ == '__main__' in your code.
Along with this, do utilize columns, dtypes which would definitely help.

Store data series in file or database if I want to do row level math operations?

I'm developing an app that handle sets of financial series data (input as csv or open document), one set could be say 10's x 1000's up to double precision numbers (Simplifying, but thats what matters).
I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input. This will be between columns (row level operations) on one set and also between columns on many (potentially all) sets at the row level also. I plan to write it in Python and it will eventually need a intranet facing interface to display the results/graphs etc. for now, csv output based on some input parameters will suffice.
What is the best way to store the data and manipulate? So far I see my choices as being either (1) to write csv files to disk and trawl through them to do the math or (2) I could put them into a database and rely on the database to handle the math. My main concern is speed/performance as the number of datasets grows as there will be inter-dataset row level math that needs to be done.
-Has anyone had experience going down either path and what are the pitfalls/gotchas that I should be aware of?
-What are the reasons why one should be chosen over another?
-Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?
-Is there any project or framework out there to help with this type of task?
-Edit-
More info:
The rows will all read all in order, BUT I may need to do some resampling/interpolation to match the differing input lengths as well as differing timestamps for each row. Since each dataset will always have a differing length that is not fixed, I'll have some scratch table/memory somewhere to hold the interpolated/resampled versions. I'm not sure if it makes more sense to try to store this (and try to upsample/interploate to a common higher length) or just regenerate it each time its needed.

"I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input."
This is the standard use case for a data warehouse star-schema design. Buy Kimball's The Data Warehouse Toolkit. Read (and understand) the star schema before doing anything else.
"What is the best way to store the data and manipulate?"
A Star Schema.
You can implement this as flat files (CSV is fine) or RDBMS. If you use flat files, you write simple loops to do the math. If you use an RDBMS you write simple SQL and simple loops.
"My main concern is speed/performance as the number of datasets grows"
Nothing is as fast as a flat file. Period. RDBMS is slower.
The RDBMS value proposition stems from SQL being a relatively simple way to specify SELECT SUM(), COUNT() FROM fact JOIN dimension WHERE filter GROUP BY dimension attribute. Python isn't as terse as SQL, but it's just as fast and just as flexible. Python competes against SQL.
"pitfalls/gotchas that I should be aware of?"
DB design. If you don't get the star schema and how to separate facts from dimensions, all approaches are doomed. Once you separate facts from dimensions, all approaches are approximately equal.
"What are the reasons why one should be chosen over another?"
RDBMS slow and flexible. Flat files fast and (sometimes) less flexible. Python levels the playing field.
"Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?"
Star Schema: central fact table surrounded by dimension tables. Nothing beats it.
"Is there any project or framework out there to help with this type of task?"
Not really.

For speed optimization, I would suggest two other avenues of investigation beyond changing your underlying storage mechanism:
1) Use an intermediate data structure.
If maximizing speed is more important than minimizing memory usage, you may get good results out of using a different data structure as the basis of your calculations, rather than focusing on the underlying storage mechanism. This is a strategy that, in practice, has reduced runtime in projects I've worked on dramatically, regardless of whether the data was stored in a database or text (in my case, XML).
While sums and averages will require runtime in only O(n), more complex calculations could easily push that into O(n^2) without applying this strategy. O(n^2) would be a performance hit that would likely have far more of a perceived speed impact than whether you're reading from CSV or a database. An example case would be if your data rows reference other data rows, and there's a need to aggregate data based on those references.
So if you find yourself doing calculations more complex than a sum or an average, you might explore data structures that can be created in O(n) and would keep your calculation operations in O(n) or better. As Martin pointed out, it sounds like your whole data sets can be held in memory comfortably, so this may yield some big wins. What kind of data structure you'd create would be dependent on the nature of the calculation you're doing.
2) Pre-cache.
Depending on how the data is to be used, you could store the calculated values ahead of time. As soon as the data is produced/loaded, perform your sums, averages, etc., and store those aggregations alongside your original data, or hold them in memory as long as your program runs. If this strategy is applicable to your project (i.e. if the users aren't coming up with unforeseen calculation requests on the fly), reading the data shouldn't be prohibitively long-running, whether the data comes from text or a database.

What matters most if all data will fit simultaneously into memory. From the size that you give, it seems that this is easily the case (a few megabytes at worst).
If so, I would discourage using a relational database, and do all operations directly in Python. Depending on what other processing you need, I would probably rather use binary pickles, than CSV.

Are you likely to need all rows in order or will you want only specific known rows?
If you need to read all the data there isn't much advantage to having it in a database.
edit: If the code fits in memory then a simple CSV is fine. Plain text data formats are always easier to deal with than opaque ones if you can use them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.