I have a few Pandas dataframes with several millions of rows each. The dataframes have columns containing JSON objects each with 100+ fields. I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added. After all 24 functions execute, I get a final JSON which is then usable for my purposes.
I am wondering what the best ways to speed up performance for this dataset. A few things I have considered and read up on:
It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".
I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).
Possibly controlling the number of threads could be an option. The 24 functions can mostly operate without any dependencies on each other.
You are asking for advice and this is not the best site for general advice but nevertheless I will try to point a few things out.
The ideas you have already considered are not going to be helpful - neither Cython, Numba, nor threading are not going to address the main problem - the format of your data that is not conductive for performance of operations on the data.
I suggest that you first "unpack" the JSONs that you store in the column(s?) of your dataframe. Preferably, each field of the JSON (mandatory or optional - deal with empty values at this stage) ends up being a column of the dataframe. If there are nested dictionaries you may want to consider splitting the dataframe (particularly if the 24 functions are working separately at separate nested JSON dicts). Alternatively, you should strive to flatten the JSONs.
Convert to the data format that gives you the best performance. JSON stores all the data in the textual format. Numbers are best used in their binary format. You can do that column-wise on the columns that you suspect should be converted using df['col'].astype(...) (works on the whole dataframe too).
Update the 24 functions to operate not on JSON strings stored in dataframe but on the fields of the dataframe.
Recombine the JSONs for storage (I assume you need them in this format). At this stage the implicit conversion from numbers to strings will occur.
Given the level of details you provided in the question, the suggestions are necessarily brief. Should you have any more detailed questions at any of the above points, it would be best to ask maximally simple question on each of them (preferably containing a self-sufficient MWE).
I'm handling some CSV files with sizes in the range 1Gb to 2Gb. It takes 20-30 minutes just to load the files into a pandas dataframe, and 20-30 minutes more for each operation I perform, e.g. filtering the dataframe by column names, printing dataframe.head(), etc. Sometimes it also lags my computer when I try to use another application while I wait. I'm on a 2019 Macbook Pro, but I imagine it'll be the same for other devices.
I have tried using modin, but the data manipulations are still very slow.
Is there any way for me to work more efficiently?
Thanks in advance for the responses.
The pandas docs on Scaling to Large Datasets have some great tips which I'll summarize here:
Load less data. Read in a subset of the columns or rows using the usecols or nrows parameters to pd.read_csv. For example, if your data has many columns but you only need the col1 and col2 columns, use pd.read_csv(filepath, usecols=['col1', 'col2']). This can be especially important if you're loading datasets with lots of extra commas (e.g. the rows look like index,col1,col2,,,,,,,,,,,. In this case, use nrows to read in only a subset of the data to make sure that the result only includes the columns you need.
Use efficient datatypes. By default, pandas stores all integer data as signed 64-bit integers, floats as 64-bit floats, and strings as objects or string types (depending on the version). You can convert these to smaller data types with tools such as Series.astype or pd.to_numeric with the downcast option.
Use Chunking. Parsing huge blocks of data can be slow, especially if your plan is to operate row-wise and then write it out or to cut the data down to a smaller final form. Alternately, use the low_memory flag to get Pandas to use the chunked iterator on the backend, but return a single dataframe.
Use other libraries. There are a couple great libraries listed here, but I'd especially call out dask.dataframe, which specifically works toward your use case, by enabling chunked, multi-core processing of CSV files which mirrors the pandas API and has easy ways of converting the data back into a normal pandas dataframe (if desired) after processing the data.
Additionally, there are some csv-specific things I think you should consider:
Specifying column data types. Especially if chunking, but even if you're not, specifying the column types can dramatically reduce read time and memory usage and highlight problem areas in your data (e.g. NaN indicators or Flags that don't meet one of pandas's defaults). Use the dtypes parameter with a single data type to apply to all columns or a dict of column name, data type pairs to indicate the types to read in. Optionally, you can provide converters to format dates, times, or other numerical data if it's not in a format recognized by pandas.
Specifying the parser engine - pandas can read csvs in pure python (slow) or C (much faster). The python engine has slightly more features (e.g. currently the C engine can't read files with complex multi-character delimeters and it can't skip footers). Try using the argument engine='c' to make sure the C engine is being used. If you need one of the unsupported file types, I'd try fixing the file(s) first manually (e.g. stripping out a footer) and then parsing with the C engine, if possible.
Make sure you're catching all NaNs and data flags in numeric columns. This can be a tough one, and specifying specific data types in your inputs can be helpful in catching bad cases. Use the na_values, keep_default_na, date_parser, and converters argumentss to pd.read_csv. Currently, the default list of values interpreted as NaN are ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'].For example, if your numeric columns have non-numeric values coded as notANumber then this would be missed and would either cause an error (if you had dtypes specified) or would cause pandas to re-categorieze the entire column as an object column (suuuper bad for memory and speed!).
Read the pd.read_csv docs over and over and over again. Many of the arguments to read_csv have important performance considerations. pd.read_csv is optimized to smooth over a large amount of variation in what can be considered a csv, and the more magic pandas has to be ready to perform (determine types, interpret nans, convert dates (maybe), skip headers/footers, infer indices/columns, handle bad lines, etc) the slower the read will be. Give it as many hints/constraints as you can and you might see performance increase a lot! And if it's still not enough, many of these tweaks will also apply to the dask.dataframe API, so this scales up further nicely.
This may or may not help you, but I found that storing data in HDF files has significantly improved IO speed. If you are ultimately the source of CSV files, I think you should try to store them as HDF instead. Otherwise what Michael has already said may be the way to go.
Consider using polars. It is typically orders of magnitudes faster than pandas. Here are some benchmarks backing that claim.
If you really want full performance, consider using the lazy API. All the filters you describe can maybe even be done at scan level. We can also parralellize over all files easily with pl.collect_all().
Based on your description you could be fine with processing these csv files as streams instead of fully loading them into memory/swap to filter and calling head.
There's a Table (docs) helper in convtools library (github), which helps with streaming csv-like files, applying transforms and of course you can pipe the resulting stream of rows to another tool of your choice (polars / pandas).
For example:
import pandas as pd
from convtools import conversion as c
from convtools.contrib.tables import Table
pd.DataFrame(
Table.from_csv("input.csv", header=True)
.take("a", "c")
.update(b=c.col("a") + c.col("c"))
.filter(c.col("b") < -2)
.rename({"a": "A"})
.drop("c")
.into_iter_rows(dict) # .into_csv("out.csv") if passing to pandas is not needed
)
I have a pandas DataFrame I want to query often (in ray via an API). I'm trying to speed up the loading of it but it takes significant time (3+s) to cast it into pandas. For most of my datasets it's fast but this one is not. My guess is that it's because 90% of these are strings.
[742461 rows x 248 columns]
Which is about 137MB on disk. To eliminate disk speed as a factor I've placed the .parq file in a tmpfs mount.
Now I've tried:
Reading the parquet using pyArrow Parquet (read_table) and then casting it to pandas (reading into table is immediate, but using to_pandas takes 3s)
Playing around with pretty much every setting of to_pandas I can think of in pyarrow/parquet
Reading it using pd.from_parquet
Reading it from Plasma memory store (https://arrow.apache.org/docs/python/plasma.html) and converting to pandas. Again, reading is immediate but to_pandas takes time.
Casting all strings as categories
Anyone has any good tips on how to speed up pandas conversion when dealing with strings? I have plenty of cores and ram.
My end results wants to be a pandas DataFrame, so I'm not bound to the parquet file format although it's generally my favourite.
Regards,
Niklas
In the end I reduced the time by more carefully handling the data, mainly by removing blank values, making sure we had as much NA values as possible (instead of blank strings etc) and making categories on all text data with less than 50% unique content.
I ended up generating the schemas via PyArrow so I could create categorical values with a custom index size (int64 instead of int16) so my categories could hold more values. The data size was reduces by 50% in the end.
I am working on a large dataset that is stored as ndjson where each row of the data is a json object, I read this in line by line and use pandas json_normalise() to flatten each one and save it in a list as a dataframe, I then concat this list afterwards.
The whole process takes ~2 hours on a high powered machine so I would like to save the result so I dont have to repeat it, however, I have tried using to_hdfs and to_parquet but both have been failing and I believe it is due to the majority of columns having mixed data types where there could be strings, floats and ints which is an unavoidable consequence of a messy data collection system.
What would be the most appropriate way of storing this unprocessed data prior to cleaning it?
I think here should help pickle.
For write DataFrame/Series use to_pickle .
For read back use read_pickle.
I am writing a data-harvesting code in Python. I'd like to produce a data frame file that would be as easy to import into R as possible. I have full control over what my Python code will produce, and I'd like to avoid unnecessary data processing on the R side, such as converting columns into factor/numeric vectors and such. Also, if possible, I'd like to make importing that data as easy as possible on the R side, preferably by calling a single function with a single argument of file name.
How should I store data into a file to make this possible?
You can write data to CSV using http://docs.python.org/2/library/csv.html Python's csv module, then it's a simple matter of using read.csv in R. (See ?read.csv)
When you read in data to R using read.csv, unless you specify otherwise, character strings will be converted to factors, numeric fields will be converted to numeric. Empty values will be converted to NA.
First thing you should do after you import some data is to look at the ?str of it to ensure the classes of data contained within meet your expectations. Many times have I made a mistake and mixed a character value in a numeric field and ended up with a factor instead of a numeric.
One thing to note is that you may have to set your own NA strings. For example, if you have "-", ".", or some other such character denoting a blank, you'll need to use the na.strings argument (which can accept a vector of strings ie, c("-",".")) to read.csv.
If you have date fields, you will need to convert them properly. R does not necessarily recognize dates or times without you specifying what they are (see ?as.Date)
If you know in advance what each column is going to be you can specify the class using colClasses.
A thorough read through of ?read.csv will provide you with more detailed information. But I've outlined some common issues.
Brandon's suggestion of using CSV is great if your data isn't enormous, and particularly if it doesn't contain a whole honking lot of floating point values, in which case the CSV format is extremely inefficient.
An option that handled huge datasets a little better might be to construct an equivalent DataFrame in pandas and use its facilities to dump to hdf5, and then open it in R that way. See for example this question for an example of that.
This other approach feels like overkill, but you could also directly transfer the dataframe in-memory to R using pandas's experimental R interface and then save it from R directly.