So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.
I have a master dataframe with anywhere between 750 to 3000 rows of data.
I have a daily order dataframe with anywhere from 3000 to 5000 rows of data.
If the product code of the daily order dataframe is found in the master dataframe, I get the item cost. Otherwise, it is marked as invalid and deleted.
I currently do this via 2 for loops. But I will have to do many more such comparisons and data updating (other fields to compare, other values to copy)
What is the most efficient way to do this?
I cannot make the column I am comparing the index column of the master dataframe.
In this case, the product code may be unique in the master and I could do a merge, but there are other cases where I may have to compare other values like supplier city which may not be unique.
I seem to be doing this repeatedly in all my Python codes and I want to learn the most efficient way to do this.
Order DF:
[![Order csv from which the Order DF is created][1]][1]
Master DF
[![Master csv from which Master DF is created][1]][1]
def fillVol(orderDF,mstrDF,paramC,paramF,notFound):
orderDF['ttlVol']=0
for i in range(len(orderDF)):
found=False
for row in mstrDF.itertuples():
if (orderDF.loc[i,paramC]==getattr(row,paramC)):
orderDF.loc[i,paramF[0]]=getattr(row,paramF[0])#mtrl cbf
found=True
break
if (found==False):
notFound.append(inv.loc[i,paramC])
inv['ttlVol']=inv[paramF[0]]*inv[paramF[2]]
return notFound
I am passing along the column names I am comparing and the column names I am filling with data because there are minor variations in naming the csv. In the data I have shared, the material volume is CBF, in come cases it is CBM
The data columns cannot be index because there are no unique data in any of the columns, it is always a combination of values that makes them unique.
The data, in this case, is a float and numpy could be used, but in other cases like copying city names from a master, the data is a string. numpy was the suggestion to other people with a similar issue
I dont know if this is the most efficient way of doing it - as someone who started programming with Fortran and then C, I am always for basic datatypes and this solution is not utilising basic datatype. This is definitely a highly Pythonic solution.
orderDF=orderDF[orderDF[ParamF].isin(mstrDF[ParamF])]
orderDF=orderDF.reset_index(drop=True)
I use a left merge on the orderDF and msterDF data frames to copy all relevant values
orderDF=orderDF.merge(mstrDF.drop_duplicates(paramC,keep='last')[[paramF[0]]]', how='left',validate = 'm:1')
I have a Pandas dataframe with one column containing string IDs. I am using idxmax() to return the index of the found IDs but since the data is over a million rows, it takes a lot of time to search. Is there any efficient way to search which can reduce time. I have not sorted the IDs currently.
please add a sample of the data you are working on.
the following code snippet will sort the dataframe based on the column and select the first row:
df.sort_values('column_name').iloc[0,0]
I have some problems with pandas' HDFStore being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
Situation
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore.
Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Data
shape == (5730000, 339) does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
My approach
I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
I use datetime index and automatically selected float or integer columns
I save the steps with hdf.put('/name', df, format='fixed') since hdf.put('/name'.format(grp), df, format='table', data_columns=True) seemed to be far too slow.
I use e.g. df.groupby(df.index).first() and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict) to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.
For plotting, I have to read-in the entire dataframe and then get the columns: hdfstore.get('/name').columns
Question
How can I retrieve all columns without reading any data from the HDFStore?
What would be the most efficient way of storing my data? Is HDF the right option? Table or fixed?
Does it matter in term of efficiency if the index is a datetime index? Does there exists a more efficient format in general (e.g. all columns the same, fixed dtype?)
Is there a faster way to aggregate instead of groupby (df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict))
similar questions
How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!
You may simply load 0 rows of the DataFrame by specifying same start and stop attributes. And leave all internal index/column processing for pandas itself:
idx = pd.MultiIndex.from_product([('A', 'B'), range(2)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx, columns=('I', 'II', 'III'))
df
>>> I II III
>>> Alpha Int
>>> A 0 -0.472412 0.436486 0.354592
>>> 1 -0.095776 -0.598585 -0.847514
>>> B 0 0.107897 1.236039 -0.196927
>>> 1 -0.154014 0.821511 0.092220
Following works both for fixed an table formats:
with pd.HDFStore('test.h5') as store:
store.put('df', df, format='f')
meta = store.select('df', start=1, stop=1)
meta
meta.index
meta.columns
>>> I II III
>>> Alpha Int
>>>
>>> MultiIndex(levels=[[], []],
>>> codes=[[], []],
>>> names=['Alpha', 'Int'])
>>>
>>> Index(['I', 'II', 'III'], dtype='object')
As for others question:
As long as your data is mostly homogeneous (almost float columns as you mentioned) and you are able to store it in single file without need to distribute data across machines - HDF is the first thing to try.
If you need to append/delete/query data - you must use table format. If you only need to write once and read many - fixed will improve performance.
As for datetime index, i think here we may use same idea as in 1 clause. If u are able to convert all data into single type it should increase your performance.
Nothing else that proposed in comment to your question comes to mind.
For a HDFStore hdf and a key (from hdf.keys()) you can get the column names with:
# Table stored with hdf.put(..., format='table')
columns = hdf.get_node('{}/table'.format(key)).description._v_names
# Table stored with hdf.put(..., format='fixed')
columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
note that hdf.get(key).columns works as well, but it reads all the data into memory, while the approach above only reads the column names.
Full working example:
#!/usr/bin/env python
import pandas as pd
data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
with pd.HDFStore(path='store.h5', mode='a') as hdf:
hdf.put('/DATA/fixed_store', data, format='fixed')
hdf.put('/DATA/table_store', data, format='table', data_columns=True)
for key in hdf.keys():
try:
# column names of table store
print(hdf.get_node('{}/table'.format(key)).description._v_names)
except AttributeError:
try:
# column names of fixed store
print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
except AttributeError:
# e.g. a dataset created by h5py instead of pandas.
print('unknown node in HDF.')
Columns without reading any data:
store.get_storer('df').ncols # substitute 'df' with your key
# you can also access nrows and other useful fields
From the docs (fixed format, table format): (important points in bold)
[fixed] These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores.
[table] Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported.
You may try to use epochms (or epochns) (milliseconds or nanoseconds since epoch) in place of datetimes. This way, you are just dealing with integer indices.
You may have a look at this answer if what you need is grouping by on large data.
An advice: if you have 4 questions to ask, it may be better to ask 4 separate questions on SO. This way, you'll get a higher number of (higher quality) answers, since each one is easier to tackle. And each will deal with a specific topic, making it easier to search for people that are looking for specific answers.
Hey all,
I have two databases. One with 145000 rows and approx. 12 columns. I have another database with around 40000 rows and 5 columns. I am trying to compare based on two columns values. For example if in CSV#1 column 1 says 100-199 and column two says Main St(meaning that this row is contained within the 100 block of main street), how would I go about comparing that with a similar two columns in CSV#2. I need to compare every row in CSV#1 to each single row in CSV#2. If there is a match I need to append the 5 columns of each matching row to the end of the row of CSV#2. Thus CSV#2's number of columns will grow significantly and have repeat entries, doesnt matter how the columns are ordered. Any advice on how to compare two columns with another two columns in a separate database and then iterate across all rows. I've been using python and the import csv so far with the rest of the work, but this part of the problem has me stumped.
Thanks in advance
-John
A csv file is NOT a database. A csv file is just rows of text-chunks; a proper database (like PostgreSQL or Mysql or SQL Server or SQLite or many others) gives you proper data types and table joins and indexes and row iteration and proper handling of multiple matches and many other things which you really don't want to rewrite from scratch.
How is it supposed to know that Address("100-199")==Address("Main Street")? You will have to come up with some sort of knowledge-base which transforms each bit of text into a canonical address or address-range which you can then compare; see Where is a good Address Parser but be aware that it deals with singular addresses (not address ranges).
Edit:
Thanks to Sven; if you were using a real database, you could do something like
SELECT
User.firstname, User.lastname, User.account, Order.placed, Order.fulfilled
FROM
User
INNER JOIN Order ON
User.streetnumber=Order.streetnumber
AND User.streetname=Order.streetname
if streetnumber and streetname are exact matches; otherwise you still need to consider point #2 above.