If I have a simple table, such as:
index location col1 col2 col3 col4
1 a TRUE yes 1 4
2 a FALSE null 2 6
3 b TRUE null 6 3
4 b TRUE no 3 4
5 b FALSE yes 4 6
6 c TRUE no 57 8
7 d FALSE null 74 9
If I wanted to aggregated the duplicate records in location, i.e the two a's or the three b's, I have been using a basic groupby functions. This works well for simple tables.
However, is it possible to expand this functionaility to allow rules per column when aggregating? As an example for col1, if TRUE was present, it would trump any FALSE value, or in col3, it would sum the values, whereas in col4 it add calculate the average? Is it possible to define these rules per column and then apply them when using groupby?
I have searched online, but not found anything that seems to cover this, however I may be barking up the wrong tree.
Thanks.
Use groupby and agg
funcs = dict(
col1=dict(Trump=lambda x: x.any()),
col3='sum',
col4=dict(Avg='mean')
)
df.groupby('location').agg(funcs)
When using agg on a groupby object with multiple columns, you can pass a dict that defines which functions to apply to which column.
In this high level dictionary (funcs), the keys are the existing column names to apply the functions defined in the value.
For example:
agg({'col1': lambda x: x.any(), 'col2': 'sum'})
Says to use any() on col1 and sum on col2. If col1 or col2 did not exist in the dataframe, this would fail.
Further, we didn't have to live with the default column names that come from this aggregation. I'll run the mini example above to illustrate.
df.groupby('location').agg({'col1': lambda x: x.any(), 'col3': 'sum'})
There isn't much description about what we've done. We can describe the functions as we'd like if we pass a dictionary as the function instead, with the key being the description and the value being the function. I'll use the same example but expanding it with a better description.
df.groupby('location').agg(
{'col1': {'All I need is one True': lambda x: x.any()},
'col3': {'SUMMATION': 'sum'}})
Armed with that information, hopefully my solution makes perfect sense.
Related
I am trying to de-duplicate rows in pandas. I have millions of rows of duplicates and it isn't suitable for what I'm trying to do.
From this:
col1 col2
0 1 23
1 1 47
2 1 58
3 1 9
4 1 4
I want to get this:
col1 col2
0 1 [23, 47, 58, 9, 4]
I have managed to do it manually by writing individual scripts for each spreadsheet, but it would really be great to have a more generalized way of doing it.
So far I've tried:
def remove_duplicates(self, df):
ids = df[self.key_field].unique()
numdicts = []
for i in ids:
instdict = {self.key_field: i}
for col in self.deduplicate_fields:
xf = df.loc[df[self.key_field] == i]
instdict[col] = str(list(xf[col]))
numdicts.append(instdict)
for n in numdicts:
print(pd.DataFrame(data=n, index=self.key_field))
return df
But unbelievably, this returns the same thing I started with.
The only way I've managed it so far is to create lists for each column manually and loop through the unique index keys from the dataframe, and add all of the duplicates to a list, then zip all of the lists and create a dataframe from them.
However, this doesn't seem to work when there is an unknown number of columns that need to be de-duplicated.
Any better way of doing this would be appreciated!
Thanks in advance!
Is this what you are looking for when you need one column only:
df.groupby('col1')['col2'].apply(lambda x: list(x)).reset_index()
And for all other columns use agg:
df.groupby('col1').apply(lambda x: list(x)).reset_index()
With agg you can also specify which columns to use:
df.groupby('col1')['col2', 'col3'].apply(lambda x: list(x)).reset_index()
You can try the following:
df.groupby('col1').agg(lambda x: list(x))
For multiple columns it should look like this instead to avoid errors:
df.groupby('col1')[['col2','col3']].agg(lambda x: list(x)).reset_index()
I have a use case where I need to validate each row in the df and mark if it is correct or not. Validation rules are in another df.
Main
col1 col2
0 1 take me home
1 2 country roads
2 2 country roads take
3 4 me home
Rules
col3 col4
0 1 take
1 2 home
2 3 country
3 4 om
4 2 take
A row in main is marked as pass if the following condition matches for any row in rules
The condition for passing is:
col1==col3 and col4 is substring of col2
Main
col1 col2 result
0 1 take me home Pass
1 2 country roads Fail
2 2 country roads take Pass
3 4 me home Pass
My initial approach was to parse Rules df and create a function out of it dynamically and then run
def action_function(row) -> object:
if self.combined_filter()(row): #combined_filter() is the lambda equivalent of Rules df
return success_action(row) #mark as pass
return fail_action(row) #mark as fail
Main["result"] = self.df.apply(action_function, axis=1)
This turned out to be very slow as apply is not vectorized. The main df is about 3 million and Rules df is around 500 entries. Time taken is around 3 hour.
I am trying to use pandas merge for this. But substring match is not supported by the merge operation. I cannot split words by space or anything.
This will be used as part of a system. So I cannot hardcode anything. I need to read the df from excel every time system starts.
Can you please suggest an approach for this?
Merge and then apply the condtion using np.where i.e
temp = main.merge(rules,left_on='col1',right_on='col3')
temp['results'] = temp.apply(lambda x : np.where(x['col4'] in x['col2'],'Pass','Fail'),1)
no_dupe_df = temp.drop_duplicates('col2',keep='last').drop(['col3','col4'],1)
col1 col2 results
0 1 take me home Pass
2 2 country roads Fail
4 2 country roads take Pass
5 4 me home Pass
Is it possible to reorder pandas.DataFrame rows (based on an index) inplace?
I am looking for something like this:
df.reorder(idx, axis=0, inplace=True)
Where idx is not equal to but is the same type of df.index, it contains the same elements, but in another order. The output is df reordered before new idx.
I have not found anything in documentation and I fail to use reindex_axis. Which made me hoping it was possible because:
A new object is produced unless the new index is equivalent to the
current one and copy=False
I might have misunderstood what equivalent index means in this context.
Try using the reindex function (note that this is not inplace):
>>import pandas as pd
>>df = pd.DataFrame({'col1':[1,2,3],'col2':['test','hi','hello']})
>>df
col1 col2
0 1 test
1 2 hi
2 3 hello
>>df = df.reindex([2,0,1])
>>df
col1 col2
2 3 hello
0 1 test
1 2 hi
I basically have a dataset that looks as follows
Col1 Col2 Col3 Count
A B 1 50
A B 1 50
A C 20 1
A D 17 2
A E 5 70
A E 15 20
Suppose it is called data. I basically do data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum(), which should give me this:
Col1 Col2 Col3 Count
A B 1 100
A C 20 1
A D 17 2
A E 5 70
A E 15 20
However, this returns an empty dataset, which does have the columns I want but no rows. The only caveat is that the by parameter is getting calculated dynamically, instead of fixed (thats because the columns might change, although Count will always be there).
Any ideas on why this could be failing, and how to fix it?
EDIT: Further searching revealed that pandas' groupby removes rows that have NULL at any column. This is a problem for me because every single column might be NULL. Hence, the actual question is: any reasonable way to deal with NULLs and still use groupby?
Would love to be corrected here, but I'm not sure if there is a clean way to handle missing data. As you noted, Pandas will just exclude rows from groupby that contain NaN values
You could fill the NaN values with something beyond the range of your data:
data = pd.read_csv("c:/Users/simon/Desktop/data.csv")
data.fillna(-999, inplace=True)
new = data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum()
Which is messy because it wont add those values to the correct group by for the summation. But theres no real way to groupby something thats missing
Another method might be to fill each column separately with some missing value that is appropriate for that variable.
Is there a more efficient way to use pandas groupby or pandas.core.groupby.DataFrameGroupBy object to create a unique list, series or dataframe, where I want unique combinations of 2 of N columns. E.g., if I have columns: Date, Name, Item Purchased and I just want to know unique Name and Date combination this works fine:
y = x.groupby(['Date','Name']).count()
y = y.reset_index()[['Date', 'Name']]
but I feel like there should be a cleaner way using
y = x.groupby(['Date','Name'])
but y.index gives me an error, although y.keys works. This actually leads me to ask the general question as what are pandas.core.groupby.DataFrameGroupBy objects convenient for?
Thanks!
You don't need to use -- and in fact shouldn't use -- groupby here. You could use drop_duplicates to get unique rows instead:
x.drop_duplicates(['Date','Name'])
Demo:
In [156]: x = pd.DataFrame({'Date':[0,1,2]*2, 'Name':list('ABC')*2})
In [158]: x
Out[158]:
Date Name
0 0 A
1 1 B
2 2 C
3 0 A
4 1 B
5 2 C
In [160]: x.drop_duplicates(['Date','Name'])
Out[160]:
Date Name
0 0 A
1 1 B
2 2 C
You shouldn't use groupby because
x.groupby(['Date','Name']).count() performs a count of the
number of elements in each group, but the count is not used -- it's a wasted computation.
x.groupby(['Date','Name']).count() raises an AttributeError if
x has only Date and Name columns.
drop_duplicates is much much faster for this purpose.
Use groupby when you want to perform some operation on each group, such as counting the number of elements in each group, or computing some statistic (e.g. a sum or mean, etc.) per group.