I have data that looks like
Name,Report_ID,Amount,Flag,Actions
Fizz,123,5,,A
Fizz,123,10,Y,A
Buzz,456,10,,B
Buzz,456,40,,C
Buzz,456,70,,D
Bazz,678,100,Y,F
From these individual operations, i'd like to create a new dataframe that captures various statistics / meta name. Mostly summations and counts of items / counts of unique entries. I'd like the output of the dataframe to look like the following:
Report_ID,Number of Flags,Number of Entries, Total,Unique Actions
123,1,2,15,1
456,0,3,120,3
678,1,1,100,1
I've tried using groupby, but I cannot merge all of the individual groupby objects back together correctly. So far I've tried
totals = raw_data.groupby('Report_ID')['Amount'].sum()
event_count = raw_data.groupby('Report_ID').size()
num_actions = raw_data.groupby('Report_ID').Actions.nunique()
output = pd.concat([totals,event_count,num_actions])
When I try this i get TypeError: cannot concatenate a non-NDFrame object. Any help would be appreciated!
You can use agg on the groupby
f = dict(Flag=['count', 'size'], Amount='sum', Actions='nunique')
df.groupby('Report_ID').agg(f)
Flag Amount Actions
count size sum nunique
Report_ID
123 1 2 15 1
456 0 3 120 3
678 1 1 100 1
You just need to specify axis=1 when concatenating:
event_count.name = 'Event Count' # Name the Series, as you did not group on one.
>>> pd.concat([totals, event_count, num_actions], axis=1)
Amount Event Count Actions
Report_ID
123 15 2 1
456 120 3 3
678 100 1 1
Related
I have a dataframe which I want to group, filter columns by regex, and then sum.
My code looks like this:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,2,2,3,3],
'Invasive' : [12,1,1,0,1,0],
'invasive': [1,4,5,3,4,6],
'Wild':[4,7,1,0,0,0],
'wild':[0,0,9,8,3,2],
'Crop':[0,0,0,0,0,0],
'Crop_2':[2,3,2,2,1,2]})
df.groupby(['ID']).filter(regex='(Invasive)|(invasive)|(Wild)|(wild)').sum()
The error message I get is:
DataFrameGroupBy.filter() missing 1 required positional argument: 'func'
I get the same Err message if groupby comes after filter
Why does this happen? Where do I input the func argument?
EDIT:
My Expected output is one column that has summed across the filtered columns and is grouped by ID. E.g.:
ID Output
0 1 29
1 2 27
2 3 16
What you want to do doesn't make sense, groupby.filter is to filter rows, not to be confused with DataFrame.filter.
You likely want to filter the columns, then to aggregate:
df.filter(regex='(?i)(Invasive|Wild)').groupby(df['ID']).sum()
NB. I replaced (Invasive)|(invasive)|(Wild)|(wild) by (?i)(Invasive|Wild), which means 'InvasiveORWild` independently of the case.
Output:
Invasive invasive Wild wild
ID
1 13 5 11 0
2 1 8 1 17
3 1 10 0 5
edit:
the output that you show needs a further summation per row:
out = (df.filter(regex='(?i)(Invasive|Wild)')
.groupby(df['ID']).sum()
.sum(axis=1)
.reset_index(name='Output')
)
# or with summation before:
out = (df.filter(regex='(?i)(Invasive|Wild)')
.sum(axis=1)
.groupby(df['ID']).sum()
.reset_index(name='Output')
)
Output:
ID Output
0 1 29
1 2 27
2 3 16
I have the following dataframe in Pandas:
name
value
in
out
A
50
1
0
A
-20
0
1
B
150
1
0
C
10
1
0
D
500
1
0
D
-250
0
1
E
800
1
0
There are maximally only 2 observations for each name: one for in and one for out.
If there is only in for a name there is only one observation for it.
You can create this dataset with this code:
data = {
'name': ['A','A','B','C','D','D','E'],
'values': [50,-20,150,10,500,-250,800],
'in': [1,0,1,1,1,0,1],
'out': [0,1,0,0,0,1,0]
}
df = pd.DataFrame.from_dict(data)
I want to sum the value column for each name but only if name has both in and out record. In other words, only when one unique name has exactly 2 rows.
The result should look like this:
name
value
A
30
D
250
If I run the following code I got all the results without filtering based on in and out.
df.groupby('name').sum()
name
value
A
30
B
150
C
10
D
250
E
800
How to add the beforementioned filtering based on columns?
Maybe you can try something with groupby, agg, and query (like below):
df.groupby('name').agg({'name':'count', 'values': 'sum'}).query('name>1')[['values']]
Output:
values
name
A 30
D 250
You could also make .query('name==2') in above if you like but assuming it can occur max at 2 .query('name>1') would also return same.
IIUC, you could filter before aggregation:
# check that we have exactly 1 in and 1 out per group
mask = df.groupby('name')[['in', 'out']].transform('sum').eq([1,1]).all(1)
# slice the correct groups and aggregate
out = df[mask].groupby('name', as_index=False)['values'].sum()
Or, you could filter afterwards (maybe less efficient if you have a lot of groups that would be filtered out):
(df.groupby('name', as_index=False).sum()
.loc[lambda d: d['in'].eq(1) & d['out'].eq(1), ['name', 'values']]
)
output:
name values
0 A 30
1 D 250
I am currently working on a dataset which has information on total sales for each product id and product sub category. For eg, let us consider that there are three products 1, 2 and 3. There are three product sub categories - A,B,C, one or two or all of which may comprise the products 1, 2 and 3. For instance, I have included a sample table below:
Now, I would like to add a flag column 'Flag' which can assign 1 or 0 to each product id depending on whether that product id is contains record of product sub category 'C'. If it does contain 'C', then assign 1 to the flag column. Otherwise, assign 0. Below is the desired output.
I am currently not able to do this in pandas. Could you help me out? Thank you so much!
use pandas transform and contains. transform applies the lambda function to all rows in the dataframe.
txt="""ID,Sub-category,Sales
1,A,100
1,B,101
1,C,102
2,B,100
2,C,101
3,A,102
3,B,100"""
df = pd.read_table(StringIO(txt), sep=',')
#print(df)
list_id=list(df[df['Sub-category'].str.contains('C')]['ID'])
df['flag']=df['ID'].apply(lambda x: 1 if x in list_id else 0 )
print(df)
output:
ID Sub-category Sales flag
0 1 A 100 1
1 1 B 101 1
2 1 C 102 1
3 2 B 100 1
4 2 C 101 1
5 3 A 102 0
6 3 B 100 0
Try this:
Flag = [ ]
for i in dataFrame["Product sub-category]:
if i == "C":
Flag.append(1)
else:
Flag.append(0)
So you have a list called "Flag" and can add it to your dataframe.
You can add a temporary column, isC to check for your condition. Then check for the number of isC's inside every "Product Id" group (with .groupby(...).transform).
check = (
df.assign(isC=lambda df: df["Product Sub-category"] == "C")
.groupby("Product Id").isC.transform("sum")
)
df["Flag"] = (check > 0).astype(int)
I have a dataframe with individuals and their household IDs and I would like to create a variable that contains the household size.
I am using Python 3.7. I tried to use the groupby function combined with the size (I tried count as well count) function. The idea is for each observation about an individual, I want to count in the dataframe the number of observations with the same household ID and store it in a new variable.
Consider that each observation has a household ID (hh_id) and that I would like to store the household size in the hh_size variable.
I tried the following:
df['hh_size'] = df.groupby('hh_id').size
I expect hh_size variable to contain for each observation the household size. However, I get a column with only nan.
When I usedf.groupby('hh_id').size alone, I get the expected result but I cannot manage to store it in the hh_size variable.
For example:
individual hh_id hh_size
1 1 2
2 1 2
3 2 1
4 3 1
Thanks,
Julien
If I understand it you have to convert it to new DataFrame - .to_frame(name='hh_size') - and you may have to reset index.
import pandas as pd
df = pd.DataFrame({
'individual': [1,1,2,2,3,4],
'hh_id': [1,1,1,1,2,3],
})
sizes = df.groupby(['individual', 'hh_id']).size()
new_df = sizes.to_frame(name='hh_size').reset_index()
print(new_df)
Result:
individual hh_id hh_size
0 1 1 2
1 2 1 2
2 3 2 1
3 4 3 1
At first I have two problems, the first will follow now:
I a dataframe df with many times the same userid and along with it a date and some unimportant other columns:
userid date
0 243 2014-04-01
1 234 2014-12-01
2 234 2015-11-01
3 589 2016-07-01
4 589 2016-03-01
I am currently trying to groupby them by userid and sort the dates descending and cut out the twelve oldest. My code looks like this:
df = df.groupby(['userid'], group_keys=False).agg(lambda x: x.sort_values(['date'], ascending=False, inplace=False).head(12))
And I get this error:
ValueError: cannot copy sequence with size 6 to array axis with dimension 12
At the moment my aim is to avoid to split the dataframe in individual ones.
My second problem is more complex:
I try to find out if the sorted dates (respectively per group of userids) are monthly consecutive. This means if there is an date for one group of userid, for example userid: 234 and date: 2014-04-01, the next entry below must be userid: 234 and date:2014-03-01. There is no focus on the day, only the year and month are important.
And only this consecutive 12 dates should be copied in another dataframe.
A second dataframe df2 contains the same userid, but they are unique and another column is 'code'. Here is an example:
userid code
0 433805 1
24 5448 0
48 3434 1
72 34434 1
96 3202 1
120 23766 1
153 39457 0
168 4113 1
172 3435 5
374 34093 1
I summarize: I try to check if there are 12 consecutive months per userid and copy every correct sequence in another dataframe. For this I have also compare the 'code' from df2.
This is a version of my code:
df['YearMonthDiff'] = df['date'].map(lambda x: 1000*x.year + x.month).diff()
df['id_before'] = df['userid'].shift()
final_df = pd.DataFrame()
for group in df.groupby(['userid'], group_keys=False):
fi = group[1]
if (fi['userid'] <> fi['id_before']) & group['YearMonthDiff'].all(-1.0) & df.loc[fi.userid]['code'] != 5:
final_df.append(group['userid','date', 'consum'])
At first calculated from the date an integer and made diff(). On other posts I saw they shift the column to compare the values from the current row and the row before. Then I made groupby(userid) to iterate over the single groups. Now it's extra ugly I tried to find the beginning of such an userid-group, try to check if there are only consecutive months and the correct 'code'. And at least I append it on the final dataframe.
On of the biggest problems is to compare the row with the following row. I can iterate over them with iterrow(), but I cannot compare them without shift(). There exits a calendar function, but on these I will take a look on the weekend. Sorry for the mess I am new to pandas.
Has anyone an idea how to solve my problem?
for your first problem, try this
df.groupby(by='userid').apply(lambda x: x.sort_values(by='date',ascending=False).iloc[[e for e in range(12) if e <len(x)]])
Using groupby and nlargest, we get the index values of those largest dates. Then we use .loc to get just those rows
df.loc[df.groupby('userid').date.nlargest(12).index.get_level_values(1)]
Consider the dataframe df
dates = pd.date_range('2015-08-08', periods=10)
df = pd.DataFrame(dict(
userid=np.arange(2).repeat(4),
date=np.random.choice(dates, 8, False)
))
print(df)
date userid
0 2015-08-12 0 # <-- keep
1 2015-08-09 0
2 2015-08-11 0
3 2015-08-15 0 # <-- keep
4 2015-08-13 1
5 2015-08-10 1
6 2015-08-17 1 # <-- keep
7 2015-08-16 1 # <-- keep
We'll keep the latest 2 dates per user id
df.loc[df.groupby('userid').date.nlargest(2).index.get_level_values(1)]
date userid
0 2015-08-12 0
3 2015-08-15 0
6 2015-08-17 1
7 2015-08-16 1
Maybe someone is interested, I solved my second problem thus:
I cast the date to an int, calculated the difference and I shift the userid one row, like in my example. And then follows this... found a solution on stackoverflow
gr_ob = df.groupby('userid')
gr_dict = gr_ob.groups
final_df = pd.DataFrame(columns=['userid', 'date', 'consum'])
for group_name in gr_dict.keys():
new_df = gr_ob.get_group(group_name)
if (new_df['userid'].iloc[0] <> new_df['id_before'].iloc[0]) & (new_df['YearMonthDiff'].iloc[1:] == -1.0).all() & (len(new_df) == 12):
final_df = final_df.append(new_df[['userid', 'date', 'consum']])