Cumsum with nan values - pandas - python

I want to pass a cumulative sum of unique values to a separate column. However, I want to disregard nan values so it essentially skips these rows and continues the count with the next viable row.
d = {'Item': [np.nan, "Blue", "Blue", np.nan, "Red", "Blue", "Blue", "Red"],
}
df = pd.DataFrame(data=d)
df['count'] = df.Item.ne(df.Item.shift()).cumsum()
intended out:
Item count
0 NaN NaN
1 Blue 1
2 Blue 1
3 NaN NaN
4 Red 2
5 Blue 3
6 Blue 3
7 Red 4

Try:
df['count'] =(df.Item.ne(df.Item.shift()) & df.Item.notna()).cumsum().mask(df.Item.isna())
OR
as suggested by #SeanBean:
df['count'] =df.Item.ne(df.Item.shift()).mask(df.Item.isna()).cumsum()
Output of df:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0

Here's one way:
NOTE: (you just need to add the where condition):
df['count'] = df.Item.ne(df.Item.shift()).where(~df.Item.isna()).cumsum()
OUTPUT:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0

Related

Pandas: Remove values that meet condition

Let's say I have data like this:
df = pd.DataFrame({'category': ["blue","red","blue", "blue","green"], 'val1': [5, 3, 2, 2, 5], 'val2':[1, 3, 2, 2, 5], 'val3': [2, 1, 1, 4, 3]})
print(df)
category val1 val2 val3
0 blue 5 1 2
1 red 3 3 1
2 blue 2 2 1
3 blue 2 2 4
4 green 5 5 3
How do I remove (or replace with for example NaN) values that meet a certain condition without removing the entire row or shift the column?
Let's say my condition is that I want to remove all values below 3 from the above data, the result would have to look like this:
category val1 val2 val3
0 blue 5
1 red 3 3
2 blue
3 blue 4
4 green 5 5 3
Use mask:
df.iloc[:, 1:] = df.iloc[:, 1:].mask(df.iloc[:, 1:] < 3)
print(df)
Output
category val1 val2 val3
0 blue 5.0 NaN NaN
1 red 3.0 3.0 NaN
2 blue NaN NaN NaN
3 blue NaN NaN 4.0
4 green 5.0 5.0 3.0
If you want to set particular value, for example 0, do:
df.iloc[:, 1:] = df.iloc[:, 1:].mask(df.iloc[:, 1:] < 3, 0)
print(df)
Output
category val1 val2 val3
0 blue 5 0 0
1 red 3 3 0
2 blue 0 0 0
3 blue 0 0 4
4 green 5 5 3
If you just need a few columns, you could do:
df[['val1', 'val2', 'val3']] = df[['val1', 'val2', 'val3']].mask(df[['val1', 'val2', 'val3']] < 3)
print(df)
Output
category val1 val2 val3
0 blue 5.0 NaN NaN
1 red 3.0 3.0 NaN
2 blue NaN NaN NaN
3 blue NaN NaN 4.0
4 green 5.0 5.0 3.0
One approach is to create a mask of the values that don't meet the removal criteria.
mask = df[['val1','val2','val3']] > 3
You can then create a new df, that is just the non-removed vals.
updated_df = df[['val1','val2','val3']][mask]
You need to add back in the unaffected columns.
updated_df['category'] = df['category']
You can use applymap or transform to columns containing integers.
df[df.iloc[:,1:].transform(lambda x: x>=3)].fillna('')

How to split rows in stacks and perform `select` for every single in Pandas?

I have 100 rows in my CSV file, I need to split them into 10 rows and perform some queries for every single stacks, because some rows have a specific column and some of them not.
How can I do it in Pandas?
1 blue 1 NaN
2 yellow 0 NaN
3 yellow 1 NaN
4 blue 1 NaN
5 blue 1 NaN
6 blue 0 NaN
7 yellow 1 NaN
8 yellow 1 NaN
9 yellow 1 NaN
10 blue 0 NaN
11 yellow NaN 1
12 blue NaN 1
13 yellow NaN 1
14 yellow NaN 0
15 blue NaN 1
16 yellow NaN 1
17 yellow NaN 0
18 blue NaN 1
19 blue NaN 0
20 blue NaN 1
I used PsychoPy for a neuroscience task and the task has 10 trials, because of that PsychoPy stores RTs (Reaction Times) in different 10 columns, so I need to access them to for example evaluating blue circles in the first trial which their RTs is 1 or yellow circles in second trial which their RTs is 0.
I would just reorganize your data for easier access. Given that all bar one of the reaction time columns is null, just sum them up to get the non-null value in a single column reaction_time. Then assign a new trial column via 1 + df.index // trials.
Now you can use .loc to access the reaction_time data by trial and color.
# Sample data.
nan = np.nan
df = pd.DataFrame(
{'color': ['blue', 'yellow', 'yellow', 'blue', 'blue', 'blue', 'yellow', 'yellow', 'yellow', 'blue', 'yellow', 'blue', 'yellow', 'yellow', 'blue', 'yellow', 'yellow', 'blue', 'blue', 'blue'],
'num_1': [1, 0, 1, 1, 1, 0, 1, 1, 1, 0] + [nan] * 10,
'num_2': [nan] * 10 + [1, 1, 1, 0, 1, 1, 0, 1, 0, 1]}
)
# Summary data result.
trials = 10
reaction_time = df.iloc[:, 1:].sum(axis=1)
df2 = df[['color']].assign(
reaction_time=reaction_time,
trial=1 + df.index // trials,
)
>>> df2
color reaction_time trial
0 blue 1.0 1
1 yellow 0.0 1
2 yellow 1.0 1
3 blue 1.0 1
4 blue 1.0 1
5 blue 0.0 1
6 yellow 1.0 1
7 yellow 1.0 1
8 yellow 1.0 1
9 blue 0.0 1
10 yellow 1.0 2
11 blue 1.0 2
12 yellow 1.0 2
13 yellow 0.0 2
14 blue 1.0 2
15 yellow 1.0 2
16 yellow 0.0 2
17 blue 1.0 2
18 blue 0.0 2
19 blue 1.0 2
# Query data.
>>> df2.loc[df2.trial.eq(2) & df2.color.eq('yellow')]
color reaction_time trial
10 yellow 1.0 2
12 yellow 1.0 2
13 yellow 0.0 2
15 yellow 1.0 2
16 yellow 0.0 2
batches = [df.loc[i:i+10] for i in range(0, df.shape[0], 10)]
for batch in batches:
select_function()

How to sum or count groups of multiple columns in pandas

I'm trying to group several group of columns to count or sum the rows in a pandas dataframe
I've checked many questions already and the most similar I found is this one > Groupby sum and count on multiple columns in python, but, by what I understand I have to do many steps to reach my goal. and was also looking at this link
As an example, I have the dataframe below:
import numpy as np
df = pd.DataFrame(np.random.randint(0,5,size=(5, 7)), columns=["grey2","red1","blue1","red2","red3","blue2","grey1"])
grey2 red1 blue1 red2 red3 blue2 grey1
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
I want to group here, all the columns by colour, for example, and what I would expect is:
If I sum the numbers,
blue 15
grey 22
red 34
If I count ( x > 0 ) then I will get,
blue 7
grey 10
red 13
this is what I have achieved so far, so now i will have to sum and then create a dataframe with the results, but if I have 100 groups,this would be very time consuming.
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='sum', margins=True)
red1 red2 red3
0 3 2 4
1 2 4 0
2 1 1 1
3 4 4 1
4 4 0 3
ALL 14 11 9
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='count', margins=True)
But here is also counting the zeros:
red1 red2 red3
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
All 5 5 5
Not sure how to alter the function to get my results, and I've already spend hours, hopefully you can help.
NOTE:
I only use colours in this example to simplify the case, but I could have around many columns and they are called col001 till col300, etc...
So, the groups could be:
blue = col131, col254, col005
red = col023, col190, col053
and so on.....
You can use pd.wide_to_long:
data= pd.wide_to_long(df.reset_index(), stubnames=['grey','red','blue'],
i='index',
j='group',
sep=''
)
Output:
# data
grey red blue
index group
0 1 2.0 3 0.0
2 4.0 2 0.0
3 NaN 4 NaN
1 1 1.0 2 0.0
2 4.0 4 3.0
3 NaN 0 NaN
2 1 1.0 1 3.0
2 1.0 1 3.0
3 NaN 1 NaN
3 1 1.0 4 1.0
2 4.0 4 1.0
3 NaN 1 NaN
4 1 1.0 4 1.0
2 3.0 0 3.0
3 NaN 3 NaN
And:
data.sum()
# grey 22.0
# red 34.0
# blue 15.0
# dtype: float64
data.gt(0).sum()
# grey 10
# red 13
# blue 7
# dtype: int64
Update wide_to_long is just a convenient shortcut for merge and rename. So if you have a dictionary {cat:[col_list]}, you could resolve to that:
groups = {'blue' : ['col131', 'col254', 'col005'],
'red' : ['col023', 'col190', 'col053']}
# create the inverse dictionary for mapping
inv_group = {v:k for k,v in groups.items()}
data = df.melt()
# map the original columns to group
data['group'] = data['variable'].map(inv_group)
# from now on, it's similar to other answers
# sum
data.groupby('group')['value'].sum()
# count
data['value'].gt(0).groupby(data['group']).sum()
The complication here is that you want to collapse both by rows and columns, which is generally difficult to do at the same time. We can melt to go from your wide format to a longer format, which then reduces the problem to a single groupby
# Get rid of the numbers + reshape
df.columns = pd.Index(df.columns.str.rstrip('0123456789'), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#blue 15
#grey 22
#red 34
df.value.gt(0).groupby(df.color).sum()
#color
#blue 7.0
#grey 10.0
#red 13.0
#Name: value, dtype: float64
With names that are less simple to group, we'd need to have the mapping somewhere, the steps are very similar:
# Unnecessary in this case, but more general
d = {'grey1': 'color_1', 'grey2': 'color_1',
'red1': 'color_2', 'red2': 'color_2', 'red3': 'color_2',
'blue1': 'color_3', 'blue2': 'color_3'}
df.columns = pd.Index(df.columns.map(d), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#color_1 22
#color_2 34
#color_3 15
Use:
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
Output:
blue 15
grey 22
red 34
dtype: int64
this works regardless of the number of digits contained in the name of the columns:
df=df.add_suffix('22')
print(df)
grey22222 red12222 blue12222 red22222 red32222 blue22222 grey12222
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
blue 15
grey 22
red 34
dtype: int64
You could also do something like this for the general case:
colors = {'blue':['blue1','blue2'], 'red':['red1','red2','red3'], 'grey':['grey1','grey2']}
orig_columns = df.columns
df.columns = [key for col in df.columns for key in colors.keys() if col in colors[key]]
print(df.groupby(level=0,axis=1).sum().sum())
df.columns = orig_columns

How to expand a df by a list of different dict?

My problem is linked to my other question here (How to expand a df by different dict as columns?):
I have a df with A LIST (!) of different dicts as entries in a column, in my case column "information". I would like to expand the df by all possible dict.keys() within that list, something like that:
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5]),
'name': pd.Series(['banana',
'apple',
'orange',
'strawberry' ,
'toast']),
'information': pd.Series([[{'shape':'curve','color':'yellow'}],
[{'color':'red'},{'color':'green'}],
[{'shape':'round'}],
[{'amount':500}],
np.nan]),
'cost': pd.Series([1,2,2,10,4])})
id name information cost
0 1 banana [{'shape': 'curve', 'color': 'yellow'}] 1
1 2 apple [{'color': 'red'}, {'color': 'green'}] 2
2 3 orange [{'shape': 'round'}] 2
3 4 strawberry [{'amount': 500}] 10
4 5 toast NaN 4
Should look like this:
id name shape color amount cost
0 1 banana curve yellow NaN 1
1 2 apple NaN red NaN 2
2 2 apple NaN green NaN 2
3 3 orange round NaN NaN 2
4 4 strawberry NaN NaN 500.0 10
5 5 toast NaN NaN NaN 4
(Note the additional row at index 2)
We can using explode start from pandas 0.25.0
df1=df.explode('information').reset_index(drop=True)
df1=pd.concat([df1,pd.DataFrame(df1.information.dropna().tolist())],axis=1)
Thank you for your answer WeNYoBen, but i found something strange:
If you consider the following df:
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5]),
'name': pd.Series(['banana',
'apple',
'orange',
'strawberry' ,
'toast']),
'information': pd.Series([[{'shape':'curve','color':'yellow'}],
[{'color':'red'},{'color':'green'}],
np.nan,
[{'shape':'round'}],
[{'amount':500}]]),
'cost': pd.Series([1,2,2,10,4])})
id name information cost
0 1 banana [{'shape': 'curve', 'color': 'yellow'}] 1
1 2 apple [{'color': 'red'}, {'color': 'green'}] 2
2 3 orange NaN 2
3 4 strawberry [{'shape': 'round'}] 10
4 5 toast [{'amount': 500}] 4
(we shifted the np.nan to "orange")
You will get the following result:
id name cost shape color amount
0 1 banana 1 curve yellow NaN
1 2 apple 2 NaN red NaN
2 2 apple 2 NaN green NaN
3 3 orange 2 round NaN NaN
4 4 strawberry 10 NaN NaN 500.0
5 5 toast 4 NaN NaN NaN
Your answer skips the np.nan of "orange" and fills "toast" with np.nan.
How can I avoid this?
I found a workaround:
a = {'shape':np.nan}
df['information'] = df['information'].apply(lambda d: d if isinstance(d, list) else [a])
id name information cost
0 1 banana [{'shape': 'curve', 'color': 'yellow'}] 1
1 2 apple [{'color': 'red'}, {'color': 'green'}] 2
2 3 orange [{'shape': nan}] 2
3 4 strawberry [{'shape': 'round'}] 10
4 5 toast [{'amount': 500}] 4
df1=df.explode('information').reset_index(drop=True)
df1=pd.concat([df1,pd.DataFrame(df1.information.dropna().tolist())],axis=1)
df1 = df1.drop('information',True)
id name cost shape color amount
0 1 banana 1 curve yellow NaN
1 2 apple 2 NaN red NaN
2 2 apple 2 NaN green NaN
3 3 orange 2 NaN NaN NaN
4 4 strawberry 10 round NaN NaN
5 5 toast 4 NaN NaN 500.0

merging dataframes on the same index

I can't find the answer to this in here.
I have two dataframes:
index, name, color, day
0 Nan Nan Nan
1 b red thu
2 Nan Nan Nan
3 d green mon
index, name, color, week
0 c blue 1
1 Nan Nan Nan
2 t yellow 4
3 Nan Nan Nan
And I'd like the result to be one dataframe:
index, name, color, day, week
0 c Blue Nan 1
1 b red thu Nan
2 t yellow Nan 4
3 d green mon Nan
Is there a way to merge the dataframes on their indexes, while adding new columns?
You can use DataFrame.combine_first:
df = df1.combine_first(df2)
print (df)
color day name week
0 blue NaN c 1.0
1 red thu b NaN
2 yellow NaN t 4.0
3 green mon d NaN
For custom order of columns create columns names by numpy.concatenate, pd.unique and then add reindex_axis:
cols = pd.unique(np.concatenate([df1.columns, df2.columns]))
df = df1.combine_first(df2).reindex_axis(cols, axis=1)
print (df)
name color day week
0 c blue NaN 1.0
1 b red thu NaN
2 t yellow NaN 4.0
3 d green mon NaN
EDIT:
Use rename columns:
df = df1.combine_first(df2.rename(columns={'week':'day'}))
print (df)
name color day
0 c blue 1
1 b red thu
2 t yellow 4
3 d green mon

Categories

Resources