Replacing null value in Python with next available value by group - python

df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'value': [None,None,'A',None,'B',None]
})
I would like to replace missing values by the first next non missing value by group. The desired result is:
df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'value': ['A','A','A','B','B',None]
})

You can try this:
df['value'] = df.groupby(by=['group'])['value'].backfill()
print(df)
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN

The Easiest way as #Erfan mention using backfill method DataFrameGroupBy.bfill.
Solution 1)
>>> df['value'] = df.groupby('group')['value'].bfill()
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 2)
DataFrameGroupBy.bfill with limit parameter works perfectly as well here.
From the pandas Documentation which nicely briefs the Limit the amount of filling is worth to read. as per the doc If we only want consecutive gaps filled up to a certain number of data points, we can use the limit keyword.
>>> df['value'] = df.groupby(['group']).bfill(limit=2)
# >>> df['value'] = df.groupby('group').bfill(limit=2)
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 3)
With groupby() we can also combine fillna() with bfill() along with limit parameter.
>>> df.groupby('group').fillna(method='bfill',limit=2)
value
0 A
1 A
2 A
3 B
4 B
5 None
Solution 4)
Other way around using DataFrame.transform function to fill the value column after group by with DataFrameGroupBy.bfill.
>>> df['value'] = df.groupby('group')['value'].transform(lambda v: v.bfill())
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 None
Solution 5)
You can use DataFrame.set_index to add the group column to the index, making it unique, and do a simple bfill() via groupby(), then you can use reset index to its original state.
>>> df.set_index('group', append=True).groupby(level=1).bfill().reset_index(level=1)
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 6)
In case strictly not going for groupby() then below would be the easiest ..
>>> df['value'] = df['value'].bfill()
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 None

Related

How can I groupby a DataFrame at the same time I count the values and put in different columns?

I have a DataFrame that looks like the one below
Index Category Class
0 1 A
1 1 A
2 1 B
3 2 A
4 3 B
5 3 B
And I would like to get an output data frame that groups by category and have one column for each of the classes with the counting of the occurrences of that class in each category, such as the one below
Index Category A B
0 1 2 1
1 2 1 0
2 3 0 2
So far I've tried various combinations of the groupby and agg methods, but I still can't get what I want. I've also tried df.pivot_table(index='Category', columns='Class', aggfunc='count'), but that return a DataFrame without columns. Any ideas of what could work in this case?
You can use aggfunc="size" to achieve your desired result:
>>> df.pivot_table(index='Category', columns='Class', aggfunc='size', fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Alternatively, you can use .groupby(...).size() to get the counts, and then unstack to reshape your data as well:
>>> df.groupby(["Category", "Class"]).size().unstack(fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Assign a dummy value to count:
out = df.assign(val=1).pivot_table('val', 'Category', 'Class',
aggfunc='count', fill_value=0).reset_index()
print(out)
# Output
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2
import pandas as pd
df = pd.DataFrame({'Index':[0,1,2,3,4,5],
'Category': [1,1,1,2,3,3],
'Class':['A','A','B','A','B','B'],
})
df = df.groupby(['Category', 'Class']).count()
df = df.pivot_table(index='Category', columns='Class')
print(df)
output:
Index
Class A B
Category
1 2.0 1.0
2 1.0 NaN
3 NaN 2.0
Use crosstab:
pd.crosstab(df['Category'], df['Class']).reset_index()
output:
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2

Groupby selected rows by a condition on a column value and then transform another column

This seems to be easy but couldn't find a working solution for it:
I have a dataframe with 3 columns:
df = pd.DataFrame({'A': [0,0,2,2,2],
'B': [1,1,2,2,3],
'C': [1,1,2,3,4]})
A B C
0 0 1 1
1 0 1 1
2 2 2 2
3 2 2 3
4 2 3 4
I want to select rows based on values of column A, then groupby based on values of column B, and finally transform values of column C into sum. something along the line of this (obviously not working) code:
df[df['A'].isin(['2']), 'C'] = df[df['A'].isin(['2']), 'C'].groupby('B').transform('sum')
desired output for above example is:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I also know how to split dataframe and do it. I am looking more for a solution that does it without the need of split+concat/merge. Thank you.
Is it just
s = df['A'].isin([2])
pd.concat((df[s].groupby(['A','B'])['C'].sum().reset_index(),
df[~s])
)
Output:
A B C
0 2 2 5
1 2 3 4
0 0 1 1
Update: Without splitting, you can assign a new column indicating special values of A:
(df.sort_values('A')
.assign(D=(~df['A'].isin([2])).cumsum())
.groupby(['D','A','B'])['C'].sum()
.reset_index('D',drop=True)
.reset_index()
)
Output:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4

pandas groupby apply returning a dataframe

Consider the following code:
>>> df = pd.DataFrame(np.random.randint(0, 4, 16).reshape(4, 4), columns=list('ABCD'))
... df
...
A B C D
0 2 1 0 2
1 3 0 2 2
2 0 2 0 2
3 2 1 2 0
>>> def grouper(frame):
... return frame
...
... df.groupby('A').apply(grouper)
...
A B C D
0 2 1 0 2
1 3 0 2 2
2 0 2 0 2
3 2 1 2 0
As you can see, the results are identical.
Here is the documentation of apply:
The function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then take care of combining the results back together into a single dataframe or series. apply is therefore a highly flexible grouping method.
Groupby will divide group into small dataframes like this:
A B C D
2 0 2 0 2
A B C D
0 2 1 0 2
3 2 1 2 0
A B C D
1 3 0 2 2
apply documentation says that it combines the dataframes back into a single dataframe. I am curious how it combined them in a way that the final result is the same as the original dataframe. If it had used concat, the final dataframe would have been equal to:
A B C D
2 0 2 0 2
0 2 1 0 2
3 2 1 2 0
1 3 0 2 2
I am curious how this concatenation has been done.
If you look at the source code you will see that there is a parameter not_indexed_same that checks if the index remains the same after groupby. If it is the same then groupby does reindexing of the dataframe before returning results. I do not know why this was implemented.
The change was made on Aug 21, 2011 and Wes made no comments on the change: https://github.com/pandas-dev/pandas/commit/00c8da0208553c37ca6df0197da431515df813b7#diff-720d374f1a709d0075a1f0a02445cd65

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

Start counting at zero by group

Consider the following dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame({'group': list('aaabbabc')})
>>> df
group
0 a
1 a
2 a
3 b
4 b
5 a
6 b
7 c
I want to count the cumulative number of times each group has occurred. My desired output looks like this:
>>> df
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
My initial approach was to do something like this:
df['n'] = df.groupby('group').apply(lambda x: list(range(x.shape[0])))
Basically assigning a length n array, zero-indexed, to each group. But that has proven difficult to transpose and join.
You can use groupby + cumcount, and horizontally concat the new column:
>>> pd.concat([df, df.group.groupby(df.group).cumcount()], axis=1).rename(columns={0: 'n'})
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Simply use groupby on column name, in this case group and then apply cumcount and finally add a column in dataframe with the result.
df['n']=df.groupby('group').cumcount()
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
You can use apply method by passing a lambda expression as parameter.
The idea is that you need to find out the count for a group as number of appearances for that group from the previous rows.
df['n'] = df.apply(lambda x: list(df['group'])[:int(x.name)].count(x['group']), axis=1)
Output
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Note: cumcount method is build with the help of the apply function.
You can read this in pandas documentation.

Categories

Resources