Add missing rows within groups - python

Suppose dataframe:
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 3],
"day": [1, 2, 3, 1, 3, 2],
"value": [1, 2, 3, 4, 5, 6],
})
I need all ids to have the same set of days. How to add rows with missing days?

IIUC
df=df.pivot(*df.columns).fillna(0).stack().reset_index()
id day 0
0 1 1 1.0
1 1 2 2.0
2 1 3 3.0
3 2 1 4.0
4 2 2 0.0
5 2 3 5.0
6 3 1 0.0
7 3 2 6.0
8 3 3 0.0

For solutions with multiple columns filled by 0 for new missing rows use DataFrame.unstack with DataFrame.stack working with MultiIndex by DataFrame.set_index, last convert again to columns by DataFrame.reset_index:
df = df.set_index(['id','day']).unstack(fill_value=0).stack().reset_index()
print (df)
id day value
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 4
4 2 2 0
5 2 3 5
6 3 1 0
7 3 2 6
8 3 3 0
Another solution with DataFrame.reindex by MultiIndex.from_product, working with MultiIndex by DataFrame.set_index, last convert again to columns by DataFrame.reset_index:
df = df.set_index(['id','day'])
m = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(m, fill_value=0).reset_index()
print (df)
id day value
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 4
4 2 2 0
5 2 3 5
6 3 1 0
7 3 2 6
8 3 3 0

Related

groupby and find the first non-zero value

I have the following df, the last column is the desired output. thanks!
group date value desired_first_nonzero
1 jan2019 0 2
1 jan2019 2 2
1 feb2019 3 2
1 mar2019 4 2
1 mar2019 5 2
2 feb2019 0 4
2 feb2019 0 4
2 mar2019 0 4
2 mar2019 4 4
2 apr2019 5 4
I want to group by "group" and find the first non-zero value
You can use GroupBy.transform with a custom function to get the index of the first non-zero value with idxmax (that return the first True value here):
df['desired_first_nonzero'] = (df.groupby('group')['value']
.transform(lambda s: s[s.ne(0).idxmax()])
)
alternatively, using an intermediate Series:
s = df.set_index('group')['value']
df['desired_first_nonzero'] = df['group'].map(s[s.ne(0)].groupby(level=0).first())
output:
group date value desired_first_nonzero
0 1 jan2019 0 2
1 1 jan2019 2 2
2 1 feb2019 3 2
3 1 mar2019 4 2
4 1 mar2019 5 2
5 2 feb2019 0 4
6 2 feb2019 0 4
7 2 mar2019 0 4
8 2 mar2019 4 4
9 2 apr2019 5 4
This should do the job:
# the given example
d = {'group': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], 'value': [0, 2, 3, 4, 5, 0, 0, 0, 4, 5]}
df = pd.DataFrame(data=d)
first_non_zero = pd.DataFrame(df[df['value'] != 0].groupby('group').head(1))
print(first_non_zero)
Output:
group value
1 1 2
8 2 4
Then you can distributed as needed for each group row.

How to expand a dataframe by assigning to each value in column 3 values

Let's say I have a dataframe:
index day
0 21
1 2
2 7
and to each day I want to assign 3 values: 0,1,2 in the end the dataframe should look like this:
index day value
0 21 0
1 21 1
2 21 2
3 2 0
4 2 1
5 2 2
6 7 0
7 7 1
8 7 2
Does anyone have any idea?
You could introduce a column containing (0, 1, 2)-tuples and then explode the dataframe on that column:
import pandas as pd
df = pd.DataFrame({'day': [21, 2, 7]})
df['value'] = [(0, 1, 2)] * len(df)
df = df.explode('value')
df.index = range(len(df))
print(df)
day value
0 21 0
1 21 1
2 21 2
3 2 0
4 2 1
5 2 2
6 7 0
7 7 1
8 7 2
Try:
N = 3
df = df.assign(value=[range(N) for _ in range(len(df))]).explode("value")
print(df)
Prints:
index day value
0 0 21 0
0 0 21 1
0 0 21 2
1 1 2 0
1 1 2 1
1 1 2 2
2 2 7 0
2 2 7 1
2 2 7 2
A reindex option:
df = (
df.reindex(index=pd.MultiIndex.from_product([df.index, [0, 1, 2]]),
level=0)
.droplevel(0)
.rename_axis(index='value')
.reset_index()
)
df:
value day
0 0 21
1 1 21
2 2 21
3 0 2
4 1 2
5 2 2
6 0 7
7 1 7
8 2 7

Get values from a smaller DataFrame with a specified step

Supposing I have the two DataFrames shown below:
dd = pd.DataFrame([1,0, 3, 0, 5])
0
0 1
1 0
2 3
3 0
4 5
and
df = pd.DataFrame([2,4])
0
0 2
1 4
How can I broadcast the values of df into dd with step = 2 so I end up with
0
0 1
1 2
2 3
3 4
4 5
Another solution:
dd = pd.DataFrame([1, 0, 3, 0, 5])
df = pd.DataFrame([2, 4])
dd.iloc[1::2] = df.values
dd
# Out:
0
0 1
1 2
2 3
3 4
4 5
dd.values[1::2] = df.values
dd now contains:
0
0 1
1 2
2 3
3 4
4 5
Note that here step=2 condition is used. array[1::2] syntax means start from the array element with index 1, until the end, with step=2.
Change df.index by range and fill second DataFrame:
df.index = range(1, len(dd)+1, 2)[:len(df)]
print (df)
0
1 2
3 4
dd.loc[df.index] = df
print (dd)
0
0 1
1 2
2 3
3 4
4 5

Adding dataframes only on selected rows

I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],\
'crit_1' : [0, 0, 1, 0, 0, 0, 1, 0, 0, 1], \
'crit_2' : ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'a', 'a', 'a'],\
'value' : [3, 4, 3, 5, 1, 2, 4, 6, 2, 3]}, \
columns=['id' , 'crit_1', 'crit_2', 'value' ])
df
Out[41]:
id crit_1 crit_2 value
0 1 0 a 3
1 1 0 a 4
2 1 1 b 3
3 1 0 b 5
4 2 0 a 1
5 2 0 b 2
6 2 1 a 4
7 3 0 a 6
8 3 0 a 2
9 3 1 a 3
I pull a subset out of this frame based on crit_1
df_subset = df[(df['crit_1']==1)]
Then I perform a complex operation (the nature of which is unimportant for this question) on that subeset producing a new column
df_subset['some_new_val'] = [1, 4,2]
df_subset
Out[42]:
id crit_1 crit_2 value some_new_val
2 1 1 b 3 1
6 2 1 a 4 4
9 3 1 a 3 2
Now, I want to add some_new_val and back into my original dataframe onto the column value. However, I only want to add it in where there is a match on id and crit_2
The result should look like this
id crit_1 crit_2 value new_value
0 1 0 a 3 3
1 1 0 a 4 4
2 1 1 b 3 4
3 1 0 b 5 6
4 2 0 a 1 1
5 2 0 b 2 6
6 2 1 a 4 4
7 3 0 a 6 8
8 3 0 a 2 4
9 3 1 a 3 5
You can use merge with left join and then add:
#filter only columns for join and for append
cols = ['id','crit_2', 'some_new_val']
df = pd.merge(df, df_subset[cols], on=['id','crit_2'], how='left')
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 NaN
1 1 0 a 4 NaN
2 1 1 b 3 1.0
3 1 0 b 5 1.0
4 2 0 a 1 4.0
5 2 0 b 2 NaN
6 2 1 a 4 4.0
7 3 0 a 6 2.0
8 3 0 a 2 2.0
9 3 1 a 3 2.0
df['some_new_val'] = df['some_new_val'].add(df['value'], fill_value=0)
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 3.0
1 1 0 a 4 4.0
2 1 1 b 3 4.0
3 1 0 b 5 6.0
4 2 0 a 1 5.0
5 2 0 b 2 2.0
6 2 1 a 4 8.0
7 3 0 a 6 8.0
8 3 0 a 2 4.0
9 3 1 a 3 5.0

Pandas dataframe: how to group by values in a column and create new columns out of grouped values

I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.

Categories

Resources