Pandas: Create missing combination rows with zero values - python

Let's say I have a dataframe df:
df = pd.DataFrame({'col1': [1,1,2,2,2], 'col2': ['A','B','A','B','C'], 'value': [2,4,6,8,10]})
col1 col2 value
0 1 A 2
1 1 B 4
2 2 A 6
3 2 B 8
4 2 C 10
I'm looking for a way to create any missing rows among the possible combination of col1 and col2 with exiting values, and fill in the missing rows with zeros
The desired result would be:
col1 col2 value
0 1 A 2
1 1 B 4
2 2 A 6
3 2 B 8
4 2 C 10
5 1 C 0 <- Missing the "1-C" combination, so create it w/ value = 0
I've looked into using stack and unstack to make this work, but I'm not sure that's exactly what I need.
Thanks in advance

Use pivot , then stack
df.pivot(*df.columns).fillna(0).stack().to_frame('values').reset_index()
Out[564]:
col1 col2 values
0 1 A 2.0
1 1 B 4.0
2 1 C 0.0
3 2 A 6.0
4 2 B 8.0
5 2 C 10.0

Another way using unstack with fill_value=0 and stack, reset_index
df.set_index(['col1','col2']).unstack(fill_value=0).stack().reset_index()
Out[311]:
col1 col2 value
0 1 A 2
1 1 B 4
2 1 C 0
3 2 A 6
4 2 B 8
5 2 C 10

You could use reindex + MultiIndex.from_product:
index = pd.MultiIndex.from_product([df.col1.unique(), df.col2.unique()])
result = df.set_index(['col1', 'col2']).reindex(index, fill_value=0).reset_index()
print(result)
Output
col1 col2 value
0 1 A 2
1 1 B 4
2 1 C 0
3 2 A 6
4 2 B 8
5 2 C 10

Related

How do I delete rows that are duplicated on specified columns?

I'm new in python.
My code:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[2,3,2,2,2],
'B':[1,5,5,1,1],
'C':[1,6,6,2,1],
'D':[1,2,3,1,1]})
df
dataframe:
A B C D
0 2 1 1 1
1 3 5 6 2
2 2 5 6 3
3 2 1 2 1
4 2 1 1 1
I want to delete the row and remain the first row, if column B and column C are both the same.
Like,
for row0 & row4, columnB and columnC are the same, delete row4;
for row1 & row2, columnB and columnC are the same, delete row2;
Use drop_duplicates on 'B' and 'C' columns (subset=['B', 'C']) and keep first (keep='first')
>>> df.drop_duplicates(subset=['B', 'C'], keep='first')
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
keep='first' is the default option so you don't have to set it.
You can do something like:
df.groupby(['B', 'C']).head(1)
This takes the first element from each group:
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
Or:
>>> df[~df[['B', 'C']].duplicated()]
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
>>>

Replace n last rows with NaN

I have a data frame df1:
df1 =
index col1 col2
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
What I would like to do is for example to replace the last two rows in col2 with NaN, so the resulting data frame would be:
index col1 col2
1 1 2
2 2 3
3 3 4
4 4 5
5 5 NaN
6 6 NaN
Use indexing by positions with DataFrame.iloc, so need position by Index.get_loc for column:
df.iloc[-2:, df.columns.get_loc('col2')] = np.nan
Or use DataFrame.loc with indexing df.index:
df.loc[df.index[-2:], 'col2'] = np.nan
print (df)
col1 col2
1 1 2.0
2 2 3.0
3 3 4.0
4 4 5.0
5 5 NaN
6 6 NaN
Last if need integer column:
df['col2'] = df['col2'].astype('Int64')
print (df)
col1 col2
1 1 2
2 2 3
3 3 4
4 4 5
5 5 <NA>
6 6 <NA>
Just try:
df.col2[-2:] = np.NaN
It seems like the post is going to gather all the possible ways
df["col2"].iloc[-2:,] = np.nan
4 ways to do this. Ways 3 and 4 seem the best to me:
1)
df.at[5,'col2']=math.nan
df.at[6,'col2']=math.nan
df.loc[5,'col2']=np.nan
df.loc[6,'col2']=np.nan
from the answer above
df.col2[-2:]=np.nan
df['col2'][-2:]=np.nan

How to select rows from a pandas df based on index values between two numbers

How would I select rows 2 through 4 of the following df to get the desired output shown below.
I tried to do df = df.index.between(2,4) but I got the following error: AttributeError: 'Int64Index' object has no attribute 'between'
col 1 col 2 col 3
0 1 1 2
1 5 4 2
2 2 1 5
3 1 2 2
4 3 2 4
5 4 3 2
Desired output
col 1 col 2 col 3
2 2 1 5
3 1 2 2
4 3 2 4
Try using loc for index selection using label slicing:
df.loc[2:4]
Output:
col 1 col 2 col 3
2 2 1 5
3 1 2 2
4 3 2 4
the easiest way to select rows from a dataframe is to use the .iloc[rows, columns] function pandas for example here i select lines 2 to 4
df1=pd.DataFrame({"a":[1,2,3,4,5,6,7],"b":[4,5,6,7,8,9,10]})
df1.iloc[1:3] #
With loc
min=2
max=4
between_range= range(min, max+1,1)
df.loc[between_range]
use the following
df.iloc[2:4]
You want to use .iloc[rows,columns]
df.iloc[2:4, :]
between cannot act on an index datatypes but just on a Series. So, if you want to use a boolean mask you first need to convert the index to a series using to_series like this:
df
# col1 col2 col3
# 0 1 1 2
# 1 5 4 2
# 2 2 1 5
# 3 1 2 2
# 4 3 2 4
# 5 4 3 2
df[df.index.to_series().between(2,4)]
# col1 col2 col3
# 2 2 1 5
# 3 1 2 2
# 4 3 2 4

pandas group by the column values with all values less than certain numbers and assign the group numbers as new columns

I have a data frame like this,
df
col1 col2
A 2
B 3
C 1
D 4
E 6
F 1
G 2
H 8
I 1
J 10
Now I want to create another column col3 with grouping all the col2 values which are under below 5 and keep col3 values as 1 to number of groups, so the final data frame would look like,
col1 col2 col3
A 2 1
B 3 1
C 1 1
D 4 1
E 6 2
F 1 2
G 2 2
H 8 3
I 1 3
J 10 4
I could do this comparing the the prev values with the current values and store into a list and make it the col3.
But the execution time will be huge in this case, so looking for some shortcuts/pythonic way to do it most efficiently.
Compare by Series.gt for > and then use Series.cumsum. New column always starts by 0, because first values of column is less like 5, else it should be 1:
df['col3'] = df['col2'].gt(5).cumsum()
print (df)
col1 col2 col3
0 A 2 0
1 B 3 0
2 C 1 0
3 D 4 0
4 E 6 1
5 F 1 1
6 G 2 1
7 H 8 2
8 I 1 2
9 J 10 3
So for general solution starting by 1 use this trick - compare first values if less like 5, convert to integers for True->1 and False->0 and add to column:
N = 5
df['col3'] = df['col2'].gt(N).cumsum() + int(df.loc[0, 'col2'] < 5)
df = df.assign(col21 = df['col2'].add(pd.Series({0:5}), fill_value=0).astype(int))
N = 5
df['col3'] = df['col2'].gt(N).cumsum() + int(df.loc[0, 'col2'] < N)
#test for first value > 5
df['col31'] = df['col21'].gt(N).cumsum() + int(df.loc[0, 'col21'] < N)
print (df)
col1 col2 col21 col3 col31
0 A 2 7 1 1
1 B 3 3 1 1
2 C 1 1 1 1
3 D 4 4 1 1
4 E 6 6 2 2
5 F 1 1 2 2
6 G 2 2 2 2
7 H 8 8 3 3
8 I 1 1 3 3
9 J 10 10 4 4

How to extract mini-dataframes from an existing one based on certain condition?

Assuming a df as follows
col1 col2
1 1
1 2
1 4
1 6
1 7
1 8
1 24
1 23
1 24
1 1
1 1
1 2
1 3
1 1
1 3
1 2
2 2
2 3
2 4
2 5
2 5
2 6
2 9
2 15
2 16
2 19
2 24
2 1
2 3
2 2
2 1
2 2
2 2
2 3
2 3
I would like to do kind of groupby on col1 and check if in col2 numbers 1, 2, 3 occur after 24. If yes, the values related to these must be stored as separate dataframes, preferably as follows:
df1:
col1 col2
1 1
1 1
1 1
2 1
2 1
df2:
col1 col2
1 2
1 2
2 2
2 2
2 2
df3:
col1 col2
1 3
1 3
2 3
2 3
2 3
The dataframes: df1, df2, df3 have been created from the values that occur after 24 in col2
Edit 1:
In the df, there is an instance where in the col2 a 23 is present between two 24s. In such a case as well, it must check that value, and if it's either 1, 2, or 3, it must be assigned to their respective dataframes
Iterate through each group of groupby for _, group in df.groupby('col1')
Find the original index for the first occurrence of 24 on each group using group.index.get_loc(group[group.col2.eq(24)].index[0])
Subset each group from index found on previous step to the end group[indexfound: ]
From the subsetted data frame find the occurrences of 1,2 and 3 [group.col2.eq(1/2/3)] and save each to separate data frames
df1=pd.DataFrame(columns=['col1','col2'])
df2= df1.copy()
df3 = df1.copy()
for _, group in df.groupby('col1'):
df1=df1.append(group[group.index.get_loc(group[group.col2.eq(24)].index[0]):][group.col2.eq(1)])
df2=df2.append(group[group.index.get_loc(group[group.col2.eq(24)].index[0]):][group.col2.eq(2)])
df3=df3.append(group[group.index.get_loc(group[group.col2.eq(24)].index[0]):][group.col2.eq(3)])

Categories

Resources