Pandas GroupBy and remove duplicates without shifting cells

Pandas GroupBy and remove duplicates without shifting cells - python

I have a somewhat large array (~3000 rows) where the first column has a duplicate string values that are of varying numbers. I want to be able to remove these duplicates without shifting the cells in this column.
Input
row/rack shelf tilt
row1.rack1 B 5
row1.rack1 A nan
row1.rack2 C nan
row1.rack2 B nan
row1.rack2 A 17
Desired output
row/rack shelf tilt
row1.rack1 B 5
A nan
row1.rack2 C nan
B nan
A 17
Is there a good way to do this? I've been searching through stackoverflow and other sites but haven't been able to find something like this

using .duplicated and .loc
df.loc[df['row/rack'].duplicated(keep='first'),'row/rack'] = ''
print(df)
row/rack shelf tilt
0 row1.rack1 B 5.0
1 A NaN
2 row1.rack2 C NaN
3 B NaN
4 A 17.0

mask the duplicates with empty strings:
df["row/rack"] = df["row/rack"].mask(df["row/rack"].duplicated(), "")
>>> df
row/rack shelf tilt
0 row1.rack1 B 5.0
1 A NaN
2 row1.rack2 C NaN
3 B NaN
4 A 17.0

Related

Fill a dataframe using a reference/lookup dataframe

I would like to fill one dataframe (df) with information from a lookup dataframe(reference_df). I have tried this (using other StackOverflow posts and answers) with merge, fillna and combine_first. However, each method brings some problems in my case.
df1:
class A B C
0 a 1 NaN NaN
1 b 2 NaN NaN
2 c 3 NaN NaN
3 a 1 NaN NaN
reference_df:
class A B C
0 a 1 2 3
1 b 2 4 6
Target_df:
class A B C
0 a 1 2.0 3.0
1 b 2 4.0 6.0
2 c 3 NaN NaN
3 a 1 2.0 3.0
Three things to note:
'c' is not in the reference_df so should remain empty
'a' appears twice not ordered in df1, and should be filled with the values of 'a' in reference_df
the order of df1 should stay as is
The above also show the issues I ran into with merging and fillna and combine_first. The repetition of 'a' breaks fillna and is filled with other values in the other two methods.
The solution I have going now is done manually via looping over both frames, but this is very expensive and the dataset is very large.
I hope I am explaining this alright, first post on StackOverflow so I might be missing some needed information. Let me know if I should clarify.
Cheers

target_df = df1[['class', 'A']].merge(reference_df, 'left')
print(target_df)
Output:
class A B C
0 a 1 2.0 3.0
1 b 2 4.0 6.0
2 c 3 NaN NaN
3 a 1 2.0 3.0

How to get previous not NaN value of a pandas DataFrame, without apply, to calculate?

Without using apply (because dataframe is too big), how I can get the previous not NaN value of a specific column to use in a calc ?
For example, this dataframe:
df = pd.DataFrame([['A',1,100],['B',2,None],['C',3,None],['D',4,182],['E',5,None]], columns=['A','B','C'])
A B C
0 A 1 100.0
1 B 2 NaN
2 C 3 NaN
3 D 4 182.0
4 E 5 NaN
I need to calc the difference, in the column 'C' of the line 3 with the line 0.
The number of NaN values between the values is variable, then .shift() maybe is not applicable here (I think)
I need some like: df['D'] = df.C - df.C[previous_not_nan] (in the line 3 will be 82.

dropna + diff
df['D'] = df['C'].dropna().diff()
A B C D
0 A 1 100.0 NaN
1 B 2 NaN NaN
2 C 3 NaN NaN
3 D 4 182.0 82.0
4 E 5 NaN NaN

Pandas shift values in a column over intervening rows

I have a pandas data frame as shown below. One column has values with intervening NaN cells. The values are to be shifted ahead by one so that they replace the next value that follows with the last being lost. The intervening NaN cells have to remain. I tried using .shift() but since I never know how many intervening NaN rows it means a calculation for each shift. Is there a better approach?

IIUC, you may just groupby by non-na values, and shift them.
df['y'] = df.y.groupby(pd.isnull(df.y)).shift()
x y
0 A NaN
1 A NaN
2 A NaN
3 B 5.0
4 B NaN
5 B NaN
6 B NaN
7 C 10.0
8 C NaN
9 C NaN
10 C NaN

Another way:
s = df['y'].notnull()
df.loc[s,'y'] = df.loc[s,'y'].shift()
It would be easier to test if you paste your text data instead of the picture.
Input:
df = pd.DataFrame({'x':list('AAABBBBCCCC'),
'y':[5,np.nan,np.nan,10, np.nan,np.nan,np.nan,
20, np.nan,np.nan,np.nan]})
output:
x y
0 A NaN
1 A NaN
2 A NaN
3 B 5.0
4 B NaN
5 B NaN
6 B NaN
7 C 10.0
8 C NaN
9 C NaN
10 C NaN

Pandas, Using generated values while iterating through rows within grouped data

I'm pretty new to Pandas and programming in general but I've always been able to find the answer to any problem through google until now. Sorry about the not terribly descriptive question, hopefully someone can come up with something clearer.
I'm trying to group data together, perform functions on that data, update a column and then use the data from that column on the next group of data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random(9),columns=['A'])
df['B'] = [1,1,1,2,2,3,3,3,3]
df['C'] = np.nan
df['D'] = np.nan
df.loc[0:2,'C'] = 500
Giving me
A B C D
0 0.825828 1 500.0 NaN
1 0.218618 1 500.0 NaN
2 0.902476 1 500.0 NaN
3 0.452525 2 NaN NaN
4 0.513505 2 NaN NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
The 500 in column C is the initial condition. I want to group the data by column B and perform the following function on the first group
def function1(row):
return row['A']*row['C']/6
giving me
A B C D
0 0.825828 1 500.0 68.818971
1 0.218618 1 500.0 18.218145
2 0.902476 1 500.0 75.206313
3 0.452525 2 NaN NaN
4 0.513505 2 NaN NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
I then want to sum the first three values in D and add them to the last value in C and making this value the group 2 value
A B C D
0 0.825828 1 500.000000 68.818971
1 0.218618 1 500.000000 18.218145
2 0.902476 1 500.000000 75.206313
3 0.452525 2 662.243429 NaN
4 0.513505 2 662.243429 NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
I then perform function1 on group 2 and repeat until I end up with this
A B C D
0 0.825828 1 500.000000 68.818971
1 0.218618 1 500.000000 18.218145
2 0.902476 1 500.000000 75.206313
3 0.452525 2 662.243429 49.946896
4 0.513505 2 662.243429 56.677505
5 0.089975 3 768.867830 11.529874
6 0.282479 3 768.867830 36.198113
7 0.774286 3 768.867830 99.220591
8 0.408501 3 768.867830 52.347246
The dataframe will consist of hundreds of rows. I've been trying various groupby, apply combinations but I'm completely stumped.
Thanks

Here is a solution:
df['D'] = df['A'] * df['C']/6
for i in df['B'].unique()[1:]:
df.loc[df['B']==i, 'C'] = df['D'].sum()
df.loc[df['B']==i, 'D'] = df['A'] * df['C']/6

You can use numpy.unique() for the selction. In your code this might look somehow like this:
import numpy as np
import math
unique, indices, counts = np.unique(df['B'], return_index=True, return_counts=True)
for i in range(len(indices)):
for j in range(len(counts)):
row = df[indices[i]+j]
if math.isnan(row['C']):
row['C'] = df.loc[indices[i-1], 'D']
# then call your function
function1(row)

Check for NaN values in some particular column in a dataframe

Suppose I have a dataframe:
a b c
0 1 2 NaN
1 2 NaN 4
3 Nan 4 NaN
I want to check for NaN in only some particular column's and want the resulting dataframe as:
a b c
0 1 2 NaN
3 Nan 4 NaN
Here I want to check for NaN in only Column 'a' and Column 'c'.
How this can be done?

You could do that with isnull and any methods:
In [264]: df
Out[264]:
a b c
0 1 2 NaN
1 2 NaN 4
2 NaN 4 NaN
In [265]: df[df.isnull().any(axis=1)]
Out[265]:
a b c
0 1 2 NaN
2 NaN 4 NaN
Note: if you just want clear rows without any NaN you could use dropna method
EDIT
If you want to subset your dataframe you could use mask with your columns and apply it to the whole dataframe:
df_subset = df[['a', 'c']]
In [282]: df[df_subset.isnull().any(axis=1)]
Out[282]:
a b c
0 1 2 NaN
2 NaN 4 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas GroupBy and remove duplicates without shifting cells - python

using .duplicated and .loc df.loc[df['row/rack'].duplicated(keep='first'),'row/rack'] = '' print(df) row/rack shelf tilt 0 row1.rack1 B 5.0 1 A NaN 2 row1.rack2 C NaN 3 B NaN 4 A 17.0

mask the duplicates with empty strings: df["row/rack"] = df["row/rack"].mask(df["row/rack"].duplicated(), "") >>> df row/rack shelf tilt 0 row1.rack1 B 5.0 1 A NaN 2 row1.rack2 C NaN 3 B NaN 4 A 17.0

Related

Fill a dataframe using a reference/lookup dataframe

How to get previous not NaN value of a pandas DataFrame, without apply, to calculate?

Pandas shift values in a column over intervening rows

Pandas, Using generated values while iterating through rows within grouped data

Check for NaN values in some particular column in a dataframe

Categories

Resources