In pandas, filter for duplicate values appearing in 1 of 2 different columns, for list of certain values only - python

zed = pd.DataFrame(data = { 'date': ['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04', '2022-03-05'], 'a': [1, 5, 7, 3, 4], 'b': [3, 4, 9, 12, 5] })
How can the following dataframe be filtered to keep the earliest row (earliest == lowest date) for each of the 3 values 1, 5, 4 appearing in either column a or column b? In this example, the rows with dates '2022-03-01', '2022-03-02' would be kept as they are the lowest dates where each of the 3 values appears?
We have tried zed[zed.isin({'a': [1, 5, 4], 'b': [1, 5, 4]}).any(1)].sort_values(by=['date']) but this returns the incorrect result as it returns 3 rows.

Without reshape your dataframe, you can use:
idx = max([zed[['a', 'b']].eq(i).sum(axis=1).idxmax() for i in [1, 5, 4]])
out = zed.loc[:idx]
Output:
>>> out
date a b
0 2022-03-01 1 3
1 2022-03-02 5 4

You can reshape by DataFrame.stack, so possible filterin gby list with remove duplicates:
s = zed.set_index('date')[['a','b']].stack()
idx = s[s.isin([1, 5, 4])].drop_duplicates().index.remove_unused_levels().levels[0]
print (idx)
Index(['2022-03-01', '2022-03-02'], dtype='object', name='date')
out = zed[zed['date'].isin(idx)]
print (out)
date a b
0 2022-03-01 1 3
1 2022-03-02 5 4
Or filter first index value matching conditions, get unique values and select rows by DataFrame.loc:
L = [1, 5, 4]
idx = pd.unique([y for x in L for y in zed[zed[['a', 'b']].eq(x).any(axis=1)].index[:1]])
df = zed.loc[idx]
print (df)
date a b
0 2022-03-01 1 3
1 2022-03-02 5 4

Related

pandas groupby ID and select row with minimal value of specific columns

i want to select the whole row in which the minimal value of 3 selected columns is found, in a dataframe like this:
it is supposed to look like this afterwards:
I tried something like
dfcheckminrow = dfquery[dfquery == dfquery['A':'C'].min().groupby('ID')]
obviously it didn't work out well.
Thanks in advance!
Bkeesey's answer looks like it almost got you to your solution. I added one more step to get the overall minimum for each group.
import pandas as pd
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3],
'C': [1, 2, 3, 4, 5, 6],
})
# set "ID" as the index
df = df.set_index('ID')
# get the min for each column
mindf = df[['A','B']].groupby('ID').transform('min')
# get the min between columns and add it to df
df['min'] = mindf.apply(min, axis=1)
# filter df for when A or B matches the min
df2 = df.loc[(df['A'] == df['min']) | (df['B'] == df['min'])]
print(df2)
In my simplified example, I'm just finding the minimum between columns A and B. Here's the output:
A B C min
ID
1 14 1 2 1
2 100 2 3 2
3 1 100 5 1
One method do filter the initial DataFrame based on a groupby conditional could be to use transform to find the minimum for a "ID" group and then use loc to filter the initial DataFrame where `any(axis=1) (checking rows) is met.
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3]})
# set "ID" as the index
df = df.set_index('ID')
Sample df:
A B
ID
1 30 10
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
Use groupby and transform to find minimum value based on "ID" group.
Then use loc to filter initial df to where any(axis=1) is valid
df.loc[(df == df.groupby('ID').transform('min')).any(axis=1)]
Output:
A B
ID
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
In this example only the first row should be removed as it in both columns is not a minimum for the "ID" group.

pandas groupby only aggregating rows that are common between two consecutive fields that are grouped

I am trying to calculate a sum for each date field, however I only want to calculate the sum of IDs that are in both the current and next date, so a rolling comparison of IDs and then a groupby sum. Currently I have to loop over the dataframe which is very slow.
For example my df:
df = pd.DataFrame({
'Date': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
'ID': [ 1, 2, 3, 4 , 2, 3, 4 , 2, 3, 4, 5, 1, 2, 3, 4],
'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
})
Ideally I want to group the dataframe by Date and only sum the IDs that are common between two dates, for example below. However this is very slow.
tmpL = df.groupby('Date')['ID'].apply(list)
tmpV = df.groupby('Date')['Value'].sum()
for i in range(1, tmpL.shape[0]):
res = list(set(tmpL.iloc[i]) - set(tmpL.iloc[i - 1]))
v = df.loc[ df.ID.isin(res) & (df.Date == tmpL.index[i]), 'Value'].sum()
tmpV.iloc[i] = tmpV.iloc[i] - v
tmpV
Date
1 10
2 18
3 27
4 42
Name: Value, dtype: int64
Is there a way to do this in pandas without looping over the dataframe?
Use DataFrame.pivot_table with aggregate sum, compare for not equal with DataFrame.diff, and last passed to DataFrame.mask with sum:
df1 = df.pivot_table(index='Date', columns='ID', values='Value', aggfunc='sum')
s = df1.mask(df1.notna().diff().fillna(False)).sum(axis=1)
print (s)
Date
1 10.0
2 18.0
3 27.0
4 42.0
dtype: float64
First solution, I think slowier:
You can get all not matched sets by convert original to sets, then use Series.diff, Series.explode and get all matched values of original by DataFrame.merge, last aggregate sum and subtract:
tmpL = (df.groupby('Date')['ID'].apply(set)
.diff()
.explode()
.reset_index()
.merge(df)
.groupby('Date')['Value']
.sum())
tmpV = df.groupby('Date')['Value'].sum()
out = tmpV.sub(tmpL, fill_value=0)
print (out)
Date
1 10.0
2 18.0
3 27.0
4 42.0
Try:
df = df.pivot_table(index='Date', columns='ID', values='Value')#.reset_index()
condition = df.notna() & df.notna().shift(1)
condition.iloc[0,:]=True
print(df[condition].sum(axis=1))
Output:
Date
1 10.0
2 18.0
3 27.0
4 42.0

groupby count of values not equal to other col value pandas

I'm aiming to pass a groupby count of values but only considering rows where Item and Item 2 are different. The following achieves this but drops rows if no values are different. If there are one or more values that are present but are identical between Item and Item 2 then I'm hoping to return 0.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,4,4,4],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A','B','B','B'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A','B','A','A'],
'Value' : [5, 6, 6, 5, 5, 6, 5, 6, 3, 1, 4, 6, 7, 4, 5, 1, 2, 3],
})
df1 = df[df['Item'] != df['Item2']].groupby(['Time']).size().reset_index(name='count')
Intended Output:
Time count
0 1 4
1 2 3
2 3 0
3 4 2
Edit 2:
df = pd.DataFrame({
'Time' : ['1','1','1','1','1','1','1','2','2','2','2','2','2','2','3','4','4','4'],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A','B','B','B'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A','B','A','A'],
'Value' : [2, 6, 6, 5, 3, 3, 4, 6, 5, 1, 4, 6, 7, 4, 5, 1, 2, 3],
})
df1 = (df.assign(new = df['Item'] != df['Item2'])
.groupby('Time')['new']
.mean()
.reset_index(name='avg')
)
Intended Output:
Time avg
0 1 3.0
1 2 5.0
2 3 0.0
3 4 2.5
Idea is not filter, bur count values of Trues per groups by sum, here is passed Series df['Time'] to groupby:
df1 = (df['Item'] != df['Item2']).groupby(df['Time']).sum().reset_index(name='count')
print (df1)
Time count
0 1 4
1 2 3
2 3 0
3 4 2
Another similar solution is create new helper column and aggregate it:
df1 = (df.assign(new = df['Item'] != df['Item2'])
.groupby('Time')['new']
.sum()
.reset_index(name='count'))
EDIT: You can replace non matched values to misisng values by Series.where and then replace misisng values by fillna
df1 = (df.assign(new = df['Value'].where(df['Item'] != df['Item2']))
.groupby('Time')['new']
.mean()
.fillna(0)
.reset_index(name='avg')
)
print (df1)
Time avg
0 1 3.0
1 2 5.0
2 3 0.0
3 4 2.5
Alternative is use Series.reindex by uniqu values of original Time column:
df1 = (df[df['Item'] != df['Item2']]
.groupby(['Time'])['Value']
.mean()
.reindex(df['Time'].unique(), fill_value=0)
.reset_index(name='avg'))
Have a look at the pivot tables for pandas
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A'],
'Value' : [5, 6, 6, 5, 5, 6, 5, 6, 3, 1, 4, 6, 7, 4, 5],
})
# this gives you just the ones were there is a differance
df2 = df[df['Item'] != df['Item2']]
# then sum up the numbers for each item
pd.pivot_table(df2,index='Time',aggfunc='count')
This gives you the table
Item Item2 Value
Time
1 4 4 4
2 3 3 3

Pandas - add a row at the end of a for loop iteration

So I have a for loop that gets a series of values and makes some tests:
list = [1, 2, 3, 4, 5, 6]
df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'])
for value in list:
if value > 3:
df['columnX']="A"
else:
df['columnX']="B"
df['columnZ']="Another value only to be filled in this condition"
df['columnY']=value-1
How can I do this and keep all the values in a single row for each loop iteration no matter what's the if outcome? Can I keep some columns empty?
I mean something like the following process:
[create empty row] -> [process] -> [fill column X] -> [process] -> [fill column Y if true] ...
Like:
[index columnX columnY columnZ]
[0 A 0 NULL ]
[1 A 1 NULL ]
[2 B 2 "..." ]
[3 B 3 "..." ]
[4 B 4 "..." ]
I am not sure to understand exactly but I think this may be a solution:
list = [1, 2, 3, 4, 5, 6]
d = {'columnX':[],'columnY':[]}
for value in list:
if value > 3:
d['columnX'].append("A")
else:
d['columnX'].append("B")
d['columnY'].append(value-1)
df = pd.DataFrame(d)
for the second question just add another condition
list = [1, 2, 3, 4, 5, 6]
d = {'columnX':[],'columnY':[], 'columnZ':[]}
for value in list:
if value > 3:
d['columnX'].append("A")
else:
d['columnX'].append("B")
if condition:
d['columnZ'].append(xxx)
else:
d['columnZ'].append(None)
df = pd.DataFrame(d)
According to the example you have given I have changed your code a bit to achieve the result you shared:
list = [1, 2, 3, 4, 5, 6]
df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'])
for index, value in enumerate(list):
temp = []
if value > 3:
#df['columnX']="A"
temp.append("A")
temp.append(None)
else:
#df['columnX']="B"
temp.append("B")
temp.append("Another value") # or you can add any conditions
#df['columnY']=value-1
temp.append(value-1)
df.loc[index] = temp
print(df)
this produce the result:
columnX columnY columnZ
0 B Another value 0.0
1 B Another value 1.0
2 B Another value 2.0
3 A None 3.0
4 A None 4.0
5 A None 5.0
df.index is printed as : Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
You may just prepare/initialize your Dataframe with an index depending on input list size, then getting power from np.where routine:
In [111]: lst = [1, 2, 3, 4, 5, 6]
...: df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'], index=range(len(lst)))
In [112]: int_arr = np.array(lst)
In [113]: df['columnX'] = np.where(int_arr > 3, 'A', 'B')
In [114]: df['columnZ'] = np.where(int_arr > 3, df['columnZ'], '...')
In [115]: df['columnY'] = int_arr - 1
In [116]: df
Out[116]:
columnX columnY columnZ
0 B 0 ...
1 B 1 ...
2 B 2 ...
3 A 3 NaN
4 A 4 NaN
5 A 5 NaN

How to iterate a vectorized if/else statement over additional columns?

import pandas as pd, numpy as np
ltlist = [1, 2]
org = {'ID': [1, 3, 4, 5, 6, 7], 'ID2': [3, 4, 5, 6, 7, 2]}
ltlist_set = set(ltlist)
org['LT'] = np.where(org['ID'].isin(ltlist_set), org['ID'], 0)
I'll need to check the ID2 column and write the ID in, unless it already has an ID.
output
ID ID2 LT
1 3 1
3 4 0
4 5 0
5 6 0
6 7 0
7 2 2
Thanks!
Option 1
You can nest numpy.where statements:
org['LT'] = np.where(org['ID'].isin(ltlist_set), 1,
np.where(org['ID2'].isin(ltlist_set), 2, 0))
Option 2
Alternatively, you can use pd.DataFrame.loc sequentially:
org['LT'] = 0 # default value
org.loc[org['ID2'].isin(ltlist_set), 'LT'] = 2
org.loc[org['ID'].isin(ltlist_set), 'LT'] = 1
Option 3
A third option is to use numpy.select:
conditions = [org['ID'].isin(ltlist_set), org['ID2'].isin(ltlist_set)]
values = [1, 2]
org['LT'] = np.select(conditions, values, 0) # 0 is default value

Categories

Resources