Dataframe groupby when specific values are encountered on a given row - python

I have a dataframe and I would like to group(or slice)it. The dataframe is in a form of
A B C
a b 1
a b 0
a b 1
a b 2
a b 0
a e 3
a e 3
f g 6
f g 7
f g 0
I would like to first group the dataframe on column A and B. Then,each group is further split by a certain value into smaller groups with consecutive rows. For example,after grouping the dataframe by columns A and B,I would like to refine the grouping on the third level each time I encounter a 0 in column C. So the grouped dataframe is like
A B C
a b 1
a b 0
a b 1
a b 2
a b 0
a e 3
a e 3
f g 6
f g 7
f g 0
Grouping a dataframe by column values like columns A and B in the example is simple but I dont know how to further group on level 3 into consecutive rows with certain cut points. Thank you in advance if you could help.

To do so the approach is alway the same: create an extra column (or several sometimes) that represents your specific grouping logic, then group against it:
df.groupby(['A', 'B', 'cut_point']).groups
Out[139]:
{('a', 'b', 0.0): Int64Index([0, 1], dtype='int64'),
('a', 'b', 1.0): Int64Index([2, 3, 4], dtype='int64'),
('a', 'e', 2.0): Int64Index([5, 6], dtype='int64'),
('f', 'g', 2.0): Int64Index([7, 8, 9], dtype='int64')}
df['cut_point'] = (df.C==0).cumsum().shift().fillna(0)
df.groupby(['A', 'B', 'cut_point']).groups
Out[141]:
{('a', 'b', 0.0): Int64Index([0, 1], dtype='int64'),
('a', 'b', 1.0): Int64Index([2, 3, 4], dtype='int64'),
('a', 'e', 2.0): Int64Index([5, 6], dtype='int64'),
('f', 'g', 2.0): Int64Index([7, 8, 9], dtype='int64')}

Related

Create and populate dataframe column simulating (excel) vlookup function

I am trying to create a new column in a dataframe and polulate it with a value from another data frame column which matches a common column from both data frames columns.
DF1 DF2
A B W B
——— ———
Y 2 X 2
N 4 F 4
Y 5 T 5
I though the following could do the tick.
df2[‘new_col’] = df1[‘A’] if df1[‘B’] == df2[‘B’] else “Not found”
So result should be:
DF2
W B new_col
X 2 Y -> Because DF1[‘B’] == 2 and value in same row is Y
F 4 N
T 5 Y
but I get the below error, I believe that is because dataframes are different sizes?
raise ValueError("Can only compare identically-labeled Series objects”)
Can you help me understand what am I doing wrong and what is the best way to achieve what I am after?
Thank you in advance.
UPDATE 1
Trying Corralien solution I still get the below:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
This is the code I wrote
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2.reset_index().merge(df1.reset_index(), on=['b'], how='left') \
.drop(columns='index').rename(columns={'One': 'new_col'})
UPDATE 2
Here is the second option, but it does not seem to add columns in df2.
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2 = df2.set_index('b', append=True).join(df1.set_index('b', append=True)) \
.reset_index('b').rename(columns={'One': 'new_col'})
print(df2)
b a c new_col Three
0 2 1 3 NaN NaN
1 5 4 6 NaN NaN
2 8 7 9 NaN NaN
Why is the code above not working?
Your question is not clear because why is F associated with N and T with Y? Why not F with Y and T with N?
Using merge:
>>> df2.merge(df1, on='B', how='left')
W B A
0 X 2 Y
1 F 4 N # What you want
2 F 4 Y # Another solution
3 T 4 N # What you want
4 T 4 Y # Another solution
How do you decide on the right value? With row index?
Update
So you need to use the index position:
>>> df2.reset_index().merge(df1.reset_index(), on=['index', 'B'], how='left') \
.drop(columns='index').rename(columns={'A': 'new_col'})
W B new_col
0 X 2 Y
1 F 4 N
2 T 4 Y
In fact you can consider the column B as an additional index of each dataframe.
Using join
>>> df2.set_index('B', append=True).join(df1.set_index('B', append=True)) \
.reset_index('B').rename(columns={'A': 'new_col'})
B W new_col
0 2 X Y
1 4 F N
2 4 T Y
Setup:
df1 = pd.DataFrame([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]],
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
columns=['a', 'b', 'c'])

column is not getting dropped

Why column A is not getting dropped in train,valid,test data frames?
import pandas as pd
train = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
test = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
valid = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
for df in [train,valid,test]:
df = df.drop(['A'],axis=1)
print('A' in train.columns)
print('A' in test.columns)
print('A' in valid.columns)
#True
#True
#True
You can use inplace=True parameter, because DataFrame.drop function working also inplace:
for df in [train,valid,test]:
df.drop(['A'],axis=1, inplace=True)
print('A' in train.columns)
False
print('A' in test.columns)
False
print('A' in valid.columns)
False
Reason why is not removed column is df is not assign back, so DataFrames are not changed.
Another idea is create list of DataFrames and assign each changed DataFrame back:
L = [train,valid,test]
for i in range(len(L)):
L[i] = L[i].drop(['A'],axis=1)
print (L)
[ B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e]

How to get index for all the duplicates in a dataframe (pandas - python)

I have a data frame with multiple columns, and I want to find the duplicates in some of them. My columns go from A to Z. I want to know which lines have the same values in columns A, D, F, K, L, and G.
I tried:
df = df[df.duplicated(keep=False)]
df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist()
However, this uses all of the columns.
I also tried
print(df[df.duplicated(['A', 'D', 'F', 'K', 'L', 'P'])])
This only returns the duplication's index. I want the index of both lines that have the same values.
Your final attempt is close. Instead of grouping by all columns, just use a list of the ones you want to consider:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [3, 3, 3, 4, 4, 5],
'C': [6, 7, 8, 9, 10, 11]})
res = df.groupby(['A', 'B']).apply(lambda x: (x.index).tolist()).reset_index()
print(res)
# A B 0
# 0 1 3 [0, 1, 2]
# 1 2 4 [3, 4]
# 2 2 5 [5]
Different layout of groupby
df.index.to_series().groupby([df['A'],df['B']]).apply(list)
Out[449]:
A B
1 3 [0, 1, 2]
2 4 [3, 4]
5 [5]
dtype: object
You can have .groupby return a dict with keys being the group labels (tuples for multiple columns) and the values being the Index
df.groupby(['A', 'B']).groups
#{(1, 3): Int64Index([0, 1, 2], dtype='int64'),
# (2, 4): Int64Index([3, 4], dtype='int64'),
# (2, 5): Int64Index([5], dtype='int64')}

Pandas Column Names of MultiIndex DataFrame - strange behaviour

I observed some strange pandas behavior with MultiIndex dataFrames.columns
Construction a MultiIndex dataframe:
a=[0,.25, .5, .75]
b=[1, 2, 3, 4]
c=[5, 6, 7, 8]
d=[1, 2, 3, 5]
df=pd.DataFrame(data={('a','a'):a, ('b', 'b'):b, ('c', 'c'):c, ('d', 'd'):d})
produces this dataFrame
a b c d
a b c d
0 0.00 1 5 1
1 0.25 2 6 2
2 0.50 3 7 3
3 0.75 4 8 5
Creating a new variable with a subset of the original dataFrame
df1=df.copy().loc[:,[('a', 'a'), ('b', 'b')]]
produces like expected:
a b
a b
0 0.00 1
1 0.25 2
2 0.50 3
but accessing the column names of this new dataFrame produces some unexpected output:
print df1.columns
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [u'a', u'b', u'c', u'd']],
labels=[[0, 1], [0, 1]])
so ('b', 'b') and ('c', 'c') is still contained.
In contrast
print df1.columns.tolist()
returns like expected:
[('a', 'a'), ('b', 'b')]
can anybody explain me the reason for this behavior??
I think you need MultiIndex.remove_unused_levels what is new function in 0.20.0 version.
Docs.
print (df1.columns)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['a', 'b', 'c', 'd']],
labels=[[0, 1], [0, 1]])
print (df1.columns.remove_unused_levels())
MultiIndex(levels=[['a', 'b'], ['a', 'b']],
labels=[[0, 1], [0, 1]])

Selecting from multi-level groupby in pandas

Lets say I have two dataframes: df with columns ('a', 'b', 'c') and tf with columns ('a', 'b'). I do a group-combine on the two common columns in df:
grouped_sum = df.groupby(('a', 'b')).sum()
How can I "add" the column c to tf according to grouped_sum, i.e.
tf[i]['c'] = grouped_sum[tf[i]['a'], tf[i]['b']]
for all rows i of the second data frame? For a groupby with a single level it works simply by indexing the group with the corresponding column of tf.
If you groupby with as_index=False you can merge with tf:
In [11]: tf = pd.DataFrame([[1, 2], [3, 4]], columns=list('ab'))
In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 4], [3, 4, 5]], columns=list('abc'))
In [13]: grouped_sum = df.groupby(['a', 'b'], as_index=False).sum()
In [14]: grouped_sum
Out[14]:
a b c
0 1 2 7
1 3 4 5
In [15]: tf.merge(grouped_sum) # this won't always be the same as grouped_sum!
Out[15]:
a b c
0 1 2 7
1 3 4 5
another option is to set a and b as the index of tf.

Categories

Resources