how to select rows with a certain pattern - python

I'm stuck in a problem, because I can't find any solution to deal with it, I have the following sample:
data = [['John', 6, 'A'], ['Paul', 6, 'D'],
['Juli', 9, 'D'], ['Geeta', 4, 'A'],
['Jay', 6, 'D'], ['Sara', 6, 'A'],
['Mario', 3, 'D'], ['Peter', 6, 'A'],
['Jin', 6, 'D'], ['Carl', 6, 'A']]
df = pd.DataFrame(data, columns=['Name', 'Number', 'Label'])
I previously grouped by number with the following line of code:
df = df.sort_values('number')
and got this output:
Name Number Label
Mario 3 D
Geeta 4 A
Peter 4 A
Jin 4 D
John 6 A
Paul 6 D
Jay 6 D
Sara 6 A
Carl 6 A
Juli 9 D
So I just want to select pair of rows which have an 'A' in the last column and followed by a row with a 'D' in the last column, and find all pair of rows that match this pattern in the same group (I don't want the last 'A' of a group and the 'D' of the next group), so the solution of the problem is:
Name Number Label
Peter 4 A
Jin 4 D
John 6 A
Paul 6 D
Anyone can help me?

You need to use:
# is the row label A?
m1 = df['Label'].eq('A')
# id the next row label D?
m2 = df['Label'].shift(-1).eq('D')
# create a mask combining both conditions
mask = m1&m2
# select the matching rows and the next one (boolean OR)
df[mask|mask.shift()]
output:
Name Number Label
0 John 6 A
1 Paul 6 D
3 Geeta 4 A
4 Jay 6 D
update: match on group
as your rows are sorted per group you can add another condition:
m1 = df['Label'].eq('A')
m2 = df['Label'].shift(-1).eq('D')
m3 = df['Number'].eq(df['Number'].shift(-1))
mask = m1&m2&m3
df[df[mask|mask.shift()]]
output:
Name Number Label
2 Peter 4 A
3 Jin 4 D
4 John 6 A
5 Paul 6 D

def function1(dd:pd.DataFrame):
id=dd[(dd.Label=='D')&(dd.Label.shift()=='A')].index
return dd.loc[id.union(id-1).sort_values()]
df.groupby('Number').apply(function1).reset_index(drop=True)
Name Number Label
0 Peter 4 A
1 Jin 4 D
2 John 6 A
3 Paul 6 D

Related

Pandas datrafame inconsistent data in mulitple rows

I have something like that:
>>> x = {'id': [1,1,2,2,2,3,4,5,5], 'value': ['a', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']}
>>> df = pd.DataFrame(x)
>>> df
id value
0 1 a
1 1 a
2 2 b
3 2 b
4 2 c
5 3 d
6 4 e
7 5 f
8 5 g
I want to filter inconsistent values in this table. For example, columns with id=2 or id=5 are inconsistent, because the same id is associated with different values. I have read solutions about where or any, but they are not something like "comparing if columns with this id always have the same value.
How can I solve this problem?
You can use groupby and filter. This should give you the ids with inconsistent values.
df.groupby('id').filter(lambda x: x.value.nunique()>1)
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g
In your case we do groupby + transform with nunique
unc_df=df[df.groupby('id').value.transform('nunique').ne(1)]
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g
I guess, you can use drop_duplicates to drop repetitive rows based on id column:
In [599]: df.drop_duplicates('id', keep='first')
Out[599]:
id value
0 1 a
2 2 b
5 3 d
6 4 e
7 5 f
So the above will pick the first value for duplicated id column. And you will have 1 row per id in your resultant dataframe.

Subtract rows from two dataframes based on index value

I have two dataframes:
df1 = pd.DataFrame({
'Name' : ['A', 'A', 'A', 'A', 'B', 'B'],
'Value': [10, 9, 8, 10, 99 , 88],
'Day' : [1,2,3,4,1,2]
})
df2 = pd.DataFrame({
'Name' : ['C', 'C', 'C', 'C'],
'Value': [1,2,3,4],
'Day' : [1,2,3,4]
})
I would like to subtract the values in df1 with the values in df2 based on the day and create a new dataframe called delta_values. If there are no entries for the day then no action should occur.
To explain further: B in the name column only has values for day 1 and 2. df2 should subtract its values associated with day 1 and 2 with B's values for day 1 and 2, but since B has no values for day 3 and 4, no arithmetic should occur. I am having trouble with this part.
The output I am looking for is
If nothing better comes to somebidy's mind, here's a correct but not very elegant solution:
results = df1.set_index(['Day','Name']).unstack()['Value']\
.subtract(df2.set_index('Day')['Value'], axis=0)\
.stack().reset_index()
Make the result look like the expected output:
result.columns = 'Day', 'Name', 'Value'
result.Value = result.Value.astype(int)
result.sort_values(['Name', 'Day'], inplace=True)
result = result[['Name', 'Value', 'Day']]
We can merge the two DataFrame's on the Day column and then subtract from there.
merged = df1.merge(df2, how='inner', on='Day', suffixes=('', '_y'))
print(merged)
Name Value Day Name_y Value_y
0 A 10 1 C 1
1 A 9 2 C 2
2 A 8 3 C 3
3 A 10 4 C 4
4 B 99 1 C 1
5 B 88 2 C 2
delta_values = df1.copy()
delta_values['Value'] = merged['Value'] - merged['Value_y']
print(delta_values)
Name Value Day
0 A 9 1
1 A 7 2
2 A 5 3
3 A 6 4
4 B 98 1
5 B 86 2
You can make do with either map or merge. Here's a map solution:
delta_values = df1.copy()
delta_values['Value'] -= delta_values['Day'].map(df2.set_index('Day')['Value']
).fillna(0)
Output:
Name Value Day
0 A 9 1
1 A 7 2
2 A 5 3
3 A 6 4
4 B 98 1
5 B 86 2

Only show specific groups in a df pandas

Hel lo, I need to focus on specific group within a table.
Here is an exemple:
groups col1
A 3
A 4
A 2
A 1
B 3
B 3
B 4
C 2
D 4
D 3
and I would like to only show groups that contain 3 and 4 but no other number.
Here I should get :
groups col1
B 3
B 3
B 4
D 4
D 3
Here are possible 2 approaches - test values by Series.isin for membership and then get all groups with all Trues by GroupBy.transform and GroupBy.all, last filter by boolean indexing:
df1 = df[df['col1'].isin([3,4]).groupby(df['groups']).transform('all')]
print (df1)
groups col1
4 B 3
5 B 3
6 B 4
8 D 4
9 D 3
Another approach is first get all groups values, which NOT contains values 3,4 and pass to another isin function with inverted mask:
df1 = df[~df['groups'].isin(df.loc[~df['col1'].isin([3,4]), 'groups'])]
print (df1)
groups col1
4 B 3
5 B 3
6 B 4
8 D 4
9 D 3
We can also use GroupBy.filter:
new_df=df.groupby('groups').filter(lambda x: x.col1.isin([3,4]).all() )
print(new_df)
groups col1
4 B 3
5 B 3
6 B 4
8 D 4
9 D 3
an alternative to remove Series.isin from the lambda function:
df['aux']=df['col1'].isin([3,4])
df.groupby('groups').filter(lambda x: x.aux.all()).drop('aux',axis=1)
Using df.loc[] and then searching by normal logic should work.
import pandas as pd
data = [['A', 3],
['A', 4],
['A', 2],
['A', 1],
['B', 3],
['B', 3],
['B', 4],
['C', 2],
['D', 4],
['D', 3]]
df = pd.DataFrame(data, columns=["col1", "col2"])
df = df.loc[df["col2"] >= 3]
print(df.head())

unmatched left table records in a left join in pandas

I have two DataFrames, 'Students' DataFrame and 'Fee' DataFrame. The fee details of some of the students are missing in 'Fee' DataFrame. I would like to return the details of all students whose fee details are missing. The three fields 'Class', 'Section' and 'RollNo' form a unique combination.
Students = pd.DataFrame({
'Class': [7, 7, 8],
'Section': ['A', 'B', 'B'],
'RollNo': [2, 3, 4],
'Student': ['Ram', 'Rahim', 'Robert']
})
Fee = pd.DataFrame({
'Class': [7, 7, 8],
'Section': ['A', 'B', 'B'],
'RollNo': [2, 2, 3],
'Fee': [10, 20, 30]
})
Students
Class RollNo Section Student
0 7 2 A Ram
1 7 3 B Rahim
2 8 4 B Robert
Fee
Class Fee RollNo Section
0 7 10 2 A
1 7 20 2 B
2 8 30 3 B
Essentially, I would like to find the unmatched records from the left table when I do a left join between 'Students' and 'Fee' DataFrames based on 3 fields mentioned above. What is the simplest way to achieve this using Pandas in Python?
Thank you very much!
If no NaNs in Fee column in Fee DataFrame use merge anf filter by boolean indexing with isna:
df = pd.merge(Students, Fee, how='left')
print (df)
Class RollNo Section Student Fee
0 7 2 A Ram 10.0
1 7 3 B Rahim NaN
2 8 4 B Robert NaN
df1 = df[df['Fee'].isna()].drop('Fee', axis=1)
#for oldier versions of pandas
#df1 = df[df['Fee'].isnull()].drop('Fee', axis=1)
print (df1)
Class RollNo Section Student
1 7 3 B Rahim
2 8 4 B Robert
More general solution working with NaNs too add parameter indicator to merge and filter rows with left_only:
Fee = pd.DataFrame({'Class':[7,7,8],
'Section':['A','B','B'],
'RollNo':[2,2,3],
'Fee':[np.nan,20,30]})
print (Fee)
Class Fee RollNo Section
0 7 NaN 2 A
1 7 20.0 2 B
2 8 30.0 3 B
df = pd.merge(Students, Fee, how='left', indicator=True)
print (df)
Class RollNo Section Student Fee _merge
0 7 2 A Ram NaN both
1 7 3 B Rahim NaN left_only
2 8 4 B Robert NaN left_only
df1 = df[df['_merge'].eq('left_only')].drop(['Fee','_merge'], axis=1)
print (df1)
Class RollNo Section Student
1 7 3 B Rahim
2 8 4 B Robert
I was having a bit of fun with this concept.
Option 1
use pandas.concat with the keys argument
ensure that the Studentss portion gets a value of 'stu' for the first level of the resulting MultiIndex.
use pandas.DataFrame.drop_duplicates with the argument keep=False to drop all duplication.
focus on just the Students portion by using loc.
catted = pd.concat([Students, Fee], keys=['stu', 'fee'])
dropped = catted.drop_duplicates(['Class', 'RollNo', 'Section'], keep=False)
index = dropped.loc['stu'].index
Students.loc[index]
Class RollNo Section Student
1 7 3 B Rahim
2 8 4 B Robert
Option 2
Use sets on list of tuples, take a difference and merge with an contrived dataframe.
cols = ['Class', 'RollNo', 'Section']
s = set(map(tuple, Students[cols].values))
f = set(map(tuple, Fee[cols].values))
Students.merge(pd.DataFrame(list(s - f), columns=cols))
Class RollNo Section Student
0 7 3 B Rahim
1 8 4 B Robert

How do I reverse the column values and leave the column headers as they are

suppose I have a dataframe df
df = pd.DataFrame([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]],
columns=['A', 'B', 'C', 'D', 'E'])
Which looks like this
A B C D E
0 1 2 3 4 5
1 6 7 8 9 10
How do I reverse the order of the column values but leave the column headers as A, B, C, D, E?
I want it to look like
A B C D E
0 5 4 3 2 1
1 10 9 8 7 6
I've tried sorting the column index df.sort_index(1, ascending=False) but that changes the column heads (obviously) and also, I don't know if my columns start off in a sorted way anyway.
Or you can just reverse your columns:
df.columns = reversed(df.columns)
df.sortlevel(axis=1)
# A B C D E
#0 5 4 3 2 1
#1 10 9 8 7 6
method 1
reconstruct
pd.DataFrame(df.values[:, ::-1], df.index, df.columns)
method 2
assign values
df[:] = df.values[:, ::-1]
df
both give
Also, using np.fliplr which flips the values along the horizontal direction:
pd.DataFrame(np.fliplr(df.values), columns=df.columns, index=df.index)

Categories

Resources