Create mask to identify final two rows in groups in Pandas dataframe - python

I have a Pandas dataframe that includes a grouping variable. An example can be produced using:
df = pd.DataFrame({'grp':['a','a','b','b','b','c','d','d','d','d'],
'data':[4,5,3,6,7,8,9,8,7,3]})
...which looks like:
grp data
0 a 4
1 a 5
2 b 3
3 b 6
4 b 7
5 c 8
6 d 9
7 d 8
8 d 7
9 d 3
I can retrieve the last two rows of each group using:
dfgrp = df.groupby('grp').tail(2)
However, I would like to produce a mask that identifies the last two rows (or 1 row if only 1 exists), ideally producing an output that looks like:
0 True
1 True
2 False
3 True
4 True
5 True
6 False
7 False
8 True
9 True
I thought this would be relatively straight-forward but I haven't been able to find the solution. Suggestions would be greatly appreciated.

If your index is unique, you could do this by using isin.
import pandas as pd
df = pd.DataFrame({'grp':['a','a','b','b','b','c','d','d','d','d'],
'data':[4,5,3,6,7,8,9,8,7,3]})
df['mask'] = df.index.isin(df.groupby('grp').tail(2).index)
df
grp data mask
0 a 4 True
1 a 5 True
2 b 3 False
3 b 6 True
4 b 7 True
5 c 8 True
6 d 9 False
7 d 8 False
8 d 7 True
9 d 3 True

Related

Pandas, create column using previous new column value

I am using Python and have the following Pandas Dataframe:
idx
result
grouping
1
False
2
True
3
True
4
False
5
True
6
True
7
True
8
False
9
True
10
True
11
True
12
True
What I would like is to do the following logic...
if the result is False then I want grouping to be the idx value.
if the result is True then I want the grouping to be the previous grouping value
So the end result will be:
idx
result
grouping
1
False
1
2
True
1
3
True
1
4
False
4
5
True
4
6
True
4
7
True
4
8
False
8
9
True
8
10
True
8
11
True
8
12
True
8
I have tried all sorts to get this working from using the Pandas shift() command to using lambda, but I am just not getting it.
I know I could iterate through the dataframe and perform the calculation but there has to be a better method.
examples of what I have tried and failed with are:
df['grouping'] = df['idx'] if not df['result'] else df['grouping'].shift(1)
df['grouping'] = df.apply(lambda x: x['idx'] if not x['result'] else x['grouping'].shift(1), axis=1)
Many Thanks for any assistance you can provide.
mask true values then forward fill
df['grouping'] = df['idx'].mask(df['result']).ffill(downcast='infer')
idx result grouping
0 1 False 1
1 2 True 1
2 3 True 1
3 4 False 4
4 5 True 4
5 6 True 4
6 7 True 4
7 8 False 8
8 9 True 8
9 10 True 8
10 11 True 8
11 12 True 8

Pandas: Filter a data-frame, and assign values to top n number of rows

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,2,5,6,7,1,8,9,2], 'city':[1,2,3,4,2,5,6,7,1,8,9,2]})
# The following code, creates a boolean filter,
filter = df.city==2
# Assigns True to all rows where filter is True
df.loc[filter,'selected']= True
What I need, is a change in the code so that it assigns True to given n number of rows.
The actual data frame has more than 3 million rows. Sometimes, I would want
df.loc[filter,'selected']= True for only 100 rows [Actual rows could be more or less than 100].
I believe you need filter by values defined in list first with isin and then for top 2 values use GroupBy.head:
cities= [2,3]
df = df1[df1.city.isin(cities)].groupby('city').head(2)
print (df)
col1 city
1 2 2
2 3 3
4 2 2
If need assign True in new column:
cities= [2,3]
idx = df1[df1.city.isin(cities)].groupby('city').head(2).index
df1.loc[idx, 'selected'] = True
print (df1)
col1 city selected
0 1 1 NaN
1 2 2 True
2 3 3 True
3 4 4 NaN
4 2 2 True
5 5 5 NaN
6 6 6 NaN
7 7 7 NaN
8 1 1 NaN
9 8 8 NaN
10 9 9 NaN
11 2 2 NaN
define a list of elements to be checked and pass it to city columns creating a new column with True & False booleans ..
>>> check
[2, 3]
>>> df['Citis'] = df.city.isin(check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
OR
>>> df['Citis'] = df['city'].apply(lambda x: x in check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
Matter of fact indeed you need to the starting (lets say 5 values to be read)
df['Citis'] = df.city.isin(check).head(5)
OR
df['Citis'] = df['city'].apply(lambda x: x in check).head(5)

Check if a value exists using multiple conditions within group in pandas

Following is what my dataframe looks like. Expected_Output is my desired/target column.
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False
If any Value1 == 7 AND if any Value2 == 9 within a given Group, then I want to return True.
I tried to no avail:
df['Expected_Output']= df.groupby('Group').Value1.isin(7) & df.groupby('Group').Value2.isin(9)
N.B:- Either True/False or 1/0 can be output.
Use groupby on Group column and then use transform and lambda function as:
g = df.groupby('Group')
df['Expected'] = (g['Value1'].transform(lambda x: x.eq(7).any()))&(g['Value2'].transform(lambda x: x.eq(9).any()))
Or using groupby, apply and merge using parameter how='left' as:
df.merge(df.groupby('Group').apply(lambda x: x['Value1'].eq(7).any()&x['Value2'].eq(9).any()).reset_index(),how='left').rename(columns={0:'Expected_Output'})
Or using groupby, apply and map as:
df['Expected_Output'] = df['Group'].map(df.groupby('Group').apply(lambda x: x['Value1'].eq(7).any()&x['Value2'].eq(9).any()))
print(df)
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False
You can create a dataframe of the expected result by group and then merge it back to the original dataframe.
expected = (
df.groupby('Group')
.apply(lambda x: (x['Value1'].eq(7).any()
& x['Value2'].eq(9)).any())
.to_frame('Expected_Output'))
>>> expected
Expected_Output
Group
1 True
2 False
>>> df.merge(expected, left_on='Group', right_index=True)
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False

Convert Outline format in CSV to Two Columns

I have data in a CSV file of the following format (one column in a dataframe). This is essentially like an outline in a Word document, where the headers I've shown here are letters are the main headers, and the items as numbers are subheaders:
A
1
2
3
B
1
2
C
1
2
3
4
I want to convert this to the following format (two columns in a dataframe):
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
C 4
I'm using pandas read_csv to convert the data into a dataframe, and I'm trying to reformat through for loops, but I'm having difficulty because the data repeats and gets overwritten. For example, A 3 will get overwritten with C 3 (resulting in two instance of C 3 when only one is desired, and losing A 3 altogether) later in the loop. What's the best way to do this?
Apologies for poor formatting, new to the site.
Use:
#if no csv header use names parameter
df = pd.read_csv(file, names=['col'])
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
df = df[df['a'] != df['col']]
print (df)
a col
1 A 1
2 A 2
3 A 3
5 B 1
6 B 2
8 C 1
9 C 2
10 C 3
11 C 4
Details:
Check isnumeric values:
print (df['col'].str.isnumeric())
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 True
9 True
10 True
11 True
Name: col, dtype: bool
Replace True by NaNs by mask and forward fill missing values:
print (df['col'].mask(df['col'].str.isnumeric()).ffill())
0 A
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
10 C
11 C
Name: col, dtype: object
Add new column to first position by DataFrame.insert:
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
print (df)
a col
0 A A
1 A 1
2 A 2
3 A 3
4 B B
5 B 1
6 B 2
7 C C
8 C 1
9 C 2
10 C 3
11 C 4
and last remove rows with same values by boolean indexing.

Compare two columns in pandas to make them match

So I have two dataframes consisting of 6 columns each containing numbers. I need to compare 1 column from each dataframe to make sure they match and fix any values in that column that don't match. Columns are already sorted and they match in terms of length. So far I can find the differences in the columns:
df1.loc[(df1['col1'] != df2['col2'])]
then I get the index # where df1 doesn't match df2. Then I'll go to that same index # in df2 to find out what value in col2 is causing a mismatch then use this to change the value to the correct one found in df2:
df1.loc[index_number, 'col1'] = new_value
Is there a way I can automatically fix the mismatches without having to manually look up what the correct value should be in df2?
if df2 is the authoritative source, you don't need to check where df1 is equal
df1.loc[:, 'column_name'] = df2['column_name']
But if we must check
c = 'column_name'
df1.loc[df1[c] != df2[c], c] = df2[c]
I think you need compare by eq and then if need add value where dont match use combine_first:
df1 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,6,5],
'E':[5,3,6],
'F':[1,4,3]})
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 6 3 4
2 3 6 9 5 6 3
df2 = pd.DataFrame({'A':[1,2,1],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 1 6 9 5 6 3
If need compare one column with all DataFrame:
print (df1.eq(df2.A, axis=0))
A B C D E F
0 True False False True False True
1 True False False False False False
2 False False False False False False
print (df1.eq(df1.A, axis=0))
A B C D E F
0 True False False True False True
1 True False False False False False
2 True False False False False True
And if need same column D:
df1.D = df1.loc[df1.D.eq(df2.D), 'D'].combine_first(df2.D)
print (df1)
A B C D E F
0 1 4 7 1.0 5 1
1 2 5 8 3.0 3 4
2 3 6 9 5.0 6 3
But then is easier only assign column D from df2 to D of df1:
df1.D = df2.D
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 3 3 4
2 3 6 9 5 6 3
If indexes are different, is possible use values for convert column to numpy array:
df1.D = df1.D.values
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 6 3 4
2 3 6 9 5 6 3

Categories

Resources