drop first and last row from within each group - python

This is a follow up question to get first and last values in a groupby
How do I drop first and last rows within each group?
I have this df
df = pd.DataFrame(np.arange(20).reshape(10, -1),
[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
['a', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']],
['X', 'Y'])
df
I intentionally made the second row have the same index value as the first row. I won't have control over the uniqueness of the index.
X Y
a a 0 1
a 2 3
c 4 5
d 6 7
b e 8 9
f 10 11
g 12 13
c h 14 15
i 16 17
d j 18 19
I want this
X Y
a b 2.0 3
c 4.0 5
b f 10.0 11
Because both groups at level 0 equal to 'c' and 'd' have less than 3 rows, all rows should be dropped.

I'd apply a similar technique to what I did for the other question:
def first_last(df):
return df.ix[1:-1]
df.groupby(level=0, group_keys=False).apply(first_last)

Note: in pandas version 0.20.0 and above, ix is deprecated and the use of iloc is encouraged instead.
So the df.ix[1:-1] should be replaced by df.iloc[1:-1].

Related

Comparison of two columns

How can I find the same values in the columns regardless of their position?
df = pd.DataFrame({'one':['A','B', 'C', 'D', 'E', np.nan, 'H'],
'two':['B', 'E', 'C', np.nan, np.nan, 'H', 'L']})
The result I want to get:
three
0 B
1 E
2 C
3 H
The exact logic is unclear, you can try:
out = pd.DataFrame({'three': sorted(set(df['one'].dropna())
&set(df['two'].dropna()))})
output:
three
0 B
1 C
2 E
3 H
Or maybe you want to keep the items of col two?
out = (df.loc[df['two'].isin(df['one'].dropna()), 'two']
.to_frame(name='three')
)
output:
three
0 B
1 E
2 C
5 H
Try this:
df = pd.DataFrame(set(df['one']).intersection(df['two']), columns=['Three']).dropna()
print(df)
Output:
Three
1 C
2 H
3 E
4 B

Aggregate group by response based on certain count (or size) Pandas

I am looking to create a sum based on certain values obtained after a groupby count (or size). I have created a mock DataFrame and the desired output bellow. It should be self explanatory from the example what I am looking for. I checked quite a bit but it seems there is no straight answer.
data = {'col1' : ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C','C','C','C','C','C','C','C'], 'col2' :[ 'A', 'B', 'C', 'B', 'A', 'B', 'C', 'A', 'C', 'B', 'B', 'C', 'A','B','A','A','A','B','C','C']}
data = pd.DataFrame(data)
data.groupby(['col1', 'col2'])['col2'].count()
The output for this count is:
A A 2
B 2
C 1
B A 1
B 3
C 2
C A 4
B 2
C 3
I would like to do a further calculation on this output and get:
A A 2
(B+C) 3
B (A+C) 3
B 3
C (A+B) 6
C 3
You could create dummy columns and groupby using those columns:
out = (data
.assign(match=data['col1']==data['col2'], count=1)
.groupby(['col1','match'], as_index=False)
.agg({'col2': lambda x: '+'.join(x.unique()), 'count':'sum'})
.drop(columns='match'))
Output:
col1 col2 count
0 A B+C 3
1 A A 2
2 B C+A 3
3 B B 3
4 C A+B 6
5 C C 3

Pandas Dataframe - GroupBy key and keep max value on a another column

I need to group a frame by key. For each group there could be :
one couple of id, where 'max registered' is a unique value I need to keep
two couples of id : id1-id2 and id2-id1 where I need to keep the max between their 'max registered' or one of them if their 'max registered' are equal and keep only one of the couples (because id1-id2 and id2-id1 should be considered as one couple, because we don't care about the order of the ids in a couple)
more than two couples of id : it could be combinations of case 1 = one couple, and case 2 = two couples. They need to be treated like case 1 and case 2 inside the same group of key.
Here is the original dataframe :
df = pd.DataFrame({
'first': ['A', 'B', 'A1', 'B1', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K'],
'second': ['B', 'A', 'B1', 'A1', 'D', 'C', 'F', 'E', 'H', 'G', 'J', 'L'],
'key': ['AB', 'AB', 'AB', 'AB', 'CD', 'CD', 'EF', 'EF', 'GH', 'GH', 'IJ', 'KL'],
'max registered': [10, 5, 10, 5, 'NaN', 15, 10, 5, 'NaN', 'NaN', 'NaN', 15]
})
df
first second key max registered
0 A B AB 10
1 B A AB 5
2 A1 B1 AB 10
3 B1 A1 AB 5
4 C D CD NaN
5 D C CD 15
6 E F EF 10
7 F E EF 5
8 G H GH NaN
9 H G GH NaN
10 I J IJ NaN
11 K L KL 15
Here is what dataframe should look like once it as been grouped and (here comes my problem) aggregated/filtered/transformed/applied ? I don't know how to do it after grouping my data and what solution I should opt for.
df = pd.DataFrame({
'first': ['A', 'A1', 'D', 'E', 'G', 'I', 'K'],
'second': ['B', 'B1', 'C', 'F', 'H', 'J', 'L'],
'key': ['AB', 'AB', 'CD', 'EF', 'GH', 'IJ', 'KL'],
'max registered': [10, 10, 15, 10, 'NaN', 'NaN', 15]
})
df
first second key max registered
0 A B AB 10
1 A1 B1 AB 10
2 D C CD 15
3 E F EF 10
4 G H GH NaN
5 I J IJ NaN
6 K L KL 15
I'm watching tutorials about groupby() and pandas documentation since 2 days without finding any clue of the logic behind it and the way I should do this. My problem is (as I see it) more complicated and not really related to what's treated in those tutorials (for example this one that I watched several times)
Create ordered group from first and second columns. key is useless here since your want all max for each subgroup (max for (A,B) and max for (A1,B1)) then sort values by max registered by descending order. Finally group by this virtual groups and keep the first value (the max):
out = df.assign(group=df[['first', 'second']].apply(frozenset, axis=1)) \
.sort_values('max registered', ascending=False) \
.groupby('group').head(1).sort_index()
print(out)
first second key max registered group
0 A B AB 10.0 (A, B)
2 A1 B1 AB 10.0 (B1, A1)
5 D C CD 15.0 (C, D)
6 E F EF 10.0 (E, F)
8 G H GH NaN (G, H)
10 I J IJ NaN (J, I)
11 K L KL 15.0 (K, L)

python: Turning in dummy variables

I would like to add dummy variable for the column TypePhase.
wm_id TypePhase
2 ['N', 'A', 'B', 'C', 'D']
2 ['N', 'A', 'B', 'C', 'D']
3 ['N', 'W', 'A', 'B', 'C', 'D']
2 ['N', 'A', 'B', 'C', 'D']
3 ['N', 'P', 'A', 'B', 'C', 'D']
2 ['N', 'A', 'B', 'C', 'D']
I tried df.TypePhase = df.TypePhase.apply(lambda s : '_'.join(s))but I did not get the expected result. I know that I need the apply
pd.get_dummies(df_new['TypePhase']).rename(columns=lambda x: 'AAAAAAAAA_' + str(x))
But I don't get it right.
Please, any suggestion?
Many Thanks in advance.
carlo
I think all values are strings in column TypePhase, so is possible use str.get_dummies with double str.strip.
Last join to original.
pop function extract column from original, so not necessary delete it.
print (type(df.loc[0, 'TypePhase']))
<class 'str'>
df1 = df.pop('TypePhase').str.strip('[]').str.get_dummies(', ')
#remove ' from new column names
df1.columns = df1.columns.str.strip("'")
df = df.join(df1)
print (df)
wm_id A B C D N P W
0 2 1 1 1 1 1 0 0
1 2 1 1 1 1 1 0 0
2 3 1 1 1 1 1 0 1
3 2 1 1 1 1 1 0 0
4 3 1 1 1 1 1 1 0
5 2 1 1 1 1 1 0 0

Python DataFrame: Why does my values change to NaN if I change the indices?

I want to change my indices. My dataFrame is as follows:
partA = pd.DataFrame({'u1': 2, 'u2': 3, 'u3':4, 'u4':29, 'u5':4, 'u6':1, 'u7':323, 'u8':9, 'u9':7, 'u10':5}, index = [20])
which gives a dataframe of size (1,10) with all cells filled.
However, when I create a new dataframe of this one (necessary in my original code which contain different data) and I change the index for this dataFrame, the values of my cells are all equal to NaN.
I know that I could use reset_index to change the index, but I would like to be able to do it all in one line.
What I did now is the following (resulting in NaNs)
partB = pd.DataFrame(partA, columns = ['A', 'B', 'C', 'D','E', 'F', 'G', 'H', 'I','J'])
You need values for converting partA to numpy array:
partA = pd.DataFrame({'u1': 2, 'u2': 3, 'u3':4, 'u4':29, 'u5':4, 'u6':1,
'u7':323, 'u8':9, 'u9':7, 'u10':5}, index = [20])
print (partA)
u1 u10 u2 u3 u4 u5 u6 u7 u8 u9
20 2 5 3 4 29 4 1 323 9 7
partB = pd.DataFrame(partA.values,columns = ['A', 'B', 'C', 'D','E', 'F', 'G', 'H', 'I','J'])
print (partB)
A B C D E F G H I J
0 2 5 3 4 29 4 1 323 9 7
If need index from partA:
partB = pd.DataFrame(partA.values,
columns = ['A', 'B', 'C', 'D','E', 'F', 'G', 'H', 'I','J'],
index = partA.index)
print (partB)
A B C D E F G H I J
20 2 5 3 4 29 4 1 323 9 7
You get NaN because not align column names, so if changed last name (u7), you get value:
partB = pd.DataFrame(partA,
columns = ['A', 'B', 'C', 'D','E', 'F', 'G', 'H', 'I','u7'],
index = partA.index)
print (partB)
A B C D E F G H I u7
20 NaN NaN NaN NaN NaN NaN NaN NaN NaN 323

Categories

Resources