I need to group a frame by key. For each group there could be :
one couple of id, where 'max registered' is a unique value I need to keep
two couples of id : id1-id2 and id2-id1 where I need to keep the max between their 'max registered' or one of them if their 'max registered' are equal and keep only one of the couples (because id1-id2 and id2-id1 should be considered as one couple, because we don't care about the order of the ids in a couple)
more than two couples of id : it could be combinations of case 1 = one couple, and case 2 = two couples. They need to be treated like case 1 and case 2 inside the same group of key.
Here is the original dataframe :
df = pd.DataFrame({
'first': ['A', 'B', 'A1', 'B1', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K'],
'second': ['B', 'A', 'B1', 'A1', 'D', 'C', 'F', 'E', 'H', 'G', 'J', 'L'],
'key': ['AB', 'AB', 'AB', 'AB', 'CD', 'CD', 'EF', 'EF', 'GH', 'GH', 'IJ', 'KL'],
'max registered': [10, 5, 10, 5, 'NaN', 15, 10, 5, 'NaN', 'NaN', 'NaN', 15]
})
df
first second key max registered
0 A B AB 10
1 B A AB 5
2 A1 B1 AB 10
3 B1 A1 AB 5
4 C D CD NaN
5 D C CD 15
6 E F EF 10
7 F E EF 5
8 G H GH NaN
9 H G GH NaN
10 I J IJ NaN
11 K L KL 15
Here is what dataframe should look like once it as been grouped and (here comes my problem) aggregated/filtered/transformed/applied ? I don't know how to do it after grouping my data and what solution I should opt for.
df = pd.DataFrame({
'first': ['A', 'A1', 'D', 'E', 'G', 'I', 'K'],
'second': ['B', 'B1', 'C', 'F', 'H', 'J', 'L'],
'key': ['AB', 'AB', 'CD', 'EF', 'GH', 'IJ', 'KL'],
'max registered': [10, 10, 15, 10, 'NaN', 'NaN', 15]
})
df
first second key max registered
0 A B AB 10
1 A1 B1 AB 10
2 D C CD 15
3 E F EF 10
4 G H GH NaN
5 I J IJ NaN
6 K L KL 15
I'm watching tutorials about groupby() and pandas documentation since 2 days without finding any clue of the logic behind it and the way I should do this. My problem is (as I see it) more complicated and not really related to what's treated in those tutorials (for example this one that I watched several times)
Create ordered group from first and second columns. key is useless here since your want all max for each subgroup (max for (A,B) and max for (A1,B1)) then sort values by max registered by descending order. Finally group by this virtual groups and keep the first value (the max):
out = df.assign(group=df[['first', 'second']].apply(frozenset, axis=1)) \
.sort_values('max registered', ascending=False) \
.groupby('group').head(1).sort_index()
print(out)
first second key max registered group
0 A B AB 10.0 (A, B)
2 A1 B1 AB 10.0 (B1, A1)
5 D C CD 15.0 (C, D)
6 E F EF 10.0 (E, F)
8 G H GH NaN (G, H)
10 I J IJ NaN (J, I)
11 K L KL 15.0 (K, L)
Related
I have the following DataFrame:
user category x y
0 AB A 1 1
1 EF A 1 1
2 SG A 1 0
3 MN A 1 0
4 AB B 0 0
5 EF B 0 1
6 SG B 0 1
7 MN B 0 0
8 AB C 1 1
9 EF C 1 1
10 SG C 1 1
11 MN C 1 1
I want to select users that have x=y in all categories. I was able to do that using the following code:
data = pd.DataFrame({'user': ['AB', 'EF', 'SG', 'MN', 'AB', 'EF',
'SG', 'MN', 'AB', 'EF', 'SG', 'MN'],
'category': ['A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'C', 'C', 'C', 'C'],
'x': [1,1,1,1, 0,0,0,0, 1,1,1,1],
'y': [1,1,0,0, 0,1,1,0, 1,1,1,1]})
data = data[data['x'] == data['y']][['user', 'category']]
count_users_match = data.groupby('user', as_index=False).count()
count_cat = data['category'].unique().shape[0]
print(count_users_match[count_users_match['category'] == count_cat])
Output:
user category
0 AB 3
I felt that this is a quite long solution. Is there any shorter way to achieve this?
Try this:
filtered = df.x.eq(df.y).groupby(df['user']).sum().loc[lambda x: x == df['category'].nunique()].reset_index(name='category')
Output:
>>> filtered
user category
0 AB 3
We could use query + groupby + size to find the number of matching categories for each user. Then compare it with the number of categories for each user:
tmp = data.query('x==y').groupby('user').size()
out = tmp[tmp == data['category'].nunique()].reset_index(name='category')
Output:
user category
0 AB 3
This is a more compact way to do it, but I don't know if it is also more efficient.
out = [{'user': user, 'frequency': data.loc[data['x'] == data['y']]['user'].value_counts()[user]} for user in data['user'].unique() if data.loc[data['x'] == data['y']]['user'].value_counts()[user] == data['user'].value_counts()[user]]
>>> out
[{'user': 'AB', 'frequency': 3}]
I have two dataframes.
data = {
'Title': ['Ak1', 'Ak2', 'Ak3', 'Ak4', 'Ak5', 'Ak6', 'Ak7', 'Ak8'],
'Items': ['A', 'B', 'J', 'A', 'A', 'K', 'L', 'M'],
'Item2': ['K', 'B', 'O', 'A', 'A', 'K', 'J', 'F'],
'Item3': ['A', 'K', 'D', 'A', 'A', 'K', 'L', 'M'],
}
df = pd.DataFrame(data)
df
Title Items Item2 Item3
0 Ak1 A K A
1 Ak2 B B K
2 Ak3 J O D
3 Ak4 A A A
4 Ak5 A A A
5 Ak6 K K K
6 Ak7 L J L
7 Ak8 M F M
second dataframe df2,
data = {
'Remove': ['A', 'J', 'M']
}
df2 = pd.DataFrame(data)
df2
Remove
0 A
1 J
2 M
I want to remove all the values in df1 which are thers in df2. Expected output is as follows.
Title Items Item2 Item3
0 Ak1 K
1 Ak2 B B K
2 Ak3 O D
3 Ak4
4 Ak5
5 Ak6 K K K
6 Ak7 L L
7 Ak8 F
You can use isin function:
df3 = df.mask(df.isin(df2['Remove'].values), '')
df3
Title Items Item2 Item3
0 Ak1 K
1 Ak2 B B K
2 Ak3 O D
3 Ak4
4 Ak5
5 Ak6 K K K
6 Ak7 L L
7 Ak8 F
How about using pandas.replace? If I try to implemented on your sample data, the result is followed:
df.replace(df2.Remove.unique(),'')
Title Items Item2 Item3
0 Ak1 K
1 Ak2 B B K
2 Ak3 O D
3 Ak4
4 Ak5
5 Ak6 K K K
6 Ak7 L L
7 Ak8 F
I want to change values from one column in a dataframe to fake data.
Here is the original table looking sample:
df = {'Name':['David', 'David', 'David', 'Kevin', 'Kevin', 'Ann', 'Joan']
'Age':[10,10,10,12,12,15,13]}
df = pd.DataFrame(df)
df
Now what I want to do is to change the Name column values to fake values like this:
df = {'Name':[A, A, A, B, B, C, D]
'Age':[10,10,10,12,12,15,13]}
df = pd.DataFrame(df)
df
Notice how I changed the names to a distinct combination of Alphabets. this is sample data, but in real data, there are a lot of names, so I start with A,B,C,D then when it reaches Z, the next new name should be AA then AB follows, etc..
Is this viable?
Here is my suggestion. List 'fake' below has more than 23000 items, if your df has more unique values, just increase the end of the loop (currently 5) and the fake list will increase exponentially:
import string
from itertools import combinations_with_replacement
names=df['Name'].unique()
letters=list(string.ascii_uppercase)
fake=[]
for i in range(1,5): #increase 5 if you need more items
fake.extend([i for i in combinations_with_replacement(letters,i)])
fake=[''.join(i) for i in fake]
d=dict(zip(names, fake))
df['code']=df.Name.map(d)
Sample of fake:
>>> print(fake[:30])
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', 'AC', 'AD']
Output:
>>>print(df)
Name Age code
0 David 10 A
1 David 10 A
2 David 10 A
3 Kevin 12 B
4 Kevin 12 B
5 Ann 15 C
6 Joan 13 D
Use factorize and make the Fake name as int which is easy to store
df['Fake']=df.Name.factorize()[0]
df
Name Age Fake
0 David 10 0
1 David 10 0
2 David 10 0
3 Kevin 12 1
4 Kevin 12 1
5 Ann 15 2
6 Joan 13 3
If need mix type
df.groupby('Name')['Name'].transform(lambda x : pd.util.testing.rands_array(8,1)[0])
0 jNAO9AdJ
1 jNAO9AdJ
2 jNAO9AdJ
3 es0p4Yjx
4 es0p4Yjx
5 x54NNbdF
6 hTMKxoXW
Name: Name, dtype: object
from string import ascii_lowercase
def excel_names(num_cols):
letters = list(ascii_lowercase)
excel_cols = []
for i in range(0, num_cols - 1):
n = i//26
m = n//26
i-=n*26
n-=m*26
col = letters[m-1]+letters[n-1]+letters[i] if m>0 else letters[n1]+letters[i] if n>0 else letters[i]
excel_cols.append(col)
return excel_cols
unique_names=df['Name'].nunique()+1
names=excel_names(unique_names)
dictionary=dict(zip(df['Name'].unique(),names))
df['new_Name']=df['Name'].map(dictionary)
Get new integer category of names using cumsum and use Python ord,char TO turn the integer argument into strings starting from A
df['Name']=(~(df.Name.shift(1)==df.Name)).cumsum().add(ord('A') - 1).map(chr)
print(df)
Name Age
0 A 10
1 A 10
2 A 10
3 B 12
4 B 12
5 C 15
6 D 13
let us think in another way. If you nead a fake sympol, so let us maping them to A0,A1,A2 to An. this would be more easy.
df = {'Name': ['David', 'David', 'David', 'Kevin', 'Kevin', 'Ann', 'Joan'], 'Age': [10, 10, 10, 12, 12, 15, 13]}
df = pd.DataFrame(df)
map = pd.DataFrame({'name': df['Name'].unique()})
map['seq'] = map.index
map['symbol'] = map['seq'].apply(lambda x: 'A' + str(x))
df['code'] = df['Name'].apply(lambda x: map.loc[map['name']==x]['symbol'].values)
df
Name Age code
0 David 10 A0
1 David 10 A0
2 David 10 A0
3 Kevin 12 A1
4 Kevin 12 A1
5 Ann 15 A2
6 Joan 13 A3
I want to change my indices. My dataFrame is as follows:
partA = pd.DataFrame({'u1': 2, 'u2': 3, 'u3':4, 'u4':29, 'u5':4, 'u6':1, 'u7':323, 'u8':9, 'u9':7, 'u10':5}, index = [20])
which gives a dataframe of size (1,10) with all cells filled.
However, when I create a new dataframe of this one (necessary in my original code which contain different data) and I change the index for this dataFrame, the values of my cells are all equal to NaN.
I know that I could use reset_index to change the index, but I would like to be able to do it all in one line.
What I did now is the following (resulting in NaNs)
partB = pd.DataFrame(partA, columns = ['A', 'B', 'C', 'D','E', 'F', 'G', 'H', 'I','J'])
You need values for converting partA to numpy array:
partA = pd.DataFrame({'u1': 2, 'u2': 3, 'u3':4, 'u4':29, 'u5':4, 'u6':1,
'u7':323, 'u8':9, 'u9':7, 'u10':5}, index = [20])
print (partA)
u1 u10 u2 u3 u4 u5 u6 u7 u8 u9
20 2 5 3 4 29 4 1 323 9 7
partB = pd.DataFrame(partA.values,columns = ['A', 'B', 'C', 'D','E', 'F', 'G', 'H', 'I','J'])
print (partB)
A B C D E F G H I J
0 2 5 3 4 29 4 1 323 9 7
If need index from partA:
partB = pd.DataFrame(partA.values,
columns = ['A', 'B', 'C', 'D','E', 'F', 'G', 'H', 'I','J'],
index = partA.index)
print (partB)
A B C D E F G H I J
20 2 5 3 4 29 4 1 323 9 7
You get NaN because not align column names, so if changed last name (u7), you get value:
partB = pd.DataFrame(partA,
columns = ['A', 'B', 'C', 'D','E', 'F', 'G', 'H', 'I','u7'],
index = partA.index)
print (partB)
A B C D E F G H I u7
20 NaN NaN NaN NaN NaN NaN NaN NaN NaN 323
This is a follow up question to get first and last values in a groupby
How do I drop first and last rows within each group?
I have this df
df = pd.DataFrame(np.arange(20).reshape(10, -1),
[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
['a', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']],
['X', 'Y'])
df
I intentionally made the second row have the same index value as the first row. I won't have control over the uniqueness of the index.
X Y
a a 0 1
a 2 3
c 4 5
d 6 7
b e 8 9
f 10 11
g 12 13
c h 14 15
i 16 17
d j 18 19
I want this
X Y
a b 2.0 3
c 4.0 5
b f 10.0 11
Because both groups at level 0 equal to 'c' and 'd' have less than 3 rows, all rows should be dropped.
I'd apply a similar technique to what I did for the other question:
def first_last(df):
return df.ix[1:-1]
df.groupby(level=0, group_keys=False).apply(first_last)
Note: in pandas version 0.20.0 and above, ix is deprecated and the use of iloc is encouraged instead.
So the df.ix[1:-1] should be replaced by df.iloc[1:-1].