I want to change values from one column in a dataframe to fake data.
Here is the original table looking sample:
df = {'Name':['David', 'David', 'David', 'Kevin', 'Kevin', 'Ann', 'Joan']
'Age':[10,10,10,12,12,15,13]}
df = pd.DataFrame(df)
df
Now what I want to do is to change the Name column values to fake values like this:
df = {'Name':[A, A, A, B, B, C, D]
'Age':[10,10,10,12,12,15,13]}
df = pd.DataFrame(df)
df
Notice how I changed the names to a distinct combination of Alphabets. this is sample data, but in real data, there are a lot of names, so I start with A,B,C,D then when it reaches Z, the next new name should be AA then AB follows, etc..
Is this viable?
Here is my suggestion. List 'fake' below has more than 23000 items, if your df has more unique values, just increase the end of the loop (currently 5) and the fake list will increase exponentially:
import string
from itertools import combinations_with_replacement
names=df['Name'].unique()
letters=list(string.ascii_uppercase)
fake=[]
for i in range(1,5): #increase 5 if you need more items
fake.extend([i for i in combinations_with_replacement(letters,i)])
fake=[''.join(i) for i in fake]
d=dict(zip(names, fake))
df['code']=df.Name.map(d)
Sample of fake:
>>> print(fake[:30])
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', 'AC', 'AD']
Output:
>>>print(df)
Name Age code
0 David 10 A
1 David 10 A
2 David 10 A
3 Kevin 12 B
4 Kevin 12 B
5 Ann 15 C
6 Joan 13 D
Use factorize and make the Fake name as int which is easy to store
df['Fake']=df.Name.factorize()[0]
df
Name Age Fake
0 David 10 0
1 David 10 0
2 David 10 0
3 Kevin 12 1
4 Kevin 12 1
5 Ann 15 2
6 Joan 13 3
If need mix type
df.groupby('Name')['Name'].transform(lambda x : pd.util.testing.rands_array(8,1)[0])
0 jNAO9AdJ
1 jNAO9AdJ
2 jNAO9AdJ
3 es0p4Yjx
4 es0p4Yjx
5 x54NNbdF
6 hTMKxoXW
Name: Name, dtype: object
from string import ascii_lowercase
def excel_names(num_cols):
letters = list(ascii_lowercase)
excel_cols = []
for i in range(0, num_cols - 1):
n = i//26
m = n//26
i-=n*26
n-=m*26
col = letters[m-1]+letters[n-1]+letters[i] if m>0 else letters[n1]+letters[i] if n>0 else letters[i]
excel_cols.append(col)
return excel_cols
unique_names=df['Name'].nunique()+1
names=excel_names(unique_names)
dictionary=dict(zip(df['Name'].unique(),names))
df['new_Name']=df['Name'].map(dictionary)
Get new integer category of names using cumsum and use Python ord,char TO turn the integer argument into strings starting from A
df['Name']=(~(df.Name.shift(1)==df.Name)).cumsum().add(ord('A') - 1).map(chr)
print(df)
Name Age
0 A 10
1 A 10
2 A 10
3 B 12
4 B 12
5 C 15
6 D 13
let us think in another way. If you nead a fake sympol, so let us maping them to A0,A1,A2 to An. this would be more easy.
df = {'Name': ['David', 'David', 'David', 'Kevin', 'Kevin', 'Ann', 'Joan'], 'Age': [10, 10, 10, 12, 12, 15, 13]}
df = pd.DataFrame(df)
map = pd.DataFrame({'name': df['Name'].unique()})
map['seq'] = map.index
map['symbol'] = map['seq'].apply(lambda x: 'A' + str(x))
df['code'] = df['Name'].apply(lambda x: map.loc[map['name']==x]['symbol'].values)
df
Name Age code
0 David 10 A0
1 David 10 A0
2 David 10 A0
3 Kevin 12 A1
4 Kevin 12 A1
5 Ann 15 A2
6 Joan 13 A3
Related
I'm stuck in a problem, because I can't find any solution to deal with it, I have the following sample:
data = [['John', 6, 'A'], ['Paul', 6, 'D'],
['Juli', 9, 'D'], ['Geeta', 4, 'A'],
['Jay', 6, 'D'], ['Sara', 6, 'A'],
['Mario', 3, 'D'], ['Peter', 6, 'A'],
['Jin', 6, 'D'], ['Carl', 6, 'A']]
df = pd.DataFrame(data, columns=['Name', 'Number', 'Label'])
I previously grouped by number with the following line of code:
df = df.sort_values('number')
and got this output:
Name Number Label
Mario 3 D
Geeta 4 A
Peter 4 A
Jin 4 D
John 6 A
Paul 6 D
Jay 6 D
Sara 6 A
Carl 6 A
Juli 9 D
So I just want to select pair of rows which have an 'A' in the last column and followed by a row with a 'D' in the last column, and find all pair of rows that match this pattern in the same group (I don't want the last 'A' of a group and the 'D' of the next group), so the solution of the problem is:
Name Number Label
Peter 4 A
Jin 4 D
John 6 A
Paul 6 D
Anyone can help me?
You need to use:
# is the row label A?
m1 = df['Label'].eq('A')
# id the next row label D?
m2 = df['Label'].shift(-1).eq('D')
# create a mask combining both conditions
mask = m1&m2
# select the matching rows and the next one (boolean OR)
df[mask|mask.shift()]
output:
Name Number Label
0 John 6 A
1 Paul 6 D
3 Geeta 4 A
4 Jay 6 D
update: match on group
as your rows are sorted per group you can add another condition:
m1 = df['Label'].eq('A')
m2 = df['Label'].shift(-1).eq('D')
m3 = df['Number'].eq(df['Number'].shift(-1))
mask = m1&m2&m3
df[df[mask|mask.shift()]]
output:
Name Number Label
2 Peter 4 A
3 Jin 4 D
4 John 6 A
5 Paul 6 D
def function1(dd:pd.DataFrame):
id=dd[(dd.Label=='D')&(dd.Label.shift()=='A')].index
return dd.loc[id.union(id-1).sort_values()]
df.groupby('Number').apply(function1).reset_index(drop=True)
Name Number Label
0 Peter 4 A
1 Jin 4 D
2 John 6 A
3 Paul 6 D
I have the following DataFrame:
user category x y
0 AB A 1 1
1 EF A 1 1
2 SG A 1 0
3 MN A 1 0
4 AB B 0 0
5 EF B 0 1
6 SG B 0 1
7 MN B 0 0
8 AB C 1 1
9 EF C 1 1
10 SG C 1 1
11 MN C 1 1
I want to select users that have x=y in all categories. I was able to do that using the following code:
data = pd.DataFrame({'user': ['AB', 'EF', 'SG', 'MN', 'AB', 'EF',
'SG', 'MN', 'AB', 'EF', 'SG', 'MN'],
'category': ['A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'C', 'C', 'C', 'C'],
'x': [1,1,1,1, 0,0,0,0, 1,1,1,1],
'y': [1,1,0,0, 0,1,1,0, 1,1,1,1]})
data = data[data['x'] == data['y']][['user', 'category']]
count_users_match = data.groupby('user', as_index=False).count()
count_cat = data['category'].unique().shape[0]
print(count_users_match[count_users_match['category'] == count_cat])
Output:
user category
0 AB 3
I felt that this is a quite long solution. Is there any shorter way to achieve this?
Try this:
filtered = df.x.eq(df.y).groupby(df['user']).sum().loc[lambda x: x == df['category'].nunique()].reset_index(name='category')
Output:
>>> filtered
user category
0 AB 3
We could use query + groupby + size to find the number of matching categories for each user. Then compare it with the number of categories for each user:
tmp = data.query('x==y').groupby('user').size()
out = tmp[tmp == data['category'].nunique()].reset_index(name='category')
Output:
user category
0 AB 3
This is a more compact way to do it, but I don't know if it is also more efficient.
out = [{'user': user, 'frequency': data.loc[data['x'] == data['y']]['user'].value_counts()[user]} for user in data['user'].unique() if data.loc[data['x'] == data['y']]['user'].value_counts()[user] == data['user'].value_counts()[user]]
>>> out
[{'user': 'AB', 'frequency': 3}]
I need to group a frame by key. For each group there could be :
one couple of id, where 'max registered' is a unique value I need to keep
two couples of id : id1-id2 and id2-id1 where I need to keep the max between their 'max registered' or one of them if their 'max registered' are equal and keep only one of the couples (because id1-id2 and id2-id1 should be considered as one couple, because we don't care about the order of the ids in a couple)
more than two couples of id : it could be combinations of case 1 = one couple, and case 2 = two couples. They need to be treated like case 1 and case 2 inside the same group of key.
Here is the original dataframe :
df = pd.DataFrame({
'first': ['A', 'B', 'A1', 'B1', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K'],
'second': ['B', 'A', 'B1', 'A1', 'D', 'C', 'F', 'E', 'H', 'G', 'J', 'L'],
'key': ['AB', 'AB', 'AB', 'AB', 'CD', 'CD', 'EF', 'EF', 'GH', 'GH', 'IJ', 'KL'],
'max registered': [10, 5, 10, 5, 'NaN', 15, 10, 5, 'NaN', 'NaN', 'NaN', 15]
})
df
first second key max registered
0 A B AB 10
1 B A AB 5
2 A1 B1 AB 10
3 B1 A1 AB 5
4 C D CD NaN
5 D C CD 15
6 E F EF 10
7 F E EF 5
8 G H GH NaN
9 H G GH NaN
10 I J IJ NaN
11 K L KL 15
Here is what dataframe should look like once it as been grouped and (here comes my problem) aggregated/filtered/transformed/applied ? I don't know how to do it after grouping my data and what solution I should opt for.
df = pd.DataFrame({
'first': ['A', 'A1', 'D', 'E', 'G', 'I', 'K'],
'second': ['B', 'B1', 'C', 'F', 'H', 'J', 'L'],
'key': ['AB', 'AB', 'CD', 'EF', 'GH', 'IJ', 'KL'],
'max registered': [10, 10, 15, 10, 'NaN', 'NaN', 15]
})
df
first second key max registered
0 A B AB 10
1 A1 B1 AB 10
2 D C CD 15
3 E F EF 10
4 G H GH NaN
5 I J IJ NaN
6 K L KL 15
I'm watching tutorials about groupby() and pandas documentation since 2 days without finding any clue of the logic behind it and the way I should do this. My problem is (as I see it) more complicated and not really related to what's treated in those tutorials (for example this one that I watched several times)
Create ordered group from first and second columns. key is useless here since your want all max for each subgroup (max for (A,B) and max for (A1,B1)) then sort values by max registered by descending order. Finally group by this virtual groups and keep the first value (the max):
out = df.assign(group=df[['first', 'second']].apply(frozenset, axis=1)) \
.sort_values('max registered', ascending=False) \
.groupby('group').head(1).sort_index()
print(out)
first second key max registered group
0 A B AB 10.0 (A, B)
2 A1 B1 AB 10.0 (B1, A1)
5 D C CD 15.0 (C, D)
6 E F EF 10.0 (E, F)
8 G H GH NaN (G, H)
10 I J IJ NaN (J, I)
11 K L KL 15.0 (K, L)
I have 2 dataframes:
df1 = pd.DataFrame({'A':[1,2,3,4],
'B':[5,6,7,8],
'D':[9,10,11,12]})
and
df2 = pd.DataFrame({'type':['A', 'B', 'C', 'D', 'E'],
'color':['yellow', 'green', 'red', 'pink', 'black'],
'size':['S', 'M', 'L', 'S', 'M']})
I want to map Information from df2 to Header of df1, the result should look like below:
how can I do this? Many thanks :)
Use rename with aggregate values by DataFrame.agg:
df1 = pd.DataFrame({'A1':[1,2,3,4],
'B':[5,6,7,8],
'D':[9,10,11,12]})
s = df2.set_index('type', drop=False).agg(','.join, axis=1)
df1 = df1.rename(columns=s)
print (df1)
A1 B,green,M D,pink,S
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
For () need more processing:
s = df2.set_index('type').agg(','.join, axis=1).add(')').radd('(')
s = s.index +' ' + s
df1 = df1.rename(columns=s)
print (df1)
A (yellow,S) B (green,M) D (pink,S)
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
This is a follow up question to get first and last values in a groupby
How do I drop first and last rows within each group?
I have this df
df = pd.DataFrame(np.arange(20).reshape(10, -1),
[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
['a', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']],
['X', 'Y'])
df
I intentionally made the second row have the same index value as the first row. I won't have control over the uniqueness of the index.
X Y
a a 0 1
a 2 3
c 4 5
d 6 7
b e 8 9
f 10 11
g 12 13
c h 14 15
i 16 17
d j 18 19
I want this
X Y
a b 2.0 3
c 4.0 5
b f 10.0 11
Because both groups at level 0 equal to 'c' and 'd' have less than 3 rows, all rows should be dropped.
I'd apply a similar technique to what I did for the other question:
def first_last(df):
return df.ix[1:-1]
df.groupby(level=0, group_keys=False).apply(first_last)
Note: in pandas version 0.20.0 and above, ix is deprecated and the use of iloc is encouraged instead.
So the df.ix[1:-1] should be replaced by df.iloc[1:-1].