how to map value from another dataframe to header of another dataframe - python

I have 2 dataframes:
df1 = pd.DataFrame({'A':[1,2,3,4],
'B':[5,6,7,8],
'D':[9,10,11,12]})
and
df2 = pd.DataFrame({'type':['A', 'B', 'C', 'D', 'E'],
'color':['yellow', 'green', 'red', 'pink', 'black'],
'size':['S', 'M', 'L', 'S', 'M']})
I want to map Information from df2 to Header of df1, the result should look like below:
how can I do this? Many thanks :)

Use rename with aggregate values by DataFrame.agg:
df1 = pd.DataFrame({'A1':[1,2,3,4],
'B':[5,6,7,8],
'D':[9,10,11,12]})
s = df2.set_index('type', drop=False).agg(','.join, axis=1)
df1 = df1.rename(columns=s)
print (df1)
A1 B,green,M D,pink,S
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
For () need more processing:
s = df2.set_index('type').agg(','.join, axis=1).add(')').radd('(')
s = s.index +' ' + s
df1 = df1.rename(columns=s)
print (df1)
A (yellow,S) B (green,M) D (pink,S)
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12

Related

Concatenate/combine two columns into one for Pandas dataframe when axis = 0

I have dataframe:
d_test = {
'c1' : ['a', 'b', np.nan, 'c'],
'c2' : ['d', np.nan, 'e', 'f'],
'test': [1,2,3,4],
}
df_test = pd.DataFrame(d_test)
And I want to concatenate columns c1 and c2 in one and have following resulted dataframe:
a 1
b 2
c 4
d 1
e 3
f 4
I tired to use
pd.concat([df_test.c1 , df_test.c2], axis = 0)
to generate such a column but have no idea how to keep 'test' column as well during concationation.
use melt
df_test.melt('test').dropna()[['value', 'test']]
result:
value test
0 a 1
1 b 2
3 c 4
4 d 1
6 e 3
7 f 4

How to create df1 from df2 with different row and column indices?

I want to fill df1 using df2 values, I could achieve it using nested loop but is very much time taking.
Is there any smart way to do this ?
P.S. The size of df is around 8000 rows , 8000 columns.
df1 initially is like this
A B C D
A 0 0 0 0
B 0 0 0 0
C 0 0 0 0
D 0 0 0 0
df2 is like this
P Q R S T
P 1 5 7 5 3
Q 5 6 2 8 5
R 3 5 4 9 3
S 9 4 5 0 8
T 2 9 4 2 1
Now there is correspondence list between indices of df1 and df2
df1 df2
A P
B Q
C R
D S
B T
df1 should be filled like this
A B C D
A 1 8 7 5
B 7 21 6 10
C 3 8 4 9
D 9 12 5 0
Here as 'B' is occurring twice in the list, so it will add values of 'Q' and 'T' together.
Thank you in advance.
You could try changing the row and col names in df1 (based on the correspondence with df2) and for the cases of multiple correspondence (like B) you could first name them B1, B2, etc... and then sum them together:
> di
{'Q': 'B1', 'P': 'A', 'S': 'D', 'R': 'C', 'T': 'B2'}
> df1 = df2.copy()
> df1.columns = [di[c] for c in df2.columns]
> df1.index = [di[c] for c in df2.index]
> ## sum B1,B2 column wise
> df1['B'] = df1.B1 + df1.B2
> ## sum B1,B2 row wise
> df1.at["B", :] = df1.loc["B1"] + df1.loc["B2"]
> ## subset with original index and column names
> df1[["A", "B", "C", "D"]].loc[["A", "B", "C", "D"]]
##output
A B C D
A 1.0 8.0 7.0 5.0
B 7.0 21.0 6.0 10.0
C 3.0 8.0 4.0 9.0
D 9.0 12.0 5.0 0.0
you can also stack df2 to a series, as the columns become an inner index(level_1) of a Series.
Then replace the indices with {'P': 'A', 'Q': 'B', 'R': 'C', 'S': 'D', 'T': 'B'}.
use groupby with sum to add values with the same indices, then unstack to turn the inner level index to column.
amap = {'P': 'A', 'Q': 'B', 'R': 'C', 'S': 'D', 'T': 'B'}
obj2 = df2.stack().reset_index()
for col in ['level_0','level_1']:
obj2[col] = obj2[col].map(amap)
df1 = obj2.groupby(['level_0', 'level_1'])[0].sum().unstack()

How to change values in a column to fake values

I want to change values from one column in a dataframe to fake data.
Here is the original table looking sample:
df = {'Name':['David', 'David', 'David', 'Kevin', 'Kevin', 'Ann', 'Joan']
'Age':[10,10,10,12,12,15,13]}
df = pd.DataFrame(df)
df
Now what I want to do is to change the Name column values to fake values like this:
df = {'Name':[A, A, A, B, B, C, D]
'Age':[10,10,10,12,12,15,13]}
df = pd.DataFrame(df)
df
Notice how I changed the names to a distinct combination of Alphabets. this is sample data, but in real data, there are a lot of names, so I start with A,B,C,D then when it reaches Z, the next new name should be AA then AB follows, etc..
Is this viable?
Here is my suggestion. List 'fake' below has more than 23000 items, if your df has more unique values, just increase the end of the loop (currently 5) and the fake list will increase exponentially:
import string
from itertools import combinations_with_replacement
names=df['Name'].unique()
letters=list(string.ascii_uppercase)
fake=[]
for i in range(1,5): #increase 5 if you need more items
fake.extend([i for i in combinations_with_replacement(letters,i)])
fake=[''.join(i) for i in fake]
d=dict(zip(names, fake))
df['code']=df.Name.map(d)
Sample of fake:
>>> print(fake[:30])
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', 'AC', 'AD']
Output:
>>>print(df)
Name Age code
0 David 10 A
1 David 10 A
2 David 10 A
3 Kevin 12 B
4 Kevin 12 B
5 Ann 15 C
6 Joan 13 D
Use factorize and make the Fake name as int which is easy to store
df['Fake']=df.Name.factorize()[0]
df
Name Age Fake
0 David 10 0
1 David 10 0
2 David 10 0
3 Kevin 12 1
4 Kevin 12 1
5 Ann 15 2
6 Joan 13 3
If need mix type
df.groupby('Name')['Name'].transform(lambda x : pd.util.testing.rands_array(8,1)[0])
0 jNAO9AdJ
1 jNAO9AdJ
2 jNAO9AdJ
3 es0p4Yjx
4 es0p4Yjx
5 x54NNbdF
6 hTMKxoXW
Name: Name, dtype: object
from string import ascii_lowercase
def excel_names(num_cols):
letters = list(ascii_lowercase)
excel_cols = []
for i in range(0, num_cols - 1):
n = i//26
m = n//26
i-=n*26
n-=m*26
col = letters[m-1]+letters[n-1]+letters[i] if m>0 else letters[n1]+letters[i] if n>0 else letters[i]
excel_cols.append(col)
return excel_cols
unique_names=df['Name'].nunique()+1
names=excel_names(unique_names)
dictionary=dict(zip(df['Name'].unique(),names))
df['new_Name']=df['Name'].map(dictionary)
Get new integer category of names using cumsum and use Python ord,char TO turn the integer argument into strings starting from A
df['Name']=(~(df.Name.shift(1)==df.Name)).cumsum().add(ord('A') - 1).map(chr)
print(df)
Name Age
0 A 10
1 A 10
2 A 10
3 B 12
4 B 12
5 C 15
6 D 13
let us think in another way. If you nead a fake sympol, so let us maping them to A0,A1,A2 to An. this would be more easy.
df = {'Name': ['David', 'David', 'David', 'Kevin', 'Kevin', 'Ann', 'Joan'], 'Age': [10, 10, 10, 12, 12, 15, 13]}
df = pd.DataFrame(df)
map = pd.DataFrame({'name': df['Name'].unique()})
map['seq'] = map.index
map['symbol'] = map['seq'].apply(lambda x: 'A' + str(x))
df['code'] = df['Name'].apply(lambda x: map.loc[map['name']==x]['symbol'].values)
df
Name Age code
0 David 10 A0
1 David 10 A0
2 David 10 A0
3 Kevin 12 A1
4 Kevin 12 A1
5 Ann 15 A2
6 Joan 13 A3

How Pandas is doing groupby for below scenario

I am facing problem while trying to understand the below code snippet of group by.I am trying to understand how is calculation is happening for df.groupby(L).sum().
This is a code snippet that i got from the urlenter link description here.
Thanks for any help.
Rows are grouping by values of list, because length of list is same like number of rows in DataFrame, it means:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
L = [0, 1, 0, 1, 2, 0]
print (df)
key data1 data2
0 A 0 5 <-0
1 B 1 0 <-1
2 C 2 3 <-0
3 A 3 3 <-1
4 B 4 7 <-2
5 C 5 9 <-0
So:
data1 for 0 is 0 + 2 + 5 = 7
data2 for 0 is 5 + 3 + 9 = 17
data1 for 1 is 1 + 3 = 4
data2 for 1 is 0 + 3 = 3
data1 for 2 is 4
data2 for 2 is 7
Output:
print(df.groupby(L).sum())
data1 data2
0 7 17
1 4 3
2 4 7
Key column is omitted, because Automatic exclusion of 'nuisance' columns.

Subtract rows from two dataframes based on index value

I have two dataframes:
df1 = pd.DataFrame({
'Name' : ['A', 'A', 'A', 'A', 'B', 'B'],
'Value': [10, 9, 8, 10, 99 , 88],
'Day' : [1,2,3,4,1,2]
})
df2 = pd.DataFrame({
'Name' : ['C', 'C', 'C', 'C'],
'Value': [1,2,3,4],
'Day' : [1,2,3,4]
})
I would like to subtract the values in df1 with the values in df2 based on the day and create a new dataframe called delta_values. If there are no entries for the day then no action should occur.
To explain further: B in the name column only has values for day 1 and 2. df2 should subtract its values associated with day 1 and 2 with B's values for day 1 and 2, but since B has no values for day 3 and 4, no arithmetic should occur. I am having trouble with this part.
The output I am looking for is
If nothing better comes to somebidy's mind, here's a correct but not very elegant solution:
results = df1.set_index(['Day','Name']).unstack()['Value']\
.subtract(df2.set_index('Day')['Value'], axis=0)\
.stack().reset_index()
Make the result look like the expected output:
result.columns = 'Day', 'Name', 'Value'
result.Value = result.Value.astype(int)
result.sort_values(['Name', 'Day'], inplace=True)
result = result[['Name', 'Value', 'Day']]
We can merge the two DataFrame's on the Day column and then subtract from there.
merged = df1.merge(df2, how='inner', on='Day', suffixes=('', '_y'))
print(merged)
Name Value Day Name_y Value_y
0 A 10 1 C 1
1 A 9 2 C 2
2 A 8 3 C 3
3 A 10 4 C 4
4 B 99 1 C 1
5 B 88 2 C 2
delta_values = df1.copy()
delta_values['Value'] = merged['Value'] - merged['Value_y']
print(delta_values)
Name Value Day
0 A 9 1
1 A 7 2
2 A 5 3
3 A 6 4
4 B 98 1
5 B 86 2
You can make do with either map or merge. Here's a map solution:
delta_values = df1.copy()
delta_values['Value'] -= delta_values['Day'].map(df2.set_index('Day')['Value']
).fillna(0)
Output:
Name Value Day
0 A 9 1
1 A 7 2
2 A 5 3
3 A 6 4
4 B 98 1
5 B 86 2

Categories

Resources