Dataframe complex reformating - python

I would like to transform this dataframe:
import pandas as pd
df = pd.DataFrame.from_items([('a', [13,'F','RD',0,0,1,0,1]),
('b', [45,'M','RD',1,1,0,1,0]),
('c', [67,'F','AN',0,0,1,0,1]),
('d', [23,'M','AN',1,0,0,1,1])],
orient='index', columns=['AGE', 'SEX', 'REG', 'A', 'B', 'C', 'D', 'E'])
print df
AGE SEX REG A B C D E
a 13 F RD 0 0 1 0 1
b 45 M RD 1 1 0 1 0
c 67 F AN 0 0 1 0 1
d 23 M AN 1 0 0 1 1
To be transform into:
AGE SEX REG PRODUCT PA
a 13 F RD A 0
a 13 F RD B 0
a 13 F RD C 1
a 13 F RD D 0
a 13 F RD E 1
b 45 M RD A 1
b 45 M RD B 1
b 45 M RD C 0
b 45 M RD D 1
b 45 M RD E 0
c 67 F AN A 0
c 67 F AN B 0
c 67 F AN C 1
c 67 F AN D 0
c 67 F AN E 1
d 23 M AN A 1
d 23 M AN B 0
d 23 M AN C 0
d 23 M AN D 1
d 23 M AN E 1
So basically repeating the each product (A,B,C,D,E) for each users (a, b, c, d) and attribute the value for each user/product. The original table has thousand of rows.

You can use set_index with stack, reset_index and last rename column name to PRODUCT:
print (df.set_index(['AGE','SEX','REG'])
.stack()
.reset_index(name='PA')
.rename(columns={'level_3':'PRODUCT'}))
AGE SEX REG PRODUCT PA
0 13 F RD A 0
1 13 F RD B 0
2 13 F RD C 1
3 13 F RD D 0
4 13 F RD E 1
5 45 M RD A 1
6 45 M RD B 1
7 45 M RD C 0
8 45 M RD D 1
9 45 M RD E 0
10 67 F AN A 0
11 67 F AN B 0
12 67 F AN C 1
13 67 F AN D 0
14 67 F AN E 1
15 23 M AN A 1
16 23 M AN B 0
17 23 M AN C 0
18 23 M AN D 1
19 23 M AN E 1
print (df.set_index(['AGE','SEX','REG'], append=True)
.stack()
.reset_index([1,2,3,4], name='PA')
.rename(columns={'level_4':'PRODUCT'}))
AGE SEX REG PRODUCT PA
a 13 F RD A 0
a 13 F RD B 0
a 13 F RD C 1
a 13 F RD D 0
a 13 F RD E 1
b 45 M RD A 1
b 45 M RD B 1
b 45 M RD C 0
b 45 M RD D 1
b 45 M RD E 0
c 67 F AN A 0
c 67 F AN B 0
c 67 F AN C 1
c 67 F AN D 0
c 67 F AN E 1
d 23 M AN A 1
d 23 M AN B 0
d 23 M AN C 0
d 23 M AN D 1
d 23 M AN E 1

Related

Mark repeated id with a-b relationship in dataframe

I'm trying to create a relationship between repeated ID's in dataframe. For example take 91, so 91 is repeated 4 times so for first 91 entry first column row value will be updated to A and second will be updated to B then for next row of 91, first will be updated to B and second will updated to C then for next first will be C and second will be D and so on and this same relationship will be there for all duplicated ID's.
For ID's that are not repeated first will marked as A.
id
first
other
11
0
0
09
0
0
91
0
0
91
0
0
91
0
0
91
0
0
15
0
0
15
0
0
12
0
0
01
0
0
01
0
0
01
0
0
Expected output:
id
first
other
11
A
0
09
A
0
91
A
B
91
B
C
91
C
D
91
D
E
15
A
B
15
B
C
12
A
0
01
A
B
01
B
C
01
C
D
I using df.iterrows() for this but that's becoming very messy code and will be slow if dataset increases is there any easy way of doing it.
You can perform a mapping using a cumcount per group as source:
from string import ascii_uppercase
# mapping dictionary
# this is an example, you can use any mapping
d = dict(enumerate(ascii_uppercase))
# {0: 'A', 1: 'B', 2: 'C'...}
g = df.groupby('id')
c = g.cumcount()
m = g['id'].transform('size').gt(1)
df['first'] = c.map(d)
df.loc[m, 'other'] = c[m].add(1).map(d)
Output:
id first other
0 11 A 0
1 9 A 0
2 91 A B
3 91 B C
4 91 C D
5 91 D E
6 15 A B
7 15 B C
8 12 A 0
9 1 A B
10 1 B C
11 1 C D
Given:
id
0 12
1 9
2 91
3 91
4 91
5 91
6 15
7 15
8 12
9 1
10 1
11 1
Doing:
# Count ids per group
df['first'] = df.groupby('id').cumcount()
# convert to letters and make other col
m = df.groupby('id').filter(lambda x: len(x)>1).index
df.loc[m, 'other'] = df['first'].add(66).apply(chr)
df['first'] = df['first'].add(65).apply(chr)
# fill in missing with 0
df['other'] = df['other'].fillna(0)
Output:
id first other
0 11 A 0
1 9 A 0
2 91 A B
3 91 B C
4 91 C D
5 91 D E
6 15 A B
7 15 B C
8 12 A 0
9 1 A B
10 1 B C
11 1 C D

How to match list in multiple columns

example my dataframe,
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ...
0 A B C D E F G H I J K L M N O P Q R S T U V
1 B C D E F G H I J K L M N O P Q R S T U V A
2 V A B C D E F G H I J K L M N O P Q R S T U
and my list
mylist = ['A', 'B' 'C']
I want to match the columns and the list so that only the characters in the list exist in the column.
output what I want
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ...
0 A B C
1 B C A
2 A B C
I'm not sure what to do, so I ask a question.
Thank you for reading.
Use DataFrame.isin with DataFrame.where:
mylist = ['A', 'B', 'C']
df = df.where(df.isin(mylist), '')
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
0 A B C
1 B C A
2 A B C
Or if invert mask is possible use:
df[~df.isin(mylist)] = ''
This might also work -
df = df[df.isin(mylist)].fillna('')

How to create a multi-conditional 1D dataframe from a multi dimensional dataframe

I have a table with scores for each product that needed to be sold for 10 days and availability of each product (totally number of products = 10)
A B C D
20 56 12 65
80 13 76 51
24 81 56 90
67 12 65 87
45 23 67 50
62 32 23 75
76 34 67 67
23 45 32 98
24 67 34 12
56 53 32 78
Product availability
A 3
B 2
C 3
D 2
First I had to rank each product and prioritize what I need to sell for each day. I was able to do that by
import pandas as pd
df = pd.read_csv('test.csv')
new_df = pd.DataFrame()
num = len(list(df))
for i in range(1,num+1) :
new_df['Max'+str(i)] = df.T.apply(lambda x: x.nlargest(i).idxmin())
print(new_df)
That gives me
Max1 Max2 Max3 Max4
0 D B A C
1 A C D B
2 D B C A
3 D A C B
4 C D A B
5 D A B C
6 A C C B
7 D B C A
8 B C A D
9 D A B C
now comes the hard part how do i create a table that contains the product to be sold for each day looking at the Max1 column but also keeping track of the availability. If the product is not available then chose the next maximum. The final df should look like this.
0 D
1 A
2 D
3 A
4 C
5 A
6 C
7 B
8 B
9 C
Breaking my head over this. Any help is appreciated. Thanks.
import pandas as pd
df1=pd.read_csv('file1',sep='\s+',header=None,names=['product','available'])
print df1
df2=pd.read_csv('file2',sep='\s+')
print df2
maxy=[]
for i in range(len(df2)):
if df1['available'][df1['product']==df2['Max1'][i]].values[0]>0:
maxy.append(df2['Max1'][i])
df1['available'][df1['product']==df2['Max1'][i]]=df1['available'][df1['product']==df2['Max1'][i]].values[0]-1
elif df1['available'][df1['product']==df2['Max2'][i]].values[0]>0:
maxy.append(df2['Max2'][i])
df1['available'][df1['product']==df2['Max2'][i]]=df1['available'][df1['product']==df2['Max2'][i]].values[0]-1
elif df1['available'][df1['product']==df2['Max3'][i]].values[0]>0:
maxy.append(df2['Max3'][i])
df1['available'][df1['product']==df2['Max3'][i]]=df1['available'][df1['product']==df2['Max3'][i]].values[0]-1
elif df1['available'][df1['product']==df2['Max4'][i]].values[0]>0:
maxy.append(df2['Max4'][i])
df1['available'][df1['product']==df2['Max4'][i]]=df1['available'][df1['product']==df2['Max4'][i]].values[0]-1
else:
print ("Check")
pd.DataFrame(maxy)
Output:
product available
0 A 3
1 B 2
2 C 3
3 D 2
Max1 Max2 Max3 Max4
0 D B A C
1 A C D B
2 D B C A
3 D A C B
4 C D A B
5 D A B C
6 A C C B
7 D B C A
8 B C A D
9 D A B C
0
0 D
1 A
2 D
3 A
4 C
5 A
6 C
7 B
8 B
9 C
I was able to do that for any number of products through this
cols = list(df2)
maxy=[]
for i in range(len(df2)):
for x in cols:
if df1['available'][df1['product']==df2[x][i]].values[0]>0:
maxy.append(df2[x][i])
df1['available'][df1['product']==df2[x][i]]=df1['available'][df1['product']==df2[x][i]].values[0]-1
break
final=pd.DataFrame(maxy)
print(final)
Thanks

how to subtract all pandas dataframe elements with each other easier way?

let's say I have a dataframe like this
name time
a 10
b 30
c 11
d 13
now I want a new dataframe like this
name1 name2 time_diff
a a 0
a b -20
a c -1
a d -3
b a 20
b b 0
b c 19
b d 17
.....
.....
d d 0
nested for loops, lambda function can be used but as the number of elements go above 200, for loops just take too much time to finish or should I say, I always have to interrupt the process. Does someone know a panda query way or something quicker & easier. shape of my dataframe is 1600x2
Solution with itertools:
import itertools
d=pd.DataFrame(list(itertools.product(df.name,df.name)),columns=['name1','name2'])
dic = dict(zip(df.name,df.time))
d['time_diff']=d.name1.map(dic)-d.name2.map(dic)
print(d)
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0
Use cross join first by merge with helper column, get difference and select only necessary columns:
df = df.assign(A=1)
df = pd.merge(df, df, on='A', suffixes=('1','2'))
df['time_diff'] = df['time1'] - df['time2']
df = df[['name1','name2','time_diff']]
print (df)
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0
Another solution with MultiIndex.from_product and reindex by first and second level:
df = df.set_index('name')
mux = pd.MultiIndex.from_product([df.index, df.index], names=['name1','name2'])
df = (df['time'].reindex(mux, level=0)
.sub(df.reindex(mux, level=1)['time'])
.rename('time_diff')
.reset_index())
another way would be, df.apply
df=pd.DataFrame({'col':['a','b','c','d'],'col1':[10,30,11,13]})
index = pd.MultiIndex.from_product([df['col'], df['col']], names = ["name1", "name2"])
res=pd.DataFrame(index = index).reset_index()
res['time_diff']=df.apply(lambda x: x['col1']-df['col1'],axis=1).values.flatten()
O/P:
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0

Count and sort co-occurence matrix

I have a co-occurrence matrix in pandas. How do I get the co-occurence values of all the combinations, sorted descending without looping?
(I didn't write the values on the other side of the diagonal, but they are there, and hold mirrored values)
Input:
A B C D E F
A 0 1 0 1 2 0
B 0 3 1 1 1
C 0 1 8 9
D 0 2 6
E 0 9
F 0
Output:
CF 9
EF 9
CE 8
DF 6
BC 3
AE 2
DE 2
AB 1
AD 1
BD 1
BE 1
BF 1
CD 1
AC 0
AF 0
i, j = np.triu_indices(len(df), 1)
pd.Series(
df.values[i, j], df.index[i] + df.index[j]
).sort_values(ascending=False)
EF 9
CF 9
CE 8
DF 6
BC 3
DE 2
AE 2
CD 1
BF 1
BE 1
BD 1
AD 1
AB 1
AF 0
AC 0
dtype: object
Setup
txt = """\
A B C D E F
A 0 1 0 1 2 0
B 0 3 1 1 1
C 0 1 8 9
D 0 2 6
E 0 9
F 0"""
df = pd.read_fwf(pd.io.common.StringIO(txt), index_col=0).fillna('')
df
A B C D E F
A 0 1 0 1 2 0
B 0 3 1 1 1
C 0 1 8 9
D 0 2 6
E 0 9
F 0
You can loop through row and columns using combinations from itertools and add to the list.
from itertools import combinations
explode_list = []
_ = [explode_list.append([r + c, df.loc[r][c]]) for r, c in combinations(df.columns, 2)]
Output
[['AB', 1],
['AC', 0],
...
]

Categories

Resources