example my dataframe,
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ...
0 A B C D E F G H I J K L M N O P Q R S T U V
1 B C D E F G H I J K L M N O P Q R S T U V A
2 V A B C D E F G H I J K L M N O P Q R S T U
and my list
mylist = ['A', 'B' 'C']
I want to match the columns and the list so that only the characters in the list exist in the column.
output what I want
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ...
0 A B C
1 B C A
2 A B C
I'm not sure what to do, so I ask a question.
Thank you for reading.
Use DataFrame.isin with DataFrame.where:
mylist = ['A', 'B', 'C']
df = df.where(df.isin(mylist), '')
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
0 A B C
1 B C A
2 A B C
Or if invert mask is possible use:
df[~df.isin(mylist)] = ''
This might also work -
df = df[df.isin(mylist)].fillna('')
Related
I want to use dataframe.melt function in pandas lib to convert data format from rows into column but keeping first column value. I ve just tried also .pivot, but it is not working good. Please look at the example below and please help:
ID Alphabet Unspecified: 1 Unspecified: 2
0 1 A G L
1 2 B NaN NaN
2 3 C H NaN
3 4 D I M
4 5 E J NaN
5 6 F K O
Into this:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
11 6 O
Try (assuming ID is unique and sorted):
df = (
pd.melt(df, "ID")
.sort_values("ID", kind="stable")
.drop(columns="variable")
.dropna()
.reset_index(drop=True)
.rename(columns={"value": "Alphabet"})
)
print(df)
Prints:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
13 6 O
Don't melt but rather stack, this will directly drop the NaNs and keep the order per row:
out = (df
.set_index('ID')
.stack().droplevel(1)
.reset_index(name='Alphabet')
)
Output:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
13 6 O
One option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = 'ID',
names_to = 'Alphabet',
names_pattern = ['.+'],
sort_by_appearance = True)
.dropna()
)
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
6 3 C
7 3 H
9 4 D
10 4 I
11 4 M
12 5 E
13 5 J
15 6 F
16 6 K
17 6 O
In the code above, the names_pattern accepts a list of regular expression to match the desired columns, all the matches are collated into one column names Alphabet in names_to.
I have a table with scores for each product that needed to be sold for 10 days and availability of each product (totally number of products = 10)
A B C D
20 56 12 65
80 13 76 51
24 81 56 90
67 12 65 87
45 23 67 50
62 32 23 75
76 34 67 67
23 45 32 98
24 67 34 12
56 53 32 78
Product availability
A 3
B 2
C 3
D 2
First I had to rank each product and prioritize what I need to sell for each day. I was able to do that by
import pandas as pd
df = pd.read_csv('test.csv')
new_df = pd.DataFrame()
num = len(list(df))
for i in range(1,num+1) :
new_df['Max'+str(i)] = df.T.apply(lambda x: x.nlargest(i).idxmin())
print(new_df)
That gives me
Max1 Max2 Max3 Max4
0 D B A C
1 A C D B
2 D B C A
3 D A C B
4 C D A B
5 D A B C
6 A C C B
7 D B C A
8 B C A D
9 D A B C
now comes the hard part how do i create a table that contains the product to be sold for each day looking at the Max1 column but also keeping track of the availability. If the product is not available then chose the next maximum. The final df should look like this.
0 D
1 A
2 D
3 A
4 C
5 A
6 C
7 B
8 B
9 C
Breaking my head over this. Any help is appreciated. Thanks.
import pandas as pd
df1=pd.read_csv('file1',sep='\s+',header=None,names=['product','available'])
print df1
df2=pd.read_csv('file2',sep='\s+')
print df2
maxy=[]
for i in range(len(df2)):
if df1['available'][df1['product']==df2['Max1'][i]].values[0]>0:
maxy.append(df2['Max1'][i])
df1['available'][df1['product']==df2['Max1'][i]]=df1['available'][df1['product']==df2['Max1'][i]].values[0]-1
elif df1['available'][df1['product']==df2['Max2'][i]].values[0]>0:
maxy.append(df2['Max2'][i])
df1['available'][df1['product']==df2['Max2'][i]]=df1['available'][df1['product']==df2['Max2'][i]].values[0]-1
elif df1['available'][df1['product']==df2['Max3'][i]].values[0]>0:
maxy.append(df2['Max3'][i])
df1['available'][df1['product']==df2['Max3'][i]]=df1['available'][df1['product']==df2['Max3'][i]].values[0]-1
elif df1['available'][df1['product']==df2['Max4'][i]].values[0]>0:
maxy.append(df2['Max4'][i])
df1['available'][df1['product']==df2['Max4'][i]]=df1['available'][df1['product']==df2['Max4'][i]].values[0]-1
else:
print ("Check")
pd.DataFrame(maxy)
Output:
product available
0 A 3
1 B 2
2 C 3
3 D 2
Max1 Max2 Max3 Max4
0 D B A C
1 A C D B
2 D B C A
3 D A C B
4 C D A B
5 D A B C
6 A C C B
7 D B C A
8 B C A D
9 D A B C
0
0 D
1 A
2 D
3 A
4 C
5 A
6 C
7 B
8 B
9 C
I was able to do that for any number of products through this
cols = list(df2)
maxy=[]
for i in range(len(df2)):
for x in cols:
if df1['available'][df1['product']==df2[x][i]].values[0]>0:
maxy.append(df2[x][i])
df1['available'][df1['product']==df2[x][i]]=df1['available'][df1['product']==df2[x][i]].values[0]-1
break
final=pd.DataFrame(maxy)
print(final)
Thanks
Say I have a row of column headers, and associated values in a Pandas Dataframe:
print df
A B C D E F G H I J K
1 2 3 4 5 6 7 8 9 10 11
how do I go about displaying them like the following:
print df
A B C D E
1 2 3 4 5
F G H I J
6 7 8 9 10
K
11
custom function
def new_repr(self):
g = self.groupby(np.arange(self.shape[1]) // 5, axis=1)
return '\n\n'.join([d.to_string() for _, d in g])
print(new_repr(df))
A B C D E
0 1 2 3 4 5
F G H I J
0 6 7 8 9 10
K
0 11
pd.set_option('display.width', 20)
pd.set_option('display.expand_frame_repr', True)
df
A B C D E \
0 1 2 3 4 5
F G H I J \
0 6 7 8 9 10
K
0 11
I'm trying to create a reusable function in python 2.7(pandas) to form categorical bins, i.e. group less-value categories as 'other'. Can someone help me to create a function for the below: col1, col2, etc. are different categorical variable columns.
##Reducing categories by binning categorical variables - column1
a = df.col1.value_counts()
#get top 5 values of index
vals = a[:5].index
df['col1_new'] = df.col1.where(df.col1.isin(vals), 'other')
df = df.drop(['col1'],axis=1)
##Reducing categories by binning categorical variables - column2
a = df.col2.value_counts()
#get top 6 values of index
vals = a[:6].index
df['col2_new'] = df.col2.where(df.col2.isin(vals), 'other')
df = df.drop(['col2'],axis=1)
You can use:
df = pd.DataFrame({'A':list('abcdefabcdefabffeg'),
'D':[1,3,5,7,1,0,1,3,5,7,1,0,1,3,5,7,1,0]})
print (df)
A D
0 a 1
1 b 3
2 c 5
3 d 7
4 e 1
5 f 0
6 a 1
7 b 3
8 c 5
9 d 7
10 e 1
11 f 0
12 a 1
13 b 3
14 f 5
15 f 7
16 e 1
17 g 0
def replace_under_top(df, c, n):
a = df[c].value_counts()
#get top n values of index
vals = a[:n].index
#assign columns back
df[c] = df[c].where(df[c].isin(vals), 'other')
#rename processes column
df = df.rename(columns={c : c + '_new'})
return df
Test:
df1 = replace_under_top(df, 'A', 3)
print (df1)
A_new D
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f 0
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f 0
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other 0
df2 = replace_under_top(df, 'D', 4)
print (df2)
A D_new
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f other
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f other
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other other
I would like to transform this dataframe:
import pandas as pd
df = pd.DataFrame.from_items([('a', [13,'F','RD',0,0,1,0,1]),
('b', [45,'M','RD',1,1,0,1,0]),
('c', [67,'F','AN',0,0,1,0,1]),
('d', [23,'M','AN',1,0,0,1,1])],
orient='index', columns=['AGE', 'SEX', 'REG', 'A', 'B', 'C', 'D', 'E'])
print df
AGE SEX REG A B C D E
a 13 F RD 0 0 1 0 1
b 45 M RD 1 1 0 1 0
c 67 F AN 0 0 1 0 1
d 23 M AN 1 0 0 1 1
To be transform into:
AGE SEX REG PRODUCT PA
a 13 F RD A 0
a 13 F RD B 0
a 13 F RD C 1
a 13 F RD D 0
a 13 F RD E 1
b 45 M RD A 1
b 45 M RD B 1
b 45 M RD C 0
b 45 M RD D 1
b 45 M RD E 0
c 67 F AN A 0
c 67 F AN B 0
c 67 F AN C 1
c 67 F AN D 0
c 67 F AN E 1
d 23 M AN A 1
d 23 M AN B 0
d 23 M AN C 0
d 23 M AN D 1
d 23 M AN E 1
So basically repeating the each product (A,B,C,D,E) for each users (a, b, c, d) and attribute the value for each user/product. The original table has thousand of rows.
You can use set_index with stack, reset_index and last rename column name to PRODUCT:
print (df.set_index(['AGE','SEX','REG'])
.stack()
.reset_index(name='PA')
.rename(columns={'level_3':'PRODUCT'}))
AGE SEX REG PRODUCT PA
0 13 F RD A 0
1 13 F RD B 0
2 13 F RD C 1
3 13 F RD D 0
4 13 F RD E 1
5 45 M RD A 1
6 45 M RD B 1
7 45 M RD C 0
8 45 M RD D 1
9 45 M RD E 0
10 67 F AN A 0
11 67 F AN B 0
12 67 F AN C 1
13 67 F AN D 0
14 67 F AN E 1
15 23 M AN A 1
16 23 M AN B 0
17 23 M AN C 0
18 23 M AN D 1
19 23 M AN E 1
print (df.set_index(['AGE','SEX','REG'], append=True)
.stack()
.reset_index([1,2,3,4], name='PA')
.rename(columns={'level_4':'PRODUCT'}))
AGE SEX REG PRODUCT PA
a 13 F RD A 0
a 13 F RD B 0
a 13 F RD C 1
a 13 F RD D 0
a 13 F RD E 1
b 45 M RD A 1
b 45 M RD B 1
b 45 M RD C 0
b 45 M RD D 1
b 45 M RD E 0
c 67 F AN A 0
c 67 F AN B 0
c 67 F AN C 1
c 67 F AN D 0
c 67 F AN E 1
d 23 M AN A 1
d 23 M AN B 0
d 23 M AN C 0
d 23 M AN D 1
d 23 M AN E 1