stack columns in pandas DataFrame CSV - python

I have a dataframe that is the result of a pivot table that has columns:
(best buy, count) 753 non-null values
(best buy, mean) 753 non-null values
(best buy, min) 753 non-null values
(best buy, max) 753 non-null values
(best buy, std) 750 non-null values
(amazon, count) 662 non-null values
(amazon, mean) 662 non-null values
(amazon, min) 662 non-null values
(amazon, max) 662 non-null values
(amazon, std) 661 non-null values
If I send this to a csv file I end up with something that looks like this (truncated)
(best buy, count) (best buy, mean) (best buy, max)
laptop 5 10 12
tv 10 23 34
and so on and so forth.
Is there a way for me to manipulate the dataframe so that the csv that is created instead looks like the below?
best buy best buy best buy
count mean max
laptop 5 10 12
tv 10 23 34

You can pass tupleize_cols=False to DataFrame.to_csv():
In [60]: df = DataFrame(poisson(50, size=(10, 2)), columns=['laptop', 'tv'])
In [61]: df
Out[61]:
laptop tv
0 48 57
1 48 45
2 48 49
3 61 47
4 49 47
5 45 65
6 49 40
7 58 39
8 46 65
9 43 53
In [62]: df['store'] = np.random.choice(['best_buy', 'amazon'], len(df))
In [63]: df
Out[63]:
laptop tv store
0 48 57 best_buy
1 48 45 best_buy
2 48 49 best_buy
3 61 47 best_buy
4 49 47 amazon
5 45 65 amazon
6 49 40 amazon
7 58 39 best_buy
8 46 65 amazon
9 43 53 best_buy
In [64]: res = df.groupby('store').agg(['mean', 'std', 'min', 'max']).T
In [65]: res
Out[65]:
store amazon best_buy
laptop mean 47.250 51.000
std 2.062 6.928
min 45.000 43.000
max 49.000 61.000
tv mean 54.250 48.333
std 12.738 6.282
min 40.000 39.000
max 65.000 57.000
In [66]: u = res.unstack()
In [67]: u
Out[67]:
store amazon best_buy
mean std min max mean std min max
laptop 47.25 2.062 45 49 51.000 6.928 43 61
tv 54.25 12.738 40 65 48.333 6.282 39 57
In [68]: u.to_csv('the_csv.csv', tupleize_cols=False, sep='\t')
In [69]: cat the_csv.csv
store amazon amazon amazon amazon best_buy best_buy best_buy best_buy
mean std min max mean std min max
laptop 47.25 2.0615528128088303 45.0 49.0 51.0 6.928203230275509 43.0 61.0
tv 54.25 12.737739202856996 40.0 65.0 48.333333333333336 6.282250127674532 39.0 57.0

Related

Efficient way to calculate every columns in pandas?

I have the following DataFrame:
1-A-873 2-A-129 3-A-123
12/12/20 45 32 41
13/12/20 94 56 87
14/12/20 12 42 84
15/12/20 73 24 25
Each column represent an equipment. Each equipment has a size that is declared in the code:
1A = 5
2A = 3
3A = 7
Every column will need to be divided by this equipment size that is - (value / size)
This is what I am using:
df["1A-NewValue"] = df["1-A-873"] / 1A
df["2A-NewValue"] = df["2-A-129"] / 2A
df["3A-NewValue"] = df["3-A-123"] / 3A
End result:
1-A-873 2-A-129 3-A-123 1A-NewValue 2A-NewValue 3A-NewValue
12/12/20 45 32 41 9 10.67 5.86
13/12/20 94 56 87 18.8 18.67 12.43
14/12/20 12 42 84 2.4 14 12
15/12/20 73 24 25 14.6 8 3.57
This works perfectly and do what I want by having three extra columns at the end of the DataFrame.
However, this will be a tedious effort later on if my total number of equipment increases to 250 instead of 3; I will need to have 250 lines for equipment size and 250 lines for the formula.
Naturally the first thing that come to my mind is a for loop, but is there a more Pandas-way of doing this efficiently?
Thanks!
You can create dictionary, rename columns names by split by - and join first 2 values for match and divide like:
d = {'1A': 5, '2A':3, '3A':7}
f = lambda x: ''.join(x.split('-')[:2])
df = df.join(df.rename(columns=f).div(d).add_suffix(' NewValue'))
print (df)
1-A-873 2-A-129 3-A-123 1A NewValue 2A NewValue 3A NewValue
12/12/20 45 32 41 9.0 10.666667 5.857143
13/12/20 94 56 87 18.8 18.666667 12.428571
14/12/20 12 42 84 2.4 14.000000 12.000000
15/12/20 73 24 25 14.6 8.000000 3.571429

Pandas: Calculate the percentage of multiple columns, saving the result in new columns - best way

My aim is to get the percentage of multiple columns, that are divided by another column. The resulting columns should be kept in the same dataframe.
A B Divisor
2000 8 31 166
2001 39 64 108
2002 68 8 142
2003 28 2 130
2004 55 61 150
result:
A B Divisor perc_A perc_B
2000 8 31 166 4.8 18.7
2001 39 64 108 36.1 59.3
2002 68 8 142 47.9 5.6
2003 28 2 130 21.5 1.5
2004 55 61 150 36.7 40.7
My solution:
def percentage(divisor,columns,heading,dframe):
for col in columns:
heading_new = str(heading+col)
dframe[heading_new] = (dframe.loc[:,col]/dframe.loc[:,divisor])*100
return dframe
df_new = division("Divisor",df.columns.values[:2],"perc_",df)
The solution above worked.But is there a more effective way to get the solution?
(I know there are already similar questions. But I couldn't find one, where I can save the results in the same dataframe without loosing the original columns)
Thanks
Use DataFrame.join for add new columns created by DataFrame.div by first 2 columns selected by DataFrame.iloc, multiple by 100 and DataFrame.add_prefix:
df = df.join(df.iloc[:, :2].div(df['Divisor'], axis=0).mul(100).add_prefix('perc_'))
print (df)
A B Divisor perc_A perc_B
2000 8 31 166 4.819277 18.674699
2001 39 64 108 36.111111 59.259259
2002 68 8 142 47.887324 5.633803
2003 28 2 130 21.538462 1.538462
2004 55 61 150 36.666667 40.666667
Your function should be changed:
def percentage(divisor,columns,heading,dframe):
return df.join(df[columns].div(df[divisor], axis=0).mul(100).add_prefix(heading))
df_new = percentage("Divisor",df.columns.values[:2],"perc_",df)
You can reshape the divisor:
df[['perc_A', 'perc_B']] = df[['A', 'B']] / df['Divisor'].values[:,None] * 100

Preserving multindex column structure after performing a groupby summation

I have a three-level multiindex column. At the lowest level, I want to add a subtotal column.
So in the example here, I would expect a new column zone: day, person:dave, find:'subtotal' with value = 49+27+63=138. similarly for all the other combinations of zone and person.
cols = pd.MultiIndex.from_product([['day', 'night'], ['dave', 'matt', 'mike'], ['gems', 'rocks', 'paper']])
rows = pd.date_range(start='20191201', periods=5, freq="d")
data = np.random.randint(0, high=100,size=(len(rows), len(cols)))
xf = pd.DataFrame(data, index=rows, columns=cols)
xf.columns.names = ['zone', 'person', 'find']
I can generate the correct subtotal data with xf.groupby(level=[0,1], axis="columns").sum() but then I lose the find level of the columns, it only leaves the zone and person levels. I need that third level of column called subtotal so that I can join that back with the original xf dataframe. But I cannot figure out a nice pythonic way to add a third level back into the multindex.
You can use sum first and then MultiIndex.from_product with new level:
df = xf.sum(level=[0,1], axis="columns")
df.columns = pd.MultiIndex.from_product(df.columns.levels + [['subtotal']])
print (df)
day night
dave matt mike dave matt mike
subtotal subtotal subtotal subtotal subtotal subtotal
2019-12-01 85 99 163 210 93 252
2019-12-02 38 113 101 211 110 135
2019-12-03 145 75 122 181 165 176
2019-12-04 220 184 173 179 134 192
2019-12-05 126 77 29 184 178 199
And then join together by concat with DataFrame.sort_index:
df = pd.concat([xf, df], axis=1).sort_index(axis=1)
print (df)
zone day \
person dave matt mike
find gems paper rocks subtotal gems paper rocks subtotal gems paper
2019-12-01 33 96 24 153 34 89 90 213 15 51
2019-12-02 74 48 61 183 94 83 2 179 75 4
2019-12-03 88 85 51 224 65 3 52 120 95 80
2019-12-04 43 28 60 131 43 14 77 134 88 54
2019-12-05 41 72 44 157 63 77 37 177 8 66
zone ... night \
person ... dave matt mike
find ... rocks subtotal gems paper rocks subtotal gems paper rocks
2019-12-01 ... 24 102 19 49 4 72 43 57 92
2019-12-02 ... 90 206 96 55 92 243 75 58 68
2019-12-03 ... 29 182 11 90 85 186 9 20 46
2019-12-04 ... 30 84 25 55 89 169 98 41 85
2019-12-05 ... 73 167 52 90 49 191 51 80 37
zone
person
find subtotal
2019-12-01 192
2019-12-02 201
2019-12-03 75
2019-12-04 224
2019-12-05 168
[5 rows x 24 columns]

Pandas filtering on max range

I'm working on text mining problem and using Pandas for text processing. From the following example I need to pick only those row which have the max span (start - end) within the same category (cat)
Given this dataframe:
name start end cat
0 coumadin 0 8 DRUG
1 albuterol 18 27 DRUG
2 albuterol sulfate 18 35 DRUG
3 sulfate 28 35 DRUG
4 2.5 36 39 STRENGTH
5 2.5 mg 36 42 STRENGTH
6 2.5 mg /3 ml 36 48 STRENGTH
7 0.083 50 55 STRENGTH
8 0.083 % 50 57 STRENGTH
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
10 solution 59 67 FORM
11 solution for nebulization 59 84 FORM
12 nebulization 72 84 ROUTE
13 one (1) 90 97 FREQUENCY
14 neb 98 101 ROUTE
15 neb inhalation 98 112 ROUTE
16 inhalation 102 112 ROUTE
17 q4h 113 116 FREQUENCY
18 every 118 123 FREQUENCY
19 every 4 hours 118 131 FREQUENCY
20 q4h (every 4 hours) 113 132 FREQUENCY
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
22 as needed 133 142 FREQUENCY
23 dyspnea 147 154 REASON
I need to get the following:
name start end cat
0 coumadin 0 8 DRUG
2 albuterol sulfate 18 35 DRUG
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
11 solution for nebulization 59 84 FORM
13 one (1) 90 97 FREQUENCY
15 neb inhalation 98 112 ROUTE
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
23 dyspnea 147 154 REASON
What I tried is to groupby by the category and then compute the max difference (end-start). However I got stuck how to find the max span between for the same entity within the category. I guess it should not be very tricky
COMMENT
Thank you all for suggestions, but I need ALL possible entities within each category. For example, in DRUG, there are two relevant drugs: coumadin and albuterol sulfate, and some fractions of them (albuterol and sulfate). I need to remove only (albuterol and sulfate) while keeping coumadin and albuterol sulfate. The same logic for other categories.
For example, rows 4-8 are all bits of a complete row 9, thus I need to keep only row 9. Rows 1 and 3 are parts of the row 2, thus I need to keep row 2 (in addition to row 0). Etc.
Obviously, all constituents ('bits') are within the max range, but the problem is to find the max (or unifying range) of the same entity and its constituents)
COMMENT 2
A possible solution could be: to find all overlapping intervals within the same category cat and pick the largest. I'm trying to implement, but not luck so far.
Possible Solution
I sorted columns by ascending and descending order:
df.sort_values(by=[1,2], ascending=[True, False])
0 1 2 3
0 coumadin 0 8 DRUG
2 albuterol sulfate 18 35 DRUG
1 albuterol 18 27 DRUG
3 sulfate 28 35 DRUG
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
6 2.5 mg /3 ml 36 48 STRENGTH
5 2.5 mg 36 42 STRENGTH
4 2.5 36 39 STRENGTH
8 0.083 % 50 57 STRENGTH
7 0.083 50 55 STRENGTH
11 solution for nebulization 59 84 FORM
10 solution 59 67 FORM
12 nebulization 72 84 ROUTE
13 one (1) 90 97 FREQUENCY
15 neb inhalation 98 112 ROUTE
14 neb 98 101 ROUTE
16 inhalation 102 112 ROUTE
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
20 q4h (every 4 hours) 113 132 FREQUENCY
17 q4h 113 116 FREQUENCY
19 every 4 hours 118 131 FREQUENCY
18 every 118 123 FREQUENCY
22 as needed 133 142 FREQUENCY
23 dyspnea 147 154 REASON
Which puts the relevant row the first, however, I still need to filter out irrelevant rows....
I have tried this on sample of your df:
Create a sample df:
import pandas as pd
Name = ['coumadin','albuterol','albuterol sulfate','sulfate']
Cat = ['D', 'D', 'D', 'D']
Start = [0, 18, 18, 28]
End = [8, 27, 33,35]
ID = [1,2,3,4]
df = pd.DataFrame(data = list(zip(ID,Name,Start,End,Cat)), \
columns=['ID','Name','Start','End','Cat'])
Make a function which will help in identifying the names which are similar
def matcher(x):
res = df.loc[df['Name'].str.contains(x, regex=False, case=False), 'ID']
return ','.join(res.astype(str))
Applying this function to value of the column
df['Matches'] = df['Name'].apply(matcher) ##Matches will contain the ID of rows which are similar and have only 1 value which are absolute.
ID Name Start End Cat Matches
0 1 coumadin 0 8 D 1
1 2 albuterol 18 27 D 2,3
2 3 albuterol sulfate 18 33 D 3
3 4 sulfate 28 35 D 3,4
Count the number of rows getting in matches
df['Count'] = df.Matches.apply(lambda x: len(x.split(',')))
Keep the df which has "Count" as 1 as these are the rows which contains the other rows:
df = df[df.Count == 1]
ID Name Start End Cat Matches Count
0 1 coumadin 0 8 D 1 1
2 3 albuterol sulfate 18 33 D 3 1
You can then remove unnecessary columns :)

Get maximum values relative to the current index in pandas python

Let me say I have a DataFrame where the data is ordered with respect to time. I have a column as weights and I want to find the maximum weight relative to the current index. For example the max value found for the 10th Row would be from elements 11 to the end.
I ended up writing this function. But performance is a big threat.
import pandas as pd
df=pd.DataFrame({"time":[100,200,300,400,500,600,700,800],"weights":
[120,160,190,110,34,55,66,33]})
totalRows=df['time'].count()
def findMaximumValRelativeToCurrentRow(row):
index= row.name
if index!= totalRows:
tempDf = df[index:totalRows]
val=tempDf['weights'].max()
df.set_value(index,'max',val)
else:
df.set_value(index,'max',row['weights'])
df.apply(findMaximumValRelativeToCurrentRow,axis=1)
print df
Is there any better way to do the operation than this?
You can use cummax with iloc for reverse order:
print (df['weights'].iloc[::-1])
7 33
6 66
5 55
4 34
3 110
2 190
1 160
0 120
Name: weights, dtype: int64
df['max1'] = df['weights'].iloc[::-1].cummax()
print (df)
time weights max max1
0 100 120 190.0 190
1 200 160 190.0 190
2 300 190 190.0 190
3 400 110 110.0 110
4 500 34 66.0 66
5 600 55 66.0 66
6 700 66 66.0 66
7 800 33 33.0 33

Categories

Resources