Formatting DataFrames - python

What's the agreed upon pythonic way to format columns in a DataFrame, while maintaining the original data?
For example, I have a large DataFrame which contains floats. For display purposes only, I would like to format some columns as percents, some as dollars, and some others as numbers rounded to the hundredths place. The remainder would be unchanged. The original data would be preserved and only the display would be affected. The solution would start with Raw df and return the Formatted df below.
Raw df:
index percent dollars rounded2 float
0 0.524 0.787 1.202 0.133
1 0.166 0.291 0.208 0.483
2 0.815 0.319 0.205 1.350
3 0.421 0.634 1.380 1.352
4 1.144 0.790 0.279 0.636
5 0.215 0.258 0.895 0.949
6 0.796 0.834 0.809 1.194
7 0.920 0.176 0.589 1.036
8 1.012 0.790 1.224 1.279
9 1.231 1.175 1.232 0.496
10 0.494 1.319 0.912 0.088
11 0.400 0.291 0.491 1.041
Formatted df:
index percent dollars rounded2 float
0 52.4% $0.79 1.20 0.133
1 16.6% $0.29 0.21 0.483
2 81.5% $0.32 0.20 1.350
3 42.1% $0.63 1.38 1.352
4 114.4% $0.79 0.28 0.636
5 21.5% $0.26 0.90 0.949
6 79.6% $0.83 0.81 1.194
7 92.0% $0.18 0.59 1.036
8 101.2% $0.79 1.22 1.279
9 123.1% $1.17 1.23 0.496
10 49.4% $1.32 0.91 0.088
11 40.0% $0.29 0.49 1.041
This seems to be pretty routine, but the available solutions for similar tasks are neither simple nor user friendly. I'd appreciate anyone who can provide a parsimonious method.

Related

python - a simple switch statement IS too much to ask for... trying to pull conditional values - newbie needs direction

I have a situation where I am looking to set unit and extended pricing on a quantity of components on a BOM - there are 1x, 10x and 100+ prices from a vendor as in the example below the kit quantity drives the unit pricing. I have the spreadsheet in a df but am having an awful time pulling the correct value (1x, 10x, 100x) into the unit price field
1X 10X 100X Qty Kit_Qty unit ext
0.1 0.062 0.0276 1 7
0.11 0.08 0.0376 1 7
0.1 0.062 0.0276 15 105
0.16 0.117 0.065 15 105
0.1 0.035 0.0158 3 21
0.1 0.055 0.0243 3 21
The above example has 2 items that are qty 7 - the 1x values should be pulled into the unit price field. the next has 105 - the 100x price should be selected - the last has 21 - the 10x price... I've generated boolean maps, etc can't seem to map the values to the outputs with the correct conditionals. Any pointers would be appreciated.
You can do this by using boolean expressions as integers. True is 1, False is 0.
import pandas as pd
data = [
[0.1, 0.062, 0.0276, 1, 7],
[0.11, 0.08, 0.0376, 1, 7],
[0.1, 0.062, 0.0276, 15, 105],
[0.16, 0.117, 0.065, 15, 105],
[0.1, 0.035, 0.0158, 3, 21],
[0.1, 0.055, 0.0243, 3, 21]
]
cols = "1X 10X 100X Qty Kit_Qty".split()
df = pd.DataFrame(data, columns=cols)
print(df)
df['unit'] = \
df['1X'] * (df['Kit_Qty']<10) +\
df['10X'] * (df['Kit_Qty']<100) * (df['Kit_Qty']>=10) +\
df['100X'] * (df['Kit_Qty']>=100)
print(df)
df['ext'] = df['Qty'] * df['unit']
print(df)
Output:
1X 10X 100X Qty Kit_Qty
0 0.10 0.062 0.0276 1 7
1 0.11 0.080 0.0376 1 7
2 0.10 0.062 0.0276 15 105
3 0.16 0.117 0.0650 15 105
4 0.10 0.035 0.0158 3 21
5 0.10 0.055 0.0243 3 21
1X 10X 100X Qty Kit_Qty unit
0 0.10 0.062 0.0276 1 7 0.1000
1 0.11 0.080 0.0376 1 7 0.1100
2 0.10 0.062 0.0276 15 105 0.0276
3 0.16 0.117 0.0650 15 105 0.0650
4 0.10 0.035 0.0158 3 21 0.0350
5 0.10 0.055 0.0243 3 21 0.0550
1X 10X 100X Qty Kit_Qty unit ext
0 0.10 0.062 0.0276 1 7 0.1000 0.100
1 0.11 0.080 0.0376 1 7 0.1100 0.110
2 0.10 0.062 0.0276 15 105 0.0276 0.414
3 0.16 0.117 0.0650 15 105 0.0650 0.975
4 0.10 0.035 0.0158 3 21 0.0350 0.105
5 0.10 0.055 0.0243 3 21 0.0550 0.165

Supervised learning for time series data

I have following time series data.I want to use classification model.for independent variable i want to pass an array of previous 5 values of feature 1 /feature 2 given some weight.for example on 06-03-2015 for id 1:[ a1 a2 a3 a4 a5] [0.053 0.036 0.044 0.087 0.02 ]
ID feature1 Date feature2
1 0.053 02-03-2015 0.0115
1 0.05 08-03-2015 0.0117
1 0.099 09-03-2015 0.00355
1 0.006 10-03-2015 0.0088
1 0.007 11-03-2015 0.0968
1 0.0045 12-03-2015 0.08325
1 0.068 13-03-2015 0.0055
1 0.097 14-03-2015 0.0668
1 0.082 18-03-2015 0.0635
2 0.053 21-03-2015 0.0115
2 0.05 26-03-2015 0.0117
2 0.099 27-03-2015 0.00355
2 0.006 28-03-2015 0.0088
2 0.007 29-03-2015 0.0968
2 0.068 31-03-2015 0.0055
2 0.097 01-04-2015 0.0668
2 0.017 02-04-2015 0.0145
2 0.049 06-04-2015 0.0556
How would I assign weights to values on rolling basis where window =5.weights can between 0 to 1 .so I can multiply them with values and result should go as 1 of the independent variable.How can i use LSTM model for this kind of data.
This article on Machine Learning Mastery walks you through how to do that.

Pandas compute average for two consecutive rows and save result in two cells

I have the following data:
INPUT
ID A
1 0.040
2 0.086
3 0.127
4 0.173
5 0.141
6 0.047
7 0.068
8 0.038
I want to create B column, each two row in B have the same average from A. As following:
OUTPUT
ID A B
1 0.040 0.063
2 0.086 0.063
3 0.127 0.150
4 0.173 0.150
5 0.141 0.094
6 0.047 0.094
7 0.068 0.053
8 0.038 0.053
I tried this code
df["B"]= (df['A'] + df['A'].shift(-1))/2
I got the average but I can't make it distrbute bi-row.
you can do it this way:
In [10]: df['B'] = df.groupby(np.arange(len(df)) // 2)['A'].transform('mean')
In [11]: df
Out[11]:
ID A B
0 1 0.040 0.063
1 2 0.086 0.063
2 3 0.127 0.150
3 4 0.173 0.150
4 5 0.141 0.094
5 6 0.047 0.094
6 7 0.068 0.053
7 8 0.038 0.053

Pandas reshaping functions

To add to the many excellent examples of this, I'm trying to reshape my data into the format I want.
I currently have data indexed by customer, purchase category and date, with observations for each intra-day time period across the columns:
I want to aggregate by purchase category, and reshape so that my data is indexed by date and time, while customers appear across the columns.
What's the simplest way to achieve this?
In text form, the original data looks like this:
<table><tbody><tr><th>Customer</th><th>Purchase Category</th><th>date</th><th>00:30</th><th>01:00</th><th>01:30</th></tr><tr><td>1</td><td>A</td><td>01/07/2012</td><td>1.25</td><td>1.25</td><td>1.25</td></tr><tr><td>1</td><td>B</td><td>01/07/2012</td><td>0.855</td><td>0.786</td><td>0.604</td></tr><tr><td>1</td><td>C</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>1</td><td>A</td><td>02/07/2012</td><td>1.25</td><td>1.25</td><td>1.125</td></tr><tr><td>1</td><td>B</td><td>02/07/2012</td><td>0.309</td><td>0.082</td><td>0.059</td></tr><tr><td>1</td><td>C</td><td>02/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>A</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>B</td><td>01/07/2012</td><td>0.167</td><td>0.108</td><td>0.119</td></tr><tr><td>2</td><td>C</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>A</td><td>02/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>B</td><td>02/07/2012</td><td>0.11</td><td>0.109</td><td>0.123</td></tr></tbody></table>
I think you need groupby with aggregating sum with reshape by stack and unstack. Last pop column level_1, add to date and convert to_datetime:
print (df)
Customer Purchase Category date 00:30 01:00 01:30
0 1 A 01/07/2012 1.250 1.250 1.250
1 1 B 01/07/2012 0.855 0.786 0.604
2 1 C 01/07/2012 0.000 0.000 0.000
3 1 A 02/07/2012 1.250 1.250 1.125
4 1 B 02/07/2012 0.309 0.082 0.059
5 1 C 02/07/2012 0.000 0.000 0.000
6 2 A 01/07/2012 0.000 0.000 0.000
7 2 B 01/07/2012 0.167 0.108 0.119
8 2 C 01/07/2012 0.000 0.000 0.000
9 2 A 02/07/2012 0.000 0.000 0.000
10 2 B 02/07/2012 0.110 0.109 0.123
df1 = df.groupby(['Customer','date']).sum().stack().unstack(0).reset_index()
df1.date = pd.to_datetime(df1.date + df1.pop('level_1'), format='%d/%m/%Y%H:%M')
print (df1)
Customer date 1 2
0 2012-07-01 00:30:00 2.105 0.167
1 2012-07-01 01:00:00 2.036 0.108
2 2012-07-01 01:30:00 1.854 0.119
3 2012-07-02 00:30:00 1.559 0.110
4 2012-07-02 01:00:00 1.332 0.109
5 2012-07-02 01:30:00 1.184 0.123

How to sort a boxplot by the median values in pandas

I've got a dataframe outcome2 that I generate a grouped boxplot with in the following manner:
In [11]: outcome2.boxplot(column='Hospital 30-Day Death (Mortality) Rates from Heart Attack',by='State')
plt.ylabel('30 Day Death Rate')
plt.title('30 Day Death Rate by State')
Out [11]:
What I'd like to do is sort the plot by the median for each state, instead of alphabetically. Not sure how to go about doing so.
To sort by the median, just compute the median, then sort it and use the resulting Index to slice the DataFrame:
In [45]: df.iloc[:10, :5]
Out[45]:
AK AL AR AZ CA
0 0.047 0.199 0.969 -0.205 1.053
1 0.206 0.132 -0.712 0.111 -0.254
2 0.638 0.233 -0.907 1.284 1.193
3 1.234 0.046 0.624 0.485 -0.048
4 -1.362 -0.559 1.108 -0.501 0.111
5 1.276 -0.954 0.653 -0.175 -0.287
6 0.524 -1.785 -0.887 1.354 -0.431
7 0.111 0.762 -0.514 0.808 0.728
8 1.301 0.619 0.957 1.542 -0.087
9 -0.892 2.327 1.363 -1.537 0.142
In [46]: med = df.median()
In [47]: med.sort()
In [48]: newdf = df[med.index]
In [49]: newdf.iloc[:10, :5]
Out[49]:
PA CT LA RI MO
0 -0.667 0.774 -0.999 -0.938 0.155
1 0.822 0.390 -0.014 -2.228 0.570
2 -1.037 0.838 -0.673 2.038 0.809
3 0.620 2.845 -0.523 -0.151 -0.955
4 -0.918 1.043 0.613 0.698 -0.446
5 -0.767 0.869 -0.496 -0.925 -0.374
6 -0.495 0.437 1.245 -1.046 0.894
7 -1.283 0.358 0.016 0.137 0.511
8 -0.018 -0.047 -0.639 -0.385 0.080
9 -1.705 0.986 0.605 0.295 0.302
In [50]: med.head()
Out[50]:
PA -0.117
CT -0.077
LA -0.072
RI -0.069
MO -0.053
dtype: float64
The resulting figure:

Categories

Resources