Pandas reshaping functions - python

To add to the many excellent examples of this, I'm trying to reshape my data into the format I want.
I currently have data indexed by customer, purchase category and date, with observations for each intra-day time period across the columns:
I want to aggregate by purchase category, and reshape so that my data is indexed by date and time, while customers appear across the columns.
What's the simplest way to achieve this?
In text form, the original data looks like this:
<table><tbody><tr><th>Customer</th><th>Purchase Category</th><th>date</th><th>00:30</th><th>01:00</th><th>01:30</th></tr><tr><td>1</td><td>A</td><td>01/07/2012</td><td>1.25</td><td>1.25</td><td>1.25</td></tr><tr><td>1</td><td>B</td><td>01/07/2012</td><td>0.855</td><td>0.786</td><td>0.604</td></tr><tr><td>1</td><td>C</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>1</td><td>A</td><td>02/07/2012</td><td>1.25</td><td>1.25</td><td>1.125</td></tr><tr><td>1</td><td>B</td><td>02/07/2012</td><td>0.309</td><td>0.082</td><td>0.059</td></tr><tr><td>1</td><td>C</td><td>02/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>A</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>B</td><td>01/07/2012</td><td>0.167</td><td>0.108</td><td>0.119</td></tr><tr><td>2</td><td>C</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>A</td><td>02/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>B</td><td>02/07/2012</td><td>0.11</td><td>0.109</td><td>0.123</td></tr></tbody></table>

I think you need groupby with aggregating sum with reshape by stack and unstack. Last pop column level_1, add to date and convert to_datetime:
print (df)
Customer Purchase Category date 00:30 01:00 01:30
0 1 A 01/07/2012 1.250 1.250 1.250
1 1 B 01/07/2012 0.855 0.786 0.604
2 1 C 01/07/2012 0.000 0.000 0.000
3 1 A 02/07/2012 1.250 1.250 1.125
4 1 B 02/07/2012 0.309 0.082 0.059
5 1 C 02/07/2012 0.000 0.000 0.000
6 2 A 01/07/2012 0.000 0.000 0.000
7 2 B 01/07/2012 0.167 0.108 0.119
8 2 C 01/07/2012 0.000 0.000 0.000
9 2 A 02/07/2012 0.000 0.000 0.000
10 2 B 02/07/2012 0.110 0.109 0.123
df1 = df.groupby(['Customer','date']).sum().stack().unstack(0).reset_index()
df1.date = pd.to_datetime(df1.date + df1.pop('level_1'), format='%d/%m/%Y%H:%M')
print (df1)
Customer date 1 2
0 2012-07-01 00:30:00 2.105 0.167
1 2012-07-01 01:00:00 2.036 0.108
2 2012-07-01 01:30:00 1.854 0.119
3 2012-07-02 00:30:00 1.559 0.110
4 2012-07-02 01:00:00 1.332 0.109
5 2012-07-02 01:30:00 1.184 0.123

Related

Replicating rows based on index multiplies the rows instead of replicating

I have dataframe, where i would like to replicate few rows:
X Y diff No
Index
1d 0.000 0.017 0.000e+00 0
2D 0.083 0.017 3.000e-03 1
3D 0.250 0.017 7.200e-03 2
6D 0.500 0.019 2.400e-03 3
1DD 1.000 0.020 2.400e-03 4
2DD 2.000 0.023 1.300e-03 5
3DD 3.000 0.024 1.000e-03 6
5DD 5.000 0.026 6.500e-04 7
7DD 7.000 0.027 2.667e-04 8
10DD 10.000 0.028 1.200e-04 9
20DD 20.000 0.029 1.200e-04 10
30DD 30.000 0.031 0.000e+00 11
I want to replicate 30DD 30 times and 20DD 20 times and 10DD 10 times with same index name.
I tried this, instead of replicating it multiplies
for i in range(4):
test1 = df.append(df.ix['30DD']*30)
X Y diff No
Index
1d 0.000 0.017 0.000e+00 0
2D 0.083 0.017 3.000e-03 1
3D 0.250 0.017 7.200e-03 2
6D 0.500 0.019 2.400e-03 3
1DD 1.000 0.020 2.400e-03 4
2DD 2.000 0.023 1.300e-03 5
3DD 3.000 0.024 1.000e-03 6
5DD 5.000 0.026 6.500e-04 7
7DD 7.000 0.027 2.667e-04 8
10DD 10.000 0.028 1.200e-04 9
20DD 20.000 0.029 1.200e-04 10
30DD 30.000 0.031 0.000e+00 11
30DD 900 0.918 0 330
Add new rows, but subtract 1, because append to original DataFrame:
vals = ['30DD'] * 29 + ['20DD'] * 19 + ['10DD'] * 9
df = df.append(df.loc[vals])
Last if want sorting values by numbers of index values:
df = df.iloc[df.index.str.extract('(\d+)').astype(int).squeeze().argsort()]
Using numpy.repeat, you can create a list of indices for rows you wish to append. Then feed to pd.DataFrame.loc and append to your original dataframe.
vals = ['30DD', '20DD', '10DD']
counts = [30, 20, 10]
df = df.append(df.loc[np.repeat(vals, counts)])

Supervised learning for time series data

I have following time series data.I want to use classification model.for independent variable i want to pass an array of previous 5 values of feature 1 /feature 2 given some weight.for example on 06-03-2015 for id 1:[ a1 a2 a3 a4 a5] [0.053 0.036 0.044 0.087 0.02 ]
ID feature1 Date feature2
1 0.053 02-03-2015 0.0115
1 0.05 08-03-2015 0.0117
1 0.099 09-03-2015 0.00355
1 0.006 10-03-2015 0.0088
1 0.007 11-03-2015 0.0968
1 0.0045 12-03-2015 0.08325
1 0.068 13-03-2015 0.0055
1 0.097 14-03-2015 0.0668
1 0.082 18-03-2015 0.0635
2 0.053 21-03-2015 0.0115
2 0.05 26-03-2015 0.0117
2 0.099 27-03-2015 0.00355
2 0.006 28-03-2015 0.0088
2 0.007 29-03-2015 0.0968
2 0.068 31-03-2015 0.0055
2 0.097 01-04-2015 0.0668
2 0.017 02-04-2015 0.0145
2 0.049 06-04-2015 0.0556
How would I assign weights to values on rolling basis where window =5.weights can between 0 to 1 .so I can multiply them with values and result should go as 1 of the independent variable.How can i use LSTM model for this kind of data.
This article on Machine Learning Mastery walks you through how to do that.

Formatting DataFrames

What's the agreed upon pythonic way to format columns in a DataFrame, while maintaining the original data?
For example, I have a large DataFrame which contains floats. For display purposes only, I would like to format some columns as percents, some as dollars, and some others as numbers rounded to the hundredths place. The remainder would be unchanged. The original data would be preserved and only the display would be affected. The solution would start with Raw df and return the Formatted df below.
Raw df:
index percent dollars rounded2 float
0 0.524 0.787 1.202 0.133
1 0.166 0.291 0.208 0.483
2 0.815 0.319 0.205 1.350
3 0.421 0.634 1.380 1.352
4 1.144 0.790 0.279 0.636
5 0.215 0.258 0.895 0.949
6 0.796 0.834 0.809 1.194
7 0.920 0.176 0.589 1.036
8 1.012 0.790 1.224 1.279
9 1.231 1.175 1.232 0.496
10 0.494 1.319 0.912 0.088
11 0.400 0.291 0.491 1.041
Formatted df:
index percent dollars rounded2 float
0 52.4% $0.79 1.20 0.133
1 16.6% $0.29 0.21 0.483
2 81.5% $0.32 0.20 1.350
3 42.1% $0.63 1.38 1.352
4 114.4% $0.79 0.28 0.636
5 21.5% $0.26 0.90 0.949
6 79.6% $0.83 0.81 1.194
7 92.0% $0.18 0.59 1.036
8 101.2% $0.79 1.22 1.279
9 123.1% $1.17 1.23 0.496
10 49.4% $1.32 0.91 0.088
11 40.0% $0.29 0.49 1.041
This seems to be pretty routine, but the available solutions for similar tasks are neither simple nor user friendly. I'd appreciate anyone who can provide a parsimonious method.

Pandas compute average for two consecutive rows and save result in two cells

I have the following data:
INPUT
ID A
1 0.040
2 0.086
3 0.127
4 0.173
5 0.141
6 0.047
7 0.068
8 0.038
I want to create B column, each two row in B have the same average from A. As following:
OUTPUT
ID A B
1 0.040 0.063
2 0.086 0.063
3 0.127 0.150
4 0.173 0.150
5 0.141 0.094
6 0.047 0.094
7 0.068 0.053
8 0.038 0.053
I tried this code
df["B"]= (df['A'] + df['A'].shift(-1))/2
I got the average but I can't make it distrbute bi-row.
you can do it this way:
In [10]: df['B'] = df.groupby(np.arange(len(df)) // 2)['A'].transform('mean')
In [11]: df
Out[11]:
ID A B
0 1 0.040 0.063
1 2 0.086 0.063
2 3 0.127 0.150
3 4 0.173 0.150
4 5 0.141 0.094
5 6 0.047 0.094
6 7 0.068 0.053
7 8 0.038 0.053

How to sort a boxplot by the median values in pandas

I've got a dataframe outcome2 that I generate a grouped boxplot with in the following manner:
In [11]: outcome2.boxplot(column='Hospital 30-Day Death (Mortality) Rates from Heart Attack',by='State')
plt.ylabel('30 Day Death Rate')
plt.title('30 Day Death Rate by State')
Out [11]:
What I'd like to do is sort the plot by the median for each state, instead of alphabetically. Not sure how to go about doing so.
To sort by the median, just compute the median, then sort it and use the resulting Index to slice the DataFrame:
In [45]: df.iloc[:10, :5]
Out[45]:
AK AL AR AZ CA
0 0.047 0.199 0.969 -0.205 1.053
1 0.206 0.132 -0.712 0.111 -0.254
2 0.638 0.233 -0.907 1.284 1.193
3 1.234 0.046 0.624 0.485 -0.048
4 -1.362 -0.559 1.108 -0.501 0.111
5 1.276 -0.954 0.653 -0.175 -0.287
6 0.524 -1.785 -0.887 1.354 -0.431
7 0.111 0.762 -0.514 0.808 0.728
8 1.301 0.619 0.957 1.542 -0.087
9 -0.892 2.327 1.363 -1.537 0.142
In [46]: med = df.median()
In [47]: med.sort()
In [48]: newdf = df[med.index]
In [49]: newdf.iloc[:10, :5]
Out[49]:
PA CT LA RI MO
0 -0.667 0.774 -0.999 -0.938 0.155
1 0.822 0.390 -0.014 -2.228 0.570
2 -1.037 0.838 -0.673 2.038 0.809
3 0.620 2.845 -0.523 -0.151 -0.955
4 -0.918 1.043 0.613 0.698 -0.446
5 -0.767 0.869 -0.496 -0.925 -0.374
6 -0.495 0.437 1.245 -1.046 0.894
7 -1.283 0.358 0.016 0.137 0.511
8 -0.018 -0.047 -0.639 -0.385 0.080
9 -1.705 0.986 0.605 0.295 0.302
In [50]: med.head()
Out[50]:
PA -0.117
CT -0.077
LA -0.072
RI -0.069
MO -0.053
dtype: float64
The resulting figure:

Categories

Resources