Find Average of Every Three Columns in Pandas dataframe - python

I am new to Python and Pandas. I have a panda dataframe with monthly columns ranging from 2000 (2000-01) to 2016 (2016-06).
I want to find the average of every three months and assign it to a new quarterly column (2000q1). I know I can do the following:
df['2000q1'] = df[['2000-01', '2000-02', '2000-03']].mean(axis=1)
df['2000q2'] = df[['2000-04', '2000-05', '2000-06']].mean(axis=1)
.
.
.
df['2016-02'] = df[['2016-04', '2016-05', '2016-06']].mean(axis=1)
But, this is very tedious. I appreciate it if someone helps me find a better way.

You can use groupby on columns:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Or, those can be converted to datetime. You can use resample:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Here's a demo:
cols = pd.date_range('2000-01', '2000-06', freq='MS')
cols = cols.strftime('%Y-%m')
cols
Out:
array(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'],
dtype='<U7')
df = pd.DataFrame(np.random.randn(10, 6), columns=cols)
df
Out:
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
0 -1.263798 0.251526 0.851196 0.159452 1.412013 1.079086
1 -0.909071 0.685913 1.394790 -0.883605 0.034114 -1.073113
2 0.516109 0.452751 -0.397291 -0.050478 -0.364368 -0.002477
3 1.459609 -1.696641 0.457822 1.057702 -0.066313 -0.910785
4 -0.482623 1.388621 0.971078 -0.038535 0.033167 0.025781
5 -0.016654 1.404805 0.100335 -0.082941 -0.418608 0.588749
6 0.684735 -2.007105 0.552615 1.969356 -0.614634 0.021459
7 0.382475 0.965739 -1.826609 -0.086537 -0.073538 -0.534753
8 1.548773 -0.157250 0.494819 -1.631516 0.627794 -0.398741
9 0.199049 0.145919 0.711701 0.305382 -0.118315 -2.397075
First alternative:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Out:
0 1
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
Second alternative:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Out:
2000-03-31 2000-06-30
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
You can assign this to a DataFrame:
res = df.resample('Q', axis=1).mean()
Change column names as you like:
res = res.rename(columns=lambda col: '{}q{}'.format(col.year, col.quarter))
res
Out:
2000q1 2000q2
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
And attach this to your current DataFrame by:
pd.concat([df, res], axis=1)

Related

decode pandas columns (categorical data to metric data) by template (pivot table) with constraints

In df A and B are label encoded categories all belonging to a certain subset (typ).
This categories should now be encoded/decoded again ... into metric data ... taken from template
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0,1,2,3,0,1,2,3,0,2,2,2,3,3,2,3,1,1],
'B': [2,3,1,1,1,3,2,2,0,2,2,2,3,3,3,3,2,1],
'typ': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]})
A and B should be decoded to metric(float) data from the templates pivot_A and pivot_B respectively. In the templates the headers are the values to replace, the indices are the conditions to match and the values are the new values:
pivot_A = pd.DataFrame(np.array([np.random.rand(9),np.random.rand(9),np.random.rand(9),np.random.rand(9)]).T,
columns=np.unique(df.A),
index = np.unique(df.typ))
pivot_B = pd.DataFrame(np.array([np.random.rand(9),np.random.rand(9),np.random.rand(9),np.random.rand(9)]).T,
columns=np.unique(df.B),
index = np.unique(df.typ))
pivot_B looks like:
In [5]: pivot_B
Out[5]:
0 1 2 3
type
1 0.326687 0.851405 0.830255 0.721817
2 0.496182 0.769574 0.083379 0.491332
3 0.442760 0.786503 0.593361 0.470658
4 0.100724 0.455841 0.485407 0.211383
5 0.989424 0.852057 0.530137 0.385900
6 0.413897 0.915375 0.708038 0.846020
7 0.548033 0.670561 0.900648 0.742418
8 0.077552 0.310529 0.156794 0.076186
9 0.463480 0.377749 0.876133 0.518022
pivot_A looks like:
In [6] pivot_A
Out[6]:
0 1 2 3
type
1 0.012808 0.128041 0.001279 0.320740
2 0.615976 0.736491 0.879216 0.842910
3 0.298637 0.828012 0.962703 0.736827
4 0.700053 0.115463 0.670091 0.638931
5 0.416262 0.633604 0.504292 0.983946
6 0.956872 0.129720 0.611625 0.682046
7 0.414579 0.062104 0.118168 0.265530
8 0.162742 0.952069 0.112400 0.837696
9 0.123151 0.061040 0.326437 0.380834
explained useage of pivots:
if df.typ == pivot.index and df.A == X:
df.A = pivot_A.loc[typ][X]
decoding could be done by:
for categorie in [i for i in df.columns if i != 'typ']:
for col in np.unique(df[categorie]):
for type_ in np.unique(df.typ):
df.loc[((df['typ']==type_) & (df[categorie]==col)), categorie] = locals()['pivot_{}'.format(categorie)].loc[type_,col]
and result in:
In[7] :df
Out[7]:
A B typ
0 0.012808 0.830255 1
1 0.736491 0.491332 2
2 0.962703 0.786503 3
3 0.638931 0.455841 4
4 0.416262 0.852057 5
5 0.129720 0.846020 6
6 0.118168 0.900648 7
7 0.837696 0.156794 8
8 0.123151 0.463480 9
9 0.001279 0.830255 1
10 0.879216 0.083379 2
11 0.962703 0.593361 3
12 0.638931 0.211383 4
13 0.983946 0.385900 5
14 0.611625 0.846020 6
15 0.265530 0.742418 7
16 0.952069 0.156794 8
17 0.061040 0.377749 9
BUT this looping seems NOT to be the best way doing it, right?!
How can I improve the code? pd.replace or dictionaries seem to be reasonable... but I can not figuere how to handle it with the extra typ conditions
melting down the 3xlooping process to one loop helps to reduce the duration time a lot:
old_values = list(pivot_A.columns) #from template
new_values_df = pd.DataFrame() #to save the decoded values without overwriting the oldvalues
for typ_ in pivot_A.index: #to match the condition (correct typ in every loop seperately)
new_values = list(pivot_A].loc[cl])
new_values_df = pd.concat([(df[df['typ']==typ]['A'].\
replace(old_values,new_values)).to_frame(A), new_values_df])

Can't find aggregation result column in Python Pandas

s = pd.Series(["08-10-2017", "08-10-2017", "08-10-2017", "09-10-2017", "09-10-2017", "09-10-2017", "10-10-2017", "10-10-2017", "10-10-2017", "11-10-2017", "11-10-2017", "11-10-2017", "12-10-2017", "12-10-2017", "12-10-2017", "13-10-2017", "13-10-2017", "13-10-2017", "14-10-2017", "14-10-2017"])
p = pd.DataFrame(data=s)
p.columns = ['date']
p.groupby('date').agg('count').reset_index().columns
Where is 'count' column ?
I think you are looking for value_counts
p.date.value_counts()
Out[1095]:
09-10-2017 3
13-10-2017 3
10-10-2017 3
12-10-2017 3
08-10-2017 3
11-10-2017 3
14-10-2017 2
Name: date, dtype: int64
And if you want to do with groupby
p.groupby('date').size()
And if do want using count
p.groupby('date').agg({'date':'count'})
Out[1101]:
date
date
08-10-2017 3
09-10-2017 3
10-10-2017 3
11-10-2017 3
12-10-2017 3
13-10-2017 3
14-10-2017 2

Python Pandas fillna doesn't work in for loop?

Given a set up such as below:
import pandas as pd
import numpy as np
#Create random number dataframes
df1 = pd.DataFrame(np.random.rand(10,4))
df2 = pd.DataFrame(np.random.rand(10,4))
df3 = pd.DataFrame(np.random.rand(10,4))
#Create list of dataframes
data_frame_list = [df1, df2, df3]
#Introduce some NaN values
df1.iloc[4,3] = np.NaN
df2.iloc[1:4,2] = np.NaN
#Create loop to ffill any NaN values
for df in data_frame_list:
df = df.fillna(method='ffill')
This still leaves df2 (for example) as:
0 1 2 3
0 0.946601 0.492957 0.688421 0.582571
1 0.365173 0.507617 NaN 0.997909
2 0.185005 0.496989 NaN 0.962120
3 0.278633 0.515227 NaN 0.868952
4 0.346495 0.779571 0.376018 0.750900
5 0.384307 0.594381 0.741655 0.510144
6 0.499180 0.885632 0.13413 0.196010
7 0.245445 0.771402 0.371148 0.222618
8 0.564510 0.487644 0.121945 0.095932
9 0.401214 0.282698 0.0181196 0.689916
Although the individual line of code:
df2 = df2.fillna(method='ffill)
Does work. I thought the issue may be due to the way I was naming variables so I introduced global()[df], but this didn't seem to work either.
Wondering if it possible to do a ffill of an entire dataframe in a for loop, or am I going wrong somewhere in my approach?
No, it unfortunately does not. You are calling fillna not in place and it results in the generation of a copy, which you then reassign back to the variable df. You should understand that reassigning this variable does not change the contents of the list.
If you want to do that, iterate over the index or use a list comprehension.
data_frame_list = [df.ffill() for df in data_frame_list]
Or,
for i in range(len(data_frame_list)):
data_frame_list[i].ffill(inplace=True)
You can change only DataFrame in list of DataFrames, so df1 - df3 are not changed with ffill and parameter inplace=True:
data_frame_list = [df1, df2, df3]
for df in data_frame_list:
df.ffill(inplace=True)
print (data_frame_list)
[ 0 1 2 3
0 0.506726 0.057531 0.627580 0.132553
1 0.131085 0.788544 0.506686 0.412826
2 0.578009 0.488174 0.335964 0.140816
3 0.891442 0.086312 0.847512 0.529616
4 0.550261 0.848461 0.158998 0.529616
5 0.817808 0.977898 0.933133 0.310414
6 0.481331 0.382784 0.874249 0.363505
7 0.384864 0.035155 0.634643 0.009076
8 0.197091 0.880822 0.002330 0.109501
9 0.623105 0.999237 0.567151 0.487938, 0 1 2 3
0 0.104856 0.525416 0.284066 0.658453
1 0.989523 0.644251 0.284066 0.141395
2 0.488099 0.167418 0.284066 0.097982
3 0.930415 0.486878 0.284066 0.192273
4 0.210032 0.244598 0.175200 0.367130
5 0.981763 0.285865 0.979590 0.924292
6 0.631067 0.119238 0.855842 0.782623
7 0.815908 0.575624 0.037598 0.532883
8 0.346577 0.329280 0.606794 0.825932
9 0.273021 0.503340 0.828568 0.429792, 0 1 2 3
0 0.491665 0.752531 0.780970 0.524148
1 0.635208 0.283928 0.821345 0.874243
2 0.454211 0.622611 0.267682 0.726456
3 0.379144 0.345580 0.694614 0.585782
4 0.844209 0.662073 0.590640 0.612480
5 0.258679 0.413567 0.797383 0.431819
6 0.034473 0.581294 0.282111 0.856725
7 0.352072 0.801542 0.862749 0.000285
8 0.793939 0.297286 0.441013 0.294635
9 0.841181 0.804839 0.311352 0.171094]
Or you can concat
df=pd.concat([df1,df2,df3],keys=['df1','df2','df3'])
[x for _,x in df.groupby(level=0).ffill().groupby(level=0)]

HDFStore output in dataframe not series

I would like to have the two tables that I read in, stored in data frames.
I'm reading a h5 file into my code with:
with pd.HDFStore(directory_path) as store:
self.df = store['/raw_ti4404']
self.hr_df = store['/metric_heartrate']
self.df is being stored as a data frame, but self.hr_df is being stored as a series.
I am calling them both in the same manner and I don't understand why the one is a data frame and the other a series. It might be something to do with how the data is stored:
Any help on how to store the metric_heartrate as a data frame would be appreciated.
Most probably the metric_heartrate was stored as Series.
Demo:
Generate sample DF:
In [123]: df = pd.DataFrame(np.random.rand(10, 3), columns=list('abc'))
In [124]: df
Out[124]:
a b c
0 0.404338 0.010642 0.686192
1 0.108319 0.962482 0.772487
2 0.564785 0.456916 0.496818
3 0.122507 0.653329 0.647296
4 0.348033 0.925427 0.937080
5 0.750008 0.301208 0.779692
6 0.833262 0.448925 0.553434
7 0.055830 0.267205 0.851582
8 0.189788 0.087814 0.902296
9 0.045610 0.738983 0.831780
In [125]: store = pd.HDFStore('d:/temp/test.h5')
Let's store a column as Series:
In [126]: store.append('ser', df['a'], format='t')
Let's store a DataFrame, containing only one column - a:
In [127]: store.append('df', df[['a']], format='t')
Reading data from HDFStore:
In [128]: store.select('ser')
Out[128]:
0 0.404338
1 0.108319
2 0.564785
3 0.122507
4 0.348033
5 0.750008
6 0.833262
7 0.055830
8 0.189788
9 0.045610
Name: a, dtype: float64
In [129]: store.select('df')
Out[129]:
a
0 0.404338
1 0.108319
2 0.564785
3 0.122507
4 0.348033
5 0.750008
6 0.833262
7 0.055830
8 0.189788
9 0.045610
Fix - read Series and convert it to DF:
In [130]: store.select('ser').to_frame('a')
Out[130]:
a
0 0.404338
1 0.108319
2 0.564785
3 0.122507
4 0.348033
5 0.750008
6 0.833262
7 0.055830
8 0.189788
9 0.045610

Selecting columns from a pandas dataframe based on row conditions

I have a pandas dataframe
In [1]: df = DataFrame(np.random.randn(10, 4))
Is there a way I can only select columns which have (last row) value>0
the desired output would be a new dataframe having all rows associated with columns where the last row >0
In [201]: df = pd.DataFrame(np.random.randn(10, 4))
In [202]: df
Out[202]:
0 1 2 3
0 -1.380064 0.391358 -0.043390 -1.970113
1 -0.612594 -0.890354 -0.349894 -0.848067
2 1.178626 1.798316 0.691760 0.736255
3 -0.909491 0.429237 0.766065 -0.605075
4 -1.214366 1.907580 -0.583695 0.192488
5 -0.283786 -1.315771 0.046579 -0.777228
6 1.195634 -0.259040 -0.432147 1.196420
7 -2.346814 1.251494 0.261687 0.400886
8 0.845000 0.536683 -2.628224 -0.238449
9 0.246398 -0.548448 -0.295481 0.076117
In [203]: df.iloc[:, (df.iloc[-1] > 0).values]
Out[203]:
0 3
0 -1.380064 -1.970113
1 -0.612594 -0.848067
2 1.178626 0.736255
3 -0.909491 -0.605075
4 -1.214366 0.192488
5 -0.283786 -0.777228
6 1.195634 1.196420
7 -2.346814 0.400886
8 0.845000 -0.238449
9 0.246398 0.076117
Basically this solution uses very basic Pandas indexing, in particular iloc() method
You can use the boolean series generated from the condition to index the columns of interest:
In [30]:
df = pd.DataFrame(np.random.randn(10, 4))
df
Out[30]:
0 1 2 3
0 -0.667736 -0.744761 0.401677 -1.286372
1 1.098134 -1.327454 1.409357 -0.180265
2 -0.105780 0.446195 -0.562578 -0.746083
3 1.366714 -0.685103 0.982354 1.928026
4 0.091040 -0.689676 0.425042 0.723466
5 0.798305 -1.454922 -0.017695 0.515961
6 -0.786693 1.496968 -0.112125 -1.303714
7 -0.211216 -1.321854 -0.892023 -0.583492
8 1.293255 0.936271 1.873870 0.790086
9 -0.699665 -0.953611 0.139986 -0.200499
In [32]:
df[df.columns[df.iloc[-1]>0]]
Out[32]:
2
0 0.401677
1 1.409357
2 -0.562578
3 0.982354
4 0.425042
5 -0.017695
6 -0.112125
7 -0.892023
8 1.873870
9 0.139986
Check out pandasql: https://pypi.python.org/pypi/pandasql
This blog post is a great tutorial for using SQL for Pandas DataFrames: http://blog.yhathq.com/posts/pandasql-sql-for-pandas-dataframes.html
This should get you started:
from pandasql import *
import pandas
def pysqldf(q):
return sqldf(q, globals())
q = """
SELECT
*
FROM
df
WHERE
value > 0
ORDER BY 1;
"""
df = pysqldf(q)

Categories

Resources