Selecting columns from a pandas dataframe based on row conditions

Selecting columns from a pandas dataframe based on row conditions - python

I have a pandas dataframe
In [1]: df = DataFrame(np.random.randn(10, 4))
Is there a way I can only select columns which have (last row) value>0
the desired output would be a new dataframe having all rows associated with columns where the last row >0

In [201]: df = pd.DataFrame(np.random.randn(10, 4))
In [202]: df
Out[202]:
0 1 2 3
0 -1.380064 0.391358 -0.043390 -1.970113
1 -0.612594 -0.890354 -0.349894 -0.848067
2 1.178626 1.798316 0.691760 0.736255
3 -0.909491 0.429237 0.766065 -0.605075
4 -1.214366 1.907580 -0.583695 0.192488
5 -0.283786 -1.315771 0.046579 -0.777228
6 1.195634 -0.259040 -0.432147 1.196420
7 -2.346814 1.251494 0.261687 0.400886
8 0.845000 0.536683 -2.628224 -0.238449
9 0.246398 -0.548448 -0.295481 0.076117
In [203]: df.iloc[:, (df.iloc[-1] > 0).values]
Out[203]:
0 3
0 -1.380064 -1.970113
1 -0.612594 -0.848067
2 1.178626 0.736255
3 -0.909491 -0.605075
4 -1.214366 0.192488
5 -0.283786 -0.777228
6 1.195634 1.196420
7 -2.346814 0.400886
8 0.845000 -0.238449
9 0.246398 0.076117
Basically this solution uses very basic Pandas indexing, in particular iloc() method

You can use the boolean series generated from the condition to index the columns of interest:
In [30]:
df = pd.DataFrame(np.random.randn(10, 4))
df
Out[30]:
0 1 2 3
0 -0.667736 -0.744761 0.401677 -1.286372
1 1.098134 -1.327454 1.409357 -0.180265
2 -0.105780 0.446195 -0.562578 -0.746083
3 1.366714 -0.685103 0.982354 1.928026
4 0.091040 -0.689676 0.425042 0.723466
5 0.798305 -1.454922 -0.017695 0.515961
6 -0.786693 1.496968 -0.112125 -1.303714
7 -0.211216 -1.321854 -0.892023 -0.583492
8 1.293255 0.936271 1.873870 0.790086
9 -0.699665 -0.953611 0.139986 -0.200499
In [32]:
df[df.columns[df.iloc[-1]>0]]
Out[32]:
2
0 0.401677
1 1.409357
2 -0.562578
3 0.982354
4 0.425042
5 -0.017695
6 -0.112125
7 -0.892023
8 1.873870
9 0.139986

Check out pandasql: https://pypi.python.org/pypi/pandasql
This blog post is a great tutorial for using SQL for Pandas DataFrames: http://blog.yhathq.com/posts/pandasql-sql-for-pandas-dataframes.html
This should get you started:
from pandasql import *
import pandas
def pysqldf(q):
return sqldf(q, globals())
q = """
SELECT
*
FROM
df
WHERE
value > 0
ORDER BY 1;
"""
df = pysqldf(q)

Related

decode pandas columns (categorical data to metric data) by template (pivot table) with constraints

In df A and B are label encoded categories all belonging to a certain subset (typ).
This categories should now be encoded/decoded again ... into metric data ... taken from template
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0,1,2,3,0,1,2,3,0,2,2,2,3,3,2,3,1,1],
'B': [2,3,1,1,1,3,2,2,0,2,2,2,3,3,3,3,2,1],
'typ': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]})
A and B should be decoded to metric(float) data from the templates pivot_A and pivot_B respectively. In the templates the headers are the values to replace, the indices are the conditions to match and the values are the new values:
pivot_A = pd.DataFrame(np.array([np.random.rand(9),np.random.rand(9),np.random.rand(9),np.random.rand(9)]).T,
columns=np.unique(df.A),
index = np.unique(df.typ))
pivot_B = pd.DataFrame(np.array([np.random.rand(9),np.random.rand(9),np.random.rand(9),np.random.rand(9)]).T,
columns=np.unique(df.B),
index = np.unique(df.typ))
pivot_B looks like:
In [5]: pivot_B
Out[5]:
0 1 2 3
type
1 0.326687 0.851405 0.830255 0.721817
2 0.496182 0.769574 0.083379 0.491332
3 0.442760 0.786503 0.593361 0.470658
4 0.100724 0.455841 0.485407 0.211383
5 0.989424 0.852057 0.530137 0.385900
6 0.413897 0.915375 0.708038 0.846020
7 0.548033 0.670561 0.900648 0.742418
8 0.077552 0.310529 0.156794 0.076186
9 0.463480 0.377749 0.876133 0.518022
pivot_A looks like:
In [6] pivot_A
Out[6]:
0 1 2 3
type
1 0.012808 0.128041 0.001279 0.320740
2 0.615976 0.736491 0.879216 0.842910
3 0.298637 0.828012 0.962703 0.736827
4 0.700053 0.115463 0.670091 0.638931
5 0.416262 0.633604 0.504292 0.983946
6 0.956872 0.129720 0.611625 0.682046
7 0.414579 0.062104 0.118168 0.265530
8 0.162742 0.952069 0.112400 0.837696
9 0.123151 0.061040 0.326437 0.380834
explained useage of pivots:
if df.typ == pivot.index and df.A == X:
df.A = pivot_A.loc[typ][X]
decoding could be done by:
for categorie in [i for i in df.columns if i != 'typ']:
for col in np.unique(df[categorie]):
for type_ in np.unique(df.typ):
df.loc[((df['typ']==type_) & (df[categorie]==col)), categorie] = locals()['pivot_{}'.format(categorie)].loc[type_,col]
and result in:
In[7] :df
Out[7]:
A B typ
0 0.012808 0.830255 1
1 0.736491 0.491332 2
2 0.962703 0.786503 3
3 0.638931 0.455841 4
4 0.416262 0.852057 5
5 0.129720 0.846020 6
6 0.118168 0.900648 7
7 0.837696 0.156794 8
8 0.123151 0.463480 9
9 0.001279 0.830255 1
10 0.879216 0.083379 2
11 0.962703 0.593361 3
12 0.638931 0.211383 4
13 0.983946 0.385900 5
14 0.611625 0.846020 6
15 0.265530 0.742418 7
16 0.952069 0.156794 8
17 0.061040 0.377749 9
BUT this looping seems NOT to be the best way doing it, right?!
How can I improve the code? pd.replace or dictionaries seem to be reasonable... but I can not figuere how to handle it with the extra typ conditions

melting down the 3xlooping process to one loop helps to reduce the duration time a lot:
old_values = list(pivot_A.columns) #from template
new_values_df = pd.DataFrame() #to save the decoded values without overwriting the oldvalues
for typ_ in pivot_A.index: #to match the condition (correct typ in every loop seperately)
new_values = list(pivot_A].loc[cl])
new_values_df = pd.concat([(df[df['typ']==typ]['A'].\
replace(old_values,new_values)).to_frame(A), new_values_df])

Python Pandas fillna doesn't work in for loop?

Given a set up such as below:
import pandas as pd
import numpy as np
#Create random number dataframes
df1 = pd.DataFrame(np.random.rand(10,4))
df2 = pd.DataFrame(np.random.rand(10,4))
df3 = pd.DataFrame(np.random.rand(10,4))
#Create list of dataframes
data_frame_list = [df1, df2, df3]
#Introduce some NaN values
df1.iloc[4,3] = np.NaN
df2.iloc[1:4,2] = np.NaN
#Create loop to ffill any NaN values
for df in data_frame_list:
df = df.fillna(method='ffill')
This still leaves df2 (for example) as:
0 1 2 3
0 0.946601 0.492957 0.688421 0.582571
1 0.365173 0.507617 NaN 0.997909
2 0.185005 0.496989 NaN 0.962120
3 0.278633 0.515227 NaN 0.868952
4 0.346495 0.779571 0.376018 0.750900
5 0.384307 0.594381 0.741655 0.510144
6 0.499180 0.885632 0.13413 0.196010
7 0.245445 0.771402 0.371148 0.222618
8 0.564510 0.487644 0.121945 0.095932
9 0.401214 0.282698 0.0181196 0.689916
Although the individual line of code:
df2 = df2.fillna(method='ffill)
Does work. I thought the issue may be due to the way I was naming variables so I introduced global()[df], but this didn't seem to work either.
Wondering if it possible to do a ffill of an entire dataframe in a for loop, or am I going wrong somewhere in my approach?

No, it unfortunately does not. You are calling fillna not in place and it results in the generation of a copy, which you then reassign back to the variable df. You should understand that reassigning this variable does not change the contents of the list.
If you want to do that, iterate over the index or use a list comprehension.
data_frame_list = [df.ffill() for df in data_frame_list]
Or,
for i in range(len(data_frame_list)):
data_frame_list[i].ffill(inplace=True)

You can change only DataFrame in list of DataFrames, so df1 - df3 are not changed with ffill and parameter inplace=True:
data_frame_list = [df1, df2, df3]
for df in data_frame_list:
df.ffill(inplace=True)
print (data_frame_list)
[ 0 1 2 3
0 0.506726 0.057531 0.627580 0.132553
1 0.131085 0.788544 0.506686 0.412826
2 0.578009 0.488174 0.335964 0.140816
3 0.891442 0.086312 0.847512 0.529616
4 0.550261 0.848461 0.158998 0.529616
5 0.817808 0.977898 0.933133 0.310414
6 0.481331 0.382784 0.874249 0.363505
7 0.384864 0.035155 0.634643 0.009076
8 0.197091 0.880822 0.002330 0.109501
9 0.623105 0.999237 0.567151 0.487938, 0 1 2 3
0 0.104856 0.525416 0.284066 0.658453
1 0.989523 0.644251 0.284066 0.141395
2 0.488099 0.167418 0.284066 0.097982
3 0.930415 0.486878 0.284066 0.192273
4 0.210032 0.244598 0.175200 0.367130
5 0.981763 0.285865 0.979590 0.924292
6 0.631067 0.119238 0.855842 0.782623
7 0.815908 0.575624 0.037598 0.532883
8 0.346577 0.329280 0.606794 0.825932
9 0.273021 0.503340 0.828568 0.429792, 0 1 2 3
0 0.491665 0.752531 0.780970 0.524148
1 0.635208 0.283928 0.821345 0.874243
2 0.454211 0.622611 0.267682 0.726456
3 0.379144 0.345580 0.694614 0.585782
4 0.844209 0.662073 0.590640 0.612480
5 0.258679 0.413567 0.797383 0.431819
6 0.034473 0.581294 0.282111 0.856725
7 0.352072 0.801542 0.862749 0.000285
8 0.793939 0.297286 0.441013 0.294635
9 0.841181 0.804839 0.311352 0.171094]

Or you can concat
df=pd.concat([df1,df2,df3],keys=['df1','df2','df3'])
[x for _,x in df.groupby(level=0).ffill().groupby(level=0)]

Pandas Split Dataframe into two Dataframes at a specific row

I have pandas DataFrame which I have composed from concat. One row consists of 96 values, I would like to split the DataFrame from the value 72.
So that the first 72 values of a row are stored in Dataframe1, and the next 24 values of a row in Dataframe2.
I create my DF as follows:
temps = DataFrame(myData)
datasX = concat(
[temps.shift(72), temps.shift(71), temps.shift(70), temps.shift(69), temps.shift(68), temps.shift(67),
temps.shift(66), temps.shift(65), temps.shift(64), temps.shift(63), temps.shift(62), temps.shift(61),
temps.shift(60), temps.shift(59), temps.shift(58), temps.shift(57), temps.shift(56), temps.shift(55),
temps.shift(54), temps.shift(53), temps.shift(52), temps.shift(51), temps.shift(50), temps.shift(49),
temps.shift(48), temps.shift(47), temps.shift(46), temps.shift(45), temps.shift(44), temps.shift(43),
temps.shift(42), temps.shift(41), temps.shift(40), temps.shift(39), temps.shift(38), temps.shift(37),
temps.shift(36), temps.shift(35), temps.shift(34), temps.shift(33), temps.shift(32), temps.shift(31),
temps.shift(30), temps.shift(29), temps.shift(28), temps.shift(27), temps.shift(26), temps.shift(25),
temps.shift(24), temps.shift(23), temps.shift(22), temps.shift(21), temps.shift(20), temps.shift(19),
temps.shift(18), temps.shift(17), temps.shift(16), temps.shift(15), temps.shift(14), temps.shift(13),
temps.shift(12), temps.shift(11), temps.shift(10), temps.shift(9), temps.shift(8), temps.shift(7),
temps.shift(6), temps.shift(5), temps.shift(4), temps.shift(3), temps.shift(2), temps.shift(1), temps,
temps.shift(-1), temps.shift(-2), temps.shift(-3), temps.shift(-4), temps.shift(-5), temps.shift(-6),
temps.shift(-7), temps.shift(-8), temps.shift(-9), temps.shift(-10), temps.shift(-11), temps.shift(-12),
temps.shift(-13), temps.shift(-14), temps.shift(-15), temps.shift(-16), temps.shift(-17), temps.shift(-18),
temps.shift(-19), temps.shift(-20), temps.shift(-21), temps.shift(-22), temps.shift(-23)], axis=1)
Question is: How can split them? :)

iloc
df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]
(iloc docs)

use np.split(..., axis=1):
Demo:
In [255]: df = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
In [256]: df
Out[256]:
a b c d e f
0 0.823638 0.767999 0.460358 0.034578 0.592420 0.776803
1 0.344320 0.754412 0.274944 0.545039 0.031752 0.784564
2 0.238826 0.610893 0.861127 0.189441 0.294646 0.557034
3 0.478562 0.571750 0.116209 0.534039 0.869545 0.855520
4 0.130601 0.678583 0.157052 0.899672 0.093976 0.268974
In [257]: dfs = np.split(df, [4], axis=1)
In [258]: dfs[0]
Out[258]:
a b c d
0 0.823638 0.767999 0.460358 0.034578
1 0.344320 0.754412 0.274944 0.545039
2 0.238826 0.610893 0.861127 0.189441
3 0.478562 0.571750 0.116209 0.534039
4 0.130601 0.678583 0.157052 0.899672
In [259]: dfs[1]
Out[259]:
e f
0 0.592420 0.776803
1 0.031752 0.784564
2 0.294646 0.557034
3 0.869545 0.855520
4 0.093976 0.268974
np.split() is pretty flexible - let's split an original DF into 3 DFs at columns with indexes [2,3]:
In [260]: dfs = np.split(df, [2,3], axis=1)
In [261]: dfs[0]
Out[261]:
a b
0 0.823638 0.767999
1 0.344320 0.754412
2 0.238826 0.610893
3 0.478562 0.571750
4 0.130601 0.678583
In [262]: dfs[1]
Out[262]:
c
0 0.460358
1 0.274944
2 0.861127
3 0.116209
4 0.157052
In [263]: dfs[2]
Out[263]:
d e f
0 0.034578 0.592420 0.776803
1 0.545039 0.031752 0.784564
2 0.189441 0.294646 0.557034
3 0.534039 0.869545 0.855520
4 0.899672 0.093976 0.268974

I generally use array split because it's easier simple syntax and scales better with more than 2 partitions.
import numpy as np
partitions = 2
dfs = np.array_split(df, partitions)
np.split(df, [100,200,300], axis=0] wants explicit index numbers which may or may not be desirable.

HDFStore output in dataframe not series

I would like to have the two tables that I read in, stored in data frames.
I'm reading a h5 file into my code with:
with pd.HDFStore(directory_path) as store:
self.df = store['/raw_ti4404']
self.hr_df = store['/metric_heartrate']
self.df is being stored as a data frame, but self.hr_df is being stored as a series.
I am calling them both in the same manner and I don't understand why the one is a data frame and the other a series. It might be something to do with how the data is stored:
Any help on how to store the metric_heartrate as a data frame would be appreciated.

Most probably the metric_heartrate was stored as Series.
Demo:
Generate sample DF:
In [123]: df = pd.DataFrame(np.random.rand(10, 3), columns=list('abc'))
In [124]: df
Out[124]:
a b c
0 0.404338 0.010642 0.686192
1 0.108319 0.962482 0.772487
2 0.564785 0.456916 0.496818
3 0.122507 0.653329 0.647296
4 0.348033 0.925427 0.937080
5 0.750008 0.301208 0.779692
6 0.833262 0.448925 0.553434
7 0.055830 0.267205 0.851582
8 0.189788 0.087814 0.902296
9 0.045610 0.738983 0.831780
In [125]: store = pd.HDFStore('d:/temp/test.h5')
Let's store a column as Series:
In [126]: store.append('ser', df['a'], format='t')
Let's store a DataFrame, containing only one column - a:
In [127]: store.append('df', df[['a']], format='t')
Reading data from HDFStore:
In [128]: store.select('ser')
Out[128]:
0 0.404338
1 0.108319
2 0.564785
3 0.122507
4 0.348033
5 0.750008
6 0.833262
7 0.055830
8 0.189788
9 0.045610
Name: a, dtype: float64
In [129]: store.select('df')
Out[129]:
a
0 0.404338
1 0.108319
2 0.564785
3 0.122507
4 0.348033
5 0.750008
6 0.833262
7 0.055830
8 0.189788
9 0.045610
Fix - read Series and convert it to DF:
In [130]: store.select('ser').to_frame('a')
Out[130]:
a
0 0.404338
1 0.108319
2 0.564785
3 0.122507
4 0.348033
5 0.750008
6 0.833262
7 0.055830
8 0.189788
9 0.045610

Find Average of Every Three Columns in Pandas dataframe

I am new to Python and Pandas. I have a panda dataframe with monthly columns ranging from 2000 (2000-01) to 2016 (2016-06).
I want to find the average of every three months and assign it to a new quarterly column (2000q1). I know I can do the following:
df['2000q1'] = df[['2000-01', '2000-02', '2000-03']].mean(axis=1)
df['2000q2'] = df[['2000-04', '2000-05', '2000-06']].mean(axis=1)
.
.
.
df['2016-02'] = df[['2016-04', '2016-05', '2016-06']].mean(axis=1)
But, this is very tedious. I appreciate it if someone helps me find a better way.

You can use groupby on columns:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Or, those can be converted to datetime. You can use resample:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Here's a demo:
cols = pd.date_range('2000-01', '2000-06', freq='MS')
cols = cols.strftime('%Y-%m')
cols
Out:
array(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'],
dtype='<U7')
df = pd.DataFrame(np.random.randn(10, 6), columns=cols)
df
Out:
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
0 -1.263798 0.251526 0.851196 0.159452 1.412013 1.079086
1 -0.909071 0.685913 1.394790 -0.883605 0.034114 -1.073113
2 0.516109 0.452751 -0.397291 -0.050478 -0.364368 -0.002477
3 1.459609 -1.696641 0.457822 1.057702 -0.066313 -0.910785
4 -0.482623 1.388621 0.971078 -0.038535 0.033167 0.025781
5 -0.016654 1.404805 0.100335 -0.082941 -0.418608 0.588749
6 0.684735 -2.007105 0.552615 1.969356 -0.614634 0.021459
7 0.382475 0.965739 -1.826609 -0.086537 -0.073538 -0.534753
8 1.548773 -0.157250 0.494819 -1.631516 0.627794 -0.398741
9 0.199049 0.145919 0.711701 0.305382 -0.118315 -2.397075
First alternative:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Out:
0 1
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
Second alternative:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Out:
2000-03-31 2000-06-30
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
You can assign this to a DataFrame:
res = df.resample('Q', axis=1).mean()
Change column names as you like:
res = res.rename(columns=lambda col: '{}q{}'.format(col.year, col.quarter))
res
Out:
2000q1 2000q2
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
And attach this to your current DataFrame by:
pd.concat([df, res], axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting columns from a pandas dataframe based on row conditions - python

I have a pandas dataframe In [1]: df = DataFrame(np.random.randn(10, 4)) Is there a way I can only select columns which have (last row) value>0 the desired output would be a new dataframe having all rows associated with columns where the last row >0

Related

decode pandas columns (categorical data to metric data) by template (pivot table) with constraints

Python Pandas fillna doesn't work in for loop?

Pandas Split Dataframe into two Dataframes at a specific row

HDFStore output in dataframe not series

Find Average of Every Three Columns in Pandas dataframe

Categories

Resources