Related
In df A and B are label encoded categories all belonging to a certain subset (typ).
This categories should now be encoded/decoded again ... into metric data ... taken from template
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0,1,2,3,0,1,2,3,0,2,2,2,3,3,2,3,1,1],
'B': [2,3,1,1,1,3,2,2,0,2,2,2,3,3,3,3,2,1],
'typ': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]})
A and B should be decoded to metric(float) data from the templates pivot_A and pivot_B respectively. In the templates the headers are the values to replace, the indices are the conditions to match and the values are the new values:
pivot_A = pd.DataFrame(np.array([np.random.rand(9),np.random.rand(9),np.random.rand(9),np.random.rand(9)]).T,
columns=np.unique(df.A),
index = np.unique(df.typ))
pivot_B = pd.DataFrame(np.array([np.random.rand(9),np.random.rand(9),np.random.rand(9),np.random.rand(9)]).T,
columns=np.unique(df.B),
index = np.unique(df.typ))
pivot_B looks like:
In [5]: pivot_B
Out[5]:
0 1 2 3
type
1 0.326687 0.851405 0.830255 0.721817
2 0.496182 0.769574 0.083379 0.491332
3 0.442760 0.786503 0.593361 0.470658
4 0.100724 0.455841 0.485407 0.211383
5 0.989424 0.852057 0.530137 0.385900
6 0.413897 0.915375 0.708038 0.846020
7 0.548033 0.670561 0.900648 0.742418
8 0.077552 0.310529 0.156794 0.076186
9 0.463480 0.377749 0.876133 0.518022
pivot_A looks like:
In [6] pivot_A
Out[6]:
0 1 2 3
type
1 0.012808 0.128041 0.001279 0.320740
2 0.615976 0.736491 0.879216 0.842910
3 0.298637 0.828012 0.962703 0.736827
4 0.700053 0.115463 0.670091 0.638931
5 0.416262 0.633604 0.504292 0.983946
6 0.956872 0.129720 0.611625 0.682046
7 0.414579 0.062104 0.118168 0.265530
8 0.162742 0.952069 0.112400 0.837696
9 0.123151 0.061040 0.326437 0.380834
explained useage of pivots:
if df.typ == pivot.index and df.A == X:
df.A = pivot_A.loc[typ][X]
decoding could be done by:
for categorie in [i for i in df.columns if i != 'typ']:
for col in np.unique(df[categorie]):
for type_ in np.unique(df.typ):
df.loc[((df['typ']==type_) & (df[categorie]==col)), categorie] = locals()['pivot_{}'.format(categorie)].loc[type_,col]
and result in:
In[7] :df
Out[7]:
A B typ
0 0.012808 0.830255 1
1 0.736491 0.491332 2
2 0.962703 0.786503 3
3 0.638931 0.455841 4
4 0.416262 0.852057 5
5 0.129720 0.846020 6
6 0.118168 0.900648 7
7 0.837696 0.156794 8
8 0.123151 0.463480 9
9 0.001279 0.830255 1
10 0.879216 0.083379 2
11 0.962703 0.593361 3
12 0.638931 0.211383 4
13 0.983946 0.385900 5
14 0.611625 0.846020 6
15 0.265530 0.742418 7
16 0.952069 0.156794 8
17 0.061040 0.377749 9
BUT this looping seems NOT to be the best way doing it, right?!
How can I improve the code? pd.replace or dictionaries seem to be reasonable... but I can not figuere how to handle it with the extra typ conditions
melting down the 3xlooping process to one loop helps to reduce the duration time a lot:
old_values = list(pivot_A.columns) #from template
new_values_df = pd.DataFrame() #to save the decoded values without overwriting the oldvalues
for typ_ in pivot_A.index: #to match the condition (correct typ in every loop seperately)
new_values = list(pivot_A].loc[cl])
new_values_df = pd.concat([(df[df['typ']==typ]['A'].\
replace(old_values,new_values)).to_frame(A), new_values_df])
I have pandas DataFrame which I have composed from concat. One row consists of 96 values, I would like to split the DataFrame from the value 72.
So that the first 72 values of a row are stored in Dataframe1, and the next 24 values of a row in Dataframe2.
I create my DF as follows:
temps = DataFrame(myData)
datasX = concat(
[temps.shift(72), temps.shift(71), temps.shift(70), temps.shift(69), temps.shift(68), temps.shift(67),
temps.shift(66), temps.shift(65), temps.shift(64), temps.shift(63), temps.shift(62), temps.shift(61),
temps.shift(60), temps.shift(59), temps.shift(58), temps.shift(57), temps.shift(56), temps.shift(55),
temps.shift(54), temps.shift(53), temps.shift(52), temps.shift(51), temps.shift(50), temps.shift(49),
temps.shift(48), temps.shift(47), temps.shift(46), temps.shift(45), temps.shift(44), temps.shift(43),
temps.shift(42), temps.shift(41), temps.shift(40), temps.shift(39), temps.shift(38), temps.shift(37),
temps.shift(36), temps.shift(35), temps.shift(34), temps.shift(33), temps.shift(32), temps.shift(31),
temps.shift(30), temps.shift(29), temps.shift(28), temps.shift(27), temps.shift(26), temps.shift(25),
temps.shift(24), temps.shift(23), temps.shift(22), temps.shift(21), temps.shift(20), temps.shift(19),
temps.shift(18), temps.shift(17), temps.shift(16), temps.shift(15), temps.shift(14), temps.shift(13),
temps.shift(12), temps.shift(11), temps.shift(10), temps.shift(9), temps.shift(8), temps.shift(7),
temps.shift(6), temps.shift(5), temps.shift(4), temps.shift(3), temps.shift(2), temps.shift(1), temps,
temps.shift(-1), temps.shift(-2), temps.shift(-3), temps.shift(-4), temps.shift(-5), temps.shift(-6),
temps.shift(-7), temps.shift(-8), temps.shift(-9), temps.shift(-10), temps.shift(-11), temps.shift(-12),
temps.shift(-13), temps.shift(-14), temps.shift(-15), temps.shift(-16), temps.shift(-17), temps.shift(-18),
temps.shift(-19), temps.shift(-20), temps.shift(-21), temps.shift(-22), temps.shift(-23)], axis=1)
Question is: How can split them? :)
iloc
df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]
(iloc docs)
use np.split(..., axis=1):
Demo:
In [255]: df = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
In [256]: df
Out[256]:
a b c d e f
0 0.823638 0.767999 0.460358 0.034578 0.592420 0.776803
1 0.344320 0.754412 0.274944 0.545039 0.031752 0.784564
2 0.238826 0.610893 0.861127 0.189441 0.294646 0.557034
3 0.478562 0.571750 0.116209 0.534039 0.869545 0.855520
4 0.130601 0.678583 0.157052 0.899672 0.093976 0.268974
In [257]: dfs = np.split(df, [4], axis=1)
In [258]: dfs[0]
Out[258]:
a b c d
0 0.823638 0.767999 0.460358 0.034578
1 0.344320 0.754412 0.274944 0.545039
2 0.238826 0.610893 0.861127 0.189441
3 0.478562 0.571750 0.116209 0.534039
4 0.130601 0.678583 0.157052 0.899672
In [259]: dfs[1]
Out[259]:
e f
0 0.592420 0.776803
1 0.031752 0.784564
2 0.294646 0.557034
3 0.869545 0.855520
4 0.093976 0.268974
np.split() is pretty flexible - let's split an original DF into 3 DFs at columns with indexes [2,3]:
In [260]: dfs = np.split(df, [2,3], axis=1)
In [261]: dfs[0]
Out[261]:
a b
0 0.823638 0.767999
1 0.344320 0.754412
2 0.238826 0.610893
3 0.478562 0.571750
4 0.130601 0.678583
In [262]: dfs[1]
Out[262]:
c
0 0.460358
1 0.274944
2 0.861127
3 0.116209
4 0.157052
In [263]: dfs[2]
Out[263]:
d e f
0 0.034578 0.592420 0.776803
1 0.545039 0.031752 0.784564
2 0.189441 0.294646 0.557034
3 0.534039 0.869545 0.855520
4 0.899672 0.093976 0.268974
I generally use array split because it's easier simple syntax and scales better with more than 2 partitions.
import numpy as np
partitions = 2
dfs = np.array_split(df, partitions)
np.split(df, [100,200,300], axis=0] wants explicit index numbers which may or may not be desirable.
I am new to Python and Pandas. I have a panda dataframe with monthly columns ranging from 2000 (2000-01) to 2016 (2016-06).
I want to find the average of every three months and assign it to a new quarterly column (2000q1). I know I can do the following:
df['2000q1'] = df[['2000-01', '2000-02', '2000-03']].mean(axis=1)
df['2000q2'] = df[['2000-04', '2000-05', '2000-06']].mean(axis=1)
.
.
.
df['2016-02'] = df[['2016-04', '2016-05', '2016-06']].mean(axis=1)
But, this is very tedious. I appreciate it if someone helps me find a better way.
You can use groupby on columns:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Or, those can be converted to datetime. You can use resample:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Here's a demo:
cols = pd.date_range('2000-01', '2000-06', freq='MS')
cols = cols.strftime('%Y-%m')
cols
Out:
array(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'],
dtype='<U7')
df = pd.DataFrame(np.random.randn(10, 6), columns=cols)
df
Out:
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
0 -1.263798 0.251526 0.851196 0.159452 1.412013 1.079086
1 -0.909071 0.685913 1.394790 -0.883605 0.034114 -1.073113
2 0.516109 0.452751 -0.397291 -0.050478 -0.364368 -0.002477
3 1.459609 -1.696641 0.457822 1.057702 -0.066313 -0.910785
4 -0.482623 1.388621 0.971078 -0.038535 0.033167 0.025781
5 -0.016654 1.404805 0.100335 -0.082941 -0.418608 0.588749
6 0.684735 -2.007105 0.552615 1.969356 -0.614634 0.021459
7 0.382475 0.965739 -1.826609 -0.086537 -0.073538 -0.534753
8 1.548773 -0.157250 0.494819 -1.631516 0.627794 -0.398741
9 0.199049 0.145919 0.711701 0.305382 -0.118315 -2.397075
First alternative:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Out:
0 1
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
Second alternative:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Out:
2000-03-31 2000-06-30
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
You can assign this to a DataFrame:
res = df.resample('Q', axis=1).mean()
Change column names as you like:
res = res.rename(columns=lambda col: '{}q{}'.format(col.year, col.quarter))
res
Out:
2000q1 2000q2
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
And attach this to your current DataFrame by:
pd.concat([df, res], axis=1)
I have a pandas dataframe
In [1]: df = DataFrame(np.random.randn(10, 4))
Is there a way I can only select columns which have (last row) value>0
the desired output would be a new dataframe having all rows associated with columns where the last row >0
In [201]: df = pd.DataFrame(np.random.randn(10, 4))
In [202]: df
Out[202]:
0 1 2 3
0 -1.380064 0.391358 -0.043390 -1.970113
1 -0.612594 -0.890354 -0.349894 -0.848067
2 1.178626 1.798316 0.691760 0.736255
3 -0.909491 0.429237 0.766065 -0.605075
4 -1.214366 1.907580 -0.583695 0.192488
5 -0.283786 -1.315771 0.046579 -0.777228
6 1.195634 -0.259040 -0.432147 1.196420
7 -2.346814 1.251494 0.261687 0.400886
8 0.845000 0.536683 -2.628224 -0.238449
9 0.246398 -0.548448 -0.295481 0.076117
In [203]: df.iloc[:, (df.iloc[-1] > 0).values]
Out[203]:
0 3
0 -1.380064 -1.970113
1 -0.612594 -0.848067
2 1.178626 0.736255
3 -0.909491 -0.605075
4 -1.214366 0.192488
5 -0.283786 -0.777228
6 1.195634 1.196420
7 -2.346814 0.400886
8 0.845000 -0.238449
9 0.246398 0.076117
Basically this solution uses very basic Pandas indexing, in particular iloc() method
You can use the boolean series generated from the condition to index the columns of interest:
In [30]:
df = pd.DataFrame(np.random.randn(10, 4))
df
Out[30]:
0 1 2 3
0 -0.667736 -0.744761 0.401677 -1.286372
1 1.098134 -1.327454 1.409357 -0.180265
2 -0.105780 0.446195 -0.562578 -0.746083
3 1.366714 -0.685103 0.982354 1.928026
4 0.091040 -0.689676 0.425042 0.723466
5 0.798305 -1.454922 -0.017695 0.515961
6 -0.786693 1.496968 -0.112125 -1.303714
7 -0.211216 -1.321854 -0.892023 -0.583492
8 1.293255 0.936271 1.873870 0.790086
9 -0.699665 -0.953611 0.139986 -0.200499
In [32]:
df[df.columns[df.iloc[-1]>0]]
Out[32]:
2
0 0.401677
1 1.409357
2 -0.562578
3 0.982354
4 0.425042
5 -0.017695
6 -0.112125
7 -0.892023
8 1.873870
9 0.139986
Check out pandasql: https://pypi.python.org/pypi/pandasql
This blog post is a great tutorial for using SQL for Pandas DataFrames: http://blog.yhathq.com/posts/pandasql-sql-for-pandas-dataframes.html
This should get you started:
from pandasql import *
import pandas
def pysqldf(q):
return sqldf(q, globals())
q = """
SELECT
*
FROM
df
WHERE
value > 0
ORDER BY 1;
"""
df = pysqldf(q)
I am using a pandas DataFrame to store data from a series of experiments so that I can easily make cuts across various parameter values for the next stage of analysis. I have a few questions about how to do this most effectively.
Currently I create my DataFrame from a dictionary of lists. There is typically a few thousand rows in the DataFrame. One of the columns is a device_id which indicates which of the 20 devices that the experimental data pertains to. Other columns include info about the experimental setup, like temperature, power, etc. and measurement results, like resonant_frequency, bandwidth, etc.
So far, I've been using this DataFrame rather "naively," that is, I use it sort of like a numpy record array, and so I don't think I'm fully taking advantage of the power of the DataFrame. The following are some examples of what I'm trying to achieve.
First I want to create a new column which is the maximum resonant_frequency measured for a given device over all experiments: call it max_freq. I do this like so:
df['max_freq'] = np.zeros((data.shape[0])) # create the new column
for index in np.unique(df.device_index):
group = df[df.device_index == index]
max = group.resonant_frequency.max()
df.max_freq[df.resonator_index == index] = max
Second One of my columns contains 1-D numpy arrays of a noise measurement. I want to compute a statistic on this 1-D array and put it into a new column. Currently I do this as:
noise_est = []
for vals,freq in (df.noise,df.resonant_freq):
noise_est.append(vals.std()/(1e6*freq))
df['noise_est'] = noise_est
Third Related the the previous one: Is it possible to iterate through rows of a DataFrame where the resulting object has attribute access to the columns? I.e. something like:
for row in df:
row.noise_est = row.noise.std()/(1e6*row.resonant_freq)
I know that this instead iterates through columns. I also know there is an iterrows method, but this provides a Series which doesn't allow attribute access.
I think this should get me started for now, thanks for your time!
edited to add df.info(), df.head() as requested:
df.info() # df.head() looks the same, but 5 non-null values
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9620 entries, 0 to 9619
Data columns (total 83 columns):
A_mag 9620 non-null values
A_mag_err 9620 non-null values
A_phase 9620 non-null values
A_phase_err 9620 non-null values
....
total_dac_atten 9600 non-null values
round_temp 9620 non-null values
dtypes: bool(1), complex128(4), float64(39), int64(12), object(27)
I trimmed this down because it's 83 columns, and I don't think this adds much to the example code snippets I shared, but have posted this bit in case it's helpful.
Create data. Note that storing a numpy array INSIDE a frame is generally not a good idea as its pretty inefficient.
In [84]: df = pd.DataFrame(dict(A = np.random.randn(20), B = np.random.randint(0,3,size=20), C = [ np.random.randn(5) for i in range(20) ]))
In [85]: df
Out[85]:
A B C
0 -0.493730 1 [-0.8790126045, -1.87366673214, 0.76227570837,...
1 -0.105616 2 [0.612075134682, -1.64452324091, 0.89799758012...
2 1.487656 1 [-0.379505426885, 1.17611806172, 0.88321152932...
3 0.351694 2 [0.132071242514, -1.54701609348, 1.29813626801...
4 -0.330538 2 [0.395383858214, 0.874419943107, 1.21124463921...
5 0.360041 0 [0.439133138619, -1.98615530266, 0.55971723554...
6 -0.505198 2 [-0.770830608002, 0.243255072359, -1.099514797...
7 0.631488 1 [0.676233200011, 0.622926691271, -0.1110029751...
8 1.292087 1 [1.77633938532, -0.141683361957, 0.46972952154...
9 0.641987 0 [1.24802709304, 0.477527098462, -0.08751885691...
10 0.732596 2 [0.475771915314, 1.24219702097, -0.54304296895...
11 0.987054 1 [-0.879620967644, 0.657193159735, -0.093519342...
12 -1.409455 1 [1.04404325784, -0.310849157425, 0.60610368623...
13 1.063830 1 [-0.760467872808, 1.33659372288, -0.9343171844...
14 0.533835 1 [0.985463451645, 1.76471927635, -0.59160181340...
15 0.062441 1 [-0.340170594584, 1.53196133354, 0.42397775978...
16 1.458491 2 [-1.79810090668, -1.82865815817, 1.08140831482...
17 -0.886119 2 [0.281341969073, -1.3516126536, 0.775326038501...
18 0.662076 1 [1.03992509625, 1.17661862104, -0.562683934951...
19 1.216878 2 [0.0746149754367, 0.156470450639, -0.477269150...
In [86]: df.dtypes
Out[86]:
A float64
B int64
C object
dtype: object
Apply an operation to the value of a series (2 and 3)
In [88]: df['C_std'] = df['C'].apply(np.std)
Get the max of each group and return the value (1)
In [91]: df['A_max_by_group'] = df.groupby('B')['A'].transform(lambda x: x.max())
In [92]: df
Out[92]:
A B C A_max_by_group C_std
0 -0.493730 1 [-0.8790126045, -1.87366673214, 0.76227570837,... 1.487656 1.058323
1 -0.105616 2 [0.612075134682, -1.64452324091, 0.89799758012... 1.458491 0.987980
2 1.487656 1 [-0.379505426885, 1.17611806172, 0.88321152932... 1.487656 1.264522
3 0.351694 2 [0.132071242514, -1.54701609348, 1.29813626801... 1.458491 1.150026
4 -0.330538 2 [0.395383858214, 0.874419943107, 1.21124463921... 1.458491 1.045408
5 0.360041 0 [0.439133138619, -1.98615530266, 0.55971723554... 0.641987 1.355853
6 -0.505198 2 [-0.770830608002, 0.243255072359, -1.099514797... 1.458491 0.443872
7 0.631488 1 [0.676233200011, 0.622926691271, -0.1110029751... 1.487656 0.432342
8 1.292087 1 [1.77633938532, -0.141683361957, 0.46972952154... 1.487656 1.021847
9 0.641987 0 [1.24802709304, 0.477527098462, -0.08751885691... 0.641987 0.676835
10 0.732596 2 [0.475771915314, 1.24219702097, -0.54304296895... 1.458491 0.857441
11 0.987054 1 [-0.879620967644, 0.657193159735, -0.093519342... 1.487656 0.628655
12 -1.409455 1 [1.04404325784, -0.310849157425, 0.60610368623... 1.487656 0.835633
13 1.063830 1 [-0.760467872808, 1.33659372288, -0.9343171844... 1.487656 0.936746
14 0.533835 1 [0.985463451645, 1.76471927635, -0.59160181340... 1.487656 0.991327
15 0.062441 1 [-0.340170594584, 1.53196133354, 0.42397775978... 1.487656 0.700299
16 1.458491 2 [-1.79810090668, -1.82865815817, 1.08140831482... 1.458491 1.649771
17 -0.886119 2 [0.281341969073, -1.3516126536, 0.775326038501... 1.458491 0.910355
18 0.662076 1 [1.03992509625, 1.17661862104, -0.562683934951... 1.487656 0.666237
19 1.216878 2 [0.0746149754367, 0.156470450639, -0.477269150... 1.458491 0.275065