I am writing an application that makes use of pandas (version 0.10.1) to store the underlying data model as a (3-level) MultiIndex'ed DataFrame. The model is a line spectrum, and the top level of the index is the atomic transition.
A simple dataframe could look like this:
Pos Sigma Ampl Line center Identifier
H-alpha-6697.6 30-30 Comp2 -3.600 0.774000 33.058000 6699.5 b
Comp3 3.538 2.153000 28.054000 6699.5 c
Contin NaN NaN 0.000000 NaN NaN
Comp4 1.384 0.921000 37.504000 6699.5 d
Comp1 -2.124 1.977000 69.166000 6699.5 a
31-31 Comp2 -3.292 0.884603 49.813423 6699.5 b
Comp3 3.600 2.299000 19.999000 6699.5 c
Contin NaN NaN 0.000000 NaN NaN
Comp4 1.692 1.009000 22.222000 6699.5 d
Comp1 -1.262 2.534000 68.002000 6699.5 a
At some point, I need to be able to create a different transition, e.g. H-beta, using H-alpha as a template. I would ideally do this by something like df.ix['H-beta-wavelength'] = df.ix['H-alpha-6697.6'], but this is not possible to do. So instead, I tried following this example: Prepend a level to a pandas MultiIndex
However, the example above requires the .names of the multiindex levels to be set in order to reorder them. And the names attribute is set when initializing the dataframe, but during the building of it, I rely quite extensibly on the set_values() method, and doing this destroys the names attribute - or rather sets them to [None, None, None].
Example:
In [68]: df
Out[68]:
Pos Sigma Ampl Line center Identifier
Transition Rows Component
Center: 6699.5 26-26 Comp2 -3.846 0.657 15.2740 6699.5 b
Comp3 2.924 1.449 31.3930 6699.5 c
Contin NaN NaN 0.0000 NaN NaN
Comp4 8.030 1.009 7.0831 6699.5 d
Comp1 -1.816 2.153 50.2750 6699.5 a
In [69]: df.set_value(('Center: 5044.3', '26-26', 'Comp1'), 'Sigma', 2.457)
Out[69]:
Pos Sigma Ampl Line center Identifier
Center: 6699.5 26-26 Comp2 -3.846 0.657 15.2740 6699.5 b
Comp3 2.924 1.449 31.3930 6699.5 c
Contin NaN NaN 0.0000 NaN NaN
Comp4 8.030 1.009 7.0831 6699.5 d
Comp1 -1.816 2.153 50.2750 6699.5 a
Center: 5044.3 26-26 Comp1 NaN 2.457 NaN NaN NaN
Of course, this makes it quite hard to use the names for reordering the levels of the multiindex. Is there a way to avoid this, short of brute-force setting the names after each time I've run set_values()?
EDIT: simpler, reproducible example.
Here is an iPython session recreating the index.names problem with a somewhat simpler example. It also shows that it is possibly a bug that goes beyond index.names, as it seems to change the index.lexsort_depth from 3 to 0. Missing numbers in the prompt are just unnecessary views of the dataframe.
I believe that one must choose secondary and/or tertiary indices that already exist like I have done below in order to reproduce it.
In [4]: idx = pd.MultiIndex.from_arrays(
[['Hans']*4 + ['Grethe']*4, ['1', '1', '2', '2']*2, ['a', 'b']*4],
names=['Name', 'Number', 'Letter'])
In [5]: df = pd.DataFrame(
random.random((8, 3)),
columns=['one', 'two','three'],
index=idx)
In [6]: df
Out[6]:
one two three
Name Number Letter
Hans 1 a 0.803566 0.434574 0.805976
b 0.655322 0.208469 0.989559
2 a 0.893952 0.380358 0.173764
b 0.822446 0.673894 0.676573
Grethe 1 a 0.202641 0.387263 0.405296
b 0.646733 0.086953 0.882114
2 a 0.358458 0.147107 0.769586
b 0.183782 0.477863 0.601098
# To rule out another possible source of problems:
In [9]: df.unstack().drop(('Grethe', '1')).stack()
Out[9]:
one two three
Name Number Letter
Grethe 2 a 0.358458 0.147107 0.769586
b 0.183782 0.477863 0.601098
Hans 1 a 0.803566 0.434574 0.805976
b 0.655322 0.208469 0.989559
2 a 0.893952 0.380358 0.173764
b 0.822446 0.673894 0.676573
In [10]: df.set_value(('Frans', '2', 'b'), 'one', 23.)
Out[10]:
one two three
Hans 1 a 0.803566 0.434574 0.805976
b 0.655322 0.208469 0.989559
2 a 0.893952 0.380358 0.173764
b 0.822446 0.673894 0.676573
Grethe 1 a 0.202641 0.387263 0.405296
b 0.646733 0.086953 0.882114
2 a 0.358458 0.147107 0.769586
b 0.183782 0.477863 0.601098
Frans 2 b 23.000000 NaN NaN
In [11]: df = df.sortlevel(level='Name')
In [13]: df.index.lexsort_depth
Out[13]: 3
In [14]: df.set_value(('Frans', '2', 'b'), 'one', 23.).index.lexsort_depth
Out[14]: 0
Your index needs to be sorted! See docs here: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-need-for-sortedness and these recipes may help http://pandas.pydata.org/pandas-docs/dev/cookbook.html
This is 0.10.1 as well
Heres a sorted frame
In [26]: index = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
names=['first', 'second'])
In [27]: df = pd.DataFrame(np.random.rand(len(index)), index=index,columns=['A'])
In [7]: df.index.lexsort_depth
Out[7]: 2
In [28]: df.set_value(('a',1),'A',1)
Out[28]:
A
first second
a 1 1.000000
2 0.136456
b 1 0.712612
2 0.818473
And if I sort by the 2nd level (so its unsorted)
In [29]: df2 = df.sortlevel(level='second')
# this is not sorted! (well it is, just not lexsorted)
In [10]: df2.index.lexsort_depth
Out[10]: 0
In [30]: df2.set_value(('b','1'),'A',2)
Out[30]:
A
a 1 1.000000
b 1 0.712612
a 2 0.136456
b 2 0.818473
1 2.000000
So according to Andy Hayden, this is a names bug in pandas.
Hopefully a fix will come soon.
Until then, I believe the best way to do this is to do the following:
tmp = df.ix['ExistingTransition'].copy()
tmp['Transition'] = 'NewTransition'
tmp = tmp.set_index('Transition', append=True)
tmp.index = tmp.index.reorder_levels([2, 0, 1])
# ...Do whatever else needs to be done to this before applying as template...
df = df.append(tmp)
...That, or making sure thet the names attribute is recreated after each run of set_values(), and then just going by the example linked in the question.
Related
I need to use slice on DataFrameGroupBy object.
For example, assume there is DataFrame with A-Z columns, if I want to use columns A-C I will use .loc[:, 'A':'C'], but when I'm using DataFrameGroupBy, I can't use slicing so I have to write [['A', 'B', 'C']]
Take a look here:
from numpy import around
from numpy.random import uniform
from pandas import DataFrame
from string import ascii_lowercase
data = around(a=uniform(low=1.0, high=50.0, size=(6, len(ascii_lowercase) + 1)), decimals=3)
df = DataFrame(data=data, columns=['group'] + list(ascii_lowercase), dtype='float64')
rows, columns = df.shape
df.loc[:rows // 2, 'group'] = 1.0
df.loc[rows // 2:, 'group'] = 2.0
print(df)
abc = df.groupby(by='group')[['a', 'b', 'c']].shift(periods=1)
print(abc)
Output of df is:
group a b c ... w x y z
0 1.0 22.380 36.873 10.073 ... 26.052 38.625 48.122 33.841
1 1.0 16.702 32.160 35.018 ... 12.990 17.878 19.297 16.330
2 1.0 9.957 25.202 7.106 ... 46.500 12.932 37.401 43.134
3 2.0 42.395 40.616 24.611 ... 30.436 33.521 42.136 2.690
4 2.0 2.069 29.891 2.217 ... 20.734 12.365 9.302 47.019
5 2.0 4.208 23.955 33.966 ... 45.439 16.488 32.892 9.345
Output of abc is:
a b c
0 NaN NaN NaN
1 22.380 36.873 10.073
2 16.702 32.160 35.018
3 NaN NaN NaN
4 42.395 40.616 24.611
5 2.069 29.891 2.217
How can I avoid of using [['a', 'b', 'c']]? I have 105 columns that I need to write there, I want use slicing like .loc[:, 'a':'c']
Thank you all :)
You can grouping by Series df['group'], so is possible filter columns before groupby to pass only filtered columns names:
abc = df.loc[:, 'a':'c'].groupby(by=df['group']).shift(periods=1)
print(abc)
a b c
0 NaN NaN NaN
1 37.999 21.197 39.527
2 35.560 27.214 23.211
3 NaN NaN NaN
4 49.053 11.319 37.279
5 27.881 38.529 46.550
Another idea is use:
cols = df.loc[:, 'a':'c'].columns
abc = df.groupby(by='group')[cols].shift(periods=1)
I have a dataframe df with NaN values and I want to dynamically replace them with the average values of previous and next non-missing values.
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
For example, A[3] is NaN so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4] then should be (-0.454142-0.788073)/2 = -0.621108.
Therefore, the result dataframe should look like:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621108 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).
You can try to understand your logic behind the average that is Geometric progression
s=df.isnull().cumsum()
t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
t2=df.lookup(s.idxmax()+1,s.idxmax().index)
df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
Out[212]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621107 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260201
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Explanation:
1st NaN x/2+y/2=1st
2nd NaN 1st/2+y/2=2nd
3rd NaN 2nd/2+y/2+3rd
Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key
Got a simular Problem.
The following code worked for me.
def fill_nan_with_mean_from_prev_and_next(df):
NANrows = pd.isnull(df).any(1).nonzero()[0]
null_df = df.isnull()
for row in NANrows :
for colum in range(0,df.shape[1]):
if(null_df.iloc[row][colum]):
df.iloc[row][colum] = (df.iloc[row-1][colum]+df.iloc[row-1][colum])/2
return df
maybe it is helps someone too.
as Ben.T has mentioned above
if you have another group of NaN in the same column
you can consider this lazy solution :)
for column in df:
for ind,row in df[[column]].iterrows():
if ~np.isnan(row[column]):
previous = row[column]
else:
indx = ind + 1
while np.isnan(df.loc[indx,column]):
indx += 1
next = df.loc[indx,column]
previous = df[column][ind] = (previous + next)/2
I have a pandas DataFrame and I want to calculate on a rolling basis the average of all the value: for all the columns, for all the observations in the rolling window.
I have a solution with loops but feels very inefficient. Note that I can have NaNs in my data, so calculating the sum and dividing by the shape of the window would not be safe (as I want a nanmean).
Any better approach?
Setup
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=['A', 'B'])
df[df>5] = np.nan # EDIT: add nans
My Attempt
n_roll = 2
df_stacked = df.values
roll_avg = {}
for idx in range(n_roll, len(df_stacked)+1):
roll_avg[idx-1] = np.nanmean(df_stacked[idx - n_roll:idx, :].flatten())
roll_avg = pd.Series(roll_avg)
roll_avg.index = df.index[n_roll-1:]
roll_avg = roll_avg.reindex(df.index)
Desired Result
roll_avg
Out[33]:
0 NaN
1 5.000000
2 1.666667
3 0.333333
4 1.000000
5 3.000000
6 3.250000
7 3.250000
8 3.333333
9 4.000000
Thanks!
Here's one NumPy solution with sliding windows off view_as_windows -
from skimage.util.shape import view_as_windows
# Setup o/p array
out = np.full(len(df),np.nan)
# Get sliding windows of length n_roll along axis=0
w = view_as_windows(df.values,(n_roll,1))[...,0]
# Assign nan-ignored mean values computed along last 2 axes into o/p
out[n_roll-1:] = np.nanmean(w, (1,2))
Memory efficiency with views -
In [62]: np.shares_memory(df,w)
Out[62]: True
To be able to get the same result in case of nan, you can use column_stack on all the df.shift(i).values for i in range(n_roll), use nanmean on axis=1, and then you need to replace the first n_roll-1 value with nan after:
roll_avg = pd.Series(np.nanmean(np.column_stack([df.shift(i).values for i in range(n_roll)]),1))
roll_avg[:n_roll-1] = np.nan
and with the second input with nan, you get as expected
0 NaN
1 5.000000
2 1.666667
3 0.333333
4 1.000000
5 3.000000
6 3.250000
7 3.250000
8 3.333333
9 4.000000
dtype: float64
Using the answer referenced in the comment, one can do:
wsize = n_roll
cols = df.shape[1]
out = group.stack(dropna=False).rolling(window=wsize * cols, min_periods=1).mean().reset_index(-1, drop=True).sort_index()
out.groupby(out.index).last()
out.iloc[:nroll-1] = np.nan
In my case it was important to specify dropna=False in stack, otherwise the length of the rolling window would not be correct.
But I am looking forward to other approaches as this does not feel very elegant/efficient.
Given a panda.Dataframe such as:
df = pd.DataFrame(np.random.randn(10,5), columns = ['a','b','c','d','e'])
I would like to know the best way to replace all values in the first row with a 0 (or some other specific value) and work with the new dataframe. I would like to do this in a general way, where there may be more or less columns than in this example.
Despite the simplicity of the question, I was not able to come across a solution. Most examples posted by others had to do with fillna() and related methods
You can use iloc to do that pretty cleanly like:
Code:
df.iloc[0] = 0
Test Code:
df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print(df)
df.iloc[0] = 0
print(df)
Results:
a b c d e
0 0.715524 -0.914676 0.241008 -1.353033 0.170578
1 -0.300348 1.118491 -0.520407 0.185877 -0.950839
2 1.942239 0.980477 0.110457 -0.558483 0.903775
3 0.400923 1.347769 -0.120445 0.036253 0.683571
4 -0.761881 -0.642469 2.030019 2.274070 -0.067672
5 0.566003 0.263949 -0.567247 0.689599 0.870442
6 1.904812 -0.689312 1.400950 1.942681 -1.268679
7 -0.253381 0.464208 1.362960 0.129433 0.527576
8 -1.404035 0.174586 1.006268 0.007333 1.172559
9 0.330404 0.735610 1.277451 -0.104888 0.528356
a b c d e
0 0.000000 0.000000 0.000000 0.000000 0.000000
1 -0.300348 1.118491 -0.520407 0.185877 -0.950839
2 1.942239 0.980477 0.110457 -0.558483 0.903775
3 0.400923 1.347769 -0.120445 0.036253 0.683571
4 -0.761881 -0.642469 2.030019 2.274070 -0.067672
5 0.566003 0.263949 -0.567247 0.689599 0.870442
6 1.904812 -0.689312 1.400950 1.942681 -1.268679
7 -0.253381 0.464208 1.362960 0.129433 0.527576
8 -1.404035 0.174586 1.006268 0.007333 1.172559
9 0.330404 0.735610 1.277451 -0.104888 0.528356
Assume I have a M (rows) by N (columns) dataFrame
df = pandas.DataFrame([...])
and a vector of length N
windows = [1,2,..., N]
I would like to apply a moving average function to each column in df, but would like the moving average to have different length for each column (e.g. column1 has MA length 1, column 2 has MA length 2, etc) - these lengths are contained in windows
Are there built in functions to do this quickly? I'm aware of the df.apply(lambda a: f(a), axis=0, args=...) but unclear how to apply different args for each column
Here's one way to do it:
In [15]: dfrm
Out[15]:
A B C
0 0.948898 0.587032 0.131551
1 0.385582 0.275673 0.107135
2 0.849599 0.696882 0.313717
3 0.993080 0.510060 0.287691
4 0.994823 0.441560 0.632076
5 0.711145 0.760301 0.813272
6 0.932131 0.531901 0.393798
7 0.965915 0.812821 0.287819
8 0.782890 0.478565 0.960353
9 0.908078 0.850664 0.912878
In [16]: windows
Out[16]: [1, 2, 3]
In [17]: pandas.DataFrame(
{c: dfrm[c].rolling(windows[i]).mean() for i, c in enumerate(dfrm.columns)}
)
Out[17]:
A B C
0 0.948898 NaN NaN
1 0.385582 0.431352 NaN
2 0.849599 0.486277 0.184134
3 0.993080 0.603471 0.236181
4 0.994823 0.475810 0.411161
5 0.711145 0.600931 0.577680
6 0.932131 0.646101 0.613049
7 0.965915 0.672361 0.498296
8 0.782890 0.645693 0.547323
9 0.908078 0.664614 0.720350
As #Manish Saraswat mentioned in the comments, you can also express the same thing as dfrm[c].rolling_mean(windows[i]). Further, you can use sequences as the items in windows if you want, and they would express a custom window shape (size and weights), or any of the other options with different rolling aggregations and keywords.