A Lexicographical Bug in Pandas? - python

Please take this question lightly as asked from curiosity:
As I was trying to see how the slicing in MultiIndex works, I came across the following situation ↓
# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)
Returns:
a 1 5
2 0
c 1 8
2 6
b 1 6
2 3
dtype: int32
NOTE that the indices are not in the sorted order ie. a, c, b is the order which will result in the expected error that we want while slicing.
# When we do slicing
data.loc["a":"c"]
Errors like:
UnsortedIndexError
----> 1 data.loc["a":"c"]
UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'
That's expected. But now, after doing the following steps:
# Making a DataFrame
data = data.unstack()
# Redindexing - to unsort the indices like before
data = data.reindex(["a", "c", "b"])
# Which looks like
1 2
a 5 0
c 8 6
b 6 3
# Then again making series
data = data.stack()
# Reindex Again!
data = data.reindex(["a", "c", "b"], level=0)
# Which looks like before
a 1 5
2 0
c 1 8
2 6
b 1 6
2 3
dtype: int32
The Problem
So, now the process is: Series → Unstack → DataFrame → Stack → Series
Now, if I do the slicing like before (still on with the indices unsorted) we don't get any error!
# The same slicing
data.loc["a":"c"]
Results without an error:
a 1 5
2 0
c 1 8
2 6
dtype: int32
Even if the data.index.is_monotonic → False. Then still why can we slice?
So the question is: WHY?.
I hope you got the understanding of the situation here. Because see, the same series which was before giving the error, after the unstack and stack operation is not giving any error.
So is that a bug, or a new concept that I am missing here?
Thanks!
Aayush ∞ Shah
UPDATE:
I have used the data.reindex() so to unsort that once more. Please have a look at it again.

The difference between your 2 dataframes is the following:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.randint(10, size=6), index=index)
data2 = data.unstack().reindex(["a", "c", "b"]).stack()
>>> data.index.codes
FrozenList([[0, 0, 2, 2, 1, 1], [0, 1, 0, 1, 0, 1]])
>>> data2.index.codes
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
Even if your two indexes are the same appearance (values), the internal index (codes) are differents.
Check this method of MultiIndex:
Create a new MultiIndex from the current to monotonically sorted
items IN the levels. This does not actually make the entire MultiIndex
monotonic, JUST the levels.
The resulting MultiIndex will have the same outward
appearance, meaning the same .values and ordering. It will also
be .equals() to the original.
Old answer
# Making a DataFrame
data = data.unstack()
# Which looks like # <- WRONG
1 2 # 1 2
a 5 0 # a 8 0
c 8 6 # b 4 1
b 6 3 # c 7 6
# Then again making series
data = data.stack()
# Which looks like before # <- WRONG
a 1 5 # a 1 2
2 0 # 2 1
c 1 8 # b 1 0
2 6 # 2 1
b 1 6 # c 1 3
2 3 # 2 9
dtype: int32
If you want to use slicing, you have to check if the index is monotonic:
# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)
>>> data.index.is_monotonic
False
>>> data.unstack().stack().index.is_monotonic
True
>>> data.sort_index().index.is_monotonic
True

Related

pandas apply with different arg for each column/row

Assume I have a M (rows) by N (columns) dataFrame
df = pandas.DataFrame([...])
and a vector of length N
windows = [1,2,..., N]
I would like to apply a moving average function to each column in df, but would like the moving average to have different length for each column (e.g. column1 has MA length 1, column 2 has MA length 2, etc) - these lengths are contained in windows
Are there built in functions to do this quickly? I'm aware of the df.apply(lambda a: f(a), axis=0, args=...) but unclear how to apply different args for each column
Here's one way to do it:
In [15]: dfrm
Out[15]:
A B C
0 0.948898 0.587032 0.131551
1 0.385582 0.275673 0.107135
2 0.849599 0.696882 0.313717
3 0.993080 0.510060 0.287691
4 0.994823 0.441560 0.632076
5 0.711145 0.760301 0.813272
6 0.932131 0.531901 0.393798
7 0.965915 0.812821 0.287819
8 0.782890 0.478565 0.960353
9 0.908078 0.850664 0.912878
In [16]: windows
Out[16]: [1, 2, 3]
In [17]: pandas.DataFrame(
{c: dfrm[c].rolling(windows[i]).mean() for i, c in enumerate(dfrm.columns)}
)
Out[17]:
A B C
0 0.948898 NaN NaN
1 0.385582 0.431352 NaN
2 0.849599 0.486277 0.184134
3 0.993080 0.603471 0.236181
4 0.994823 0.475810 0.411161
5 0.711145 0.600931 0.577680
6 0.932131 0.646101 0.613049
7 0.965915 0.672361 0.498296
8 0.782890 0.645693 0.547323
9 0.908078 0.664614 0.720350
As #Manish Saraswat mentioned in the comments, you can also express the same thing as dfrm[c].rolling_mean(windows[i]). Further, you can use sequences as the items in windows if you want, and they would express a custom window shape (size and weights), or any of the other options with different rolling aggregations and keywords.

Pandas assign label based on index value

I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned

Evaluating formula provided by GUI

I'm trying to evaluate a simple formula provided via a GUI.
Currently I store the data in a dict with letters as keys (happy to change that, but thought that could come the solution one step closer).
Eventually I want to type in a simple folmula such as "A - J*2"
import pandas as pd
data_dict = {}
data_dict['A'] = pd.Series([1, 2, 3])
data_dict['C'] = pd.Series([0, 1, 2])
data_dict['E'] = pd.Series([0.5, 1.5, 2.5])
data_dict['J'] = pd.Series([4, 5, 6])
e.g. "A - J*2" ==>
data_dict['A'] - data_dict['J'] * 2
The letters will change dynamically.
Use DataFrame.eval, but first need create DataFrame from dict of Series:
df = pd.DataFrame(data_dict)
print (df)
A C E J
0 1 0 0.5 4
1 2 1 1.5 5
2 3 2 2.5 6
print (df.eval("A - J*2"))
0 -7
1 -8
2 -9
dtype: int64

Collapsing identical adjacent rows in a Pandas Series

Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64

How to create a lagged data structure using pandas dataframe

Example
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
print s
1 5
2 4
3 3
4 2
5 1
Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)
3 [3, 4, 5]
4 [2, 3, 4]
5 [1, 2, 3]
This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])
How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?
Thanks
Edited after seeing the answers
ok, at the end I implemented this function:
def buildLaggedFeatures(s,lag=2,dropna=True):
'''
Builds a new DataFrame to facilitate regressing over all possible lagged features
'''
if type(s) is pd.DataFrame:
new_dict={}
for col_name in s:
new_dict[col_name]=s[col_name]
# create lagged Series
for l in range(1,lag+1):
new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)
res=pd.DataFrame(new_dict,index=s.index)
elif type(s) is pd.Series:
the_range=range(lag+1)
res=pd.concat([s.shift(i) for i in the_range],axis=1)
res.columns=['lag_%d' %i for i in the_range]
else:
print 'Only works for DataFrame or Series'
return None
if dropna:
return res.dropna()
else:
return res
it produces the wished outputs and manages the naming of columns in the resulting DataFrame.
For a Series as input:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res=buildLaggedFeatures(s,lag=2,dropna=False)
lag_0 lag_1 lag_2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
and for a DataFrame as input:
s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5])
res2=buildLaggedFeatures(s2,lag=2,dropna=True)
a a_lag1 a_lag2 b b_lag1 b_lag2
3 3 4 5 30 40 50
4 2 3 4 20 30 40
5 1 2 3 10 20 30
As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.
One solution is to concat shifted Series together to make a DataFrame:
In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1)
Out[11]:
0 1 2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna()
Out[12]:
0 1 2
3 3 4 5
4 2 3 4
5 1 2 3
Doing work on this will be more efficient that on lists...
Very simple solution using pandas DataFrame:
number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in xrange(1, number_lags + 1):
df['lag_' + str(lag)] = df.vals.shift(lag)
#if you want numpy arrays with no null values:
df.dropna().values for numpy arrays
for Python 3.x (change xrange to range)
number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in range(1, number_lags + 1):
df['lag_' + str(lag)] = df.vals.shift(lag)
print(df)
vals lag_1 lag_2 lag_3
0 5 NaN NaN NaN
1 4 5.0 NaN NaN
2 3 4.0 5.0 NaN
3 2 3.0 4.0 5.0
4 1 2.0 3.0 4.0
For a dataframe df with the lag to be applied on 'col name', you can use the shift function.
df['lag1']=df['col name'].shift(1)
df['lag2']=df['col name'].shift(2)
I like to put the lag numbers in the columns by making the columns a MultiIndex. This way, the names of the columns are retained.
Here's an example of the result:
# Setup
indx = pd.Index([1, 2, 3, 4, 5], name='time')
s=pd.Series(
[5, 4, 3, 2, 1],
index=indx,
name='population')
shift_timeseries_by_lags(pd.DataFrame(s), [0, 1, 2])
Result: a MultiIndex DataFrame with two column labels: the original one ("population") and a new one ("lag"):
Solution: Like in the accepted solution, we use DataFrame.shift and then pandas.concat.
def shift_timeseries_by_lags(df, lags, lag_label='lag'):
return pd.concat([
shift_timeseries_and_create_multiindex_column(df, lag,
lag_label=lag_label)
for lag in lags], axis=1)
def shift_timeseries_and_create_multiindex_column(
dataframe, lag, lag_label='lag'):
return (dataframe.shift(lag)
.pipe(append_level_to_columns_of_dataframe,
lag, lag_label))
I wish there were an easy way to append a list of labels to the existing columns. Here's my solution.
def append_level_to_columns_of_dataframe(
dataframe, new_level, name_of_new_level, inplace=False):
"""Given a (possibly MultiIndex) DataFrame, append labels to the column
labels and assign this new level a name.
Parameters
----------
dataframe : a pandas DataFrame with an Index or MultiIndex columns
new_level : scalar, or arraylike of length equal to the number of columns
in `dataframe`
The labels to put on the columns. If scalar, it is broadcast into a
list of length equal to the number of columns in `dataframe`.
name_of_new_level : str
The label to give the new level.
inplace : bool, optional, default: False
Whether to modify `dataframe` in place or to return a copy
that is modified.
Returns
-------
dataframe_with_new_columns : pandas DataFrame with MultiIndex columns
The original `dataframe` with new columns that have the given `level`
appended to each column label.
"""
old_columns = dataframe.columns
if not hasattr(new_level, '__len__') or isinstance(new_level, str):
new_level = [new_level] * dataframe.shape[1]
if isinstance(dataframe.columns, pd.MultiIndex):
new_columns = pd.MultiIndex.from_arrays(
old_columns.levels + [new_level],
names=(old_columns.names + [name_of_new_level]))
elif isinstance(dataframe.columns, pd.Index):
new_columns = pd.MultiIndex.from_arrays(
[old_columns] + [new_level],
names=([old_columns.name] + [name_of_new_level]))
if inplace:
dataframe.columns = new_columns
return dataframe
else:
copy_dataframe = dataframe.copy()
copy_dataframe.columns = new_columns
return copy_dataframe
Update: I learned from this solution another way to put a new level in a column, which makes it unnecessary to use append_level_to_columns_of_dataframe:
def shift_timeseries_by_lags_v2(df, lags, lag_label='lag'):
return pd.concat({
'{lag_label}_{lag_number}'.format(lag_label=lag_label, lag_number=lag):
df.shift(lag)
for lag in lags},
axis=1)
Here's the result of shift_timeseries_by_lags_v2(pd.DataFrame(s), [0, 1, 2]):
Here is a cool one liner for lagged features with _lagN suffixes in column names using pd.concat:
lagged = pd.concat([s.shift(lag).rename('{}_lag{}'.format(s.name, lag+1)) for lag in range(3)], axis=1).dropna()
You can do following:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res = pd.DataFrame(index = s.index)
for l in range(3):
res[l] = s.shift(l)
print res.ix[3:,:].as_matrix()
It produces:
array([[ 3., 4., 5.],
[ 2., 3., 4.],
[ 1., 2., 3.]])
which I hope is very close to what you are actually want.
For multiple (many of them) lags, this could be more compact:
df=pd.DataFrame({'year': range(2000, 2010), 'gdp': [234, 253, 256, 267, 272, 273, 271, 275, 280, 282]})
df.join(pd.DataFrame({'gdp_' + str(lag): df['gdp'].shift(lag) for lag in range(1,4)}))
Assuming you are focusing on a single column in your data frame, saved into s. This shortcode will generate instances of the column with 7 lags.
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5], name='test')
shiftdf=pd.DataFrame()
for i in range(3):
shiftdf = pd.concat([shiftdf , s.shift(i).rename(s.name+'_'+str(i))], axis=1)
shiftdf
>>
test_0 test_1 test_2
1 5 NaN NaN
2 4 5.0 NaN
3 3 4.0 5.0
4 2 3.0 4.0
5 1 2.0 3.0
Based on the proposal by #charlie-brummitt, here is a revision that fix a set of columns:
def shift_timeseries_by_lags(df, fix_columns, lag_numbers, lag_label='lag'):
df_fix = df[fix_columns]
df_lag = df.drop(columns=fix_columns)
df_lagged = pd.concat({f'{lag_label}_{lag}':
df_lag.shift(lag) for lag in lag_numbers},
axis=1)
df_lagged.columns = ['__'.join(reversed(x)) for x in df_lagged.columns.to_flat_index()]
return pd.concat([df_fix, df_lagged], axis=1)
Here is an example of usage:
df = shift_timeseries_by_lags(df_province_cases, fix_columns=['country', 'state'], lag_numbers=[1,2,3])
I personally prefer the lag name as suffix. But can be changed removing reversed().

Categories

Resources