Evaluating formula provided by GUI - python

I'm trying to evaluate a simple formula provided via a GUI.
Currently I store the data in a dict with letters as keys (happy to change that, but thought that could come the solution one step closer).
Eventually I want to type in a simple folmula such as "A - J*2"
import pandas as pd
data_dict = {}
data_dict['A'] = pd.Series([1, 2, 3])
data_dict['C'] = pd.Series([0, 1, 2])
data_dict['E'] = pd.Series([0.5, 1.5, 2.5])
data_dict['J'] = pd.Series([4, 5, 6])
e.g. "A - J*2" ==>
data_dict['A'] - data_dict['J'] * 2
The letters will change dynamically.

Use DataFrame.eval, but first need create DataFrame from dict of Series:
df = pd.DataFrame(data_dict)
print (df)
0 1 0 0.5 4
1 2 1 1.5 5
2 3 2 2.5 6
print (df.eval("A - J*2"))
0 -7
1 -8
2 -9
dtype: int64


A Lexicographical Bug in Pandas?

Please take this question lightly as asked from curiosity:
As I was trying to see how the slicing in MultiIndex works, I came across the following situation ↓
# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)
a 1 5
2 0
c 1 8
2 6
b 1 6
2 3
dtype: int32
NOTE that the indices are not in the sorted order ie. a, c, b is the order which will result in the expected error that we want while slicing.
# When we do slicing
Errors like:
----> 1 data.loc["a":"c"]
UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'
That's expected. But now, after doing the following steps:
# Making a DataFrame
data = data.unstack()
# Redindexing - to unsort the indices like before
data = data.reindex(["a", "c", "b"])
# Which looks like
1 2
a 5 0
c 8 6
b 6 3
# Then again making series
data = data.stack()
# Reindex Again!
data = data.reindex(["a", "c", "b"], level=0)
# Which looks like before
a 1 5
2 0
c 1 8
2 6
b 1 6
2 3
dtype: int32
The Problem
So, now the process is: Series → Unstack → DataFrame → Stack → Series
Now, if I do the slicing like before (still on with the indices unsorted) we don't get any error!
# The same slicing
Results without an error:
a 1 5
2 0
c 1 8
2 6
dtype: int32
Even if the data.index.is_monotonic → False. Then still why can we slice?
So the question is: WHY?.
I hope you got the understanding of the situation here. Because see, the same series which was before giving the error, after the unstack and stack operation is not giving any error.
So is that a bug, or a new concept that I am missing here?
Aayush ∞ Shah
I have used the data.reindex() so to unsort that once more. Please have a look at it again.
The difference between your 2 dataframes is the following:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.randint(10, size=6), index=index)
data2 = data.unstack().reindex(["a", "c", "b"]).stack()
>>> data.index.codes
FrozenList([[0, 0, 2, 2, 1, 1], [0, 1, 0, 1, 0, 1]])
>>> data2.index.codes
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
Even if your two indexes are the same appearance (values), the internal index (codes) are differents.
Check this method of MultiIndex:
Create a new MultiIndex from the current to monotonically sorted
items IN the levels. This does not actually make the entire MultiIndex
monotonic, JUST the levels.
The resulting MultiIndex will have the same outward
appearance, meaning the same .values and ordering. It will also
be .equals() to the original.
Old answer
# Making a DataFrame
data = data.unstack()
# Which looks like # <- WRONG
1 2 # 1 2
a 5 0 # a 8 0
c 8 6 # b 4 1
b 6 3 # c 7 6
# Then again making series
data = data.stack()
# Which looks like before # <- WRONG
a 1 5 # a 1 2
2 0 # 2 1
c 1 8 # b 1 0
2 6 # 2 1
b 1 6 # c 1 3
2 3 # 2 9
dtype: int32
If you want to use slicing, you have to check if the index is monotonic:
# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)
>>> data.index.is_monotonic
>>> data.unstack().stack().index.is_monotonic
>>> data.sort_index().index.is_monotonic

is it possible to use numpy to calculate on recursive data [duplicate]

I have a time-series A holding several values. I need to obtain a series B that is defined algebraically as follows:
B[t] = a * A[t] + b * B[t-1]
where we can assume B[0] = 0, and a and b are real numbers.
Is there any way to do this type of recursive computation in Pandas? Or do I have no choice but to loop in Python as suggested in this answer?
As an example of input:
> A = pd.Series(np.random.randn(10,))
0 -0.310354
1 -0.739515
2 -0.065390
3 0.214966
4 -0.605490
5 1.293448
6 -3.068725
7 -0.208818
8 0.930881
9 1.669210
As I noted in a comment, you can use scipy.signal.lfilter. In this case (assuming A is a one-dimensional numpy array), all you need is:
B = lfilter([a], [1.0, -b], A)
Here's a complete script:
import numpy as np
from scipy.signal import lfilter
A = np.random.randn(10)
a = 2.0
b = 3.0
# Compute the recursion using lfilter.
# [a] and [1, -b] are the coefficients of the numerator and
# denominator, resp., of the filter's transfer function.
B = lfilter([a], [1, -b], A)
print B
# Compare to a simple loop.
B2 = np.empty(len(A))
for k in range(0, len(B2)):
if k == 0:
B2[k] = a*A[k]
B2[k] = a*A[k] + b*B2[k-1]
print B2
print "max difference:", np.max(np.abs(B2 - B))
The output of the script is:
[ -2.17126121e+00 -4.51909273e+00 -1.29913212e+01 -4.19865530e+01
-1.27116859e+02 -3.78047705e+02 -1.13899647e+03 -3.41784725e+03
-1.02510099e+04 -3.07547631e+04]
[ -2.17126121e+00 -4.51909273e+00 -1.29913212e+01 -4.19865530e+01
-1.27116859e+02 -3.78047705e+02 -1.13899647e+03 -3.41784725e+03
-1.02510099e+04 -3.07547631e+04]
max difference: 0.0
Another example, in IPython, using a pandas DataFrame instead of a numpy array:
If you have
In [12]: df = pd.DataFrame([1, 7, 9, 5], columns=['A'])
In [13]: df
0 1
1 7
2 9
3 5
and you want to create a new column, B, such that B[k] = A[k] + 2*B[k-1] (with B[k] == 0 for k < 0), you can write
In [14]: df['B'] = lfilter([1], [1, -2], df['A'].astype(float))
In [15]: df
0 1 1
1 7 9
2 9 27
3 5 59

Pandas replacing one value with another for specified columns

I need to apply a function to a subset of columns in a dataframe. consider the following toy example:
pdf = pd.DataFrame({'a' : [1, 2, 3], 'b' : [2, 3, 4], 'c' : [5, 6, 7]})
arb_cols = ['a', 'b']
what I want to do is this:
[df[c] = df[c].apply(lambda x : 99 if x == 2 else x) for c in arb_cols]
But this is bad syntax. Is it possible to accomplish such a task without a for loop?
With mask
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Or with assign
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Do not use pd.Series.apply when you can use vectorised functions.
For example, the below should be efficient for larger dataframes even though there is an outer loop:
for col in arb_cols:
pdf.loc[pdf[col] == 2, col] = 99
Another option it to use pd.DataFrame.replace:
pdf[arb_cols] = pdf[arb_cols].replace(2, 99)
Yet another option is to use numpy.where:
import numpy as np
pdf[arb_cols] = np.where(pdf[arb_cols] == 2, 99, pdf[arb_cols])
For this case it would probably be better to use applymap if you need to apply a custom function
pdf[arb_cols] = pdf[arb_cols].applymap(lambda x : 99 if x == 2 else x)

Split a pandas dataframe by a list of values from another data frame

I'm pretty sure there's a really simple solution for this and I'm just not realising it. However...
I have a data frame of high-frequency data. Call this data frame A. I also have a separate list of far lower frequency demarcation points, call this B. I would like to append a column to A that would display 1 if A's timestamp column is between B[0] and B[1], 2 if it is between B[1] and B[2], and so on.
As said, it's probably incredibly trivial, and I'm just not realising it at this late an hour.
Here is a quick and dirty approach using a list comprehension.
>>> df = pd.DataFrame({'A': np.arange(1, 3, 0.2)})
>>> A = df.A.values.tolist()
A: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.5, 2.6, 2.8]
>>> B = np.arange(0, 3, 1).tolist()
B: [0, 1, 2]
>>> BA = [k for k in range(0, len(B)-1) for a in A if (B[k]<=a) & (B[k+1]>a) or (a>max(B))]
BA: [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Use searchsorted:
A['group'] = B['timestamp'].searchsorted(A['timestamp'])
For each value in A['timestamp'], an index value is returned. That index indicates where amongst the sorted values in B['timestamp'] that value from A would be inserted into B in order to maintain sorted order.
For example,
import numpy as np
import pandas as pd
N = 10
A = pd.DataFrame({'timestamp':np.random.uniform(0, 1, size=N).cumsum()})
B = pd.DataFrame({'timestamp':np.random.uniform(0, 3, size=N).cumsum()})
# timestamp
# 0 1.739869
# 1 2.467790
# 2 2.863659
# 3 3.295505
# 4 5.106419
# 5 6.872791
# 6 7.080834
# 7 9.909320
# 8 11.027117
# 9 12.383085
A['group'] = B['timestamp'].searchsorted(A['timestamp'])
timestamp group
0 0.896705 0
1 1.626945 0
2 2.410220 1
3 3.151872 3
4 3.613962 4
5 4.256528 4
6 4.481392 4
7 5.189938 5
8 5.937064 5
9 6.562172 5
Thus, the timestamp 0.896705 is in group 0 because it comes before B['timestamp'][0] (i.e. 1.739869). The timestamp 2.410220 is in group 1 because it is larger than B['timestamp'][0] (i.e. 1.739869) but smaller than B['timestamp'][1] (i.e. 2.467790).
You should also decide what to do if a value in A['timestamp'] is exactly equal to one of the cutoff values in B['timestamp']. Use
B['timestamp'].searchsorted(A['timestamp'], side='left')
if you want searchsorted to return i when B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]. Use
B['timestamp'].searchsorted(A['timestamp'], side='right')
if you want searchsorted to return i+1 in that situation. If you don't specify side, then side='left' is used by default.

Collapsing identical adjacent rows in a Pandas Series

Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
return seen
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
0 1
2 2
6 3
10 4
12 5
dtype: int64

