Vectorized way of finding the index of a previously occurring element - python

Let's say I have this Pandas series:
num = pd.Series([1,2,3,4,5,6,5,6,4,2,1,3])
What I want to do is to get a number, say 5, and return the index where it previously occurred. So if I'm using the element 5, I should get 4 as the element appears in indices 4 and 6. Now I want to do this for all of the elements of the series, and can be easily done using a for loop:
for idx,x in enumerate(num):
idx_prev = num[num == x].idxmax()
if(idx_prev < idx):
return idx_prev
However, this process consumes too much time for longer series lengths due to the looping. Is there a way to implement the same thing but in a vectorized form? The output should be something like this:
[NaN,NaN,NaN,NaN,NaN,NaN,4,5,3,1,0,2]

You can use groupby to shift the index:
num.index.to_series().groupby(num).shift()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
7 5.0
8 3.0
9 1.0
10 0.0
11 2.0
dtype: float64

It's possible to keep working in numpy.
Equivalent of [num[num == x].idxmax() for idx,x in enumerate(num)] using numpy is:
_, out = np.unique(num.values, return_inverse=True)
which assigns
array([0, 1, 2, 3, 4, 5, 4, 5, 3, 1, 0, 2], dtype=int64)
to out. Now you can assign bad values of out to Nans like this:
out_series = pd.Series(out)
out_series[out >= np.arange(len(out))] = np.nan

Related

Creating a pandas column of values with a calculation, but change the calculation every x times to a different one

I'm currently creating a new column in my pandas dataframe, which calculates a value based on a simple calculation using a value in another column, and a simple value subtracting from it. This is my current code, which almost gives me the output I desire (example shortened for reproduction):
subtraction_value = 3
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = data['test'][::-1] - subtraction_value
When run, this gives me the current output:
print(data['new_column'])
[9,1,2,1,-2,0,-1,3,7,6]
However, if I wanted to use a different value to subtract on the column, from position [0], then use the original subtraction value on positions [1:3] of the column, before using the second value on position [4] again, and repeat this pattern, how would I do this iteratively? I realize I could use a for loop to achieve this, but for performance reasons I'd like to do this another way. My new output would ideally look like this:
subtraction_value_2 = 6
print(data['new_column'])
[6,1,2,1,-5,0,-1,3,4,6]
You can use positional indexing:
subtraction_value_2 = 6
col = data.columns.get_loc('new_column')
data.iloc[0::4, col] = data['test'].iloc[0::4].sub(subtraction_value_2)
or with numpy.where:
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
output:
test new_column
0 12 6
1 4 1
2 5 2
3 4 1
4 1 -5
5 3 0
6 2 -1
7 5 2
8 10 4
9 9 6
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
data['new_column'] = data.test - subtraction_value
data['new_column'][::4] = data.test[::4] - subtraction_value_2
print(list(data.new_column))
Output:
[6, 1, 2, 1, -5, 0, -1, 2, 4, 6]

boolean selection with loc in python

s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
output of s is
49 NaN
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
s.loc[[False,True]]
it gives output as-:
48 NaN
.loc Access a group of rows and columns by label(s), I have given list of false and true and it is also not equal to length of axis being sliced.
My doubt is if we gave list of boolean array to loc it slice the dataframe/series with position instead of label?
I certainly get an error when I am running this:
import pandas as pd
import numpy as np
s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
s.loc[[False,True]]
the error (as expected) is:
IndexError: Item wrong length 2 instead of 10.
Maybe your problem is specific to a certain version of pandas? Maybe an old one? I used pandas version 0.25.3

Manipulate pandas.DataFrame with multiple criterias

For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?
query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0
Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2

apply function on groups of k elements of a pandas Series

I have a pandas Series:
0 1
1 5
2 20
3 -1
Lets say I want to apply mean() on every two elements, so I get something like this:
0 3.0
1 9.5
Is there an elegant way to do this?
You can use groupby by index divide by k=2:
k = 2
print (s.index // k)
Int64Index([0, 0, 1, 1], dtype='int64')
print (s.groupby([s.index // k]).mean())
name
0 3.0
1 9.5
You can do this:
(s.iloc[::2].values + s.iloc[1::2])/2
if you want you can also reset the index afterwards, so you have 0, 1 as the index, using:
((s.iloc[::2].values + s.iloc[1::2])/2).reset_index(drop=True)
If you are using this over large series and many times, you'll want to consider a fast approach. This solution uses all numpy functions and will be fast.
Use reshape and construct new pd.Series
consider the pd.Series s
s = pd.Series([1, 5, 20, -1])
generalized function
def mean_k(s, k):
pad = (k - s.shape[0] % k) % k
nan = np.repeat(np.nan, pad)
val = np.concatenate([s.values, nan])
return pd.Series(np.nanmean(val.reshape(-1, k), axis=1))
demonstration
mean_k(s, 2)
0 3.0
1 9.5
dtype: float64
mean_k(s, 3)
0 8.666667
1 -1.000000
dtype: float64

Update DataFrame using a slice of a MultiIndex (many rows) with Pandas

I have a MultiIndex of dimensions I, M and would like for one i \in I to update all M rows at the same time.
Here is my data frame:
>>> result.head(n=10)
Out[9]:
FINLWT21
i INCAGG
0 1 NaN
7 NaN
9 NaN
5 NaN
3 NaN
1 1 NaN
7 NaN
9 NaN
5 NaN
3 NaN
Here is what I would like to fill in:
sample.groupby(field).sum()
FINLWT21
INCAGG
1 8800809.719
3 9951002.611
5 9747721.721
7 7683066.990
9 11091861.692
I thought the right command would be result.loc[i] = sample.groupby(field).sum(). However, here is the contents of result afterwards:
>>> result.loc[i]
Out[11]:
FINLWT21
INCAGG
1 NaN
7 NaN
9 NaN
5 NaN
3 NaN
How can I update all the "inner index" at the same time?
you want to use pd.IndexSlice. It returns an object that can be used in sclicing with loc.
Solution
result.sort_index();
slc = pd.IndexSlice[i, :]
result.loc[slc, :] = sample.groupby(field).sum()
Explanation
result.sort_index(); -> pd.IndexSclice requires the index be sorted.
slc = pd.IndexSclice[i, :] -> syntax to create a generic slicer to get ith group of 1st level for a pd.MultiIndex with 2 levels.
'result.loc[slc, :] = ` -> use the slice
Demonstration
import pandas as pd
import numpy as np
result = pd.DataFrame([], columns=['FINLWT21'],
index=pd.MultiIndex.from_product([[0, 1], [1, 7, 9, 5, 3]]))
result.sort_index(inplace=True);
slc = pd.IndexSlice[0, :]
result.loc[slc, :] = [1, 2, 3, 4, 5]
print result
FINLWT21
0 1 1
3 2
5 3
7 4
9 5
1 1 NaN
3 NaN
5 NaN
7 NaN
9 NaN
This is a function I have which might be what you are looking for:
def _assign_multi_index(dest, k, v, inplace=True, bool_nan=False):
"""
assigns v to dest[k] inplace, doing a "sensible" multi-index alignment, raising
a ValueError if no alignment is achieved.
I'm not sure if there's a better way to do this, or a reason not to do it
the way it's currently written.
"""
if not inplace:
raise NotImplementedError()
if k in dest:
warn("key '{}' already exists, continue with caution!".format(k))
v_names = v.index.names
dest_names = dest.index.names
if all(n in dest_names for n in v_names):
if len(v_names) < len(dest_names):
# if need be, drop some index levels temporarily in dest
dropped_names = [n for n in dest_names if n not in v_names]
dest.reset_index(dropped_names, inplace=True)
v.index = v.index.reorder_levels([n for n in dest_names if n in v_names]) # just to be safe
else:
raise ValueError("index levels do not match dest.")
dest[k] = v
# restore the original index levels if need be
if dest.index.names != dest_names:
dest.reset_index(inplace=True)
dest.set_index(dest_names, inplace=True)
if bool_nan != np.nan and v.dtype.name == 'bool' and dest[k].dtype.name != 'bool':
# this happens when nans had to be inserted, let's convert nans
dest_k = dest[k].copy()
dest_k[pd.isnull(dest_k)] = bool_nan
dest[k] = dest_k.astype(bool)
Turns out that the probably best way is to add an index to the right data set. The following works as expected:
data = sample.groupby(field).sum()
data['index'] = i
result.loc[i] = data.reset_index().set_index(['index', field])

Categories

Resources