Create column using a sequence of numbers

Create column using a sequence of numbers - python

I have a dataframe as such
Anger Sad Happy Disgust Neutral Scared
0.06754 0.6766 0.4343 0.7732 0.5563 0.76433
0.54434 0.9865 0.6654 0.3334 0.4322 0.54453
...
0.5633 0.67655 0.5444 0.3278 0.9834 0.88569
I would like to create a new column that marks the first 5 rows as 1, the next 5 rows as 2, the next 5 rows 3 and then the next 3 rows as 4, and repeat the same pattern till the end of the dataset. How can I achieve this?
I tried looking into arange but failed in the implementation
An example output would be the new column Tperiod
Tperiod
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
1
1
1
1
1

One way to do it would be as such
pattern = [1] * 5 + [2] * 5 + [3] * 5 + [4] * 3
no_patterns = len(df)//len(pattern)
remaining = len(df) - (len(pattern) * no_patterns)
new_values = pattern * no_patterns + pattern[:remaining]
df['new_column'] = new_values

You can do:
gr=[5,5,5,3]
mp={n: next(x+1 for x in range(len(gr)) if sum(gr[:x+1])>n) for n in range(sum(gr))}
df['Tperiod']=range(len(df))
df['Tperiod']=(df['Tperiod']%sum(gr)).map(mp)
Where gr indicates your groups size (that's your input), and mp is just to utilize it properly with pandas.

Related

Is there a way to reference a previous value in Pandas column efficiently?

I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25

numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17

Index and save last N points from a list that meets conditions from dataframe Python

I have a DataFrame that contains gas concentrations and the corresponding valve number. This data was taken continuously where we switched the valves back and forth (valves=1 or 2) for a certain amount of time to get 10 cycles for each valve value (20 cycles total). A snippet of the data looks like this (I have 2,000+ points and each valve stayed on for about 90 seconds each cycle):
gas1 valveW time
246.9438 2 1
247.5367 2 2
246.7167 2 3
246.6770 2 4
245.9197 1 5
245.9518 1 6
246.9207 1 7
246.1517 1 8
246.9015 1 9
246.3712 2 10
247.0826 2 11
... ... ...
My goal is to save the last N points of each valve's cycle. For example, the first cycle where valve=1, I want to index and save the last N points from the end before the valve switches to 2. I would then save the last N points and average them to find one value to represent that first cycle. Then I want to repeat this step for the second cycle when valve=1 again.
I am currently converting from Matlab to Python so here is the Matlab code that I am trying to translate:
% NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
ind_noaaHigh_end = zeros(1,length(t_c));
numPoints = 40;
for i = 1:length(valveW_c)-1
if (valveW_c(i) == 1 && valveW_c(i+1) ~= 1)
test = (i-numPoints):i;
ind_noaaHigh_end(test) = 1;
n2o_noaaHigh = [n2o_noaaHigh mean(n2o_c(test))];
co2_noaaHigh = [co2_noaaHigh mean(co2_c(test))];
co_noaaHigh = [co_noaaHigh mean(co_c(test))];
h2o_noaaHigh = [h2o_noaaHigh mean(h2o_c(test))];
end
end
ind_noaaHigh_end = logical(ind_noaaHigh_end);
This is what I have so far for Python:
# NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
t_c_High = []; # time
for i in range(len(valveW_c)):
# NOAA HIGH
if (valveW_c[i] == 1):
t_c_High.append(t_c[i])
n2o_noaaHigh.append(n2o_c[i])
co2_noaaHigh.append(co2_c[i])
co_noaaHigh.append(co_c[i])
h2o_noaaHigh.append(h2o_c[i])
Thanks in advance!

I'm not sure if I understood correctly, but I guess this is what you are looking for:
# First we create a column to show cycles:
df['cycle'] = (df.valveW.diff() != 0).cumsum()
print(df)
gas1 valveW time cycle
0 246.9438 2 1 1
1 247.5367 2 2 1
2 246.7167 2 3 1
3 246.677 2 4 1
4 245.9197 1 5 2
5 245.9518 1 6 2
6 246.9207 1 7 2
7 246.1517 1 8 2
8 246.9015 1 9 2
9 246.3712 2 10 3
10 247.0826 2 11 3
Now you can use groupby method to get the average for the last n points of each cycle:
n = 3 #we assume this is n
df.groupby('cycle').apply(lambda x: x.iloc[-n:, 0].mean())
Output:
cycle 0
1 246.9768
2 246.6579
3 246.7269

Let's call your DataFrame df; then you could do:
results = {}
for k, v in df.groupby((df['valveW'].shift() != df['valveW']).cumsum()):
results[k] = v
print(f'[group {k}]')
print(v)
Shift(), as it suggests, shifts the column of the valve cycle allows to detect changes in number sequences. Then, cumsum() helps to give a unique number to each of the group with the same number sequence. Then we can do a groupby() on this column (which was not possible before because groups were either of ones or twos!).
which gives e.g. for your code snippet (saved in results):
[group 1]
gas1 valveW time
0 246.9438 2 1
1 247.5367 2 2
2 246.7167 2 3
3 246.6770 2 4
[group 2]
gas1 valveW time
4 245.9197 1 5
5 245.9518 1 6
6 246.9207 1 7
7 246.1517 1 8
8 246.9015 1 9
[group 3]
gas1 valveW time
9 246.3712 2 10
10 247.0826 2 11
Then to get the mean for each cycle; you could e.g. do:
df.groupby((df['valveW'].shift() != df['valveW']).cumsum()).mean()
which gives (again for your code snippet):
gas1 valveW time
valveW
1 246.96855 2.0 2.5
2 246.36908 1.0 7.0
3 246.72690 2.0 10.5
where you wouldn't care much about the time mean but the gas1 one!
Then, based on results you could e.g. do:
n = 3
mean_n_last = []
for k, v in results.items():
if len(v) < n:
mean_n_last.append(np.nan)
else:
mean_n_last.append(np.nanmean(v.iloc[len(v) - n:, 0]))
which gives [246.9768, 246.65796666666665, nan] for n = 3 !

If your dataframe is sorted by time you could get the last N records for each valve like this.
N=2
valve1 = df[df['valveW']==1].iloc[-N:,:]
valve2 = df[df['valveW']==2].iloc[-N:,:]
If it isn't currently sorted you could easily sort it like this.
df.sort_values(by=['time'])

Drop rows if value in column changes

Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]

my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

Rolling sum on a dynamic window

I am new to python and the last time I coded was in the mid-80's so I appreciate your patient help.
It seems .rolling(window) requires the window to be a fixed integer. I need a rolling window where the window or lookback period is dynamic and given by another column.
In the table below, I seek the Lookbacksum which is the rolling sum of Data as specified by the Lookback column.
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
eg:
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3

You can create a custom function for use with df.apply, eg:
def lookback_window(row, values, lookback, method='sum', *args, **kwargs):
loc = values.index.get_loc(row.name)
lb = lookback.loc[row.name]
return getattr(values.iloc[loc - lb: loc + 1], method)(*args, **kwargs)
Then use it as:
df['new_col'] = df.apply(lookback_window, values=df['Data'], lookback=df['Lookback'], axis=1)
There may be some corner cases but as long as your indices align and are unique - it should fulfil what you're trying to do.

here is one with a list comprehension which stores the index and value of the column df['Lookback'] and the gets the slice by reversing the values and slicing according to the column value:
df['LookbackSum'] = [sum(df.loc[:e,'Data'][::-1].to_numpy()[:i+1])
for e,i in enumerate(df['Lookback'])]
print(df)
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3

An exercise in pain, if you want to try an almost fully vectorized approach. Sidenote: I don't think it's worth it here. At all.
Inspired by Divakar's answer here
Given:
import numpy as np
import pandas as pd
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
Using the function from Divakar's answer, but slightly modified
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r, fill_value=np.nan):
# Concatenate with sliced to cover all rolls
p = np.full((a.shape[0],a.shape[1]-1),fill_value)
a_ext = np.concatenate((p,a,p),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), -r + (n-1),0]
Now, we just need to prepare a 2d array for the data and independently shift the rows according to our desired lookback values.
arr = df['Data'].to_numpy().reshape(1, -1).repeat(len(df), axis=0)
shifter = np.arange(len(df) - 1, -1, -1) #+ d['Lookback'] - 1
temp = strided_indexing_roll(arr, shifter, fill_value=0)
out = strided_indexing_roll(temp, (len(df) - 1 - df['Lookback'])*-1, 0).sum(-1)
Output:
array([ 1, 2, 3, 4, 5, 8, 10, 7, 8, 3], dtype=int64)
We can then just assign it back to the dataframe as needed and check.
df['out'] = out
#output:
Data Lookback LookbackSum out
0 1 0 1 1
1 1 1 2 2
2 1 2 3 3
3 2 2 4 4
4 3 1 5 5
5 2 3 8 8
6 3 3 10 10
7 2 2 7 7
8 1 3 8 8
9 2 1 3 3

Create a pandas Series

I want to create a panda series that contains the first ‘n’ natural numbers and their respective squares. The first ‘n’ numbers should appear in the index position by using manual indexing
Can someone please share a code with me

Use numpy.arange with ** for squares:
n = 5
s = pd.Series(np.arange(n) ** 2)
print (s)
0 0
1 1
2 4
3 9
4 16
dtype: int32
If want omit 0:
n = 5
arr = np.arange(1, n + 1)
s = pd.Series(arr ** 2, index=arr)
print (s)
1 1
2 4
3 9
4 16
5 25
dtype: int32

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create column using a sequence of numbers - python

One way to do it would be as such pattern = [1] * 5 + [2] * 5 + [3] * 5 + [4] * 3 no_patterns = len(df)//len(pattern) remaining = len(df) - (len(pattern) * no_patterns) new_values = pattern * no_patterns + pattern[:remaining] df['new_column'] = new_values

You can do: gr=[5,5,5,3] mp={n: next(x+1 for x in range(len(gr)) if sum(gr[:x+1])>n) for n in range(sum(gr))} df['Tperiod']=range(len(df)) df['Tperiod']=(df['Tperiod']%sum(gr)).map(mp) Where gr indicates your groups size (that's your input), and mp is just to utilize it properly with pandas.

Related

Is there a way to reference a previous value in Pandas column efficiently?

Index and save last N points from a list that meets conditions from dataframe Python

Drop rows if value in column changes

Rolling sum on a dynamic window

Create a pandas Series

Categories

Resources