I have one series and one DataFrame, all integers.
s = [10,
10,
10]
m = [[0,0,0,0,3,4,5],
[0,0,0,0,1,1,1],
[10,0,0,0,0,5,5]]
I want to return a matrix containing the cumulative differences to take the place of the existing number.
Output:
n = [[10,10,10,10,7,3,-2],
[10,10,10,10,9,8,7],
[0,0,0,0,0,-5,-10]]
Calculate the cumsum of data frame by row first and then subtract from the Series:
import pandas as pd
s = pd.Series(s)
df = pd.DataFrame(m)
-df.cumsum(1).sub(s, axis=0)
# 0 1 2 3 4 5 6
#0 10 10 10 10 7 3 -2
#1 10 10 10 10 9 8 7
#2 0 0 0 0 0 -5 -10
You can directly compute a cumulative difference using np.subtract.accumulate:
# make a copy
>>> n = np.array(m)
# replace first column
>>> n[:, 0] = s - n[:, 0]
# subtract in-place
>>> np.subtract.accumulate(n, axis=1, out=n)
array([[ 10, 10, 10, 10, 7, 3, -2],
[ 10, 10, 10, 10, 9, 8, 7],
[ 0, 0, 0, 0, 0, -5, -10]])
Related
I have a data frame
df = pd.DataFrame([["X",62,5],["Y",16,3],["Z",27,4]],columns=["id","total","days"])
id total days
X 62 5
Y 16 3
Z 27 4
Divide total column by days column and Create a new column plan which is a list in which No. of elements=Divisor, and the value of elements=Quotient, if any reminder is there increase those many values from negative indexing.
Expected Output:
df_out = pd.DataFrame([["X",62,5,[12,12,12,13,13]],["Y",16,3,[5, 5, 6]],["Z",27,4,[6, 7, 7, 7]]],columns=["id","total","days","plan"])
id total days plan
X 62 5 [12, 12, 12, 13, 13]
Y 16 3 [5, 5, 6]
Z 27 4 [6, 7, 7, 7]
How to do it in pandas?
You can use a custom function:
def split(t, d):
# get floor division and remainder
x, r = divmod(t, d)
# assign divider or divider + 1
# depending on the number of remainders
return [x]*(d-r)+[x+1]*r
df['plan'] = [split(t, d) for t, d in zip(df['total'], df['days'])]
Output:
id total days plan
0 X 62 5 [12, 12, 12, 13, 13]
1 Y 16 3 [5, 5, 6]
2 Z 27 4 [6, 7, 7, 7]
Mozway already provided a better solution.Yet this could be another approach with the use of costume function well with lambda.
def create_plan(plan, days, remainder):
return [plan]*days if remainder == 0 else [plan]*(days-remainder)+[plan+1]*remainder
df = pd.DataFrame([["X",62,5],["Y",16,3],["Z",27,4]],columns=["id","total","days"])
# Create plan column
df["plan"] = df["total"] // df["days"]
# Create column for remainder
df["remainder"] = df["total"] % df["days"]
# Apply function to create final plan
df["plan"] = df.apply(lambda x: create_plan(x["plan"], x["days"], x["remainder"]), axis=1)
# Drop remainder column
df.drop("remainder", axis=1, inplace=True)
print(df)
Output:
id total days plan
0 X 62 5 [12, 12, 12, 13, 13]
1 Y 16 3 [5, 5, 6]
2 Z 27 4 [6, 7, 7, 7]
I have a function f(a) that takes one entry from a testarray and returns an array with 5 values:
f(testarray[0])
#Output: array([[0, 1, 5, 3, 2]])
Since f(testarray[0]) is the result of an experiment, I want to run this function f for each entry of the testarray and store each result in a new NumPy array. I always thought this would be quite simple by just taking an empty NumPy array with the length of the testarray and save the results the following way:
N = 1000 #Number of entries of the testarray
test_result = np.zeros([N, 5], dtype=int)
for i in testarray:
test_result[i] = f(i)
When I run this, I don't receive any error message but nonsense results (half of the test_result is empty while the rest is filled with implausible values). Since f() works perfectly for a single entry of the testarray I suppose that something of the way of how I save the results in the test_result is wrong. What am I missing here?
(I know that I could save the results as list and then append an empty list, but this method is too slow for the large number of times I want to run the function).
Since you don't seem to understand indexing, stick with this approach
alist = [f(i) for i in testarray]
arr = np.array(alist)
I could show how to use row indices and testarray values together, but that requires more explanation.
Your problem may could be reproduced by the following small example:
testarray = np.array([5, 6, 7, 3, 1])
def f(x):
return np.array([x * i for i in np.arange(1, 6)])
f(testarray[0])
# [ 5 10 15 20 25]
test_result = np.zeros([len(testarray), 5], dtype=int) # len(testarray) or testarray.shape[0]
So, as hpaulj mentioned in the comments, you must be careful how to use indexing:
for i in range(len(testarray)):
test_result[i] = f(testarray[i])
# [[ 5 10 15 20 25]
# [ 6 12 18 24 30]
# [ 7 14 21 28 35]
# [ 3 6 9 12 15]
# [ 1 2 3 4 5]]
There will be another condition where the testarray is a specified index array that contains shuffle integers from 0 to N to full fill the zero array i.e. test_result. For this condition we can create a reproducible example as:
testarray = np.array([4, 3, 0, 1, 2])
def f(x):
return np.array([x * i for i in np.arange(1, 6)])
f(testarray[0])
# [ 4 8 12 16 20]
test_result = np.zeros([len(testarray), 5], dtype=int)
So, using your loop will get the following result:
for i in testarray:
test_result[i] = f(i)
# [[ 0 0 0 0 0]
# [ 1 2 3 4 5]
# [ 2 4 6 8 10]
# [ 3 6 9 12 15]
# [ 4 8 12 16 20]]
As it can be understand from this loop, if the index array be not from 0 to N, some rows in the zero array will left zero (unchanged):
testarray = np.array([4, 2, 4, 1, 2])
for i in testarray:
test_result[i] = f(i)
# [[ 0 0 0 0 0] # <--
# [ 1 2 3 4 5]
# [ 2 4 6 8 10]
# [ 0 0 0 0 0] # <--
# [ 4 8 12 16 20]]
I have a numpy 2d array and I need to transform it in a way that the first row remains the same, the second row moves by one position to right, (it can wrap around or just have zero padded to the front). Third row shifts 3 positions to the right, etc.
I can do this through a "for loop" but that is not very efficient. I am guessing there should be a filtering matrix that multipled by the original one will have the same effect, or maybe a numpy trick that will help me doing this? Thanks!
I have looked into numpy.roll() but I don't think it can work on each row separately.
import numpy as np
p = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
'''
p = [ 1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16]
desired output:
p'= [ 1 2 3 4
0 5 6 7
0 0 9 10
0 0 0 13]
'''
We can extract sliding windows into a zeros padded version of the input to have a memory efficient approach and hence performant too. To get those windows, we can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows. More info on use of as_strided based view_as_windows.
Hence, the solution would be -
from skimage.util.shape import view_as_windows
def slide_by_one(p):
m,n = p.shape
z = np.zeros((m,m-1),dtype=p.dtype)
a = np.concatenate((z,p),axis=1)
w = view_as_windows(a,(1,p.shape[1]))[...,0,:]
r = np.arange(m)
return w[r,r[::-1]]
Sample run -
In [60]: p # generic sample of size mxn
Out[60]:
array([[ 1, 5, 9, 13, 17],
[ 2, 6, 10, 14, 18],
[ 3, 7, 11, 15, 19],
[ 4, 8, 12, 16, 20]])
In [61]: slide_by_one(p)
Out[61]:
array([[ 1, 5, 9, 13, 17],
[ 0, 2, 6, 10, 14],
[ 0, 0, 3, 7, 11],
[ 0, 0, 0, 4, 8]])
We can leverage the regular rampy pattern to have a more efficient approach with a more raw usage of np.lib.stride_tricks.as_strided, like so -
def slide_by_one_v2(p):
m,n = p.shape
z = np.zeros((m,m-1),dtype=p.dtype)
a = np.concatenate((z,p),axis=1)
s0,s1 = a.strides
return np.lib.stride_tricks.as_strided(a[:,m-1:],shape=(m,n),strides=(s0-s1,s1))
Another one with some masking -
def slide_by_one_v3(p):
m,n = p.shape
z = np.zeros((len(p),1),dtype=p.dtype)
a = np.concatenate((p,z),axis=1)
return np.triu(a[:,::-1],1)[:,::-1].flat[:-m].reshape(m,-1)
Here is a simple method based on zero-padding and reshaping. It is fast because it avoids advanced indexing and other overheads.
def pp(p):
m,n = p.shape
aux = np.zeros((m,n+m-1),p.dtype)
np.copyto(aux[:,:n],p)
return aux.ravel()[:-m].reshape(m,n+m-2)[:,:n].copy()
So I created this post regarding my problem 2 days ago and got an answer thankfully.
I have a data made of 20 rows and 2500 columns. Each column is a unique product and rows are time series, results of measurements. Therefore each product is measured 20 times and there are 2500 products.
This time I want to know for how many consecutive rows my measurement result can stay above a specific threshold.
AKA: I want to count the number of consecutive values that is above a value, let's say 5.
A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
We have these values in bold and according to what I defined above, I should get NumofConsFeature = 3 as the result. (Getting the max if there are more than 1 series that meets the condition)
I thought of filtering using .gt, then getting the indexes and using a loop afterwards in order to detect the consecutive index numbers but couldn't make it work.
In 2nd phase, I'd like to know the index of the first value of my consecutive series. For the above example, that would be 3.
But I have no idea of how for this one.
Thanks in advance.
Here's another answer using only Pandas functions:
A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
a = pd.DataFrame(A, columns = ['foo'])
a['is_large'] = (a.foo > 5)
a['crossing'] = (a.is_large != a.is_large.shift()).cumsum()
a['count'] = a.groupby(['is_large', 'crossing']).cumcount(ascending=False) + 1
a.loc[a.is_large == False, 'count'] = 0
which gives
foo is_large crossing count
0 1 False 1 0
1 2 False 1 0
2 6 True 2 3
3 8 True 2 2
4 7 True 2 1
5 3 False 3 0
6 2 False 3 0
7 3 False 3 0
8 6 True 4 2
9 10 True 4 1
10 2 False 5 0
11 1 False 5 0
12 0 False 5 0
13 2 False 5 0
From there on you can easily find the maximum and its index.
There is simple way to do that.
Lets say your list is like: A = [1, 2, 6, 8, 7, 6, 8, 3, 2, 3, 6, 10,6,7,8, 2, 1, 0, 2]
And you want to find how many consecutive series that has values bigger than 6 and length of 5. For instance, here your answer is 2. There is two series that has values bigger than 6 and length of the series are 5. In python and pandas we do that like below:
condition = (df.wanted_row > 6) & \
(df.wanted_row.shift(-1) > 6) & \
(df.wanted_row.shift(-2) > 6) & \
(df.wanted_row.shift(-3) > 6) & \
(df.wanted_row.shift(-4) > 6)
consecutive_count = df[condition].count().head(1)[0]
Here's one with maxisland_start_len_mask -
# https://stackoverflow.com/a/52718782/ #Divakar
def maxisland_start_len_mask(a, fillna_index = -1, fillna_len = 0):
# a is a boolean array
pad = np.zeros(a.shape[1],dtype=bool)
mask = np.vstack((pad, a, pad))
mask_step = mask[1:] != mask[:-1]
idx = np.flatnonzero(mask_step.T)
island_starts = idx[::2]
island_lens = idx[1::2] - idx[::2]
n_islands_percol = mask_step.sum(0)//2
bins = np.repeat(np.arange(a.shape[1]),n_islands_percol)
scale = island_lens.max()+1
scaled_idx = np.argsort(scale*bins + island_lens)
grp_shift_idx = np.r_[0,n_islands_percol.cumsum()]
max_island_starts = island_starts[scaled_idx[grp_shift_idx[1:]-1]]
max_island_percol_start = max_island_starts%(a.shape[0]+1)
valid = n_islands_percol!=0
cut_idx = grp_shift_idx[:-1][valid]
max_island_percol_len = np.maximum.reduceat(island_lens, cut_idx)
out_len = np.full(a.shape[1], fillna_len, dtype=int)
out_len[valid] = max_island_percol_len
out_index = np.where(valid,max_island_percol_start,fillna_index)
return out_index, out_len
def maxisland_start_len(a, trigger_val, comp_func=np.greater):
# a is 2D array as the data
mask = comp_func(a,trigger_val)
return maxisland_start_len_mask(mask, fillna_index = -1, fillna_len = 0)
Sample run -
In [169]: a
Out[169]:
array([[ 1, 0, 3],
[ 2, 7, 3],
[ 6, 8, 4],
[ 8, 6, 8],
[ 7, 1, 6],
[ 3, 7, 8],
[ 2, 5, 8],
[ 3, 3, 0],
[ 6, 5, 0],
[10, 3, 8],
[ 2, 3, 3],
[ 1, 7, 0],
[ 0, 0, 4],
[ 2, 3, 2]])
# Per column results
In [170]: row_index, length = maxisland_start_len(a, 5)
In [172]: row_index
Out[172]: array([2, 1, 3])
In [173]: length
Out[173]: array([3, 3, 4])
You can apply diff() on your Series, and then just count the number of consecutive entries where the difference is 1 and the actual value is above your cutoff. The largest count is the maximum number of consecutive values.
First compute diff():
df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
df['b'] = df.a.diff()
df
a b
0 1 NaN
1 2 1.0
2 6 4.0
3 7 1.0
4 8 1.0
5 3 -5.0
6 2 -1.0
7 3 1.0
8 6 3.0
9 10 4.0
10 2 -8.0
11 1 -1.0
12 0 -1.0
13 2 2.0
Now count consecutive sequences:
above = 5
n_consec = 1
max_n_consec = 1
for a, b in df.values[1:]:
if (a > above) & (b == 1):
n_consec += 1
else: # check for new max, then start again from 1
max_n_consec = max(n_consec, max_n_consec)
n_consec = 1
max_n_consec
3
Here's how I did it using numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
consecutive_steps = 2
marginal_price = 5
assertions = [(df.loc[:, "a"].shift(-i) < marginal_price) for i in range(consecutive_steps)]
condition = np.all(assertions, axis=0)
consecutive_count = df.loc[condition, :].count()
print(consecutive_count)
which yields 6.
I want to generate "category intervals" from categories.
for example, suppose I have the following :
>>> df['start'].describe()
count 259431.000000
mean 10.435858
std 5.504730
min 0.000000
25% 6.000000
50% 11.000000
75% 15.000000
max 20.000000
Name: start, dtype: float64
and unique value of my column are:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20], dtype=int8)
but I want to use the following list of intervals:
>>> intervals
[[0, 2.2222222222222223],
[2.2222222222222223, 4.4444444444444446],
[4.4444444444444446, 6.666666666666667],
[6.666666666666667, 8.8888888888888893],
[8.8888888888888893, 11.111111111111111],
[11.111111111111111, 13.333333333333332],
[13.333333333333332, 15.555555555555554],
[15.555555555555554, 17.777777777777775],
[17.777777777777775, 20]]
to change my column 'start' into values x where x represents the index of the interval that contains df['start'] (so x in my case will vary from 0 to 8)
is there a more or less simple way to do it using pandas/numpy?
In advance, thanks a lot for the help.
Regards.
You can use np.digitize:
import numpy as np
import pandas as pd
df = pd.DataFrame(dict(start=np.random.random_integers(0, 20, 10000)))
# the left-hand edges of each "interval"
intervals = np.linspace(0, 20, 9, endpoint=False)
print(intervals)
# [ 0. 2.22222222 4.44444444 6.66666667 8.88888889
# 11.11111111 13.33333333 15.55555556 17.77777778]
df['start_idx'] = np.digitize(df['start'], intervals) - 1
print(df.head())
# start start_idx
# 0 8 3
# 1 16 7
# 2 0 0
# 3 7 3
# 4 0 0