I'm new to python and have a simple question for which I haven't found an answer yet.
Lets say I have a time series with c(t):
t_ c_
1 40
2 41
3 4
4 5
5 7
6 20
7 20
8 8
9 90
10 99
11 10
12 5
13 8
14 8
15 19
I now want to evaluate this series with respect to how long the value c has been continuously in certain ranges and how often these time periods occur.
The result would therefore include three columns: c (binned), duration (binned), frequency. Translated to the simple example the result could look as follows:
c_ Dt_ Freq_
0-50 8 1
50-100 2 1
0-50 5 1
Can you give me an advice?
Thanks in advance,
Ulrike
//EDIT:
Thank you for the replies! My example data were somewhat flawed so that I couldn't show a part of my question. So, here is a new data series:
series=
t c
1 1
2 1
3 10
4 10
5 10
6 1
7 1
8 50
9 50
10 50
12 1
13 1
14 1
If I apply the code proposed by Christoph below:
bins = pd.cut(series['c'], [-1, 5, 100])
same_as_prev = (bins != bins.shift())
run_ids = same_as_prev.cumsum()
result = bins.groupby(run_ids).aggregate(["first", "count"])
I receive a result like this:
first count
(-1, 5] 2
(5, 100] 3
(-1, 5] 2
(5, 100] 3
(-1, 5] 3
but what I'm more interested in something looking like this:
c length freq
(-1, 5] 2 2
(-1, 5] 3 1
(5, 100] 3 2
How do I achieve this? And how could I plot it in a KDE plot?
Best,
Ulrike
Nicely asked question with an example :)
This is one way to do it, most likely incomplete, but it should help you a bit.
Since your data is spaced in time by a fixed increment, I do not implement the time series and use the index as time. Thus, I convert c to an array and use np.where() to find the value in the bins.
import numpy as np
c = np.array([40, 41, 4, 5, 7, 20, 20, 8, 90, 99, 10, 5, 8, 8, 19])
bin1 = np.where((0 <= c) & (c <= 50))[0]
bin2 = np.where((50 < c) & (c <= 100))[0]
For bin1, the output is array([ 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14], dtype=int64) which correspond to the idx where the values from c are in the bin.
Next step is to find the consecutive idx. According to this SO post::
from itertools import groupby
from operator import itemgetter
data = bin1
for k, g in groupby(enumerate(data), lambda ix : ix[0] - ix[1]):
print(list(map(itemgetter(1), g)))
# Output is:
#[0, 1, 2, 3, 4, 5, 6, 7]
#[10, 11, 12, 13, 14]
Final step: place the new sub-bin in the right order and track which bins correspond to which subbin. Thus, the complete code would look like:
import numpy as np
from itertools import groupby
from operator import itemgetter
c = np.array([40, 41, 4, 5, 7, 20, 20, 8, 90, 99, 10, 5, 8, 8, 19])
bin1 = np.where((0 <= c) & (c <= 50))[0]
bin2 = np.where((50 < c) & (c <= 100))[0]
# 1 and 2 for the range names.
bins = [(bin1, 1), (bin2, 2)]
subbins = list()
for b in bins:
data = b[0]
name = b[1] # 1 or 2
for k, g in groupby(enumerate(data), lambda ix : ix[0] - ix[1]):
subbins.append((list(map(itemgetter(1), g)), name))
subbins = sorted(subbins, key=lambda x: x[0][0])
Output: [([0, 1, 2, 3, 4, 5, 6, 7], 1), ([8, 9], 2), ([10, 11, 12, 13, 14], 1)]
Then, you just have to do the stats you want :)
import pandas as pd
def bin_run_lengths(series, bins):
binned = pd.cut(pd.Series(series), bins)
return binned.groupby(
(1 - (binned == binned.shift())).cumsum()
).aggregate(
["first", "count"]
)
(I'm not sure where your frequency column comes in - in the problem as you describe it, it seems like it would always be set to 1.)
Binning
Binning a series is easy with pandas.cut():
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html
import pandas as pd
pd.cut(pd.Series(range(100)), bins=[-1,0,10,20,50,100])
The bins here are given as (right-inclusive, left-exclusive) boundaries; the argument can be given in different forms.
0 (-1.0, 0.0]
1 (0.0, 10.0]
2 (0.0, 10.0]
3 (0.0, 10.0]
4 (0.0, 10.0]
5 (0.0, 10.0]
6 (0.0, 10.0]
...
19 (10.0, 20.0]
20 (10.0, 20.0]
21 (20.0, 50.0]
22 (20.0, 50.0]
23 (20.0, 50.0]
...
29 (20.0, 50.0]
...
99 (50.0, 100.0]
Length: 100, dtype: category
Categories (4, interval[int64]): [(0, 10] < (10, 20] < (20, 50] < (50, 100]]
This converts it from a series of values to a series of intervals.
Count consecutive values
This doesn't have a native idiom in pandas, but it is fairly easy with a few common functions. The top-voted StackOverflow answer here puts it very well: Counting consecutive positive value in Python array
same_as_prev = (series != series.shift())
This yields a Boolean series that determines if the value is different from the one before.
run_ids = same_as_prev.cumsum()
This makes an int series that increments from 0 each time the value changes to a new run, and thus assigns each position in the series to a "run ID"
result = series.groupby(run_ids).aggregate(["first", "count"])
This yields a dataframe that shows the value in each run and the length of that run:
first count
0 (-1, 0] 1
1 (0, 10] 10
2 (10, 20] 10
3 (20, 50] 30
4 (50, 100] 49
Related
We start with an interval axis that is divided into bins of length 5. (0,5], (5, 10], ...
There is a timestamp column that has some timestamps >= 0. By using pd.cut() the interval bin that corresponds to the timestamp is determined. (e.g. "timestamp" = 3.0 -> "time_bin" = (0,5]).
If there is a time bin that has no corresponding timestamp, it does not show up in the interval column. Thus, there can be interval gaps in the "time_bin" column, e.g., (5,10], (15,20]. (i.e., interval (10,15] is missing // note that the timestamp column is sorted)
The goal is to obtain a column "connected_interval" that indicates whether the current row interval is connected to the previous row interval; connected meaning no interval gaps, i.e., (0,5], (5,10], (10, 15] would be assigned the same integer ID) and a column "conn_interv_length" that indicates for each largest possible connected interval the length of the interval. The interval (0,5], (5,10], (10, 15] would be of length 15.
The initial dataframe has columns "group_id", "timestamp", "time_bin". Columns "connected_interval" & "conn_interv_len" should be computed.
Note: any solution to obtaining the length of populated connected intervals is welcome.
df = pd.DataFrame({"group_id":['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],\
"timestamp": [0.0, 3.0, 9.0, 24.2, 30.2, 0.0, 136.51, 222.0, 237.0, 252.0],\
"time_bin": [pd.Interval(0, 5, closed='left'), pd.Interval(0, 5, closed='left'), pd.Interval(5, 10, closed='left'), pd.Interval(20, 25, closed='left'), pd.Interval(30, 35, closed='left'), pd.Interval(0, 5, closed='left'), pd.Interval(135, 140, closed='left'), pd.Interval(220, 225, closed='left'), pd.Interval(235, 240, closed='left'), pd.Interval(250, 255, closed='left')],\
"connected_interval":[0, 0, 0, 1, 2, 0, 1, 2, 3, 4],\
"conn_interv_len":[10, 10, 10, 5, 5, 5, 5, 5, 5, 5],\
})
input with expected output columns:
group_id timestamp time_bin connected_interval conn_interv_len
0 A 0.00 [0, 5) 0 10
1 A 3.00 [0, 5) 0 10
2 A 9.00 [5, 10) 0 10
3 A 24.20 [20, 25) 1 5
4 A 30.20 [30, 35) 2 5
5 B 0.00 [0, 5) 0 5
6 B 136.51 [135, 140) 1 5
7 B 222.00 [220, 225) 2 5
8 B 237.00 [235, 240) 3 5
9 B 252.00 [250, 255) 4 5
IIUC, you can sort the intervals, drop duplicates, extract the left/right bound, create groups based on the match/mismatch of the successive left/right, then merge again the output to the original:
df2 = (df[['group_id', 'time_bin']]
# extract bounds and sort intervals
.assign(left=df['time_bin'].array.left,
right=df['time_bin'].array.right)
.sort_values(by=['group_id', 'left', 'right'])
# ensure no duplicates
.drop_duplicates(['group_id', 'time_bin'])
# compute connected intervals and connected length
.assign(connected_interval=lambda d:
d.groupby('group_id', group_keys=False)
.apply(lambda g: g['left'].ne(g['right'].shift())
.cumsum().sub(1)),
conn_interv_len=lambda d:
(g := d.groupby(['group_id', 'connected_interval']))['right'].transform('max')
-g['left'].transform('min')
)
.drop(columns=['left', 'right'])
)
# merge to restore missing dropped duplicated rows
out = df.merge(df2)
output:
group_id timestamp time_bin connected_interval conn_interv_len
0 A 0.00 [0, 5) 0 10
1 A 3.00 [0, 5) 0 10
2 A 9.00 [5, 10) 0 10
3 A 24.20 [20, 25) 1 5
4 A 30.20 [30, 35) 2 5
5 B 0.00 [0, 5) 0 5
6 B 136.51 [135, 140) 1 5
7 B 222.00 [220, 225) 2 5
8 B 237.00 [235, 240) 3 5
9 B 252.00 [250, 255) 4 5
I am new to python and its libraries. Searched all the forums but could not find a proper solution. This is the first time posting a question here. Sorry if I did something wrong.
So, I have two DataFrames like below containing X Y Z coordinates (UTM) and other features.
In [2]: a = {
...: 'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
...: 'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
...: 'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19],
...: }
...:
In [3]: b = {
...: 'X': [1, 8, 20, 7, 32],
...: 'Y': [6, 4, 17, 45, 32],
...: 'Z': [52, 12, 6, 8, 31],
...: }
In [4]: df1 = pd.DataFrame(data=a)
In [5]: df2 = pd.DataFrame(data=b)
In [6]: print(df1)
X Y Z
0 1 3 12
1 2 4 4
2 5 8 9
3 7 15 16
4 10 20 13
5 5 12 1
6 2 23 8
7 3 22 17
8 24 14 11
9 21 7 19
In [7]: print(df2)
X Y Z
0 1 6 52
1 8 4 12
2 20 17 6
3 7 45 8
4 32 32 31
I need to find the closest point (distance) in df1 to each point of df2 and creating new DataFrame.
So I wrote the code below and actually find the closest point (distance) to df2.iloc[0].
In [8]: x = (
...: np.sqrt(
...: ((df1['X'].sub(df2["X"].iloc[0]))**2)
...: .add(((df1['Y'].sub(df2["Y"].iloc[0]))**2))
...: .add(((df1['Z'].sub(df2["Z"].iloc[0]))**2))
...: )
...: ).idxmin()
In [9]: x1 = df1.iloc[[x]]
In[10]: print(x1)
X Y Z
3 7 15 16
So, I guess I need a loop to iterate through df2 and apply above code to each row. As a result I need a new updated df1 containing all the closest points to each point of df2. But couldn't make it. Please advise.
This is actually a great example of a case where numpy's broadcasting rules have distinct advantages over pandas.
Manually aligning df1's coordinates as column vectors (by referencing df1[[col]].to_numpy()) and df2's coordinates as row vectors (df2[col].to_numpy()), we can get the distance from every element in each dataframe to each element in the other very quickly with automatic broadcasting:
In [26]: dists = np.sqrt(
...: (df1[['X']].to_numpy() - df2['X'].to_numpy()) ** 2
...: + (df1[['Y']].to_numpy() - df2['Y'].to_numpy()) ** 2
...: + (df1[['Z']].to_numpy() - df2['Z'].to_numpy()) ** 2
...: )
In [27]: dists
Out[27]:
array([[40.11234224, 7.07106781, 24.35159132, 42.61455151, 46.50806382],
[48.05205511, 10. , 22.29349681, 41.49698784, 49.12229636],
[43.23193264, 5.83095189, 17.74823935, 37.06750599, 42.29657197],
[37.58989226, 11.74734012, 16.52271164, 31.04834939, 33.74907406],
[42.40283009, 16.15549442, 12.56980509, 25.67099531, 30.85449724],
[51.50728104, 13.92838828, 16.58312395, 33.7934905 , 45.04442252],
[47.18050445, 20.32240143, 19.07878403, 22.56102835, 38.85871846],
[38.53569774, 19.33907961, 20.85665361, 25.01999201, 33.7194306 ],
[47.68647607, 18.89444363, 7.07106781, 35.48239 , 28.0713377 ],
[38.60051813, 15.06651917, 16.43167673, 41.96427052, 29.83286778]])
Argmin will now give you the correct vector of positional indices:
In [28]: dists.argmin(axis=0)
Out[28]: array([3, 2, 8, 6, 8])
Or, to select the appropriate values from df1:
In [29]: df1.iloc[dists.argmin(axis=0)]
Out[29]:
X Y Z
3 7 15 16
2 5 8 9
8 24 14 11
6 2 23 8
8 24 14 11
Edit
An answer popped up just after mine, then was deleted, which made reference to scipy.spatial.distance_matrix, computing dists with:
distance_matrix(df1[list('XYZ')].to_numpy(), df2[list('XYZ')].to_numpy())
Not sure why that answer was deleted, but this seems like a really nice, clean approach to getting the array I produced manually above!
Performance Note
Note that if you are just trying to get the closest value, there's no need to take the square root, as this is a costly operation compared to addition, subtraction, and powers, and sorting on dist**2 is still valid.
First, you define a function that returns the closest point using numpy.where. Then you use the apply function to run through df2.
import pandas as pd
import numpy as np
a = {
'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19]
}
b = {
'X': [1, 8, 20, 7, 32],
'Y': [6, 4, 17, 45, 32],
'Z': [52, 12, 6, 8, 31]
}
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
dist = lambda dx,dy,dz: np.sqrt(dx**2+dy**2+dz**2)
def closest(row):
darr = dist(df1['X']-row['X'], df1['Y']-row['Y'], df1['Z']-row['Z'])
idx = np.where(darr == np.amin(darr))[0][0]
return df1['X'][idx], df1['Y'][idx], df1['Z'][idx]
df2['closest'] = df2.apply(closest, axis=1)
print(df2)
Output:
X Y Z closest
0 1 6 52 (7, 15, 16)
1 8 4 12 (5, 8, 9)
2 20 17 6 (24, 14, 11)
3 7 45 8 (2, 23, 8)
4 32 32 31 (24, 14, 11)
I have one series and one DataFrame, all integers.
s = [10,
10,
10]
m = [[0,0,0,0,3,4,5],
[0,0,0,0,1,1,1],
[10,0,0,0,0,5,5]]
I want to return a matrix containing the cumulative differences to take the place of the existing number.
Output:
n = [[10,10,10,10,7,3,-2],
[10,10,10,10,9,8,7],
[0,0,0,0,0,-5,-10]]
Calculate the cumsum of data frame by row first and then subtract from the Series:
import pandas as pd
s = pd.Series(s)
df = pd.DataFrame(m)
-df.cumsum(1).sub(s, axis=0)
# 0 1 2 3 4 5 6
#0 10 10 10 10 7 3 -2
#1 10 10 10 10 9 8 7
#2 0 0 0 0 0 -5 -10
You can directly compute a cumulative difference using np.subtract.accumulate:
# make a copy
>>> n = np.array(m)
# replace first column
>>> n[:, 0] = s - n[:, 0]
# subtract in-place
>>> np.subtract.accumulate(n, axis=1, out=n)
array([[ 10, 10, 10, 10, 7, 3, -2],
[ 10, 10, 10, 10, 9, 8, 7],
[ 0, 0, 0, 0, 0, -5, -10]])
A very simple example just for understanding.
I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame({'A':pd.Series([1, 2, 13, 14, 25, 26, 37, 38])})
df
A
0 1
1 2
2 13
3 14
4 25
5 26
6 37
8 38
Set n = 3
First example
How to get a new dataframe df1 (in an efficient way), like the following:
D1 D2 D3 T
0 1 2 13 14
1 2 13 14 25
2 13 14 25 26
3 14 25 26 37
4 25 26 37 38
Hint: think at the first n-columns as the data (Dx) and the last columns as the target (T). In the 1st example the target (e.g 25) depends on the preceding n-elements (2, 13, 14).
Second example
What if the target is some element ahead (e.g.+3)?
D1 D2 D3 T
0 1 2 13 26
1 2 13 14 37
2 13 14 25 38
Thank you for your help,
Gilberto
P.S. If you think that the title can be improved, please suggest me how to modify it.
Update
Thanks to #Divakar and this post the rolling function can be defined as:
import numpy as np
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = np.arange(1000000000)
b = rolling(a, 4)
In less than 1 second!
Let's see how we can solve it with NumPy tools. So, let's imagine you have the column data as a NumPy array, let's call it a. For such sliding windowed operations, we have a very efficient tool in NumPy as strides, as they are views into the input array without actually making copies.
Let's directly use the methods with the sample data and start with case #1 -
In [29]: a # Input data
Out[29]: array([ 1, 2, 13, 14, 25, 26, 37, 38])
In [30]: m = a.strides[0] # Get strides
In [31]: n = 3 # parameter
In [32]: nrows = a.size - n # Get number of rows in o/p
In [33]: a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,n+1),strides=(m,m))
In [34]: a2D
Out[34]:
array([[ 1, 2, 13, 14],
[ 2, 13, 14, 25],
[13, 14, 25, 26],
[14, 25, 26, 37],
[25, 26, 37, 38]])
In [35]: np.may_share_memory(a,a2D)
Out[35]: True # a2D is a view into a
Case #2 would be similar with an additional parameter for the Target column -
In [36]: n2 = 3 # Additional param
In [37]: nrows = a.size - n - n2 + 1
In [38]: part1 = np.lib.stride_tricks.as_strided(a,shape=(nrows,n),strides=(m,m))
In [39]: part1 # These are D1, D2, D3, etc.
Out[39]:
array([[ 1, 2, 13],
[ 2, 13, 14],
[13, 14, 25]])
In [43]: part2 = a[n+n2-1:] # This is target col
In [44]: part2
Out[44]: array([26, 37, 38])
I found another method: view_as_windows
import numpy as np
from skimage.util.shape import view_as_windows
window_shape = (4, )
aa = np.arange(1000000000) # 1 billion!
bb = view_as_windows(aa, window_shape)
bb
array([[ 0, 1, 2, 3],
[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
...,
[999999994, 999999995, 999999996, 999999997],
[999999995, 999999996, 999999997, 999999998],
[999999996, 999999997, 999999998, 999999999]])
Around 1 second.
What do you think?
I want to generate "category intervals" from categories.
for example, suppose I have the following :
>>> df['start'].describe()
count 259431.000000
mean 10.435858
std 5.504730
min 0.000000
25% 6.000000
50% 11.000000
75% 15.000000
max 20.000000
Name: start, dtype: float64
and unique value of my column are:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20], dtype=int8)
but I want to use the following list of intervals:
>>> intervals
[[0, 2.2222222222222223],
[2.2222222222222223, 4.4444444444444446],
[4.4444444444444446, 6.666666666666667],
[6.666666666666667, 8.8888888888888893],
[8.8888888888888893, 11.111111111111111],
[11.111111111111111, 13.333333333333332],
[13.333333333333332, 15.555555555555554],
[15.555555555555554, 17.777777777777775],
[17.777777777777775, 20]]
to change my column 'start' into values x where x represents the index of the interval that contains df['start'] (so x in my case will vary from 0 to 8)
is there a more or less simple way to do it using pandas/numpy?
In advance, thanks a lot for the help.
Regards.
You can use np.digitize:
import numpy as np
import pandas as pd
df = pd.DataFrame(dict(start=np.random.random_integers(0, 20, 10000)))
# the left-hand edges of each "interval"
intervals = np.linspace(0, 20, 9, endpoint=False)
print(intervals)
# [ 0. 2.22222222 4.44444444 6.66666667 8.88888889
# 11.11111111 13.33333333 15.55555556 17.77777778]
df['start_idx'] = np.digitize(df['start'], intervals) - 1
print(df.head())
# start start_idx
# 0 8 3
# 1 16 7
# 2 0 0
# 3 7 3
# 4 0 0