generate "category-intervals" from categories - python

I want to generate "category intervals" from categories.
for example, suppose I have the following :
>>> df['start'].describe()
count 259431.000000
mean 10.435858
std 5.504730
min 0.000000
25% 6.000000
50% 11.000000
75% 15.000000
max 20.000000
Name: start, dtype: float64
and unique value of my column are:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20], dtype=int8)
but I want to use the following list of intervals:
>>> intervals
[[0, 2.2222222222222223],
[2.2222222222222223, 4.4444444444444446],
[4.4444444444444446, 6.666666666666667],
[6.666666666666667, 8.8888888888888893],
[8.8888888888888893, 11.111111111111111],
[11.111111111111111, 13.333333333333332],
[13.333333333333332, 15.555555555555554],
[15.555555555555554, 17.777777777777775],
[17.777777777777775, 20]]
to change my column 'start' into values x where x represents the index of the interval that contains df['start'] (so x in my case will vary from 0 to 8)
is there a more or less simple way to do it using pandas/numpy?
In advance, thanks a lot for the help.
Regards.

You can use np.digitize:
import numpy as np
import pandas as pd
df = pd.DataFrame(dict(start=np.random.random_integers(0, 20, 10000)))
# the left-hand edges of each "interval"
intervals = np.linspace(0, 20, 9, endpoint=False)
print(intervals)
# [ 0. 2.22222222 4.44444444 6.66666667 8.88888889
# 11.11111111 13.33333333 15.55555556 17.77777778]
df['start_idx'] = np.digitize(df['start'], intervals) - 1
print(df.head())
# start start_idx
# 0 8 3
# 1 16 7
# 2 0 0
# 3 7 3
# 4 0 0

Related

How to concatenate columns and pivot keeping columns information in pandas

I have an input df:
input_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val1', 'Y_val2', 'Y_val3'],
[1, 10, 11, 31],
[2, 20, 12, 21],
[3, 30, 13, 11],])
and want to concat every y-value but still distinct where the value came from for plotting and analysis,
I have multiple files with variable number of Y columns and ended up concatenating them column by column and extending with multiplied value, but was wondering if there is a better solution, because mine is terribly tedious.
expected_output_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val' 'Y_type'],
[1, 10, 'Y_val1'],
[1, 11, 'Y_val2'],
[1, 31, 'Y_val3'],
[2, 20, 'Y_val1'],
[2, 12, 'Y_val2'],
[2, 21, 'Y_val3'],
[3, 30, 'Y_val1'],
[3, 13, 'Y_val2'],
[3, 11, 'Y_val3'],])
You can use pandas.DataFrame.melt :
input_.melt(
id_vars=['X_val'],
value_vars=['Y_val1', 'Y_val2', 'Y_val3'],
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Alternatively, as suggested by #Vishnudev, you can also use the following variation, especially for large number of similarly named Y_val* columns:
input_.melt(
id_vars=['X_val'],
value_vars=input_.filter(regex='Y_val').columns,
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Output:
X_val Y_type Y_val
0 1 Y_val1 10
1 1 Y_val2 11
2 1 Y_val3 31
3 2 Y_val1 20
4 2 Y_val2 12
5 2 Y_val3 21
6 3 Y_val1 30
7 3 Y_val2 13
8 3 Y_val3 11
Optionally, you can rearrange the column sequence if you like.

Find nearest index in one dataframe to another

I am new to python and its libraries. Searched all the forums but could not find a proper solution. This is the first time posting a question here. Sorry if I did something wrong.
So, I have two DataFrames like below containing X Y Z coordinates (UTM) and other features.
In [2]: a = {
...: 'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
...: 'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
...: 'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19],
...: }
...:
In [3]: b = {
...: 'X': [1, 8, 20, 7, 32],
...: 'Y': [6, 4, 17, 45, 32],
...: 'Z': [52, 12, 6, 8, 31],
...: }
In [4]: df1 = pd.DataFrame(data=a)
In [5]: df2 = pd.DataFrame(data=b)
In [6]: print(df1)
X Y Z
0 1 3 12
1 2 4 4
2 5 8 9
3 7 15 16
4 10 20 13
5 5 12 1
6 2 23 8
7 3 22 17
8 24 14 11
9 21 7 19
In [7]: print(df2)
X Y Z
0 1 6 52
1 8 4 12
2 20 17 6
3 7 45 8
4 32 32 31
I need to find the closest point (distance) in df1 to each point of df2 and creating new DataFrame.
So I wrote the code below and actually find the closest point (distance) to df2.iloc[0].
In [8]: x = (
...: np.sqrt(
...: ((df1['X'].sub(df2["X"].iloc[0]))**2)
...: .add(((df1['Y'].sub(df2["Y"].iloc[0]))**2))
...: .add(((df1['Z'].sub(df2["Z"].iloc[0]))**2))
...: )
...: ).idxmin()
In [9]: x1 = df1.iloc[[x]]
In[10]: print(x1)
X Y Z
3 7 15 16
So, I guess I need a loop to iterate through df2 and apply above code to each row. As a result I need a new updated df1 containing all the closest points to each point of df2. But couldn't make it. Please advise.
This is actually a great example of a case where numpy's broadcasting rules have distinct advantages over pandas.
Manually aligning df1's coordinates as column vectors (by referencing df1[[col]].to_numpy()) and df2's coordinates as row vectors (df2[col].to_numpy()), we can get the distance from every element in each dataframe to each element in the other very quickly with automatic broadcasting:
In [26]: dists = np.sqrt(
...: (df1[['X']].to_numpy() - df2['X'].to_numpy()) ** 2
...: + (df1[['Y']].to_numpy() - df2['Y'].to_numpy()) ** 2
...: + (df1[['Z']].to_numpy() - df2['Z'].to_numpy()) ** 2
...: )
In [27]: dists
Out[27]:
array([[40.11234224, 7.07106781, 24.35159132, 42.61455151, 46.50806382],
[48.05205511, 10. , 22.29349681, 41.49698784, 49.12229636],
[43.23193264, 5.83095189, 17.74823935, 37.06750599, 42.29657197],
[37.58989226, 11.74734012, 16.52271164, 31.04834939, 33.74907406],
[42.40283009, 16.15549442, 12.56980509, 25.67099531, 30.85449724],
[51.50728104, 13.92838828, 16.58312395, 33.7934905 , 45.04442252],
[47.18050445, 20.32240143, 19.07878403, 22.56102835, 38.85871846],
[38.53569774, 19.33907961, 20.85665361, 25.01999201, 33.7194306 ],
[47.68647607, 18.89444363, 7.07106781, 35.48239 , 28.0713377 ],
[38.60051813, 15.06651917, 16.43167673, 41.96427052, 29.83286778]])
Argmin will now give you the correct vector of positional indices:
In [28]: dists.argmin(axis=0)
Out[28]: array([3, 2, 8, 6, 8])
Or, to select the appropriate values from df1:
In [29]: df1.iloc[dists.argmin(axis=0)]
Out[29]:
X Y Z
3 7 15 16
2 5 8 9
8 24 14 11
6 2 23 8
8 24 14 11
Edit
An answer popped up just after mine, then was deleted, which made reference to scipy.spatial.distance_matrix, computing dists with:
distance_matrix(df1[list('XYZ')].to_numpy(), df2[list('XYZ')].to_numpy())
Not sure why that answer was deleted, but this seems like a really nice, clean approach to getting the array I produced manually above!
Performance Note
Note that if you are just trying to get the closest value, there's no need to take the square root, as this is a costly operation compared to addition, subtraction, and powers, and sorting on dist**2 is still valid.
First, you define a function that returns the closest point using numpy.where. Then you use the apply function to run through df2.
import pandas as pd
import numpy as np
a = {
'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19]
}
b = {
'X': [1, 8, 20, 7, 32],
'Y': [6, 4, 17, 45, 32],
'Z': [52, 12, 6, 8, 31]
}
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
dist = lambda dx,dy,dz: np.sqrt(dx**2+dy**2+dz**2)
def closest(row):
darr = dist(df1['X']-row['X'], df1['Y']-row['Y'], df1['Z']-row['Z'])
idx = np.where(darr == np.amin(darr))[0][0]
return df1['X'][idx], df1['Y'][idx], df1['Z'][idx]
df2['closest'] = df2.apply(closest, axis=1)
print(df2)
Output:
X Y Z closest
0 1 6 52 (7, 15, 16)
1 8 4 12 (5, 8, 9)
2 20 17 6 (24, 14, 11)
3 7 45 8 (2, 23, 8)
4 32 32 31 (24, 14, 11)

Evaluate frequency, duration and values of a timeseries

I'm new to python and have a simple question for which I haven't found an answer yet.
Lets say I have a time series with c(t):
t_ c_
1 40
2 41
3 4
4 5
5 7
6 20
7 20
8 8
9 90
10 99
11 10
12 5
13 8
14 8
15 19
I now want to evaluate this series with respect to how long the value c has been continuously in certain ranges and how often these time periods occur.
The result would therefore include three columns: c (binned), duration (binned), frequency. Translated to the simple example the result could look as follows:
c_ Dt_ Freq_
0-50 8 1
50-100 2 1
0-50 5 1
Can you give me an advice?
Thanks in advance,
Ulrike
//EDIT:
Thank you for the replies! My example data were somewhat flawed so that I couldn't show a part of my question. So, here is a new data series:
series=
t c
1 1
2 1
3 10
4 10
5 10
6 1
7 1
8 50
9 50
10 50
12 1
13 1
14 1
If I apply the code proposed by Christoph below:
bins = pd.cut(series['c'], [-1, 5, 100])
same_as_prev = (bins != bins.shift())
run_ids = same_as_prev.cumsum()
result = bins.groupby(run_ids).aggregate(["first", "count"])
I receive a result like this:
first count
(-1, 5] 2
(5, 100] 3
(-1, 5] 2
(5, 100] 3
(-1, 5] 3
but what I'm more interested in something looking like this:
c length freq
(-1, 5] 2 2
(-1, 5] 3 1
(5, 100] 3 2
How do I achieve this? And how could I plot it in a KDE plot?
Best,
Ulrike
Nicely asked question with an example :)
This is one way to do it, most likely incomplete, but it should help you a bit.
Since your data is spaced in time by a fixed increment, I do not implement the time series and use the index as time. Thus, I convert c to an array and use np.where() to find the value in the bins.
import numpy as np
c = np.array([40, 41, 4, 5, 7, 20, 20, 8, 90, 99, 10, 5, 8, 8, 19])
bin1 = np.where((0 <= c) & (c <= 50))[0]
bin2 = np.where((50 < c) & (c <= 100))[0]
For bin1, the output is array([ 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14], dtype=int64) which correspond to the idx where the values from c are in the bin.
Next step is to find the consecutive idx. According to this SO post::
from itertools import groupby
from operator import itemgetter
data = bin1
for k, g in groupby(enumerate(data), lambda ix : ix[0] - ix[1]):
print(list(map(itemgetter(1), g)))
# Output is:
#[0, 1, 2, 3, 4, 5, 6, 7]
#[10, 11, 12, 13, 14]
Final step: place the new sub-bin in the right order and track which bins correspond to which subbin. Thus, the complete code would look like:
import numpy as np
from itertools import groupby
from operator import itemgetter
c = np.array([40, 41, 4, 5, 7, 20, 20, 8, 90, 99, 10, 5, 8, 8, 19])
bin1 = np.where((0 <= c) & (c <= 50))[0]
bin2 = np.where((50 < c) & (c <= 100))[0]
# 1 and 2 for the range names.
bins = [(bin1, 1), (bin2, 2)]
subbins = list()
for b in bins:
data = b[0]
name = b[1] # 1 or 2
for k, g in groupby(enumerate(data), lambda ix : ix[0] - ix[1]):
subbins.append((list(map(itemgetter(1), g)), name))
subbins = sorted(subbins, key=lambda x: x[0][0])
Output: [([0, 1, 2, 3, 4, 5, 6, 7], 1), ([8, 9], 2), ([10, 11, 12, 13, 14], 1)]
Then, you just have to do the stats you want :)
import pandas as pd
def bin_run_lengths(series, bins):
binned = pd.cut(pd.Series(series), bins)
return binned.groupby(
(1 - (binned == binned.shift())).cumsum()
).aggregate(
["first", "count"]
)
(I'm not sure where your frequency column comes in - in the problem as you describe it, it seems like it would always be set to 1.)
Binning
Binning a series is easy with pandas.cut():
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html
import pandas as pd
pd.cut(pd.Series(range(100)), bins=[-1,0,10,20,50,100])
The bins here are given as (right-inclusive, left-exclusive) boundaries; the argument can be given in different forms.
0 (-1.0, 0.0]
1 (0.0, 10.0]
2 (0.0, 10.0]
3 (0.0, 10.0]
4 (0.0, 10.0]
5 (0.0, 10.0]
6 (0.0, 10.0]
...
19 (10.0, 20.0]
20 (10.0, 20.0]
21 (20.0, 50.0]
22 (20.0, 50.0]
23 (20.0, 50.0]
...
29 (20.0, 50.0]
...
99 (50.0, 100.0]
Length: 100, dtype: category
Categories (4, interval[int64]): [(0, 10] < (10, 20] < (20, 50] < (50, 100]]
This converts it from a series of values to a series of intervals.
Count consecutive values
This doesn't have a native idiom in pandas, but it is fairly easy with a few common functions. The top-voted StackOverflow answer here puts it very well: Counting consecutive positive value in Python array
same_as_prev = (series != series.shift())
This yields a Boolean series that determines if the value is different from the one before.
run_ids = same_as_prev.cumsum()
This makes an int series that increments from 0 each time the value changes to a new run, and thus assigns each position in the series to a "run ID"
result = series.groupby(run_ids).aggregate(["first", "count"])
This yields a dataframe that shows the value in each run and the length of that run:
first count
0 (-1, 0] 1
1 (0, 10] 10
2 (10, 20] 10
3 (20, 50] 30
4 (50, 100] 49

Cumulative subtraction from first row

I have one series and one DataFrame, all integers.
s = [10,
10,
10]
m = [[0,0,0,0,3,4,5],
[0,0,0,0,1,1,1],
[10,0,0,0,0,5,5]]
I want to return a matrix containing the cumulative differences to take the place of the existing number.
Output:
n = [[10,10,10,10,7,3,-2],
[10,10,10,10,9,8,7],
[0,0,0,0,0,-5,-10]]
Calculate the cumsum of data frame by row first and then subtract from the Series:
import pandas as pd
s = pd.Series(s)
df = pd.DataFrame(m)
-df.cumsum(1).sub(s, axis=0)
# 0 1 2 3 4 5 6
#0 10 10 10 10 7 3 -2
#1 10 10 10 10 9 8 7
#2 0 0 0 0 0 -5 -10
You can directly compute a cumulative difference using np.subtract.accumulate:
# make a copy
>>> n = np.array(m)
# replace first column
>>> n[:, 0] = s - n[:, 0]
# subtract in-place
>>> np.subtract.accumulate(n, axis=1, out=n)
array([[ 10, 10, 10, 10, 7, 3, -2],
[ 10, 10, 10, 10, 9, 8, 7],
[ 0, 0, 0, 0, 0, -5, -10]])

pandas dataframe look ahead value

What's the best way to do this with pandas dataframe? I want to loop through a dataframe, and compute the difference between the current value and the next value which is different than the current value.
For example:
[13, 13, 13, 14, 13, 12]
will create a new column with this
[-1, -1, -1, 1, 1]
How about use diff to calculate the difference and then back fill 0 with the next non zero value:
import pandas as pd
import numpy as np
df = pd.DataFrame({"S": [13, 13, 13, 14, 13, 12]})
df.S.diff(-1).replace(0, np.nan).bfill() # replace zero with nan and apply back fill.
# 0 -1
# 1 -1
# 2 -1
# 3 1
# 4 1
# 5 NaN
# Name: S, dtype: float64

Categories

Resources