My question is quite straight forward. I want to bin polar coordinates, which means that the domain in which I want to bin is limited by 0 and 360, where 0 = 360. Here start my problems, due to this circular behavior of the data, and as I want to bin each 1 degree starting from 0.5 degrees up to 355.5 degrees (Unfortunately due to the nature of the project binning from (0,1] until (359,360]), then, I have to make sure that there is a bin that goes from (355.5,0.5], which is obviously not what will happen by default.
I made up this script to better illustrate what I am looking for:
bins_direction = np.linspace(0.5,360.5,360, endpoint = False)
points = np.random.rand(10000)*360
df = pd.DataFrame({'Points': points})
df['Bins'] = pd.cut(x= df['Points'],
bins=bins_direction)
You will see that if the data is between 355.5 and 0.5 degrees, the binning will be NaN. I want to find a solution in which that would be (355.5,0.5]
So, my result (depending on which seed you set, of course), will look something like this:
Points Bins
0 17.102993 (16.5, 17.5]
1 97.665600 (97.5, 98.5]
2 46.697548 (46.5, 47.5]
3 9.832000 (9.5, 10.5]
4 21.260980 (20.5, 21.5]
5 47.433179 (46.5, 47.5]
6 359.813283 nan
7 355.654251 (355.5, 356.5]
8 0.23740105 nan
And I would like it to be:
Points Bins
0 17.102993 (16.5, 17.5]
1 97.665600 (97.5, 98.5]
2 46.697548 (46.5, 47.5]
3 9.832000 (9.5, 10.5]
4 21.260980 (20.5, 21.5]
5 47.433179 (46.5, 47.5]
6 359.813283 (359.5, 0.5]
7 355.654251 (355.5, 356.5]
8 0.23740105 (359.5, 0.5]
Since you cannot have a pandas interval of the form (355.5, 0.5], you can only have them as string:
bins = [0] + list(np.linspace(0.5,355.5,356)) + [360]
df = pd.DataFrame({'Points': [0,1,350,356, 357, 359]})
(pd.cut(df['Points'], bins=bins, include_lowest=True)
.astype(str)
.replace({'(-0.001, 0.5]':'(355.5,0.5]', '(355.5, 360.0]':'(355.5,0.5]'})
)
Output:
0 (355.5,0.5]
1 (0.5, 1.5]
2 (349.5, 350.5]
3 (355.5,0.5]
4 (355.5,0.5]
5 (355.5,0.5]
Name: Points, dtype: object
Related
I'm working with a numpy array called array_test with shape (5, 359, 2). This is checked with array_test.shape. The array reflects mean and uncertainty for observations in 5 repetitions of an experiment.
The goal of this is to be able to estimate the mean value of each observation across the 5 repetitions of the experiment, and to estimate the total uncertainty per observation also a mean across the 5 repetitions.
I would need to create a pandas dataframe from it, I believe with a multiindex in which the first level would have 5 values from the first dimension (named simply '1', '2', etc.), and a second one which would be 'mean' and 'uncertainty'.
Suggestions are more than welcome!
IIUC, you might want to aggregate in numpy, then construct a DataFrame and stack:
a = np.random.random((5, 359, 2))
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
columns=['mean', 'uncertainty']).stack()
Output (a Series):
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299
dtype: float64
For a DataFrame:
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
columns=['mean', 'uncertainty']).stack().to_frame('value')
Output:
value
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299
I would approach it by using a normal Dataframe, but adding columns for the observation and experiment number.
import numpy as np
import pandas as pd
a = np.random.rand(5, 10, 2)
# Get the shape
n_experiments, n_observations, n_values = a.shape
# Reshape array into a 2-dimensional array
# (stacking experiments on top of each other)
a = a.reshape(-1, n_values)
# Create Dataframe and add experiment and observation number
df = pd.DataFrame(a, columns=["mean", "uncertainty"])
# This returns an array, like [0, 0, 0, 0, 0, 1, 1, 1, ..., 4, 4]
experiment = np.repeat(range(n_experiments), n_observations)
df["experiment"] = experiment
# This returns an array like [0, 1, 2, 3, 4, 0, 1, 2, ..., 3, 4]
observation = np.tile(range(n_observations), n_experiments)
df["observation"] = observation
The Dataframe now looks like this:
print(df.head(15))
mean uncertainty experiment observation
0 0.741436 0.775086 0 0
1 0.401934 0.277716 0 1
2 0.148269 0.406040 0 2
3 0.852485 0.702986 0 3
4 0.240930 0.644746 0 4
5 0.309648 0.914761 0 5
6 0.479186 0.495845 0 6
7 0.154647 0.422658 0 7
8 0.381012 0.756473 0 8
9 0.939797 0.764821 0 9
10 0.994342 0.019140 1 0
11 0.300225 0.992146 1 1
12 0.265698 0.823469 1 2
13 0.791907 0.555051 1 3
14 0.503281 0.249237 1 4
Now you can analyze the Dataframe (with groupby and mean):
# Only the mean
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).mean())
mean uncertainty
observation
0 0.699324 0.506369
1 0.382288 0.456324
2 0.333396 0.324469
3 0.690545 0.564583
4 0.365198 0.555231
5 0.453545 0.596149
6 0.526988 0.395162
7 0.565689 0.569904
8 0.425595 0.415944
9 0.731776 0.375612
Or with more advanced aggregate functions, which are probably useful for your usecase:
# Use aggregate function to calculate not only mean, but min and max as well
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).aggregate(['mean', 'min', 'max']))
mean uncertainty
mean min max mean min max
observation
0 0.699324 0.297030 0.994342 0.506369 0.019140 0.974842
1 0.382288 0.063046 0.810411 0.456324 0.108774 0.992146
2 0.333396 0.148269 0.698921 0.324469 0.009539 0.823469
3 0.690545 0.175471 0.895190 0.564583 0.260557 0.721265
4 0.365198 0.015501 0.726352 0.555231 0.249237 0.929258
5 0.453545 0.111355 0.807582 0.596149 0.101421 0.914761
6 0.526988 0.323945 0.786167 0.395162 0.007105 0.691998
7 0.565689 0.154647 0.813336 0.569904 0.302157 0.964782
8 0.425595 0.116968 0.567544 0.415944 0.014439 0.756473
9 0.731776 0.411324 0.939797 0.375612 0.085988 0.764821
df1 = pd.DataFrame({"DEPTH":[0.5, 1, 1.5, 2, 2.5],
"POROSITY":[10, 22, 15, 30, 20],
"WELL":"well 1"})
df2 = pd.DataFrame({"Well":"well 1",
"Marker":["Fm 1","Fm 2"],
"Depth":[0.7, 1.7]})
Hello everyone. I have two dataframes and I would like to create a new column on df1, for example: df1["FORMATIONS"], with information from df2["Marker"] values based on depth limits from df2["Depth"] and df1["DEPTH"].
So, for example, if df2["Depth"] = 1.7, then all samples in df1 with df1["DEPTH"] > 1.7 should be labelled as "Fm 2" in this new column df1["FORMATIONS"].
And the final dataframe df1 should look like this:
DEPTH POROSITY WELL FORMATIONS
0.5 10 well 1 nan
1 22 well 1 Fm 1
1.5 15 well 1 Fm 1
2 30 well 1 Fm 2
2.5 20 well 1 Fm 2
Anyone could help me?
What you're doing here is transforming continuous data into categorical data. There are many ways to do this with pandas, but one of the better known ways is using pandas.cut.
When specifying the bins argument, you need to add float(inf) to the end of the list, to represent that the last bin goes to infinity.
df1["FORMATIONS"] = pd.cut(df1.DEPTH, list(df2.Depth) + [float('inf')], labels=df2.Marker)
df1 will now be:
Use pandas.merge_asof:
NB. the columns used for the merge need to be sorted first
pd.merge_asof(df1,
df2[['Marker', 'Depth']].rename(columns={'Marker': 'Formations'}),
left_on='DEPTH', right_on='Depth')
output:
DEPTH POROSITY WELL Formations Depth
0 0.5 10 well 1 NaN NaN
1 1.0 22 well 1 Fm 1 0.7
2 1.5 15 well 1 Fm 1 0.7
3 2.0 30 well 1 Fm 2 1.7
4 2.5 20 well 1 Fm 2 1.7
I have a pd.Series of floats and I would like to bin it into n bins where the bin size for each bin is set so that max/min is a preset value (e.g. 1.20)?
The requirement means that the size of the bins is not constant. For example:
data = pd.Series(np.arange(1, 11.0))
print(data)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
dtype: float64
I would like the bin sizes to be:
1.00 <= bin 1 < 1.20
1.20 <= bin 2 < 1.20 x 1.20 = 1.44
1.44 <= bin 3 < 1.44 x 1.20 = 1.73
...
etc
Thanks
Here's one with pd.cut, where the bins can be computed taking the np.cumprod of an array filled with 1.2:
data = pd.Series(list(range(11)))
import numpy as np
n = 20 # set accordingly
bins= np.r_[0,np.cumprod(np.full(n, 1.2))]
# array([ 0. , 1.2 , 1.44 , 1.728 ...
pd.cut(data, bins)
0 NaN
1 (0.0, 1.2]
2 (1.728, 2.074]
3 (2.986, 3.583]
4 (3.583, 4.3]
5 (4.3, 5.16]
6 (5.16, 6.192]
7 (6.192, 7.43]
8 (7.43, 8.916]
9 (8.916, 10.699]
10 (8.916, 10.699]
dtype: category
Where bins in this case goes up to:
np.r_[0,np.cumprod(np.full(20, 1.2))]
array([ 0. , 1.2 , 1.44 , 1.728 , 2.0736 ,
2.48832 , 2.985984 , 3.5831808 , 4.29981696, 5.15978035,
6.19173642, 7.43008371, 8.91610045, 10.69932054, 12.83918465,
15.40702157, 18.48842589, 22.18611107, 26.62333328, 31.94799994,
38.33759992])
So you'll have to set that according to the range of values of the actual data
This is I believe the best way to do it because you are considering the max and min values from your array. Therefore you won't need to worry about what values are you using, only the multiplier or step_size for your bins (of course you'd need to add a column name or some additional information if you will be working with a DataFrame):
data = pd.Series(np.arange(1, 11.0))
bins = []
i = min(data)
while i < max(data):
bins.append(i)
i = i*1.2
bins.append(i)
bins = list(set(bins))
bins.sort()
df = pd.cut(data,bins,include_lowest=True)
print(df)
Output:
0 (0.999, 1.2]
1 (1.728, 2.074]
2 (2.986, 3.583]
3 (3.583, 4.3]
4 (4.3, 5.16]
5 (5.16, 6.192]
6 (6.192, 7.43]
7 (7.43, 8.916]
8 (8.916, 10.699]
9 (8.916, 10.699]
Bins output:
Categories (13, interval[float64]): [(0.999, 1.2] < (1.2, 1.44] < (1.44, 1.728] < (1.728, 2.074] < ... <
(5.16, 6.192] < (6.192, 7.43] < (7.43, 8.916] <
(8.916, 10.699]]
Thanks everyone for all the suggestions. None does quite what I was after (probably because my original question wasn't clear enough) but they really helped me figure out what to do so I have decided to post my own answer (I hope this is what I am supposed to do as I am relatively new at being an active member of stackoverflow...)
I liked #yatu's vectorised suggestion best because it will scale better with large data sets but I am after the means to not only automatically calculate the bins but also figure out the minimum number of bins needed to cover the data set.
This is my proposed algorithm:
The bin size is defined so that bin_max_i/bin_min_i is constant:
bin_max_i / bin_min_i = bin_ratio
Figure out the number of bins for the required bin size (bin_ratio):
data_ratio = data_max / data_min
n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
Set the lower boundary for the smallest bin so that the smallest data point fits in it:
bin_min_0 = data_min
Create n non-overlapping bins meeting the conditions:
bin_min_i+1 = bin_max_i
bin_max_i+1 = bin_min_i+1 * bin_ratio
Stop creating further bins once all dataset can be split between the bins already created. In other words, stop once:
bin_max_last > data_max
Here is a code snippet:
import math
import pandas as pd
bin_ratio = 1.20
data = pd.Series(np.arange(2,12))
data_ratio = max(data) / min(data)
n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
n_bins = n_bins + 1 # bin ranges are defined as [min, max)
bins = np.full(n_bins, bin_ratio) # initialise the ratios for the bins limits
bins[0] = bin_min_0 # initialise the lower limit for the 1st bin
bins = np.cumprod(bins) # generate bins
print(bins)
[ 2. 2.4 2.88 3.456 4.1472 4.97664
5.971968 7.1663616 8.59963392 10.3195607 12.38347284]
I am now set to build a histogram of the data:
data.hist(bins=bins)
I have some geographical data (global) as arrays:
latitude: lats = ([34.5,34.2,67.8,-24,...])
wind speed: u = ([2.2,2.5,6,-3,-0.5,...])
I would like to have a statement how the wind speed depends on latitude. Therefore I would like to bin the data in latitude bins of 1 degree.
latbins = np.linspace(lats.min(),lat.(max),180)
How can I calculate which wind speeds are falling in which bin. I read about pandas.groupby. Is that an option?
The numpy function np.digitize does this task.
Here one example that categorises each value in a bin:
import numpy as np
import math
# Generate random lats
lats = np.arange(0,10) - 0.5
print("{:20s}: {}".format("Lats", lats))
# Lats : [-0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5]
# Generate bins spaced by 1 from the minus to max values of lats
bins = np.arange(math.floor(lats.min()), math.ceil(lats.max()) +1, 1)
print("{:20s}: {}".format("Bins", bins))
# Bins : [-1 0 1 2 3 4 5 6 7 8 9]
lats_bins = np.digitize(lats, bins)
print("{:20s}: {}".format("Lats in bins", lats_bins))
# Lats in bins : [ 1 2 3 4 5 6 7 8 9 10]
As suggested by #High Performance Mark in the comments, since you want to split in bins with 1 degree, you can use the floor to extract the floor of each lats (note: this method introduces negative index bins if there are negative values):
lats_bins_floor = np.floor(lats)
# lats_bins_floor = lats_bins_floor + abs(min(lats_bins_floor))
print("{:20s}: {}".format("Lats in bins (floor)", lats_bins_floor))
# Lats in bins (floor): [-1. 0. 1. 2. 3. 4. 5. 6. 7. 8.]
Is there a way to customize the window of the rolling_mean function?
data
1
2
3
4
5
6
7
8
Let's say the window is set to 2, that is to calculate the average of 2 datapoints before and after the obervation including the observation. Say the 3rd observation. In this case, we will have (1+2+3+4+5)/5 = 3. So on and so forth.
Compute the usual rolling mean with a forward (or backward) window and then use the shift method to re-center it as you wish.
data_mean = pd.rolling_mean(data, window=5).shift(-2)
If you want to average over 2 datapoints before and after the observation (for a total of 5 datapoints) then make the window=5.
For example,
import pandas as pd
data = pd.Series(range(1, 9))
data_mean = pd.rolling_mean(data, window=5).shift(-2)
print(data_mean)
yields
0 NaN
1 NaN
2 3
3 4
4 5
5 6
6 NaN
7 NaN
dtype: float64
As kadee points out, if you wish to center the rolling mean, then use
pd.rolling_mean(data, window=5, center=True)
For more current version of Pandas (please see 0.23.4 documentation https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html), you don't have rolling_mean anymore. Instead, you will use
DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)
For your example, it will be:
df.rolling(5,center=True).mean()