I have some geographical data (global) as arrays:
latitude: lats = ([34.5,34.2,67.8,-24,...])
wind speed: u = ([2.2,2.5,6,-3,-0.5,...])
I would like to have a statement how the wind speed depends on latitude. Therefore I would like to bin the data in latitude bins of 1 degree.
latbins = np.linspace(lats.min(),lat.(max),180)
How can I calculate which wind speeds are falling in which bin. I read about pandas.groupby. Is that an option?
The numpy function np.digitize does this task.
Here one example that categorises each value in a bin:
import numpy as np
import math
# Generate random lats
lats = np.arange(0,10) - 0.5
print("{:20s}: {}".format("Lats", lats))
# Lats : [-0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5]
# Generate bins spaced by 1 from the minus to max values of lats
bins = np.arange(math.floor(lats.min()), math.ceil(lats.max()) +1, 1)
print("{:20s}: {}".format("Bins", bins))
# Bins : [-1 0 1 2 3 4 5 6 7 8 9]
lats_bins = np.digitize(lats, bins)
print("{:20s}: {}".format("Lats in bins", lats_bins))
# Lats in bins : [ 1 2 3 4 5 6 7 8 9 10]
As suggested by #High Performance Mark in the comments, since you want to split in bins with 1 degree, you can use the floor to extract the floor of each lats (note: this method introduces negative index bins if there are negative values):
lats_bins_floor = np.floor(lats)
# lats_bins_floor = lats_bins_floor + abs(min(lats_bins_floor))
print("{:20s}: {}".format("Lats in bins (floor)", lats_bins_floor))
# Lats in bins (floor): [-1. 0. 1. 2. 3. 4. 5. 6. 7. 8.]
Related
I couldn't make a better title. Let me explain:
Numpy has the percentile() function, which calculates the Nth percentile of any array:
import numpy as np
arr = np.arange(0, 10)
print(arr)
print(np.percentile(arr, 80))
>>> [0 1 2 3 4 5 6 7 8 9]
>>> 7.2
Which is great - 7.2 marks the 80th percentile on that array.
How can I obtain the same percentile type of calculation, but find out the Nth percentile of both extremities of an array (the positive and negative numbers)?
For example, my array may be:
[-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9]
So I place them in a number line, it would go from -10 to 10. I like to get the Nth percentile that marks extremities of that number line. For 90th percentile, the output could look like -8.1 and 7.5, for example, since since 90% of the values in the array fall within that range, and the remaining 10% are lower than -8.1 or greater than 7.5.
I made these numbers up of course, just for illustrating what I'm trying to calculate.
Is there any NumPy method for obtaining such boundaries?
Please let me know if I can explain or clarify further, I know this is a complicated question to ask and I'm trying my best to make it clear. Thanks a lot!
Are you looking for something like
import numpy as np
def extremities(array, pct):
# assert 50 <= pct <= 100
return np.percentile(array, [100 - pct, pct])
arr = np.arange(-10, 10)
print(extremities(arr, 90)) # [-8.1, 7.1]
My question is quite straight forward. I want to bin polar coordinates, which means that the domain in which I want to bin is limited by 0 and 360, where 0 = 360. Here start my problems, due to this circular behavior of the data, and as I want to bin each 1 degree starting from 0.5 degrees up to 355.5 degrees (Unfortunately due to the nature of the project binning from (0,1] until (359,360]), then, I have to make sure that there is a bin that goes from (355.5,0.5], which is obviously not what will happen by default.
I made up this script to better illustrate what I am looking for:
bins_direction = np.linspace(0.5,360.5,360, endpoint = False)
points = np.random.rand(10000)*360
df = pd.DataFrame({'Points': points})
df['Bins'] = pd.cut(x= df['Points'],
bins=bins_direction)
You will see that if the data is between 355.5 and 0.5 degrees, the binning will be NaN. I want to find a solution in which that would be (355.5,0.5]
So, my result (depending on which seed you set, of course), will look something like this:
Points Bins
0 17.102993 (16.5, 17.5]
1 97.665600 (97.5, 98.5]
2 46.697548 (46.5, 47.5]
3 9.832000 (9.5, 10.5]
4 21.260980 (20.5, 21.5]
5 47.433179 (46.5, 47.5]
6 359.813283 nan
7 355.654251 (355.5, 356.5]
8 0.23740105 nan
And I would like it to be:
Points Bins
0 17.102993 (16.5, 17.5]
1 97.665600 (97.5, 98.5]
2 46.697548 (46.5, 47.5]
3 9.832000 (9.5, 10.5]
4 21.260980 (20.5, 21.5]
5 47.433179 (46.5, 47.5]
6 359.813283 (359.5, 0.5]
7 355.654251 (355.5, 356.5]
8 0.23740105 (359.5, 0.5]
Since you cannot have a pandas interval of the form (355.5, 0.5], you can only have them as string:
bins = [0] + list(np.linspace(0.5,355.5,356)) + [360]
df = pd.DataFrame({'Points': [0,1,350,356, 357, 359]})
(pd.cut(df['Points'], bins=bins, include_lowest=True)
.astype(str)
.replace({'(-0.001, 0.5]':'(355.5,0.5]', '(355.5, 360.0]':'(355.5,0.5]'})
)
Output:
0 (355.5,0.5]
1 (0.5, 1.5]
2 (349.5, 350.5]
3 (355.5,0.5]
4 (355.5,0.5]
5 (355.5,0.5]
Name: Points, dtype: object
I have a pd.Series of floats and I would like to bin it into n bins where the bin size for each bin is set so that max/min is a preset value (e.g. 1.20)?
The requirement means that the size of the bins is not constant. For example:
data = pd.Series(np.arange(1, 11.0))
print(data)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
dtype: float64
I would like the bin sizes to be:
1.00 <= bin 1 < 1.20
1.20 <= bin 2 < 1.20 x 1.20 = 1.44
1.44 <= bin 3 < 1.44 x 1.20 = 1.73
...
etc
Thanks
Here's one with pd.cut, where the bins can be computed taking the np.cumprod of an array filled with 1.2:
data = pd.Series(list(range(11)))
import numpy as np
n = 20 # set accordingly
bins= np.r_[0,np.cumprod(np.full(n, 1.2))]
# array([ 0. , 1.2 , 1.44 , 1.728 ...
pd.cut(data, bins)
0 NaN
1 (0.0, 1.2]
2 (1.728, 2.074]
3 (2.986, 3.583]
4 (3.583, 4.3]
5 (4.3, 5.16]
6 (5.16, 6.192]
7 (6.192, 7.43]
8 (7.43, 8.916]
9 (8.916, 10.699]
10 (8.916, 10.699]
dtype: category
Where bins in this case goes up to:
np.r_[0,np.cumprod(np.full(20, 1.2))]
array([ 0. , 1.2 , 1.44 , 1.728 , 2.0736 ,
2.48832 , 2.985984 , 3.5831808 , 4.29981696, 5.15978035,
6.19173642, 7.43008371, 8.91610045, 10.69932054, 12.83918465,
15.40702157, 18.48842589, 22.18611107, 26.62333328, 31.94799994,
38.33759992])
So you'll have to set that according to the range of values of the actual data
This is I believe the best way to do it because you are considering the max and min values from your array. Therefore you won't need to worry about what values are you using, only the multiplier or step_size for your bins (of course you'd need to add a column name or some additional information if you will be working with a DataFrame):
data = pd.Series(np.arange(1, 11.0))
bins = []
i = min(data)
while i < max(data):
bins.append(i)
i = i*1.2
bins.append(i)
bins = list(set(bins))
bins.sort()
df = pd.cut(data,bins,include_lowest=True)
print(df)
Output:
0 (0.999, 1.2]
1 (1.728, 2.074]
2 (2.986, 3.583]
3 (3.583, 4.3]
4 (4.3, 5.16]
5 (5.16, 6.192]
6 (6.192, 7.43]
7 (7.43, 8.916]
8 (8.916, 10.699]
9 (8.916, 10.699]
Bins output:
Categories (13, interval[float64]): [(0.999, 1.2] < (1.2, 1.44] < (1.44, 1.728] < (1.728, 2.074] < ... <
(5.16, 6.192] < (6.192, 7.43] < (7.43, 8.916] <
(8.916, 10.699]]
Thanks everyone for all the suggestions. None does quite what I was after (probably because my original question wasn't clear enough) but they really helped me figure out what to do so I have decided to post my own answer (I hope this is what I am supposed to do as I am relatively new at being an active member of stackoverflow...)
I liked #yatu's vectorised suggestion best because it will scale better with large data sets but I am after the means to not only automatically calculate the bins but also figure out the minimum number of bins needed to cover the data set.
This is my proposed algorithm:
The bin size is defined so that bin_max_i/bin_min_i is constant:
bin_max_i / bin_min_i = bin_ratio
Figure out the number of bins for the required bin size (bin_ratio):
data_ratio = data_max / data_min
n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
Set the lower boundary for the smallest bin so that the smallest data point fits in it:
bin_min_0 = data_min
Create n non-overlapping bins meeting the conditions:
bin_min_i+1 = bin_max_i
bin_max_i+1 = bin_min_i+1 * bin_ratio
Stop creating further bins once all dataset can be split between the bins already created. In other words, stop once:
bin_max_last > data_max
Here is a code snippet:
import math
import pandas as pd
bin_ratio = 1.20
data = pd.Series(np.arange(2,12))
data_ratio = max(data) / min(data)
n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
n_bins = n_bins + 1 # bin ranges are defined as [min, max)
bins = np.full(n_bins, bin_ratio) # initialise the ratios for the bins limits
bins[0] = bin_min_0 # initialise the lower limit for the 1st bin
bins = np.cumprod(bins) # generate bins
print(bins)
[ 2. 2.4 2.88 3.456 4.1472 4.97664
5.971968 7.1663616 8.59963392 10.3195607 12.38347284]
I am now set to build a histogram of the data:
data.hist(bins=bins)
I am converting a data frame to a square matrix. The data frame has an index and only one column with floats. What I need to do is to calculate all pairs of indices, and for each pair take the mean of two associated column values. So, the usual pivot function is only part of the solution.
Currently, the function has an estimated complexity of O(n^2), which is not good as I have to work with larger inputs with data frames with several hundred rows at a time. Is there another faster approach I could take?
Example input (with integers here for simplicity):
df = pd.DataFrame([3, 4, 5])
Update: transformation logic
For an input data frame in the example:
0
0 3
1 4
2 5
I do the following (not claiming it is the best way though):
get all pairs of indices: (0,1), (1,2), (0,2)
for each pair, compute the mean of their values: (0,1):3.5, (1,2):4.5, (0,2):4.0
build a square symmetric matrix using indices in each pair as column and row identifiers, and zero on the diagonal (as shown in the desired output).
The code is in the turn_table_into_square_matrix().
Desired output:
0 1 2
0 0.0 3.5 4.0
1 3.5 0.0 4.5
2 4.0 4.5 0.0
Current implementation:
import pandas as pd
from itertools import combinations
import time
import string
import random
def turn_table_into_square_matrix(original_dataframe):
# get all pairs of indices
index_pairs = list(combinations(list(original_dataframe.index),2))
rows_for_final_dataframe = []
# collect new data frame row by row - the time consuming part
for pair in index_pairs:
subset_original_dataframe = original_dataframe[original_dataframe.index.isin(list(pair))]
rows_for_final_dataframe.append([pair[0], pair[1], subset_original_dataframe[0].mean()])
rows_for_final_dataframe.append([pair[1], pair[0], subset_original_dataframe[0].mean()])
final_dataframe = pd.DataFrame(rows_for_final_dataframe)
final_dataframe.columns = ["from", "to", "weight"]
final_dataframe_pivot = final_dataframe.pivot(index="from", columns="to", values="weight")
final_dataframe_pivot = final_dataframe_pivot.fillna(0)
return final_dataframe_pivot
Code to time the performance:
for size in range(50, 600, 100):
index = range(size)
values = random.sample(range(0, 1000), size)
example = pd.DataFrame(values, index)
print ("dataframe size", example.shape)
start_time = time.time()
turn_table_into_square_matrix(example)
print ("conversion time:", time.time()-start_time)
The timing results:
dataframe size (50, 1)
conversion time: 0.5455281734466553
dataframe size (150, 1)
conversion time: 5.001590013504028
dataframe size (250, 1)
conversion time: 14.562285900115967
dataframe size (350, 1)
conversion time: 31.168692111968994
dataframe size (450, 1)
conversion time: 49.07127499580383
dataframe size (550, 1)
conversion time: 78.73740792274475
Thus, a data frame of with 50 rows takes only half a second to convert, whereas one with 550 rows (11 times longer) takes 79 seconds (over 11^2 times longer). Is there a faster solution to this problem?
I don't think it is possible to do better than O(n^2) for that computation. As #piiipmatz suggested, you should try doing everything with numpy and then put the result in a pd.DataFrame. Your problem sounds like a good use case for numpy.add.at.
Here is a quick example
import numpy as np
import itertools
# your original array
x = np.array([1, 4, 8, 99, 77, 23, 4, 45])
n = len(x)
# all pairs of indices in x
a, b = zip(*list(itertools.product(range(n), range(n))))
a, b = np.array(a), np.array(b)
# resulting matrix
result = np.zeros(shape=(n, n))
np.add.at(result, [a, b], (x[a] + x[b]) / 2.0)
print(result)
# [[ 1. 2.5 4.5 50. 39. 12. 2.5 23. ]
# [ 2.5 4. 6. 51.5 40.5 13.5 4. 24.5]
# [ 4.5 6. 8. 53.5 42.5 15.5 6. 26.5]
# [ 50. 51.5 53.5 99. 88. 61. 51.5 72. ]
# [ 39. 40.5 42.5 88. 77. 50. 40.5 61. ]
# [ 12. 13.5 15.5 61. 50. 23. 13.5 34. ]
# [ 2.5 4. 6. 51.5 40.5 13.5 4. 24.5]
# [ 23. 24.5 26.5 72. 61. 34. 24.5 45. ]]
I think you have a lot of overhead from pandas (i.e. original_dataframe[original_dataframe.index.isin(list(pair))] seems too expensive for what it actually does). I haven't tested it but I assume you can save a considerable amount of execution time when you just work with numpy arrays. If needed you can still feed it to a pandas.DataFrame at the end.
Something like (just to sketch what I mean):
original_array = original_dataframe.as_matrix().ravel()
n = len(original_array)
final_matrix = np.zeros((n,n))
for pair in pairs:
final_matrix[pair[0], pair[1]] = 0.5*(original_array[pair[0]]+original_array[pair[1]])
How about this:
df.pivot(index='i', columns = 'j', values = 'empty')
for this you need to cheat a bit the standard pivot with adding new index columns (twice) as it does not allow the same argument twice in pivot and adding an empty column for values:
df['i']=df.index
df['j']=df.index
df['empty']=None
And that's it.
I have a matrix of ternary values (2 observations, 11 variables) for which I calculate the eigenvectors using np.linalg.eig() from Numpy. The matrix is (0 values are not used for this example):
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
1 1 1 1 1 1 1 1 1 -1 -1
1 1 1 1 1 1 1 1 1 -1 -1
Result of the eigenvector from largest eigenvalue:
[ 0.33333333 0. 0.33333333 0. 0.33333333 0.33333333
0.33333333 0.33333333 0.33333333 0.33333333 0.33333333]
I am not sure about the order of these coefficients. Are they following the order of the variables expressed in the matrix (i.e. first 0.33333333 is weight coefficient of v1, 0.0 is weight coefficient of v2, etc...)?
Last part of my code is:
# Matrix with rounded values
Mtx = np.matrix.round(Mtx,3)
# Cross product of Mtx
Mtx_CrossProduct = (Mtx.T).dot(Mtx)
# Calculation of eigenvectors
eigen_Value, eigen_Vector = np.linalg.eig(Mtx_CrossProduct)
eigen_Vector = np.absolute(eigen_Vector)
# Listing (eigenvalue, eigenvector) and sorting of eigenvalues to get PC1
eig_pairs = [(np.absolute(eigen_Value[i]), eigen_Vector[i,:]) for i in range(len(eigen_Value))]
eig_pairs.sort(key=lambda tup: tup[0],reverse=True)
# Getting largest eigenvector
eig_Vector_Main = np.zeros((11,))
for i in range(len(eig_pairs)):
eig_Vector_Main[i] = eig_pairs[i][1][0]
The dimensions of each vector are the same as the dimensions of your original matrix (i.e. they follow the order as you say).
I've not figured out exactly what you're doing with your lambda and 'standard' python list but you can probably do the same thing more elegantly and quickly by sticking to numpy i.e.
eigen_Value, eigen_Vector = np.linalg.eig(Mtx_CrossProduct)
eigen_Vector = np.absolute(eigen_Vector)
ix = np.argsort(eigen_Value)[::-1] # reverse sorted index
eig_Vector_Main = eigen_Vector[ix]