Healpy map2alm and alm2map inconsistency? - python

I'm just starting to work with Healpy and have noticed that if I use a map to get alm's and then use those alm's to generate a new map, I do not get the map I started with. Here's what I'm looking at:
import numpy as np
import healpy as hp
nside = 2 # healpix nside parameter
m = np.arange(hp.nside2npix(nside)) # create a map to test
alm = hp.map2alm(m) # compute alm's
new_map = hp.alm2map(alm, nside) # create new map from computed alm's
# Let's look at two maps
print(m)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47] # as expected
print(new_map)
[-23.30522233 -22.54434515 -21.50906755 -20.09203749 -19.48841773
-18.66392484 -16.99593867 -16.789984 -15.14587061 -14.57960049
-13.4403252 -13.35992138 -10.51368725 -10.49793946 -10.1262039
-8.6340571 -7.41789272 -6.87712224 -5.75765487 -3.75121764
-4.35825512 -1.6221964 -1.03902923 -0.41478954 0.52480646
2.34629955 2.1511705 2.40325268 5.39576497 5.38390848
5.78324832 7.24779083 8.4915595 9.0047257 10.15179735
12.1306303 12.62672772 13.4512206 15.11920678 15.32516145
16.96927483 17.53554496 18.67482024 18.75522407 20.42078855
21.18166574 22.21694334 23.6339734 ] # not what I was expecting
As you can see, new_map doesn't match the input map, m. I imagine there's some subtlety to these functions that I'm missing. Any idea?

I get a different result:
print(new_map)
[ 0.15859344, 0.91947062, 1.95474822, 3.37177828,
4.01808325, 4.84257613, 6.51056231, 6.71651698,
8.36063036, 8.92690049, 10.06617577, 10.1465796 ,
12.98620654, 13.00668621, 13.3736899 , 14.87056857,
16.08200108, 16.62750343, 17.74223892, 19.75340803,
19.13441288, 21.8704716 , 22.45363877, 23.07787846,
24.01747446, 25.83896755, 25.6438385 , 25.89592068,
28.89565876, 28.88853415, 29.28314212, 30.7524165 ,
31.9914533 , 32.50935137, 33.65169114, 35.63525597,
36.13322869, 36.95772158, 38.62570775, 38.83166242,
40.47577581, 41.04204594, 42.18132122, 42.26172504,
43.88460433, 44.64548151, 45.68075911, 47.09778917]
Older versions of healpy were automatically removing a constant offset from the map before transformation, better to update healpy to the last version.
The residual difference is related to the fact the pixelization introduces an error, this error is larger at low nside.

Related

Sort rows of curve shaped data in python

I have a dataset that consists of 5 rows that are formed like a curve. I want to separate the inner row from the other or if possible each row and store them in a separate array. Is there any way to do this, like somehow flatten the curved data and sorting it afterwards based on the x and y values?
I would like to assign each row from left to right numbers from 0 to the max of the row. Right now the labels for each dot are not useful for me and I can't change the labels.
Here are the first 50 data points of my data set:
x y
0 -6.4165 0.3716
1 -4.0227 2.63
2 -7.206 3.0652
3 -3.2584 -0.0392
4 -0.7565 2.1039
5 -0.0498 -0.5159
6 2.363 1.5329
7 -10.7253 3.4654
8 -8.0621 5.9083
9 -4.6328 5.3028
10 -1.4237 4.8455
11 1.8047 4.2297
12 4.8147 3.6074
13 -5.3504 8.1889
14 -1.7743 7.6165
15 1.1783 6.9698
16 4.3471 6.2411
17 7.4067 5.5988
18 -2.6037 10.4623
19 0.8613 9.7628
20 3.8054 9.0202
21 7.023 8.1962
22 9.9776 7.5563
23 0.1733 12.6547
24 3.7137 11.9097
25 6.4672 10.9363
26 9.6489 10.1246
27 12.5674 9.3369
28 3.2124 14.7492
29 6.4983 13.7562
30 9.2606 12.7241
31 12.4003 11.878
32 15.3578 11.0027
33 6.3128 16.7014
34 9.7676 15.6557
35 12.2103 14.4967
36 15.3182 13.5166
37 18.2495 12.5836
38 9.3947 18.5506
39 12.496 17.2993
40 15.3987 16.2716
41 18.2212 15.1871
42 21.1241 14.0893
43 12.3548 20.2538
44 15.3682 18.9439
45 18.357 17.8862
46 21.0834 16.6258
47 23.9992 15.4145
48 15.3776 21.9402
49 18.3568 20.5803
50 21.1733 19.3041
It seems that your curves have a pattern, so you could select the curve of interest using splicing. I had the offset the selection slightly to get the five curves because the first 8 points are not in the same order as the rest of the data. So the initial 8 data points are discarded. But these could be added back in afterwards if required.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({ 'x': [-6.4165, -4.0227, -7.206, -3.2584, -0.7565, -0.0498, 2.363, -10.7253, -8.0621, -4.6328, -1.4237, 1.8047, 4.8147, -5.3504, -1.7743, 1.1783, 4.3471, 7.4067, -2.6037, 0.8613, 3.8054, 7.023, 9.9776, 0.1733, 3.7137, 6.4672, 9.6489, 12.5674, 3.2124, 6.4983, 9.2606, 12.4003, 15.3578, 6.3128, 9.7676, 12.2103, 15.3182, 18.2495, 9.3947, 12.496, 15.3987, 18.2212, 21.1241, 12.3548, 15.3682, 18.357, 21.0834, 23.9992, 15.3776, 18.3568, 21.1733],
'y': [0.3716, 2.63, 3.0652, -0.0392, 2.1039, -0.5159, 1.5329, 3.4654, 5.9083, 5.3028, 4.8455, 4.2297, 3.6074, 8.1889, 7.6165, 6.9698, 6.2411, 5.5988, 10.4623, 9.7628, 9.0202, 8.1962, 7.5563, 12.6547, 11.9097, 10.9363, 10.1246, 9.3369, 14.7492, 13.7562, 12.7241, 11.878, 11.0027, 16.7014, 15.6557, 14.4967, 13.5166, 12.5836, 18.5506, 17.2993, 16.2716, 15.1871, 14.0893, 20.2538, 18.9439, 17.8862, 16.6258, 15.4145, 21.9402, 20.5803, 19.3041]})
# Generate the 5 dataframes
df_list = [df.iloc[i+8::5, :] for i in range(5)]
# Generate the plot
fig = plt.figure()
for frame in df_list:
plt.scatter(frame['x'], frame['y'])
plt.show()
# Print the data of the innermost curve
print(df_list[4])
OUTPUT:
The 5th dataframe df_list[4] contains the data of the innermost plot.
x y
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
You can then add the missing data like this:
# Retrieve the two missing points of the inner curve
inner_curve = pd.concat([df_list[4], df[5:7]]).sort_index(ascending=True)
print(inner_curve)
# Plot the inner curve only
fig2 = plt.figure()
plt.scatter(inner_curve['x'], inner_curve['y'], color = '#9467BD')
plt.show()
OUTPUT: inner curve
x y
5 -0.0498 -0.5159
6 2.3630 1.5329
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
Complete Inner Curve

Given a SciPy discrete random variable distribution, how do I round a number to the closest value in that distribution? [duplicate]

What I ultimately want to do is round the expected value of a discrete random variable distribution to a valid number in the distribution. For example if I am drawing evenly from the numbers [1, 5, 6], the expected value is 4 but I want to return the closest number to that (ie, 5).
from scipy.stats import *
xk = (1, 5, 6)
pk = np.ones(len(xk))/len(xk)
custom = rv_discrete(name='custom', values=(xk, pk))
print(custom.expect())
# 4.0
def round_discrete(discrete_rv_dist, val):
# do something here
return answer
print(round_discrete(custom, custom.expect()))
# 5.0
I don't know apriori what distribution will be used (ie might not be integers, might be an unbounded distribution), so I'm really struggling to think of an algorithm that is sufficiently generic. Edit: I just learned that rv_discrete doesn't work on non-integer xk values.
As to why I want to do this, I'm putting together a monte-carlo simulation, and want a "nominal" value for each distribution. I think that the EV is the most physically appropriate rather than the mode or median. I might have values in the downstream simulation that have to be one of several discrete choices, so passing a value that is not within that set is not acceptable.
If there's already a nice way to do this in Python that would be great, otherwise I can interpret math into code.
Here is R code that I think will do what you want, using Poisson data to illustrate:
set.seed(322)
x = rpois(100, 7) # 100 obs from POIS(7)
a = mean(x); a
[1] 7.16 # so 7 is the value we want
d = min(abs(x-a)); d # min distance btw a and actual Pois val
[1] 0.16
u = unique(x); u # unique Pois values observed
[1] 7 5 4 10 2 9 8 6 11 3 13 14 12 15
v = u[abs(u-a)==d]; v # unique val closest to a
[1] 7
Hope you can translate it to Python.
Another run:
set.seed(323)
x = rpois(100, 20)
a = mean(x); a
[1] 20.32
d = min(abs(x-a)); d
[1] 0.32
u = unique(x)
v = u[abs(u-a)==d]; v
[1] 20
x
[1] 17 16 20 23 23 20 19 23 21 19 21 20 22 25 13 15 19 19 14 27 19 30 17 19 23
[26] 16 23 26 33 16 11 23 14 21 24 12 18 20 20 19 26 12 22 24 20 22 17 23 11 19
[51] 19 26 17 17 11 17 23 21 26 13 18 28 22 14 17 25 28 24 16 15 25 26 22 15 23
[76] 27 19 21 17 23 21 24 23 22 23 18 25 14 24 25 19 19 21 22 16 28 18 11 25 23
u
[1] 17 16 20 23 19 21 22 25 13 15 14 27 30 26 33 11 24 12 18 28
Figured it out, and tested it working. If I plug my value X into the cdf, then I can plug that probability P = cdf(X) into the ppf. The values at ppf(P +- epsilon) will give me the closest values in the set to X.
Or more geometrically, for a discrete pmf, the point (X,P) will lie on a horizontal portion of the corresponding cdf. When you invert the cdf, (P,X) is now on a vertical section of the ppf. Taking P +- eps will give you the 2 nearest flat portions of the ppf connected to that vertical jump, which correspond to the valid values X1, X2. You can then do a simple difference to figure out which is closer to your target value.
import numpy as np
eps = np.finfo(float).eps
ev = custom.expect()
p = custom.cdf(ev)
ev_candidates = custom.ppf([p - eps, p, p + eps])
ev_candidates_distance = abs(ev_candidates - ev)
ev_closest = ev_candidates[np.argmin(ev_candidates_distance)]
print(ev_closest)
# 5.0
Terms:
pmf - probability mass function
cdf - cumulative distribution function (cumulative sum of the pdf)
ppf - percentage point function (inverse of the cdf)
eps - epsilon (smallest possible increment)
Would the function ceil from the math library help? For example:
from math import ceil
print(float(ceil(3.333333333333333)))

How can I multiply a numpy array with pandas series?

I have a numpy series of size (50,0)
array([1.01255569e+00, 1.04166667e+00, 1.07158165e+00, 1.10229277e+00,
1.13430127e+00, 1.16387337e+00, 1.20365912e+00, 1.24007937e+00,
1.27877238e+00, 1.31856540e+00, 1.35281385e+00, 1.40291807e+00,
1.45180023e+00, 1.49700599e+00, 1.55183116e+00, 1.60051216e+00,
1.66002656e+00, 1.73370319e+00, 1.80115274e+00, 1.87687688e+00,
1.95312500e+00, 2.04750205e+00, 2.14961307e+00, 2.23613596e+00,
2.34082397e+00, 2.48015873e+00, 2.61780105e+00, 2.75027503e+00,
2.91715286e+00, 3.07881773e+00, 3.31564987e+00, 3.57142857e+00,
3.81679389e+00, 4.17362270e+00, 4.51263538e+00, 4.95049505e+00,
5.59284116e+00, 6.17283951e+00, 7.02247191e+00, 8.03858521e+00,
9.72762646e+00, 1.17370892e+01, 1.47928994e+01, 2.10084034e+01,
3.12500000e+01, 4.90196078e+01, 9.25925926e+01, 2.08333333e+02,
5.00000000e+02, 1.25000000e+03])
And I have a pandas dataframe of length 50 as well.
x
0 9.999740e-01
1 9.981870e-01
2 9.804506e-01
3 9.187764e-01
4 8.031568e-01
5 6.544660e-01
6 5.032716e-01
7 3.707446e-01
8 2.650768e-01
9 1.857835e-01
10 1.285488e-01
11 8.824506e-02
12 6.030141e-02
13 4.111080e-02
14 2.800453e-02
15 1.907999e-02
16 1.301045e-02
17 8.882996e-03
18 6.074386e-03
19 4.161024e-03
20 2.855636e-03
21 1.963543e-03
22 1.352791e-03
23 9.338596e-04
24 6.459459e-04
25 4.476854e-04
26 3.108912e-04
27 2.163201e-04
28 1.508106e-04
29 1.053430e-04
30 7.372442e-05
31 5.169401e-05
32 3.631486e-05
33 2.555852e-05
34 1.802129e-05
35 1.272995e-05
36 9.008454e-06
37 6.386289e-06
38 4.535381e-06
39 3.226546e-06
40 2.299394e-06
41 1.641469e-06
42 1.173785e-06
43 8.407618e-07
44 6.032249e-07
45 4.335110e-07
46 3.120531e-07
47 2.249870e-07
48 1.624726e-07
49 1.175140e-07
And I want to multiply every numpy cells with pandas cell.
Example:
1.01255569e+00*9.999740e-01
1.04166667e+00*9.981870e-01
Desired output
numpy array of same size.
You can just use the .values property of the 'x' series in your Pandas dataframe:
df['x'].values * arr
where df is your dataframe and arr is your array.
The above expression will return the result as a Numpy array. If you want a Pandas DataFrame instead, you can omit the use of .values:
df['x'] * arr
Or np.multiply, multiply n with p['x'].values:
print(np.multiply(n,p['x'].values))
Or pd.Series.multiply:
print(np.array(p['x'].multiply(n)))
Or pd.Series.mul:
print(np.array(p['x'].mul(n)))

Calculating average/standard deviations of rows containing certain string in pandas dataframe

I have a large pandas dataframe read as table. I would like to calculate the means and standard deviations of the two different groups, CRPS and Age, so I can plot them in a bar plot with std deviations as the error bars.
I can get the mean calculated by just the Age column. I figured it's a for loop that I have to construct, but I don't know how to construct further than table["Age"].mean(), which just gives me the average of all data points' age values. This is where I need some guidance. I want to look in the group column, tell it to calculate the average and standard deviation for the ages of that group. So, an average and standard deviation value for the ages of the CRPS group, for example.
I have the first 25 rows down below just to show what the dataframe looks like. I also have imported numpy as np as well.
Group Age
0 CRPS 50
1 CRPS 59
2 CRPS 22
3 CRPS 48
4 CRPS 53
5 CRPS 48
6 CRPS 29
7 CRPS 44
8 CRPS 28
9 CRPS 42
10 CRPS 35
11 CONTROLS 54
12 CONTROLS 43
13 CRPS 50
14 CRPS 62
15 CONTROLS 64
16 CONTROLS 39
17 CRPS 40
18 CRPS 59
19 CRPS 46
20 CONTROLS 56
21 CRPS 21
22 CRPS 45
23 CONTROLS 41
24 CRPS 46
25 CONTROLS 35
I don't think you need a for-loop.
Instead, you might try something like:
table.iloc[table['Group'] == 'CRPS']['Age'].mean()
I haven't tested with your table, but I think that will work.
The idea is to first create a boolean array, which is true for row indices where the group field contains 'CRPS', then to select all of those rows using iloc, and finally to take the mean. You could iterate over all of the groups in the following way:
mean_age = dict()
for group in set(table['Group']):
mean_age[group] = table.iloc[table['Group'] == group]['Age'].mean()
Maybe this is where you intended to use a for loop.

Error in using knn for multidimensional data

I am a beginer in Machine Learning, I am trying to classify multi dimensional data into two classes. Each data point is 40x6 float values. To begin with I have read my csv file. In this file shot number represents data point.
https://docs.google.com/spreadsheets/d/1tW1xJqnNZa1PhVDAE-ieSVbcdqhT8XfYGy8ErUEY_X4/edit?usp=sharing
Here is the code in python:
import pandas as pd
1 import numpy as np
2 import matplotlib.pyplot as plot
3
4 from sklearn.neighbors import KNeighborsClassifier
5
6 # Read csv data into pandas data frame
7 data_frame = pd.read_csv('data.csv')
8
9 extract_columns = ['LinearAccX', 'LinearAccY', 'LinearAccZ', 'Roll', 'pitch', 'compass']
10
11 # Number of sample in one shot
12 samples_per_shot = 40
13
14 # Calculate number of shots in dataframe
15 count_of_shots = len(data_frame.index)/samples_per_shot
16
17 # Initialize Empty data frame
18 training_index = range(count_of_shots)
19 training_data_list = []
20
21 # flag for backward compatibility
22 make_old_data_compatible_with_new = 0
23
24 if make_old_data_compatible_with_new:
25 # Convert 40 shot data to 25 shot data
26 # New logic takes 25 samples/shot
27 # old logic takes 40 samples/shot
28 start_shot_sample_index = 9
29 end_shot_sample_index = 34
30 else:
31 # Start index from 1 and continue till lets say 40
32 start_shot_sample_index = 1
33 end_shot_sample_index = samples_per_shot
34
35 # Extract each shot into pandas series
36 for shot in range(count_of_shots):
37 # Extract current shot
38 current_shot_data = data_frame[data_frame['shot_no']==(shot+1)]
39
40 # Select only the following column
41 selected_columns_from_shot = current_shot_data[extract_columns]
42
43 # Select columns from selected rows
44 # Find start and end row indexes
45 current_shot_data_start_index = shot * samples_per_shot + start_shot_sample_index
46 current_shot_data_end_index = shot * samples_per_shot + end_shot_sample_index
47 selected_rows_from_shot = selected_columns_from_shot.ix[current_shot_data_start_index:curren t_shot_data_end_index]
48
49 # Append to list of lists
50 # Convert selected short into multi-dimensional array
51
training_data_list.append([selected_columns_from_shot[extract_columns[index]].values.tolist( ) for index in range(len(extract_columns))])
8
7 # Append each sliced shot into training data
6 training_data = pd.DataFrame(training_data_list, columns=extract_columns)
5 training_features = [1 for i in range(count_of_shots)]
4 knn = KNeighborsClassifier(n_neighbors=3)
3 knn.fit(training_data, training_features)
training_data_list.append([selected_columns_from_shot[extract_columns[index]].values.tolist( ) for index in range(len(extract_columns))])
After running the above code, I am getting an error
ValueError: setting an array element with a sequence.
for the line
knn.fit(training_data, training_features)

Categories

Resources