Pandas multi column input to multi column output - python

I am trying to find a good way to run a (nonlinear, injective, multivariable) transformation on columns in a pandas dataframe. Transform is a black box with multiple variables in and multiple variables out.
As an easy illustration, let's just consider converting r, theta coordinates to x, y coordinates. Run this for setup/context
# set up example (all this is given in my case)
def blackbox_transform(rtheta):
x = rtheta[0]*np.cos(rtheta[1])
y = rtheta[0]*np.sin(rtheta[1])
return (x, y)
n = 50
r = np.ones(n)
theta = np.linspace(0, np.pi / 2, n)
r_theta = np.concatenate((r[:, None], theta[:, None]), axis=1)
df = pd.DataFrame(data=r_theta, columns=['r', 'theta'])
For the solution, this is the best I can come up with, but the apply and unpacking seems clunky (hoping a pandas wizard has a better approach):
# solution
xy = df[['r', 'theta']].apply(blackbox_transform, axis=1)
df = pd.concat((df, pd.DataFrame(data=[*xy], columns=['x', 'y'], index=xy.index)), axis=1)
I get that using pandas may look a little silly here, but there's a lot of other information I have in the dataframe and I need to transform some numerics columns while keeping all the indices and other info straight.

Here is a slightly more readable approach:
out = df[['r', 'theta']].apply(rtheta_to_xy, 1).apply(pd.Series)
df = df.assign(x=out[0], y=out[1])
Btw your use of lambda is dispensable when you are just forwarding the same argument.

Related

Skewed graph after rotating points

I am trying to rotate a series of points around the origin given an angle in radians, but the results always give a skewed version of the graph.
Here is the method I am using, it is a pandas data frame containing x and y values:
df['x'] = df['x']*math.cos(math.radians(45))-df['y']*math.sin(math.radians(45))
df['y'] = df['x']*math.sin(math.radians(45))+df['y']*math.cos(math.radians(45))
I don't understand why it is creating a skewed graph.
df['x'] = df['x']*math.cos(math.radians(45))-df['y']*math.sin(math.radians(45))
df['x'] is overridden with the new value.
So, the 2nd line uses the new df['x'] but that's wrong. It should use the old value instead.
df['y'] = df['x']*math.sin(math.radians(45))+df['y']*math.cos(math.radians(45))
Using at temporary should fix this:
dfx = df['x']*math.cos(math.radians(45))-df['y']*math.sin(math.radians(45))
df['y'] = df['x']*math.sin(math.radians(45))+df['y']*math.cos(math.radians(45))
df['x'] = dfx

Finding the argmax() of a column based on constraints in other columns of an Numpy array

My question is quite straightforward and there is probably a really simple way to solve this which I couldn't find out. So, firstly, I concatenate some arrays, and then I want to find the combination of the first and second column (data_x1, data_x2) that returns me the maximum value of y. However, there is one constraint, I want to limit all the x between -20 and 20, if it is more than 20 or less than -20, I want to ignore this value.
Also I am using this process inside a function, hence I am really looking for a way which may work for a n-number of 'x'. Summarizing: I want to find out the optimal y for the constrained data_x1 and data_x2, that means, the optimal value in the row data_y which correspond to the value of the data_x1 and data_x2 that are bounded by the aforementioned condition ( < 20 and > -20). For example, in this dataset that I am providing, the row with contains the maximum data_y is beyond the conditions that I am imposing. Example, when I try:
y_max = data_y.max()
ID = data_y.argmax()
x1_max = data_x1[ID]
x2_max = data_x2[ID]
I will have x2_2 beyond the limit that I want to impose.
Here is the dataset:
data_x1 = np.array([ 7.50581267e-01, 4.85312598e+00, -1.37035821e+00, -1.27199171e-03,
-1.61347902e+00, -2.47705419e+00, 1.54149227e-01, 2.96462913e+00,
6.39336584e+00, 2.22526551e+00, -3.13825557e+00, -4.53521105e+00,
3.66632759e+00, 6.95980810e-01, -2.08555389e+00, -3.42268057e+00,
-2.67733126e+00, 3.44611056e+00, -3.21242281e-01, -4.45557410e+00,
2.36357280e+00, 6.76143624e-01, -1.12756068e+00, 1.56898158e+00,
-2.73721604e+00, 2.63754963e+00, -4.52874687e+00, -2.96449234e+00,
-4.38481329e+00, -1.50384134e+00, -2.52651726e+00, -1.34210192e+00,
-2.39860669e-01, 1.40859346e+00, 1.85432054e-01, 5.01414945e-01,
4.55880766e+00, -1.05363585e+00, -4.62917198e+00, 2.59998127e+00,
5.25344447e+00, 3.07701918e-01, 2.26443850e+00, -2.22101423e+00,
3.02861897e-01, 1.65691179e+00, 8.81562566e-01, -1.87325712e+00,
4.63772521e+00, 2.64284088e-01, 2.53643045e+00, 9.63172795e-01,
2.36685850e+00, 2.54559573e+00, -9.02629613e-01, 2.24687227e+00,
6.22720302e+00, 5.74281188e+00, 2.03796010e+00, 4.80760151e+00])
data_x2 = np.array([-30.09938636, -28.83362992, -22.57425202, -23.14358566,
-33.59852454, -27.51674098, -30.7885103 , -25.90249062,
-22.08337401, -29.07237476, -23.04023689, -30.30583811,
-21.00309374, -29.99686696, -28.90991919, -26.62903318,
-31.72168863, -22.87107873, -30.729956 , -25.6780506 ,
-31.38729541, -27.19055645, -27.55148381, -28.68462801,
-26.05224771, -30.87040206, -22.95430799, -26.91256322,
-35.8942374 , -21.50322056, -26.16176442, -22.85920962,
-28.05071496, -34.30775127, -28.7790589 , -31.19811517,
-27.63535267, -28.96808588, -26.89286845, -32.81312953,
-27.35855807, -28.89865079, -25.61937868, -32.59681293,
-28.79511822, -22.54470727, -31.06309398, -25.30574423,
-23.52838694, -27.55017459, -24.55437336, -24.39558638,
-22.81063876, -28.62340189, -27.85680254, -25.10753673,
-29.75683744, -27.37575317, -29.61561727, -34.50702866]
data_y = np.array([2511661.54014723, 2506471.03096404, 2496512.87703406,
2500666.09145807, 2492786.42701569, 2513191.79101637,
2509515.1829362 , 2509970.89367091, 2481463.90896938,
2512505.17266542, 2496999.56860772, 2503950.65803291,
2481665.31885133, 2511985.61283778, 2512968.70827174,
2510599.791468 , 2502795.50006905, 2495342.7106848 ,
2509708.93248061, 2505715.61726413, 2504986.68522465,
2514933.54167635, 2514835.36052355, 2513916.01349115,
2510784.07070835, 2506718.40944214, 2493199.57962053,
2511925.51820147, 2466117.27254433, 2488828.88557003,
2511417.16267116, 2498364.67720219, 2515221.17931068,
2487471.40157182, 2514636.01655828, 2507757.43933369,
2508292.40113149, 2514000.76143246, 2507722.80700035,
2496671.63747914, 2505965.77313117, 2514453.85665244,
2510375.19913626, 2498705.33749204, 2514595.64115671,
2496054.0775116 , 2508144.96504256, 2509901.46588431,
2496183.49020786, 2515239.10310988, 2506016.58240813,
2507055.51518852, 2496891.65309883, 2512606.04865712,
2515010.58385846, 2508707.73815183, 2499240.78218084,
2504177.72406016, 2511686.21461949, 2477825.15797829])
Hope that I managed to be succinct and precise albeit the length of the explanation. I would really appreciate your help on this one!
Your data_x2 contains no values between -20 and 20.
If you can use pandas for this, you can do (example is for -30 < x < 30)
import pandas as pd
df = pd.DataFrame({'x1': data_x1, 'x2': data_x2, 'y': data_y})
df = df[df['x1'].between(-30, 30, inclusive=False) & df['x2'].between(-30, 30, inclusive=False)]
df.sort_values(by='y', ascending=False).iloc[0]
Output:
x1 2.642841e-01
x2 -2.755017e+01
y 2.515239e+06
Name: 49, dtype: float64
Here's a function for calculating this. (Again using pandas)
def func(x1, x2, y, lower_bound, upper_bound):
df = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})
df = df[df['x1'].between(lower_bound, upper_bound, inclusive=False) & df['x2'].between(lower_bound, upper_bound, inclusive=False)]
df.sort_values(by='y', ascending=False, inplace=True)
if len(df):
return df['x1'].iloc[0], df['x2'].iloc[0]
func(data_x1, data_x2, data_y, -20, 20)
Output:
None
func(data_x1, data_x2, data_y, -30, 30)
Output:
(0.264284088, -27.55017459)
EDIT:
Using pandas DataFrame is nice because it treats your data as a matrix where you can slice based on values in multiple columns. The numpy solution below works, but requires replacing values that are outside of your range with np.nan in order to keep your indexes the same.
Here's a pure numpy solution with help from Removing nan in array at position from another numpy array
data_x1 = np.where(np.logical_and(data_x1 > -30, data_x1 < 30), data_x1, np.nan)
data_x2 = np.where(np.logical_and(data_x2 > -30, data_x2 < 30), data_x2, np.nan)
mask = ~np.isnan(data_x1) & ~np.isnan(data_x2)
data_y = np.where(mask, data_y, np.nan)
idx = np.nanargmax(data_y)
data_x1[idx], data_x2[idx]
Output:
(0.264284088, -27.55017459)
Although, I would agree with Evgeny and use Pandas DataFrame's as it is easier to follow IMO
So, firstly, I concatenate some arrays,
Three vectors?
and then I want to find the combination of the first and second column (data_x1, data_x2) that returns me the maximum value of y.
Just one row?
However, there is one constraint, I want to limit all the x between -20 and 20, if it is more than 20 or less than -20, I want to ignore this value.
See question above.
What prevents you from filtering the dataframe by condition on x1 and x2 and finding the y max position afterwards?
I'd suggest to wrap the numpy vectors in a dataframe to make your work on them together easier.
Argmax on dataframe is described here
Find row where values for column is maximal in a pandas DataFrame
You may need to eliminate the unsatisfying x's before finding the y. If several y's needed sort by y.

How to compute the correlations of long format dataframe with pandas?

I have a dataframe with 3 columns.
UserId | ItemId | Rating
(where Rating is the rating a User gave to an Item. It's a np.float16. The 2 Id's are np.int32)
How do you best compute correlations between items using python pandas?
My take is to first pivot the table (wide format) and then apply pd.corr
df = df.pivot(index='UserId', columns='ItemId', values='Rating')
df.corr()
It's working on small datasets, but not on big ones.
That first step creates a big matrix dataset mostly full of missing values. It's quite ram intensive and I can't run it with bigger dataframes.
Isn't there a simpler way to compute the correlations directly on the long dataset, without pivoting?
(I looked into pd.groupBy, but that seems to only split the dataframe, not what I'm looking for.)
EDIT: oversimplified data and working pivot code
import pandas as pd
import numpy as np
d = {'UserId': [1,2,3, 1,2,3, 1,2,3],
'ItemId': [1,1,1, 2,2,2, 3,3,3],
'Rating': [1.1,4.5,7.1, 5.5,3.1,5.5, 1.1,np.nan,2.2]}
df = pd.DataFrame(data=d)
df = df.astype(dtype={'UserId': np.int32, 'ItemId': np.int32, 'Rating': np.float32})
print(df.info())
pivot = df.pivot(index='UserId', columns='ItemId', values='Rating')
print('')
print(pivot)
corr = pivot.corr()
print('')
print(corr)
EDIT2: Large random data generator
def randDf(size = 100):
## MAKE RANDOM DATAFRAME, df =======================
import numpy as np
import pandas as pd
import random
import math
dict_for_df = {}
for i in ('UserId','ItemId','Rating'):
dict_for_df[i] = {}
for j in range(size):
if i=='Rating': val = round( random.random()*5, 1)
else: val = round( random.random() * math.sqrt(size/2) )
dict_for_df[i][j] = val # store in a dict
# print(dict_for_df)
df = pd.DataFrame(dict_for_df) # after the loop convert the dict to a dataframe
# print(df.head())
df = df.astype(dtype={'UserId': np.int32, 'ItemId': np.int32, 'Rating': np.float32})
# df = df.astype(dtype={'UserId': np.int64, 'ItemId': np.int64, 'Rating': np.float64})
## remove doubles -----
df.drop_duplicates(subset=['UserId','ItemId'], keep='first', inplace=True)
## show -----
print(df.info())
print(df.head())
return df
# =======================
df = randDf()
I had another go, and have something that gets exactly the same correlation numbers as your method without using pivot, but is much slower. I can't say whether it uses less or more memory:
from scipy.stats.stats import pearsonr
import itertools
import pandas as pd
import numpy as np
d = []
itemids = list(set(df['ItemId']))
pairsofitems = list(itertools.combinations(itemids,2))
for itempair in pairsofitems:
a = df[df['ItemId'] == itempair[0]][['Rating', 'UserId']]
b = df[df['ItemId'] == itempair[1]][['Rating', 'UserId']]
z = np.ones(len(set(df.UserId)), dtype=int)
z = z * np.nan
z[a.UserId.values] = a.Rating.values
w = np.ones(len(set(df.UserId)), dtype=int)
w = w * np.nan
w[b.UserId.values] = b.Rating.values
bad = ~np.logical_or(np.isnan(w), np.isnan(z))
z = np.compress(bad, z)
w = np.compress(bad, w)
d.append({'firstitem': itempair[0],
'seconditem': itempair[1],
'correlation': pearsonr(z,w)[0]})
df_out = pd.DataFrame(d, columns=['firstitem', 'seconditem', 'correlation'])
This was of help working out to handle the nan's before taking the correlation.
The slicing in the two lines after the for loop take time. I think though, it may have potential if the bottlenecks could be fixed.
Yes, there is some repetition in there with the z and w variables, could put that in a function.
Some explanation of what it does:
find all combinations of pairs within your items
organise and "x" and "y" set of points for UserId / Rating where any point pair where one of the two is missing (nan) is dropped. I think of a scatter plot and the correlation being how well a straight line fits through it.
run pearson correlation on this x-y pair
put the ItemId each pair and correlation into a dataframe

Interpolating from multiple dataframes

I have 2 timeseries dataframes where xp are say the x coordinates of the data fp.
I want to interpolate the values from xp/fp combinations per date for a fixed set of x values. So resulting output being a timeseries dataframe with same datetime index as the xp and fp and no of columns = no of elements in x
I have tried to use numpy.interp() but end up with ValueError: object too deep for desired array
import pandas as pd
import numpy as np
fp = pd.DataFrame(
data=np.random.randint(0,100,size=(10, 4)),
index=pd.date_range("20180101", periods=10),
columns=list('ABCD'),
)
xp = pd.DataFrame(
data=np.column_stack([
list(range(1,11)),
list(range(70,80)),
list(range(150,160)),
list(range(220,230))
]),
index=pd.date_range("20180101", periods=10),
columns=list('ABCD'),
)
x = [60, 120, 180]
x_interp = np.interp(x,xp,fp)
It seems like np.interp cant take dataframes as input? but it sounds like this is the fastest way for me to do it for a large dataset (of >3000 xp and fp rows)
Would appreciate any pointers.
UPDATE
found a way of doing what I wanted as below
x_interp = pd.DataFrame.from_records(fp.index.to_series().apply(lambda z: np.interp(x, xp.loc[(z)], fp.loc[(z)])).values, index = fp.index)

How to create a grid from LiDAR points (X,Y,Z) with GDAL python?

I'm new really to python programming, and I was just wondering if you can create a regular grid of 0.5 by o.5 m of resolution using LiDAR points.
My data are in LAS format (reading with from liblas import file as lasfile) and they have the following format: X,Y,Z. Where X and Y are coordinates.
The points are randomly positioned and some pixel are empty (NAN value) and in some pixel there are more of one points. Where there are more of one point, I wish to obtain a mean value. In the end i need to save the data in a TIF format or Ascii format.
I am studying osgeo module and GDAL but I honest to say that i don't know if osgeo module is the best solution.
I am really glad for help with some code that i can study and implement,
Thanks in Advance for the help, I really need.
I don't know the best way to get a grid with these parameters.
It's a bit late but maybe this answer will be useful for others, if not for you...
I have done this with Numpy and Pandas, and it's pretty fast. I was using TLS data and could do this with several million data points without any trouble on a decent 2009-vintage laptop. The key is 'binning' by rounding the data, and then using Pandas' GroupBy methods to do the aggregating and calculate the means.
If you need to round to a power of 10 you can use np.round, otherwise you can round to an arbitrary value by making a function to do so, which I have done by modifying this SO answer.
import numpy as np
import pandas as pd
# make rounding function:
def round_to_val(a, round_val):
return np.round( np.array(a, dtype=float) / round_val) * round_val
# load data
data = np.load( 'shape of ndata, 3')
n_d = data.shape[0]
# round the data
d_round = np.empty( [n_d, 5] )
d_round[:,0] = data[:,0]
d_round[:,1] = data[:,1]
d_round[:,2] = data[:,2]
del data # free up some RAM
d_round[:,3] = round_to_val( d_round[:,0], 0.5)
d_round[:,4] = round_to_val( d_round[:,1], 0.5)
# sorting data
ind = np.lexsort( (d_round[:,4], d_round[:,3]) )
d_sort = d_round[ind]
# making dataframes and grouping stuff
df_cols = ['x', 'y', 'z', 'x_round', 'y_round']
df = pd.DataFrame( d_sort)
df.columns = df_cols
df_round = df[['x_round', 'y_round', 'z']]
group_xy = df_round.groupby(['x_round', 'y_round'])
# calculating the mean, write to csv, which saves the file with:
# [x_round, y_round, z_mean] columns. You can exit Python and then start up
# later to clear memory if that's an issue.
group_mean = group_xy.mean()
group_mean.to_csv('your_binned_data.csv')
# Restarting...
import numpy as np
from scipy.interpolate import griddata
binned_data = np.loadtxt('your_binned_data.csv', skiprows=1, delimiter=',')
x_bins = binned_data[:,0]
y_bins = binned_data[:,1]
z_vals = binned_data[:,2]
pts = np.array( [x_bins, y_bins])
pts = pts.T
# make grid (with borders rounded to 0.5...)
xmax, xmin = 640000.5, 637000
ymax, ymin = 6070000.5, 6067000
grid_x, grid_y = np.mgrid[640000.5:637000:0.5, 6067000.5:6070000:0.5]
# interpolate onto grid
data_grid = griddata(pts, z_vals, (grid_x, grid_y), method='cubic')
# save to ascii
np.savetxt('data_grid.txt', data_grid)
When I've done this, I have saved the output as a .npy and converted to a tiff with the Image library, and then georeferenced in ArcMap. There is probably a way to do that with osgeo but I haven't used it.
Hope this helps someone at least...
You can use the histogram function in Numpy to do binning, for instance:
import numpy as np
points = np.random.random(1000)
#create 10 bins from 0 to 1
bins = np.linspace(0, 1, 10)
means = (numpy.histogram(points, bins, weights=data)[0] /
numpy.histogram(points, bins)[0])
Try LAStools, particularly lasgrid or las2dem.

Categories

Resources