I have 2 timeseries dataframes where xp are say the x coordinates of the data fp.
I want to interpolate the values from xp/fp combinations per date for a fixed set of x values. So resulting output being a timeseries dataframe with same datetime index as the xp and fp and no of columns = no of elements in x
I have tried to use numpy.interp() but end up with ValueError: object too deep for desired array
import pandas as pd
import numpy as np
fp = pd.DataFrame(
data=np.random.randint(0,100,size=(10, 4)),
index=pd.date_range("20180101", periods=10),
columns=list('ABCD'),
)
xp = pd.DataFrame(
data=np.column_stack([
list(range(1,11)),
list(range(70,80)),
list(range(150,160)),
list(range(220,230))
]),
index=pd.date_range("20180101", periods=10),
columns=list('ABCD'),
)
x = [60, 120, 180]
x_interp = np.interp(x,xp,fp)
It seems like np.interp cant take dataframes as input? but it sounds like this is the fastest way for me to do it for a large dataset (of >3000 xp and fp rows)
Would appreciate any pointers.
UPDATE
found a way of doing what I wanted as below
x_interp = pd.DataFrame.from_records(fp.index.to_series().apply(lambda z: np.interp(x, xp.loc[(z)], fp.loc[(z)])).values, index = fp.index)
Related
I am trying to find a good way to run a (nonlinear, injective, multivariable) transformation on columns in a pandas dataframe. Transform is a black box with multiple variables in and multiple variables out.
As an easy illustration, let's just consider converting r, theta coordinates to x, y coordinates. Run this for setup/context
# set up example (all this is given in my case)
def blackbox_transform(rtheta):
x = rtheta[0]*np.cos(rtheta[1])
y = rtheta[0]*np.sin(rtheta[1])
return (x, y)
n = 50
r = np.ones(n)
theta = np.linspace(0, np.pi / 2, n)
r_theta = np.concatenate((r[:, None], theta[:, None]), axis=1)
df = pd.DataFrame(data=r_theta, columns=['r', 'theta'])
For the solution, this is the best I can come up with, but the apply and unpacking seems clunky (hoping a pandas wizard has a better approach):
# solution
xy = df[['r', 'theta']].apply(blackbox_transform, axis=1)
df = pd.concat((df, pd.DataFrame(data=[*xy], columns=['x', 'y'], index=xy.index)), axis=1)
I get that using pandas may look a little silly here, but there's a lot of other information I have in the dataframe and I need to transform some numerics columns while keeping all the indices and other info straight.
Here is a slightly more readable approach:
out = df[['r', 'theta']].apply(rtheta_to_xy, 1).apply(pd.Series)
df = df.assign(x=out[0], y=out[1])
Btw your use of lambda is dispensable when you are just forwarding the same argument.
I try to select only this part of the data within a specific time range that is different for every pixel.
For indexing, I have two np.datetime64[ns] xr.DataArrays with shape(lat:152, lon:131) named time_range_min, time_range_max
One is holding the start dates and the other one the end dates.
I try this for selecting the data
dataset = data.sel(time=slice(time_range_min, time_range_max))
but it yields
cannot use non-scalar arrays in a slice for xarray indexing:
<xarray.DataArray 'NDVI' (lat: 152, lon: 131)>
If I cannot use non-scalar arrays it means that it is in general not possible to do this, or can I transform my arrays?
If "time" is a list of dates in string that is ordered from past to present (e.g. ["10-20-2021", "10-21-2021", ...]:
import numpy as np
listOfMinMaxTimeRanges = [time_range_min, time_range_max]
specifiedRangeOfTimeIndexedList = []
for indexingListOfMinMaxTimeRanges in range(np.shape(listOfMinMaxTimeRanges)[1])):
specifiedRangeOfTimeIndexed = [specifiedRangeOfTime for specifiedRangeOfTime in np.arange(0, len(time), 1) if time.index(listOfMinMaxTimeRanges[0][indexingListOfMinMaxTimeRanges]) <= specifiedRangeOfTime <= time.index(listOfMinMaxTimeRanges[1][indexingListOfMinMaxTimeRanges])]
for indexes in range(len(specifiedRangeOfTimeIndexed)):
specifiedRangeOfTimeIndexedList.append(specifiedRangeOfTimeIndexed[indexes])
Depending on how your dataset is structured:
dataset = data.sel(time = specifiedRangeOfTimeIndexedList)
or
dataset = data.sel(time = time[specifiedRangeOfTimeIndexedList])
or
dataset = dataset[time[specifiedRangeOfTimeIndexedList]]
or
dataset = dataset[:, time[specifiedRangeOfTimeIndexedList]]
or
dataset = dataset[time[specifiedRangeOfTimeIndexedList], :, :]
or
dataset = dataset[specifiedRangeOfTimeIndexedList]
...
I found a way to group every cell with stacking in xarray:
time_range_min and time_range_max marks now a single date
stack = dataset.value.stack(gridcell=['lat', 'lon'])
for unique_value, grouped_array in stack.groupby('gridcell'):
grouped_array.sel(time=slice(time_range_min,time_range_max))
I have a dataframe with 3 columns.
UserId | ItemId | Rating
(where Rating is the rating a User gave to an Item. It's a np.float16. The 2 Id's are np.int32)
How do you best compute correlations between items using python pandas?
My take is to first pivot the table (wide format) and then apply pd.corr
df = df.pivot(index='UserId', columns='ItemId', values='Rating')
df.corr()
It's working on small datasets, but not on big ones.
That first step creates a big matrix dataset mostly full of missing values. It's quite ram intensive and I can't run it with bigger dataframes.
Isn't there a simpler way to compute the correlations directly on the long dataset, without pivoting?
(I looked into pd.groupBy, but that seems to only split the dataframe, not what I'm looking for.)
EDIT: oversimplified data and working pivot code
import pandas as pd
import numpy as np
d = {'UserId': [1,2,3, 1,2,3, 1,2,3],
'ItemId': [1,1,1, 2,2,2, 3,3,3],
'Rating': [1.1,4.5,7.1, 5.5,3.1,5.5, 1.1,np.nan,2.2]}
df = pd.DataFrame(data=d)
df = df.astype(dtype={'UserId': np.int32, 'ItemId': np.int32, 'Rating': np.float32})
print(df.info())
pivot = df.pivot(index='UserId', columns='ItemId', values='Rating')
print('')
print(pivot)
corr = pivot.corr()
print('')
print(corr)
EDIT2: Large random data generator
def randDf(size = 100):
## MAKE RANDOM DATAFRAME, df =======================
import numpy as np
import pandas as pd
import random
import math
dict_for_df = {}
for i in ('UserId','ItemId','Rating'):
dict_for_df[i] = {}
for j in range(size):
if i=='Rating': val = round( random.random()*5, 1)
else: val = round( random.random() * math.sqrt(size/2) )
dict_for_df[i][j] = val # store in a dict
# print(dict_for_df)
df = pd.DataFrame(dict_for_df) # after the loop convert the dict to a dataframe
# print(df.head())
df = df.astype(dtype={'UserId': np.int32, 'ItemId': np.int32, 'Rating': np.float32})
# df = df.astype(dtype={'UserId': np.int64, 'ItemId': np.int64, 'Rating': np.float64})
## remove doubles -----
df.drop_duplicates(subset=['UserId','ItemId'], keep='first', inplace=True)
## show -----
print(df.info())
print(df.head())
return df
# =======================
df = randDf()
I had another go, and have something that gets exactly the same correlation numbers as your method without using pivot, but is much slower. I can't say whether it uses less or more memory:
from scipy.stats.stats import pearsonr
import itertools
import pandas as pd
import numpy as np
d = []
itemids = list(set(df['ItemId']))
pairsofitems = list(itertools.combinations(itemids,2))
for itempair in pairsofitems:
a = df[df['ItemId'] == itempair[0]][['Rating', 'UserId']]
b = df[df['ItemId'] == itempair[1]][['Rating', 'UserId']]
z = np.ones(len(set(df.UserId)), dtype=int)
z = z * np.nan
z[a.UserId.values] = a.Rating.values
w = np.ones(len(set(df.UserId)), dtype=int)
w = w * np.nan
w[b.UserId.values] = b.Rating.values
bad = ~np.logical_or(np.isnan(w), np.isnan(z))
z = np.compress(bad, z)
w = np.compress(bad, w)
d.append({'firstitem': itempair[0],
'seconditem': itempair[1],
'correlation': pearsonr(z,w)[0]})
df_out = pd.DataFrame(d, columns=['firstitem', 'seconditem', 'correlation'])
This was of help working out to handle the nan's before taking the correlation.
The slicing in the two lines after the for loop take time. I think though, it may have potential if the bottlenecks could be fixed.
Yes, there is some repetition in there with the z and w variables, could put that in a function.
Some explanation of what it does:
find all combinations of pairs within your items
organise and "x" and "y" set of points for UserId / Rating where any point pair where one of the two is missing (nan) is dropped. I think of a scatter plot and the correlation being how well a straight line fits through it.
run pearson correlation on this x-y pair
put the ItemId each pair and correlation into a dataframe
I am very new to coding python and I am working with a .CSV file that gives me a 32x32 matrix in a 1024 column row with a time stamp. I reshaped the data to give me 32x32 arrays and looped through each row appending the matrices to a numpy array.
`i = 0
while i < len(df_array):
if i == 0:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = spec
else:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = np.concatenate((spectrum_matrix, spec), axis = 0)
i = i + 1
print("job done")`
What I would like to do is to add the time stamp from the original data file and add them to each of the matrices thus allowing me to re sample the data over a 5 minute average. I also would like to plot the bins a to get a plot similar to this Drop size distribution
As a reference I am reading in the data .CSV with pandas and here is an example of a portion of the raw data: 01.06.2017;18:22:20;0.122;0.00;51;7.401;10375;18745;57;27;0.00;23.6;0.110;0;
<SPECTRUM>;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
The ;'s after the SPECTRUM is the 32x32 matrix.
Thanks in advance for any help!
Python and associated packages can do many things without loops
From my understanding of your data you have a (8640 x 32 x 32) Data Structure (time x size x velocity).
Pandas works very well with 2D data structures, however for higher dimensional data I would recommend you get familiar with xarray. With this package along with pandas you can create and manipulate your data without having to resort to loops.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import seaborn as sns
%matplotlib inline
#create random data
data = (np.random.binomial(n =5, p =0.2, size =(8640,32,32))*1000).astype(int)
#create labels for data
sizes= np.linspace(1,5,32)
velocities = np.linspace(1,1000, num = 32)
#make time range of 24 hours with 10sec intervals
ind = pd.date_range(start='2014-01-01', periods=8640, freq='10s')
#convert data to xarray 3D data structure
df = xr.DataArray(data, coords = [ind, sizes, velocities],
dims = ['time', 'size', 'speed'])
#make a 5 min average of the data
min_average= df.resample('300s', dim = 'time', how = 'mean')
#plot sample of data and 5 min average
my1d = min_average.isel(size = 5, speed= 10)
my1d.plot(label = '5 min avg')
plt.gca()
df.isel(size = 5, speed =10).plot(alpha = 0.3, c = 'r', label = 'raw_data')
plt.legend()
As for making a distribution plot like you linked things become a bit trickier but is possible:
#transform your data to only have mean speed for each time and size
#and convert to pandas dataframe
mean_speed =min_average.mean(dim = ['speed'])
#for some reason xarray make you name the new column when you convert
#to a pandas dataframe. I then get rid of the extra empty variable with
#a list comprehension
df= mean_speed.to_dataframe('').unstack().T
df.index = np.array([np.array(i)[1].astype(float) for i in df.index])
#make a contourplot of your new data
plt.contourf(df.columns, df.index, df.values, cmap ='PuBu_r')
plt.title('mean speed')
plt.ylabel('size')
plt.xlabel('time')
plt.colorbar()
I have a pandas dataframe and I want to bin the data by values of a single column. E.G. 0-1, 1-2, etc, starting at 0 and ending at 1, with intervals of 0.1, taking the mean of each column within each bin.
I'm attempting to accomplish this using the .groupby functionality of pandas. See my code below:
import pandas as pd
import numpy as np
my_df = pd.DataFrame({"a": np.random.random(100),
"b": np.random.random(100),
"id": np.arange(100)})
bins = np.linspace(0, 1, 0.1)
groups = my_df.groupby(np.digitize(my_df.a, bins))
binned_data = groups.mean()
print binned_data
The print line then gives a single row with index "1", even though the data of column "a" should have a range of values for the bins specified.
I think its a problem with the creation of "bins" but I can't work out what.
I want 10 rows binned from 0 to 1 in 0.1 intervals. How can I accomplish this?
Many thanks.