Consider the following list
import numpy as np
import pandas as pd
l = [1,4,6,np.NaN,20,np.Nan,24]
I know I can replace the nan values using simple linear interpolation using pandas interpolate as follows
pd.Series([1,4,6,np.NaN,20,np.NaN,24]).interpolate()
Out[38]:
0 1.0
1 4.0
2 6.0
3 13.0
4 20.0
5 22.0
6 24.0
dtype: float64
My question is: how can I get the same result by only using list comprehensions, standard numpy functions, but no built-in interpolation function (pd.interpolate() or numpy.interp()`)? That is, using directly the formula for linear interpolation between two points.
l = [1,4,6,np.nan,20,np.nan,24]
res = [l[i] if not np.isnan(l[i]) else (l[i-1]+l[i+1])/2 for i in range(len(l))]
print(res)
Not sure if it is really a fit for this question since it is not just a list comprehension, but I've figured out a solution that works for gaps with more than 1 consecutive nan:
import numpy as np
l = [1,4,6,np.nan,20,np.nan,24, 30, 31, np.nan, np.nan, 70, 75]
# 1 -> entry is nan
nans = np.isnan(l)
# 1 -> from number to nan, -1 -> from nan to number
diffs = np.diff(list(map(int, nans)))
# get "gap of nans" begin and end indices
gap_starts = np.where(diffs == 1)[0]
gap_ends = np.where(diffs == -1)[0]
for begin, end in zip(gap_starts, gap_ends):
# number of nans in the gap
nans_n = end - begin
# difference of gap extrema
nan_diff = abs(l[begin] - l[end+1])
# step to add at each nan
step = round(nan_diff / (nans_n + 1))
# interpolate section from begin to end
filling = [l[begin] + (step * n) for n in range(1, nans_n + 1)]
# fix l with interpolated values
l[begin+1:end+1] = filling
print(l)
produces
[1, 4, 6, 13, 20, 22, 24, 30, 31, 44, 57, 70, 75]
Related
I have a long list of H-points with known coordinates. I have also a list of TP-points. I'd like to know if the H-points fall within any(!) TP-point with certain radius (e.g. r=5).
dfPoints = pd.DataFrame({'H-points' : ['a','b','c','d','e'],
'Xh' :[10, 35, 52, 78, 9],
'Yh' : [15,5,11,20,10]})
dfTrafaPostaje = pd.DataFrame({'TP-points' : ['a','b','c','d','e'],
'Xt' :[15,25,35],
'Yt' : [15,25,35],
'M' : [5,2,3]})
def inside_circle(x, y, a, b, r):
return (x - a)*(x - a) + (y - b)*(y - b) < r*r
I've started but.. it would be much easier to check this for only one TP point. But if I have e.g. 1500 of them and 30.000 H-points, then i need more general solution.
Can anyone help?
Another option is to use distance_matrix from scipy.spatial:
dist_mat = distance_matrix(dfPoints [['Xh','Yh']], dfTrafaPostaje [['Xt','Yt']])
dfPoints [np.min(dist_mat,axis=1)<5]
Took about 2s for 1500 dfPoints and 30000 dfTrafaPostje.
Update: to get the index of the reference points with highest score:
dist_mat = distance_matrix(dfPoints [['Xh','Yh']], dfTrafaPostaje [['Xt','Yt']])
# get the M scores of those within range
M_mat = pd.DataFrame(np.where(dist_mat <= 5, dfTrafaPosaje['M'].values[None, :], np.nan),
index=dfPoints['H-points'] ,
columns=dfTrafaPostaje['TP-points'])
# get the points with largest M values
# mask with np.nan for those outside range
dfPoints['M'] = np.where(M_mat.notnull().any(1), M_mat.idxmax(1), np.nan)
For the included sample data:
H-points Xh Yh TP
0 a 10 15 a
1 b 35 5 NaN
2 c 52 11 NaN
3 d 78 20 NaN
4 e 9 10 NaN
You could use cdist from scipy to compute the pairwise distances, then create a mask with True where distance is less than radius, and finally filter:
import pandas as pd
from scipy.spatial.distance import cdist
dfPoints = pd.DataFrame({'H-points': ['a', 'b', 'c', 'd', 'e'],
'Xh': [10, 35, 52, 78, 9],
'Yh': [15, 5, 11, 20, 10]})
dfTrafaPostaje = pd.DataFrame({'TP-points': ['a', 'b', 'c'],
'Xt': [15, 25, 35],
'Yt': [15, 25, 35]})
radius = 5
distances = cdist(dfPoints[['Xh', 'Yh']].values, dfTrafaPostaje[['Xt', 'Yt']].values, 'sqeuclidean')
mask = (distances <= radius*radius).sum(axis=1) > 0 # create mask
print(dfPoints[mask])
Output
H-points Xh Yh
0 a 10 15
I have some numerical time-series of varying lengths stored in a wide pandas dataframe. Each row corresponds to one series and each column to a measurement time point. Because of their varying length, those series can have missing values (NA) tails either left (first time points) or right (last time points) or both. There is always a continuous stripe without NA of a minimum length on each row.
I need to get a random subset of fixed length from each of these rows, without including any NA. Ideally, I wish to keep the original dataframe intact and to report the subsets in a new one.
I managed to obtain this output with a very inefficient for loop that goes through each row one by one, determines a start for the crop position such that NAs will not be included in the output and copies the cropped result. This works but it is extremely slow on large datasets. Here is the code:
import pandas as pd
import numpy as np
from copy import copy
def crop_random(df_in, output_length, ignore_na_tails=True):
# Initialize new dataframe
colnames = ['X_' + str(i) for i in range(output_length)]
df_crop = pd.DataFrame(index=df_in.index, columns=colnames)
# Go through all rows
for irow in range(df_in.shape[0]):
series = copy(df_in.iloc[irow, :])
series = np.array(series).astype('float')
length = len(series)
if ignore_na_tails:
pos_non_na = np.where(~np.isnan(series))
# Range where the subset might start
lo = pos_non_na[0][0]
hi = pos_non_na[0][-1]
left = np.random.randint(lo, hi - output_length + 2)
else:
left = np.random.randint(0, length - output_length)
series = series[left : left + output_length]
df_crop.iloc[irow, :] = series
return df_crop
And a toy example:
df = pd.DataFrame.from_dict({'t0': [np.NaN, 1, np.NaN],
't1': [np.NaN, 2, np.NaN],
't2': [np.NaN, 3, np.NaN],
't3': [1, 4, 1],
't4': [2, 5, 2],
't5': [3, 6, 3],
't6': [4, 7, np.NaN],
't7': [5, 8, np.NaN],
't8': [6, 9, np.NaN]})
# t0 t1 t2 t3 t4 t5 t6 t7 t8
# 0 NaN NaN NaN 1 2 3 4 5 6
# 1 1 2 3 4 5 6 7 8 9
# 2 NaN NaN NaN 1 2 3 NaN NaN NaN
crop_random(df, 3)
# One possible output:
# X_0 X_1 X_2
# 0 2 3 4
# 1 7 8 9
# 2 1 2 3
How could I achieve same results in a way adapted to large dataframes?
Edit: Moved my improved solution to the answer section.
I managed to speed up things quite drastically with:
def crop_random(dataset, output_length, ignore_na_tails=True):
# Get a random range to crop for each row
def get_range_crop(series, output_length, ignore_na_tails):
series = np.array(series).astype('float')
if ignore_na_tails:
pos_non_na = np.where(~np.isnan(series))
start = pos_non_na[0][0]
end = pos_non_na[0][-1]
left = np.random.randint(start,
end - output_length + 2) # +1 to include last in randint; +1 for slction span
else:
length = len(series)
left = np.random.randint(0, length - output_length)
right = left + output_length
return left, right
# Crop the rows to random range, reset_index to do concat without recreating new columns
range_subset = dataset.apply(get_range_crop, args=(output_length,ignore_na_tails, ), axis = 1)
new_rows = [dataset.iloc[irow, range_subset[irow][0]: range_subset[irow][1]]
for irow in range(dataset.shape[0])]
for row in new_rows:
row.reset_index(drop=True, inplace=True)
# Concatenate all rows
dataset_cropped = pd.concat(new_rows, axis=1).T
return dataset_cropped
I have a CSV file with each cell value a two element list(pair).
| 0 | 1 | 2 |
----------------------------------------
0 |[87, 1.03] | [30, 4.05] | NaN |
1 |[34, 2.01] | NaN | NaN |
2 |[83, 0.2] | [18, 3.4] | NaN |
How do I access the elements of these, separately? The first element of each pair acts as an index for another CSV table.
I have done something like this, but this keeps bugging me on one thing or other.
links = pd.read_csv('buslinks.csv', header = None)
a_list = []
for i in range(0, 100):
l = []
a_list.append(l)
for j in range(0, 100):
a = busStops.iloc[j]
df = pd.DataFrame(columns = ['id', 'Distance'])
l = links.iloc[j]
for i in l:
if(pd.isnull(i)):
continue
else:
x = int(i[0])
d = busStops.iloc[x-1]
id = d['id']
dist = distance(d['xCoordinate'], a['xCoordinate'], d['yCoordinate'], a['yCoordinate'])
df.loc[i] = [id, dist]
a_list[j] = (df.sort('Distance', ascending = True)).tolist()
This approach worked when each cell contained only one element. In that case, np.isnan() was used instead of pd.isnull()
The read CSV file was created as:
a_list = []
for i in range(0, 100):
l = []
a_list.append(l)
for i in range(0, 100):
while(len(a_list[i])<3):
x = random.randint(1, 100)
if(x-1 == i):
continue
a = busStops.iloc[i]
b = busStops.iloc[x-1]
dist = distance(a['xCoordinate'], b['xCoordinate'], a['yCoordinate'], b['yCoordinate'])
if dist>3:
continue
if x in a_list[i]:
continue
a_list[i].append([b['id'], dist])
a_list[x-1].append([a['id'], dist])
for j in range(0, 3):
y = random.randint(0, 1)
while (y == 0):
x = random.randint(1, 100)
if(x-1 == i):
continue
a = busStops.iloc[i]
b = busStops.iloc[x-1]
dist = distance(a['xCoordinate'], b['xCoordinate'], a['yCoordinate'], b['yCoordinate'])
if dist>3:
continue
if x in a_list[i]:
continue
a_list[i].append([b['id'], dist])
a_list[x-1].append([a['id'], dist])
y = 1
dfLinks = pd.DataFrame(a_list)
dfLinks
dfLinks.to_csv('buslinks.csv', index = False, header = False)
BusStops is yet another CSV file, that contains id, xCoordinate, yCoordinate, Population and Priority as columns.
First of all, beware that storing lists in DataFrames dooms you to Python-speed loops. To take advantage of fast Pandas/NumPy routines, you need to use native NumPy dtypes such as np.float64 (whereas, in contrast, list require "object" dtype).
That being said, here is my code i wrote just to show how to do it so you can use something like that in your code:
import pandas as pd
table = pd.DataFrame(columns=['col1', 'col2', 'col3'])
table.loc[0] = [1, 2,3]
table.loc[1] = [1, [2,3], 4]
table.loc[1].iloc[1] # returns [2, 3]
table.loc[1].iloc[1][0] # returns 2
You shouldn't be putting lists in pd.Series objects. It's inefficient and you lose all vectorised functionality. If, however, you are determined that this must be your starting point, you can unravel the lists into multiple columns in a couple of steps.
Setup
df = pd.DataFrame({0: [[87, 1.03], [34, 2.01], [83, 0.2]],
1: [[30, 4.05], np.nan, [18, 3.4]],
2: [np.nan, np.nan, np.nan]})
Step 1: ensure lists have same size
# messy way to ensure all values have length 2
df[1] = np.where(df[1].isnull(), pd.Series([[np.nan, np.nan]]*len(df[1])), df[1])
print(df)
0 1 2
0 [87, 1.03] [30, 4.05] NaN
1 [34, 2.01] [nan, nan] NaN
2 [83, 0.2] [18, 3.4] NaN
Step 2: concatenate dataframes of split series
# create list of dataframes
L = [pd.DataFrame(df[col].values.tolist()) for col in df]
# concatenate dataframes in list
df_new = pd.concat(L, axis=1, ignore_index=True)
print(df_new)
0 1 2 3 4
0 87 1.03 30.0 4.05 NaN
1 34 2.01 NaN NaN NaN
2 83 0.20 18.0 3.40 NaN
You can then access values as you would normally, e.g. df_new[2].
I have a set of objects and their positions over time. I would like to get the average distance between objects for each time point. An example dataframe is as follows:
time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
df
x y car
time
0 216 13 1
0 218 12 2
0 217 12 3
1 280 110 1
1 290 109 3
2 130 3 4
2 132 56 5
The end result I would like to have is:
df2
average distance
between cars
time
0 1.55
1 10.05
2 53.04
any idea on how to proceed? I've been trying apply the scipy.spatial.distance function to the dataframe, but I'm not sure how to apply it to df.groupby('time'), and then get the mean value of all those distances.
Any help appreciated!
You could pass an array of the points to scipy.spatial.distaince.pdist and it will calculate all pair-wise distances between Xi and Xj for i>j. Then take the mean.
import numpy as np
from scipy import spatial
df.groupby('time').apply(lambda x: spatial.distance.pdist(np.array(list(zip(x.x, x.y)))).mean())
Outputs:
time
0 1.550094
1 10.049876
2 53.037722
dtype: float64
For me using apply or for loop does not have much different
l1=[]
l2=[]
for y,x in df.groupby('time'):
v=np.triu(spatial.distance.cdist(x[['x','y']].values, x[['x','y']].values),k=0)
v = np.ma.masked_equal(v, 0)
l2.append(np.mean(v))
l1.append(y)
pd.DataFrame({'ave':l2},index=l1)
Out[250]:
ave
0 1.550094
1 10.049876
2 53.037722
building this up from the first principles:
For each point at index n, it is necessary to compute the distance with all the points with index > n.
if the distance between two points is given by formula:
np.sqrt((x0 - x1)**2 + (y0 - y1)**2)
then for an array of points in a dataframe, we can get all the distances & then calculate its mean:
distances = []
for i in range(len(df)-1):
distances += np.sqrt( (df.x[i+1:] - df.x[i])**2 + (df.y[i+1:] - df.y[i])**2 ).tolist()
np.mean(distances)
expressing the same logic using pd.concat & a couple of helper functions
def diff_sq(x, i):
return (x.iloc[i+1:] - x.iloc[i])**2
def dist_df(x, y, i):
d_sq = diff_sq(x, i) + diff_sq(y, i)
return np.sqrt(d_sq)
def avg_dist(df):
return pd.concat([dist_df(df.x, df.y, i) for i in range(len(df)-1)]).mean()
then it is possible to use the avg_dist function with groupby
df.groupby('time').apply(avg_dist)
# outputs:
time
0 1.550094
1 10.049876
2 53.037722
dtype: float64
You could also use the itertools package to define your own function as follow:
import itertools
import numpy as np
def combinations(series):
l = list()
for item in itertools.combinations(series,2):
l.append(((item[0] - item[1])**2))
return l
df2 = df.groupby('time').agg(combinations)
df2['avg_distance'] = [np.mean(np.sqrt(pd.Series(df2.iloc[k,0]) +
pd.Series(df2.iloc[k,1]))) for k in range(len(df2))]
df2.avg_distance.to_frame()
Then, the output is:
avg_distance
time
0 1.550094
1 10.049876
2 53.037722
I have an array A, say :
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
And I wish to create a new array B by replacing each element in A by the median of its four nearest neighbors, without taking into account the value at the given position... for example :
B[2] = np.median([A[0], A[1], A[3], A[4]]) (=3)
The thing is that I need to perform this on a gigantic A and I want to optimize times, so I want to avoid for loops or similar. And... I don't care about the result at the edges.
I already tried scipy.ndimage.filters.median_filter but it is not producing the desired output :
import scipy.ndimage
B = scipy.ndimage.filters.median_filter(A,footprint=[1,1,0,1,1],mode='wrap')
which produces B=[7,4,4,5,6,7,6,6], which is clearly not the correct answer.
Any idea is welcome.
On way could be using np.roll to shift the number in your array such as:
A_1 = np.roll(A,1)
# output: array([8, 1, 2, 3, 4, 5, 6, 7])
And then the same thing with rolling by -2, -1 and 2:
A_2 = np.roll(A,2)
A_m1 = np.roll(A,-1)
A_m2 = np.roll(A,-2)
Now you just need to sum your 4 arrays, as for each index you have the 4 neighbors in one of them:
B = (A_1 + A_2 + A_m1 + A_m2)/4.
And as you said you don't care about the edges, I think it works for you!
EDIT: I guess I was focus on the rolling idea that I mixed up mean and median, the median can be calculated by B = np.median([A_1,A_2,A_m1,A_m2],axis=0)
I'd make a rolling, central window of length 5 in pandas, and apply the median function to the values of the window, the middle one masked away:
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
mask = np.array(np.ones(5), bool)
mask[5//2] = False
import pandas as pd
df = pd.DataFrame(A)
r5 = df.rolling(5, center=True)
result = r5.apply(lambda x: np.median(x[mask]))
result
0
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
5 6.0
6 NaN
7 NaN