I have a dataframe with sets of longitude/latitude values, where each set of coordinates behave linearly. The coordinates in each set all have a common index value, so I've been trying to figure out a way to use groupby and apply interpolation in order to obtain a greater abundance of data. Here is my data (simplified):
Longitude Latitude
t
0 40 70
0 41 71
0 42 72
0 43 73
1 120 10
1 121 12
1 122 14
1 123 16
... ... ...
For this instance, each set has 4 coordinates, and I want to interpolate each set to have a specific number of coordinates, say 8. This is what I've tried so far. The function works by returning a dataframe with interpolated numbers, but I'm not sure how to use it in conjunction with groupby. How should I word the command, or is there a better method?
from scipy import interpolate
def interpolate(lon,lat):
#convert coordinates into array
x=np.asarray(lon)
y=np.asarray(lat)
#generates interpolation function based on coordinates
f=interpolate.interp1d(x, y, kind='linear')
#set of numbers stretching from minimum to maximum longitude
xnew=np.linspace(min(lon), max(lon), num=8, endpoint=True)
#apply interpolation function to obtain interpolated latitude
df={'lon':xnew,'lat':f(xnew)}
df=pd.DataFrame(df)
return df
df.groupby(level=0).apply(interpolate(df['Longitude'],df['Latitude']))
Related
I have a time-series with several products. I want to remove outliers using the Tukey Fence method. The idea is to create a column with a flag indicating outlier or not, using groupby. It should be like that (flag column is added by the groupby):
date prod units flag
1 a 100 0
2 a 90 0
3 a 80 0
4 a 15 1
1 b 200 0
2 b 180 0
3 b 190 0
4 b 30000 1
I was able to do it separating the prods using a for-loop and then making corresponding joins, but I wish to do it more cleanly.
I would compute the quantiles first; then derive IQR from them. Compute the fence bounds and call merge() to map these limits to the original dataframe and call eval() to check if the units are within their respective Tukey fence bounds.
# compute quantiles
quantiles = df.groupby('prod')['units'].quantile([0.25, 0.75]).unstack()
# compute interquartile range for each prod
iqr = quantiles.diff(axis=1).bfill(axis=1)
# compute fence bounds
fence_bounds = quantiles + iqr * [-1.5, 1.5]
# check if units are outside their respective tukey ranges
df['flag'] = df.merge(fence_bounds, left_on='prod', right_index=True).eval('not (`0.25` < units < `0.75`)').astype(int)
df
The intermediate fence bounds are:
I want to use scipy or pandas to interpolate on a table like this one:
df = pd.DataFrame({'x':[1,1,1,2,2,2],'y':[1,2,3,1,2,3],'z':[10,20,30,40,50,60] })
df =
x y z
0 1 1 10
1 1 2 20
2 1 3 30
3 2 1 40
4 2 2 50
5 2 3 60
I want to be able to interpolate for a x value of 1.5 and a y value of 2.5 and obtain a 40.
The process would be:
Starting from the first interpolation parameter (x), find the values that surround the target value. In this case the target is 1.5 and the surrounding values are 1 and 2.
Interpolate in y for a target of 2.5 considering x=1. In this case between rows 1 and 2, obtaining a 25
Interpolate in y for a target of 2.5 considering x=2. In this case between rows 4 and 5, obtaining a 55
Interpolate the values form previous steps to the target x value. In this case I have 25 for x=1 and 55 for x=2. The interpolated value for 1.5 is 40
The order in which interpolation is to be performed is fixed and the data will be correctly sorted.
I've found this question but I'm wondering if there is a standard solution already available in those libraries.
You can use scipy.interpolate.interp2d:
import scipy.interpolate
f = scipy.interpolate.interp2d(df.x, df.y, df.z)
f([1.5], [2.5])
[40.]
The first line creates an interpolation function z = f(x, y) using three arrays for x, y, and z. The second line uses this function to interpolate for z given values for x and y. The default is linear interpolation.
Define your interpolate function:
def interpolate(x, y, df):
cond = df.x.between(int(x), int(x) + 1) & df.y.between(int(y), int(y) + 1)
return df.loc[cond].z.mean()
interpolate(1.5,2.5,df)
40.0
I have a dataset with some rows containing singular answers and others having multiple answers. Like so:
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
With the singular answers I managed to create a heatmap using df.corr(), but I can't figure out what is the best approach for multiple answers rows.
I could split them and add additional columns for each answer like:
year length Animation
0 1971 121 1
1 1971 121 2
2 1971 121 3
3 1939 71 1
4 1939 71 3 ...
and then do the exact same dr.corr(), or add additional Animation_01, Animation_02 ... columns, but there must be a smarter way to work around this issue?
EDIT: Actual data snippet
You should compute a frequency table between two categorical variables using pd.crosstab() and perform subsequent analyses based on this table. df.corr(x, y) is NOT mathematically meaningful when one of x and y is categorical, no matter it is encoded into number or not.
N.B.1 If x is categorical but y is numerical, there are two options to describe the linkage between them:
Group y into quantiles (bins) and treat it as categorical
Perform a linear regression of y against one-hot encoded dummy variables of x
Option 2 is more precise in general but the statistics is beyond the scope of this question. This post will focus on the case of two categorical variables.
N.B.2 For sparse matrix output please see this post.
Sample Solution
Data & Preprocessing
import pandas as pd
import io
import matplotlib.pyplot as plt
from seaborn import heatmap
df = pd.read_csv(io.StringIO("""
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
"""), sep=r"\s{2,}", engine="python")
# convert string to list
df["Animation"] = df["Animation"].str.split(',')
# expand list column into new rows
df = df.explode("Animation")
# (optional)
df["Animation"] = df["Animation"].astype(int)
Frequency Table
Note: grouping of length is ignored for simplicity
ct = pd.crosstab(df["Animation"], df["length"])
print(ct)
# Out[65]:
# length 7 70 71 121
# Animation
# 0 1 1 1 0
# 1 0 1 1 1
# 2 1 1 1 1
# 3 0 0 2 1
Visualization
ax = heatmap(ct, cmap="viridis",
yticklabels=df["Animation"].drop_duplicates().sort_values(),
xticklabels=df["length"].drop_duplicates().sort_values(),
)
ax.set_title("Title", fontsize=20)
plt.show()
Example Analysis
Based on the frequency table, you can ask questions about the distribution of y given a certain (subset of) x value(s), or vice versa. This should better describe the linkage between two categorical variables, as the categorical variables have no order.
For example,
Q: What length does Animation=3 produces?
A: 66.7% chance to give 71
33.3% chance to give 121
otherwise unobserved
You want to break Animation (or Preferred_positions in your data snippet) up into a series of one-hot columns, one one-hot column for every unique string in the original column. Every column with have values of either zero or one, one corresponding to rows where that string appeared in the original column.
First, you need to get all the unique substrings in Preferred_positions (see this answer for how to deal with a column of lists).
positions = df.Preferred_positions.str.split(',').sum().unique()
Then you can create the positions columns in a loop based on whether the given position is in Preferred_positions for each row.
for position in positions:
df[position] = df.Preferred_positions.apply(
lambda x: 1 if position in x else 0
)
I have dataset with 2 columns and I would like to show the variation of one feature according to the binary output value
data
id Column1 output
1 15 0
2 80 1
3 120 1
4 20 0
... ... ...
I would like to drop a plot with python where x-axis contains values of Column1 and y-axis contains the percent of getting positive values.
I know already that the form of my plot have the form of exponontial function where when column1 has smaller numbers I will get more positive output then when it have long values
exponential plot maybe need two list like this
try this
import matplotlib.pyplot as plt
# x-axis points list
xL = [5,10,15,20,25,30]
# y-axis points list
yL = [100,50,25,12,10]
plt.plot(xL, yL)
plt.axis([0, 35, 0, 200])
plt.show()
I have this series:
data:
0 17
1 25
2 10
3 60
4 0
5 20
6 300
7 50
8 10
9 80
10 100
11 65
12 125
13 50
14 100
15 150
Name: 1, dtype: int64
I wanted to plot an histogram with variable bin size, so I made this:
filter_values = [0,25,50,60,75,100,150,200,250,300,350]
out = pd.cut(data."1", bins = filter_values)
counts = pd.value_counts(out)
print(counts)
My problem is that when I use counts.plot(kind="hist"), i have not the good label for x axis. I only get them by using a bargraph instead counts.plot(kind="bar"), but I can't get the right order then.
I tried to use xticks=counts.index.values[0] but it makes an error, and xticks=filter_values give an odd figure shape as the numbers go far beyond what the plot understand the bins to be.
I also tried counts.hist(), data.hist(), and counts.plot.hist without success.
I don't know how to plot correctly the categorical data from counts (it includes as index a pandas categorical index) so, I don't know which process I should apply, if there is a way to plot variable bins directly in data.hist() or data.plot(kind="hist") or data.plot.hist(), or if I am right to build counts, but then how to represent this correctly (with good labels on the xaxis and the right order, not a descending one as in the bar graph.