Finding intersection of Pandas dataframes within range

Finding intersection of Pandas dataframes within range - python

A project I'm working on requires merging two dataframes together along some line with a delta. Basically, I need to take a dataframe with a non-linear 2D line and find the data points within the other that fall along that line, plus or minus a delta.
Dataframe 1 (Line that we want to find points along)
import pandas as pd
df1 = pd.read_csv('path/to/df1/data.csv')
df1
x y
0 0.23 0.54
1 0.27 0.95
2 0.78 1.59
...
97 0.12 2.66
98 1.74 0.43
99 0.93 4.23
Dataframe 2 (Dataframe we want to filter, leaving points within some delta)
df2 = pd.read_csv('path/to/df2/data.csv')
df2
x y
0 0.21 0.51
1 0.27 0.35
2 3.45 1.19
...
971 0.94 2.60
982 1.01 1.33
993 0.43 2.43
Finding the coarse line
DELTA = 0.03
coarse_line = find_coarse_line(df1, df2, DELTA)
coarse_line
x y
0 0.21 0.51
1 0.09 2.68
2 0.23 0.49
...
345 1.71 0.45
346 0.96 0.40
347 0.81 1.62
I've tried using df.loc((df['x'] >= BOTLEFT_X) & (df['x'] >= BOTLEFT_Y) & (df['x'] <= TOPRIGHT_X) & (df['y'] <= TOPRIGHT_Y)) among many, many other Pandas functions and whatnot but have yet to find anything that works, much less anything efficient (with datasets >2 million points).

Have taken an approach of using merge() where x,y have been placed into bins from good curve df1
generated a uniform line, y=x^2
randomised it a small amount to generate df1
randomised it a large amount to generate df2 also generated three times as many co-ordinates
take df1 as reference for good ranges of x and y co-ordinates to split into bins using pd.cut(). bins being 1/3 of total number of co-ordinates is working well
standardised these back into arrays for use again in pd.cut() when merging
You can see from scatter plots, it's doing a pretty reasonable job of finding and keeping points close to curve in df2
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,3, sharey=True, sharex=False, figsize=[20,5])
linex = [i for i in range(100)]
liney = [i**2 for i in linex]
df1 = pd.DataFrame({"x":[l*random.uniform(0.95, 1.05) for l in linex],
"y":[l*random.uniform(0.95, 1.05) for l in liney]})
df1.plot("x","y", kind="scatter", ax=ax[0])
df2 = pd.DataFrame({"x":[l*random.uniform(0.5, 1.5) for l in linex*3],
"y":[l*random.uniform(0.5, 1.5) for l in liney*3]})
df2.plot("x","y", kind="scatter", ax=ax[1])
# use bins on x and y axis - both need to be within range to find
bincount = len(df1)//3
xc = pd.cut(df1["x"], bincount).unique()
yc = pd.cut(df1["y"], bincount).unique()
xc = np.sort([intv.left for intv in xc] + [xc[-1].right])
yc = np.sort([intv.left for intv in yc] + [yc[-1].right])
dfm = (df2.assign(
xb=pd.cut(df2["x"],xc, duplicates="drop"),
yb=pd.cut(df2["y"],yc, duplicates="drop"),
).query("~(xb.isna() | yb.isna())") # exclude rows where df2 falls outside of range of df1
.merge(df1.assign(
xb=pd.cut(df1["x"],xc, duplicates="drop"),
yb=pd.cut(df1["y"],yc, duplicates="drop"),
),
on=["xb","yb"],
how="inner",
suffixes=("_l","_r")
)
)
dfm.plot("x_l", "y_l", kind="scatter", ax=ax[2])
print(f"graph 2 pairs:{len(df2)} graph 3 pairs:{len(dfm)}")

Related

plot.bar(), duplicate values are removed in the x-axis?

I have as example the following DataFrame df and I want to plot the price as x-axis and share_1 and share_2 as y-axis in bar stacked form. I want to avoid using pandas.plot and rather using plt.bar and extract the x_values and y_values from the Dataframe.
Price size share_1 share_2
10 1 0.05 0.95
10 2 0.07 0.93
10 3 0.1 0.95
20 4 0.15 0.75
20 5 0.2. 0.8
20 6 0.35 0.65
30 7 0.5. 0.5
30 8 0.53 0.47
30 9 0.6. 0.4
This is the way I proceed:
x= df['Price']
y1= df['share_1']
y2= df['share_2']
plt.bar(x,y1,label='share_1')
plt.bar(x,y2,label='share_2')
I still have the problem that the matplotlib removed the duplicate values the x-axis or maybe the mean value for the duplicated values is calculated automatically so that I get 3 value in the x-axis and not 6 as I aim to have. I don't know what is the reason.
my questions are:
It's possible to extract x and y values as I did or should I convert the values in certain form as string or list?
How can I avoid the fact that the duplicate values are removed in the x-axis. I want to have exactly the same number of x_values as in the DataFrame

Try:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.bar(x, y1, label="share_1")
ax.bar(x, y2, label="share_2", bottom=y1)
ax.set_xticks(x)
ax.legend()
ax.set_xticklabels(labels)
plt.show()
As an aside, consider using pandas.plot as follows:
fig,ax = plt.subplots()
df.plot.bar(x="Price", y=["share_1","share_2"], stacked=True, ax=ax)

Pandas Plot range as bar

I have the following Dataframe(this table is just an example, the Types and sizes are more):
df = pd.DataFrame({
'type':['A','A','B','B','C','C','D','D'],
'size':['a','b','c','d','e','f','g','h'],
'Nx':[4.3,2.4,2.5,4.4,3.5,1.8,4.5,2.8],
'min':[0.5,2.5,0.7,3.2,0.51,2,0.3,3],
'max':[1.5,3.4,1.7,4.3,1.51,3,1.2,4]})
print(df)
ax=df.plot.bar(x='type',y='max',stacked=True,bottom=df['min'])
ax.plt(x='type',y='Nx')
This is the result:
type size Nx min max
0 A a 4.3 0.50 1.50
1 A b 2.4 2.50 3.40
2 B c 2.5 0.70 1.70
3 B d 4.4 3.20 4.30
4 C e 3.5 0.51 1.51
5 C f 1.8 2.00 3.00
6 D g 4.5 0.30 1.20
7 D h 2.8 3.00 4.00
how can i plot this data by having just one column for Type A, B,C.. And then plot scatter for Type,Nx to be like this:

You can add a new column called height equal to max - min since the plt.bar method takes a height parameter, then reindex the DataFrame by ['type','size']. Then loop through the levels of this multiindex DataFrame and plot a bar with a different color for each unique type and size combination.
This also requires you to define your own color palette. I chose a discrete color palette from plt.cm and mapped integer values to each color. As you are looping through each unique type and size, you can have a counter for the inner most loop to ensure that each bar within the same type has a different color.
NOTE: this does make the assumption that there aren't multiple rows with the same type and size.
To show this is generalizable, I added another bar of type 'D' and size 'i' and it appears as a distinct bar in the plot.
import pandas as pd
import matplotlib.pyplot as plt
## added a third size to type D
df = pd.DataFrame({
'type':['A','A','B','B','C','C','D','D','D'],
'size':['a','b','c','d','e','f','g','h','i'],
'Nx':[4.3,2.4,2.5,4.4,3.5,1.8,4.5,2.8,5.6],
'min':[0.5,2.5,0.7,3.2,0.51,2,0.3,3,4.8],
'max':[1.5,3.4,1.7,4.3,1.51,3,1.2,4,5.3]})
## create a height column for convenience
df['height'] = df['max'] - df['min']
df_grouped = df.set_index(['type','size'])
## create a list of as many colors as there are categories
cmap = plt.cm.get_cmap('Accent', 10)
## loop through the levels of the grouped DataFrame
for each_type, df_type in df_grouped.groupby(level=0):
color_idx=0
for each_size, df_type_size in df_type.groupby(level=1):
color_idx += 1
plt.bar(x=[each_type]*len(df_type_size), height=df_type_size['height'], bottom=df_type_size['min'], width=0.4,
edgecolor='grey', color=cmap(color_idx))
plt.scatter(x=[each_type]*len(df_type_size), y=df_type_size['Nx'], color=cmap(color_idx))
plt.ylim([0, 7])
plt.show()

Pandas - Using `.rolling()` on multiple columns

Consider a pandas DataFrame which looks like the one below
A B C
0 0.63 1.12 1.73
1 2.20 -2.16 -0.13
2 0.97 -0.68 1.09
3 -0.78 -1.22 0.96
4 -0.06 -0.02 2.18
I would like to use the function .rolling() to perform the following calculation for t = 0,1,2:
Select the rows from t to t+2
Take the 9 values contained in those 3 rows, from all the columns. Call this set S
Compute the 75th percentile of S (or other summary statistics about S)
For instance, for t = 1 we have
S = { 2.2 , -2.16, -0.13, 0.97, -0.68, 1.09, -0.78, -1.22, 0.96 } and the 75th percentile is 0.97.
I couldn't find a way to make it work with .rolling(), since it apparently takes each column separately. I'm now relying on a for loop, but it is really slow.
Do you have any suggestion for a more efficient approach?

One solution is to stack the data and then multiply your window size by the number of columns and slice the result by the number of columns. Also, since you want a forward looking window, reverse the order of the stacked DataFrame
wsize = 3
cols = len(df.columns)
df.stack(dropna=False)[::-1].rolling(window=wsize*cols).quantile(0.75)[cols-1::cols].reset_index(-1, drop=True).sort_index()
Output:
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
dtype: float64
In the case of many columns and a small window:
import pandas as pd
import numpy as np
wsize = 3
df2 = pd.concat([df.shift(-x) for x in range(wsize)], 1)
s_quant = df2.quantile(0.75, 1)
# Only necessary if you need to enforce sufficient data.
s_quant[df2.isnull().any(1)] = np.NaN
Output: s_quant
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
Name: 0.75, dtype: float64

You can use numpy ravel. Still you may have to use for loops.
for i in range(0,3):
print(df.iloc[i:i+3].values.ravel())
If your t steps in 3s, you can use numpy reshape function to create a n*9 dataframe.

Pandas: How to construct a table with column and row keys from numerical ranges

I would like to use pandas dataframes to create a two-dimensional table. The table should associate two values alpha and epsilon with a third value. alpha and epsilon come from a variable range, like:
alphaRange = numpy.arange(0.01, 0.26, 0.01)
epsilonRange = numpy.arange(0.01, 0.11, 0.01)
(The goal is to find out which combination of alpha and epsilon leads to the highest values, or more generally, find a correlation between parameters and values.)
What is the best way to construct such a dataframe and later fill it with values?

It might be easier to use NumPy to compute the values first, and then load the result into a DataFrame:
import numpy as np
import pandas as pd
alphaRange = np.arange(0.01, 0.26, 0.01)
epsilonRange = np.arange(0.01, 0.11, 0.01)
X, Y = np.meshgrid(alphaRange, epsilonRange)
vals = X+Y
print(vals.shape)
df = pd.DataFrame(vals, index=epsilonRange, columns=alphaRange)
print(df)
Edit: PaulH is right -- floats do not make good column or index labels, since they could be hard to reference properly. (Checking floats for equality brings up float-representation issues.) So it would be better to make alpha and epsilon DataFrame columns:
df = pd.DataFrame({'vals':vals.ravel()},
index=pd.MultiIndex.from_product([alphaRange, epsilonRange],
names=['alpha', 'epsilon']))
df.reset_index(inplace=True)
print(df.head())
yields
alpha epsilon vals
0 0.01 0.01 0.02
1 0.01 0.02 0.03
2 0.01 0.03 0.04
3 0.01 0.04 0.05
4 0.01 0.05 0.06
[5 rows x 3 columns]
pd.MultiIndex.from_product was added in pandas 0.13.1. For earlier versions of pandas, you could use:
def from_product(iterables, sortorder=None, names=None):
from pandas.tools.util import cartesian_product
product = cartesian_product(iterables)
return pd.MultiIndex.from_arrays(product, sortorder=sortorder,
names=names)
df = pd.DataFrame({'vals':vals.ravel()},
index=from_product([alphaRange, epsilonRange],
names=['alpha', 'epsilon']))

Python boxplot out of columns of different lengths

I have the following dataframe in Python (the actual dataframe is much bigger, just presenting a small sample):
A B C D E F
0 0.43 0.52 0.96 1.17 1.17 2.85
1 0.43 0.52 1.17 2.72 2.75 2.94
2 0.43 0.53 1.48 2.85 2.83
3 0.47 0.59 1.58 3.14
4 0.49 0.80
I convert the dataframe to numpy using df.values and then pass that to boxplot.
When I try to make a boxplot out of this pandas dataframe, the number of values picked from each column is restricted to the least number of values in a column (in this case, column F). Is there any way I can boxplot all values from each column?
NOTE: I use df.dropna to drop the rows in each column with missing values. However, this is resizing the dataframe to the lowest common denominator of column length, and messing up the plotting.
import prettyplotlib as ppl
import numpy as np
import pandas
import matplotlib as mpl
from matplotlib import pyplot
df = pandas.DataFrame.from_csv(csv_data,index_col=False)
df = df.dropna()
labels = ['A', 'B', 'C', 'D', 'E', 'F']
fig, ax = pyplot.subplots()
ppl.boxplot(ax, df.values, xticklabels=labels)
pyplot.show()

The right way to do it, saving from reinventing the wheel, would be to use the .boxplot() in pandas, where the nan handled correctly:
In [31]:
print df
A B C D E F
0 0.43 0.52 0.96 1.17 1.17 2.85
1 0.43 0.52 1.17 2.72 2.75 2.94
2 0.43 0.53 1.48 2.85 2.83 NaN
3 0.47 0.59 1.58 NaN 3.14 NaN
4 0.49 0.80 NaN NaN NaN NaN
[5 rows x 6 columns]
In [32]:
_=plt.boxplot(df.values)
_=plt.xticks(range(1,7),labels)
plt.savefig('1.png') #keeping the nan's and plot by plt
In [33]:
_=df.boxplot()
plt.savefig('2.png') #keeping the nan's and plot by pandas
In [34]:
_=plt.boxplot(df.dropna().values)
_=plt.xticks(range(1,7),labels)
plt.savefig('3.png') #dropping the nan's and plot by plt

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding intersection of Pandas dataframes within range - python

Related

plot.bar(), duplicate values are removed in the x-axis?

Pandas Plot range as bar

Pandas - Using `.rolling()` on multiple columns

Pandas: How to construct a table with column and row keys from numerical ranges

Python boxplot out of columns of different lengths

Categories

Resources