Pandas - Outer Join on Column with Repeating Values - python

This is my first question on Stack Overflow, please let me know how I can help you help me if my question is unclear.
Goal: Use Python and Pandas to Outer join (or merge) Data Sets containing different experimental trials where the "x" axis of each trial is extremely similar but has some deviations. Most importantly, the "x" axis increases, hits a maximum and then decreases, often overlapping with previously existing "x" points.
Problem: When I go to join/merge the datasets on "x", the "x" column is sorted, messing up the order of the collected data and making it impossible to plot it correctly.
Here is a small example of what I am trying to do:
Wouldn't let me add pictures because I am new. Here is the code to generate these example data sets.
Data Sets :
Import:
import numpy as np
import pandas as pd
import random as rand
Code :
T1 = {'x':np.array([1,1.5,2,2.5,3,3.5,4,5,2,1]),'y':np.array([10000,8500,7400,6450,5670,5100,4600,4500,8400,9000]),'z':np.array(rand.sample(range(0,10000),10))}'
T2 = {'x':np.array([1,2,3,4,5,6,7,2,1.5,1]),'y':np.array([10500,7700,5500,4560,4300,3900,3800,5400,8400,8800]),'z':np.array(rand.sample(range(0,10000),10))}
Trial1 = pd.DataFrame(T1)
Trial2 = pd.DataFrame(T2)
Attempt to Merge/Join:
WomboCombo = Trial1.join(Trial2,how='outer',lsuffix=1,rsuffix=2, on='x')
WomboCombo2 = pd.merge(left=Trial1, right= Trial2, how = 'outer', left
Attempt to split into two parts, increasing and decreasing part (manually found row number where data "x" starts decreasing):
Trial1Inc = Trial1[0:8]
Trial2Inc = Trial2[0:7]
Result - Merge works well, join messes with the "x" column, not sure why:
Trial1Inc.merge(Trial2Inc,on='x',how='outer', suffixes=[1,2])
Incrementing section Merge Result
Trial1Inc.join(Trial2Inc,on='x',how='outer', lsuffix=1,rsuffix=2)
Incrementing section Join Result
Hopefully my example is clear, the "x" column in Trial 1 increases until 5, then decreases back towards 0. In Trial 2, I altered the test a bit because I noticed that I needed data at a slightly higher "x" value. Trial 2 Increases until 7 and then quickly decreases back towards 0.
My end goal is to plot the average of all y values (where there is overlap between the trials) against the corresponding x values.
If there is overlap I can add error bars. Pandas is almost perfect for what I am trying to do because an Outer join adds null values where there is no overlap and is capable of horizontally concatenating the two trials when there is overlap.
All thats left now is to figure out how to join on the "x" column but maintain its order of increasing values and then decreasing values. The reason it is important for me to first increase "x" and then decrease it is because when looking at the "y" values, it seems as though the initial "y" value at a given "x" is greater than the "y" value when "x" is decreasing (E.G. in trial 1 when x=1, y=10000, however, later in the trial when we come back to x=1, y=9000, this trend is important. When Pandas sorts the column before merging, instead of there being a clean curve showing a decrease in "y" as "x" increases and then the reverse, there are vertical downward jumps at any point where the data was joined.
I would really appreciate any help with either:
A) a perfect solution that lets me join on "x" when "x" contains duplicates
B) an efficient way to split the data sets into increasing "x" and decreasing "x" so that I can merge the increasing and decreasing sections of each trial separately and then vertically concat them.
Hopefully I did an okay job explaining the problem I would like to solve. Please let me know if I can clarify anything,
Thanks for the help!

I think #xyzjayne idea of splitting the dataframe is a great idea.
Splitting Trial1 and Trial2:
# index of max x value in Trial2
t2_max_index = Trial2.index[Trial2['x'] == Trial2['x'].max()].tolist()
# split Trial2 by max value
trial2_high = Trial2.loc[:t2_max_index[0]].set_index('x')
trial2_low = Trial2.loc[t2_max_index[0]+1:].set_index('x')
# index of max x value in Trial1
t1_max_index = Trial1.index[Trial1['x'] == Trial1['x'].max()].tolist()
# split Trial1 by max vlaue
trial1_high = Trial1.loc[:t1_max_index[0]].set_index('x')
trial1_low = Trial1.loc[t1_max_index[0]+1:].set_index('x')
Once we split the dataframes we join the highers together and the lowers together:
WomboCombo_high = trial1_high.join(trial2_high, how='outer', lsuffix='1', rsuffix='2', on='x').reset_index()
WomboCombo_low = trial1_low.join(trial2_low, how='outer', lsuffix='1', rsuffix='2', on='x').reset_index()
We now combine them toegther to have one dataframe WomboCombo
WomboCombo = WomboCombo_high.append(WomboCombo_low)
OUTPUT:
x y1 z1 y2 z2
0 1.0 10000.0 3425.0 10500.0 3061.0
1 1.5 8500.0 5059.0 NaN NaN
2 2.0 7400.0 2739.0 7700.0 7090.0
3 2.5 6450.0 9912.0 NaN NaN
4 3.0 5670.0 2099.0 5500.0 1140.0
5 3.5 5100.0 9637.0 NaN NaN
6 4.0 4600.0 7581.0 4560.0 9584.0
7 5.0 4500.0 8616.0 4300.0 3940.0
8 6.0 NaN NaN 3900.0 5896.0
9 7.0 NaN NaN 3800.0 6211.0
0 2.0 8400.0 3181.0 5400.0 9529.0
2 1.5 NaN NaN 8400.0 3260.0
1 1.0 9000.0 4280.0 8800.0 8303.0

One possible solution is to give you trial rows specific IDs an then merge on the IDs. Should keep the x values from being sorted.

Here's what I was trying out, but it doesn't address varying numbers of data points. I like gym-hh's answer, though it's not clear to me that you wanted two columns of y,z pairs. So you could combine his ideas and this code to get what you need.
Trial1['index1'] = Trial1.index
Trial2['index1'] = Trial2.index
WomboCombo = Trial1.append(Trial2)
WomboCombo.sort_values(by=['index1'],inplace=True)
WomboCombo
Output:
x y z index1
0 1.0 10000 7148 0
0 1.0 10500 2745 0
1 1.5 8500 248 1
1 2.0 7700 9505 1
2 2.0 7400 6380 2
2 3.0 5500 3401 2
3 2.5 6450 6183 3
3 4.0 4560 5281 3
4 3.0 5670 99 4
4 5.0 4300 8864 4
5 3.5 5100 5132 5
5 6.0 3900 7570 5
6 4.0 4600 9951 6
6 7.0 3800 7447 6
7 2.0 5400 3713 7
7 5.0 4500 3863 7
8 1.5 8400 8776 8
8 2.0 8400 1592 8
9 1.0 9000 2167 9
9 1.0 8800 782 9

Related

Find index of first bigger row of current value in pandas dataframe

I have big dataset of values as follow:
column "bigger" would be index of the first row with bigger "bsl" than "mb" from current row. I need to do it without loop as I need it to be done in less than a second. by loop it's over a minute.
For example for the first row (with index 74729) the bigger is going to be 74731. I know it can be done by linq in C# but I'm almost new in python.
here is another example:
here is text version:
index bsl mb bigger
74729 47091.89 47160.00 74731.0
74730 47159.00 47201.00 74735.0
74731 47196.50 47201.50 74735.0
74732 47186.50 47198.02 74735.0
74733 47191.50 47191.50 74735.0
74734 47162.50 47254.00 74736.0
74735 47252.50 47411.50 74736.0
74736 47414.50 47421.00 74747.0
74737 47368.50 47403.00 74742.0
74738 47305.00 47310.00 74742.0
74739 47292.00 47320.00 74742.0
74740 47302.00 47374.00 74742.0
74741 47291.47 47442.50 74899.0
74742 47403.50 47416.50 74746.0
74743 47354.34 47362.50 74746.0
I'm not sure how many rows you have, but if the number is reasonable, you can perform a pairwise comparison:
# get data as arrays
a = df['bsl'].to_numpy()
b = df['mb'].to_numpy()
idx = df.index.to_numpy()
# compare values and mask lower triangle
# to ensure comparing only the greater indices
out = np.triu(a>b[:,None]).argmax(1).astype(float)
# reindex to original indices
idx = idx[out]
# mask invalid indices
idx[out<np.arange(len(out))] = np.nan
df['bigger'] = idx
Output:
bsl mb bigger
0 1 2 2.0
1 2 4 6.0
2 3 3 5.0
3 2 1 3.0
4 3 5 NaN
5 4 2 5.0
6 5 1 6.0
7 1 0 7.0

how to get a continuous rolling mean in pandas?

Looking to get a continuous rolling mean of a dataframe.
df looks like this
index price
0 4
1 6
2 10
3 12
looking to get a continuous rolling of price
the goal is to have it look this a moving mean of all the prices.
index price mean
0 4 4
1 6 5
2 10 6.67
3 12 8
thank you in advance!
you can use expanding:
df['mean'] = df.price.expanding().mean()
df
index price mean
0 4 4.000000
1 6 5.000000
2 10 6.666667
3 12 8.000000
Welcome to SO: Hopefully people will soon remember you from prior SO posts, such as this one.
From your example, it seems that #Allen has given you code that produces the answer in your table. That said, this isn't exactly the same as a "rolling" mean. The expanding() function Allen uses is taking the mean of the first row divided by n (which is 1), then adding rows 1 and 2 and dividing by n (which is now 2), and so on, so that the last row is (4+6+10+12)/4 = 8.
This last number could be the answer if the window you want for the rolling mean is 4, since that would indicate that you want a mean of 4 observations. However, if you keep moving forward with a window size 4, and start including rows 5, 6, 7... then the answer from expanding() might differ from what you want. In effect, expanding() is recording the mean of the entire series (price in this case) as though it were receiving a new piece of data at each row. "Rolling", on the other hand, gives you a result from an aggregation of some window size.
Here's another option for doing rolling calculations: the rolling() method in a pandas.dataframe.
In your case, you would do:
df['rolling_mean'] = df.price.rolling(4).mean()
df
index price rolling_mean
0 4 nan
1 6 nan
2 10 nan
3 12 8.000000
Those nans are a result of the windowing: until there are enough rows to calculate the mean, the result is nan. You could set a smaller window:
df['rolling_mean'] = df.price.rolling(2).mean()
df
index price rolling_mean
0 4 nan
1 6 5.000000
2 10 8.000000
3 12 11.00000
This shows the reduction in the nan entries as well as the rolling function: it 's only averaging within the size-two window you provided. That results in a different df['rolling_mean'] value than when using df.price.expanding().
Note: you can get rid of the nan by using .rolling(2, min_periods = 1), which tells the function the minimum number of defined values within a window that have to be present to calculate a result.

How many data points are plotted on my matplotlib graph?

So I want to count the number of data points plotted on my graph to keep a total track of graphed data. The problem is, my data table messes it up to where there are some NaN values in a different row in comparison to another column where it may or may not have a NaN value. For example:
# I use num1 as my y-coordinate and num1-num2 for my x-coordinate.
num1 num2 num3
1 NaN 25
NaN 7 45
3 8 63
NaN NaN 23
5 10 42
NaN 4 44
#So in this case, there should be only 2 data point on the graph between num1 and num2. For num1 and num3, there should be 3. There should be 4 data points between num2 and num3.
I believe Matplotlib doesn't graph the rows of the column that contain NaN values since its null (please correct me if I'm wrong, I can only tell this due to no dots being on the 0 coordinate of the x and y axes). In the beginning, I thought I could get away with using .count() and find the smaller of the two columns and use that as my tracker, but realistically that won't work as shown in my example above because it can be even LESS than that since one may have the NaN value and the other will have an actual value. Some examples of code I did:
# both x and y are columns within the DataFrame and are used to "count" how many data points are # being graphed.
def findAmountOfDataPoints(colA, colB):
if colA.count() < colB.count():
print(colA.count()) # Since its a smaller value, print the number of values in colA.
else:
print(colB.count()) # Since its a smaller value, print the number of values in colB.
Also, I thought about using .value_count() but I'm not sure if thats the exact function I'm looking for to complete what I want. Any suggestions?
Edit 1: Changed Data Frame names to make example clearer hopefully.
If I understood correctly your problem, assuming that your table is a pandas dataframe df, the following code should work:
sum((~np.isnan(df['num1']) & (~np.isnan(df['num2']))))
How it works:
np.isnan returns True if a cell is Nan. ~np.isnan is the inverse, hence it returns True when it's not Nan.
The code checks where both the column "num1" AND the column "num2" contain a non-Nan value, in other words it returns True for those rows where both the values exist.
Finally, those good rows are counted with sum, which takes into account only True values.
The way I understood it is that the number of combiniations of points that are not NaN is needed. Using a function I found I came up with this:
import pandas as pd
import numpy as np
def choose(n, k):
"""
A fast way to calculate binomial coefficients by Andrew Dalke (contrib).
https://stackoverflow.com/questions/3025162/statistics-combinations-in-python
"""
if 0 <= k <= n:
ntok = 1
ktok = 1
for t in range(1, min(k, n - k) + 1):
ntok *= n
ktok *= t
n -= 1
return ntok // ktok
else:
return 0
data = {'num1': [1, np.nan,3,np.nan,5,np.nan],
'num2': [np.nan,7,8,np.nan,10,4],
'num3': [25,45,63,23,42,44]
}
df = pd.DataFrame(data)
df['notnulls'] = df.notnull().sum(axis=1)
df['plotted'] = df.apply(lambda row: choose(int(row.notnulls), 2), axis=1)
print(df)
print("Total data points: ", df['plotted'].sum())
With this result:
num1 num2 num3 notnulls plotted
0 1.0 NaN 25 2 1
1 NaN 7.0 45 2 1
2 3.0 8.0 63 3 3
3 NaN NaN 23 1 0
4 5.0 10.0 42 3 3
5 NaN 4.0 44 2 1
Total data points: 9

How to find the closest element in another column for each element in a column?

The situation is as follows.
I have two pandas dataframes:
df1, which contains a column "p1" with 1895 rows of random numbers ranging from 2.805 to 3.035 (here are the first 20 rows):
p1
0 2.910
1 2.885
2 2.875
3 2.855
4 2.910
5 2.870
6 2.850
7 2.875
8 2.865
9 2.875
10 2.890
11 2.910
12 2.965
13 2.955
14 2.935
15 2.905
16 2.900
17 2.905
18 2.970
19 2.940
df2, which contains two columns, "p2" and "h"
p2 h
0 2.7 256.88
1 2.8 253.52
2 2.9 250.18
3 3.0 246.86
4 3.1 243.55
The aim is to first loop through all rows in df1 and find the closest element in p2 for each row. e.g. for p1[0] = 2.910, the closest element is p2[2] = 2.9.
Then, if these two values are the same, the output for that row is the corresponding value of h
otherwise, the output is the average of the previous and subsequent values of h.
Going back to our example, the output for p1[0] should therefore be (h[1]+h[3])/2
I hope this all makes sense, this is my first question on here :).
Thanks!
This is the usage of merge_asof, notice the allow_exact_matches=True is default as True, for example 2.9 nearest is 2.9 in this case
df1=df1.sort_values('p1')
s1=pd.merge_asof(df1,df2,left_on='p1',right_on='p2',direction='backward')
s2=pd.merge_asof(df1,df2,left_on='p1',right_on='p2',direction='forward')
df1['Value']=(s1.h+s2.h)/2
Another solution with numpy:
import numpy as np
# Generate some test data
x1 = np.random.randint(0,100,10)
x2 = np.vstack([np.random.randint(0,100,10),np.random.normal(0,1,10)]).T
# Repeat the two vectors
X1 = np.tile(x1,(len(x2),1))
X2 = np.tile(x2[:,0],(len(x1),1))
distance = np.abs(X1 - X2.T)
closest_idx = np.argmin(distance,axis=0)
print(x2[closest_idx,1])

In pandas is there a way to compute a subsection of a expanding window; without calculating the entire array and "tail-ing" the result

I want to compute the expanding window of just the last few elements in a group...
df = pd.DataFrame({'B': [np.nan, np.nan, 1, 1, 2, 2, 1,1], 'A': [1, 2, 1, 2, 1, 2,1,2]})
df.groupby("A")["B"].expanding().quantile(0.5)
this gives:
1 0 NaN
2 1.0
4 1.5
6 1.0
2 1 NaN
3 1.0
5 1.5
7 1.0
I only really want the last two rows for each group though. the result should be:
1 4 1.5
6 1.0
2 5 1.5
7 1.0
I can easily calculate it all and then just get the sections I want. but this is very slow if my dataframe is 1000s of elements long and I dont want to roll across the whole window... just the last two "rolls"
EDIT: I have ammended the title; A lot of people are correctly answering part of the question, but ignoring what is IMO the important part (I should have been more clear)
The issue here is the time it takes. I could just "tail" the answer to get the last two; but then it involves calculating the first two "expanding windows" and then throwing away those results. If my dataframe was instead 1000s of rows long and I just needed the answer for the last few entries, much of this calculation would be wasting time. This is the main problem I have.
As I stated:
"I can easily calculate it all and then just get the sections I want" => through using tail.
Sorry for the confusion.
Also potentially using tail doesnt involve calculating the lot, but it still seems like it does from the timings that I have done... maybe this is not correct, it is an assumption I have made.
EDIT2: The other Option I have tried was using the min_windows in rolling to force it to not calculate the initial sections of the group, but this has many pitfalls such as: -if the array includes NaNs this doesnt work, -if the groupbys are not the same length.
EDIT3:
As a simpler problem and reasoning: Its a limitation of the expanding/or rolling window I think... say we had an array [1,2,3,4,5] the expanding windows are [1], [1,2], [1,2,3], [1,2,3,4], [1,2,3,4,5], and if we run the max over that we get: 1,2,3,4,5 (the max of each array). But if I just want the max of the last two expanding windows. I just need max[1,2,3,4] = 4 and max[1,2,3,4,5]. Intuitively I don't need to calculate max of the first 3 expanding window results to get the last two. But Pandas Implementation might be that it calculates max[1,2,3,4] as max[max[1,2,3],max[4]] = 4 in which case the calculation of the entire window is necessary... this might be the same for the quantile example. There might be an alternate way to do this however without using expanding... not sure... this is what I cant work out.
Maybe try using tail: https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.core.groupby.GroupBy.tail.html
df.groupby('A')['B'].rolling(4, min_periods=1).quantile(0.5).reset_index(level=0).groupby('A').tail(2)
Out[410]:
A B
4 1 1.5
6 1 1.0
5 2 1.5
7 2 1.0
rolling and expanding are similar
How about this (edited 06/12/2018):
def last_two_quantile(row, q):
return pd.Series([row.iloc[:-1].quantile(q), row.quantile(q)])
df.groupby('A')['B'].apply(last_two_quantile, 0.5)
Out[126]:
A
1 0 1.5
1 1.0
2 0 1.5
1 1.0
Name: B, dtype: float64
If this (or something like it) doesn't do what you desire I think you should provide a real example of your use case.
Is this you want?
df[-4:].groupby("A")["B"].expanding().quantile(0.5)
A
1 4 2.0
6 1.5
2 5 2.0
7 1.5
Name: B, dtype: float64
Hope can help you.
Solution1:
newdf = df.groupby("A")["B"].expanding().quantile(0.5).reset_index()
for i in range(newdf["A"].max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
Solution2:
newdf2 = df.groupby("A")["B"].expanding().quantile(0.5)
for i in range(newdf2.index.get_level_values("A").max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
Solution3:
for i in range(df.groupby("A")["B"].expanding().quantile(0.5).index.get_level_values("A").max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
output:
Empty DataFrame
Columns: [A, level_1, B]
Index: []
A level_1 B
2 1 4 1.5
3 1 6 1.0
A level_1 B
6 2 5 1.5
7 2 7 1.0
new solution:
newdf = pd.DataFrame(columns={"A", "B"})
for i in range(len(df["A"].unique())):
newdf = newdf.append(pd.DataFrame(df[df["A"]==i+1][:-2].sum()).T)
newdf["A"] = newdf["A"]/2
for i in range(len(df["A"].unique())):
newdf = newdf.append(df[df["A"]==df["A"].unique()[i]][-2:])
#newdf = newdf.reset_index(drop=True)
newdf["A"] = newdf["A"].astype(int)
for i in range(newdf["A"].max()+1):
print(newdf[newdf["A"]==i].groupby("A")["B"].expanding().quantile(0.5)[-2:])
output:
Series([], Name: B, dtype: float64)
A
1 4 1.5
6 1.0
Name: B, dtype: float64
A
2 5 1.5
7 1.0
Name: B, dtype: float64

Categories

Resources