Splice two different dataframes based on similar value - python

I have two dataframes with different dimensions. Lets say it´s displacement measurements but the readings are slightly different values and one has more data. Looks like this:
df1
Index
displacement
1
0
2
2
3
4
4
2
5
0
df2
Index
displacement
other data
1
0
5
2
0.4
6
3
0.9
7
4
1.3
8
5
1.8
9
6
2.4
10
I want to add the "other data" to the first dataframe (df1), by looking for similar displacement value in df2 and asociating displacement value. In this case, the output i want must be similar to this:
df1
Index
displacement
other data (from df2)
1
0
5
2
2
9
And to keep adding the "other data" from df2. I dont know if pd.merge will work and im thinking maybe with a loop till displacement is higher than what im looking from and add the data from the previous row, but df2 has 10 times more rows than df1 and if the displacement measurement is the same as the one from a previous row it may not work. Any help in a cleaner/easier way to do it will be greatly appreciated.

I used the merge_asof function to find the nearest value base on two DataFrames' displacement columns, and then filtered the resulting DataFrame by a threshold.
df1['displacement'] =df1['displacement'].astype(float)
df1 = df1.drop_duplicates('displacement', keep='last')
df_out = pd.merge_asof(
df1.sort_values("displacement"),
df2.sort_values("displacement").assign(df2_displacement=lambda d: d["displacement"]),
on="displacement",
direction="nearest",
)
threshold = .5
dfout1 = df_out[abs(df_out['displacement'] -df_out['df2_displacement'] )< threshold ]

Related

Column by column pairplotting of 2 dataframes

I want to be able to plot two dataframes against each other pairing each column successively (but not all columns against all columns). The dataframes are identical in size and column headers but differ in the values. So the dataframes are of the form:
df_X =
A B C
0 1 1 1
1 2 2 2
...
df_Y =
A B C
0 3 3 3
1 4 4 4
...
At the moment I can do this manually on subplots using by starting with a merged dataframe with two header columns:
df_merge =
col A B C
X Y X Y X Y
0 1 3 1 3 1 3
1 2 4 2 4 2 4
...
_, ax = plt.subplots(3, 1)
for i in range(3):
ax[i].scatter(df_merge[col[i]][X], df_merge[col[i]][Y])
This works, but I am wondering if there is a better way of acheving this. Particularly when trying to then calculate the numerical correlation value between the pairs, which would again involve another loop and several more lines of code.
You can get correlation with something like:
df_merge[[col[i]][X],col[i]][Y]]).corr()
You can generally assume that most statistical functions can be applied in a single line to dataframe content either with built-in Pandas functions (https://pandas.pydata.org/docs/user_guide/computation.html), or scipy/numpy functions which you can apply.
To title each plot with the correlation, for example, you can do
thisAX.set_title("Corr: {}".format(df_merge[[col[i]][X],col[i]][Y]]).corr())
(I flattened your column names to make display a bit simpler, and I reversed one of the number pairs to show negative correlation)
Note: when feeding two Pandas columns (Series) into .corr(), you'll get a dataframe returned - to get the X:Y correlation, you can pick out a single value with .corr()["{}_X".format(col[i])]["{}_Y".format(col[i])])) (those are just the column and index names of the correlation)
Here's a lightly styled version of the same plot (again, using the flattened version of your column names)

Calculate quantile for each observation in a dataframe

I am new to Python and I have the following dataframe structure:
data = {'name': ["a","b","c","d","e","f","g","h"], 'value1': [1,2,3,4,5,6,7,8],'value2': [1,2,3,4,5,6,7,8]}
data = pd.DataFrame.from_dict(data)
data = data.transpose()
What I want to calculate is a new dataframe, where for each row, each column has a value corresponding to the quantile in the data.
In other words, I am trying to understand how to apply the function pd.quantile to return a dataframe with each entry being equal to the quantile value of the column in the row.
I tried the following, but I don't think it works:
x.quantile(q = 0.9,axis =0)
or:
x.apply(quantile,axis=0)
Many thanks in advance.
This is because you transpose your data and as per pandas documentation here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html
When the DataFrame has mixed dtypes, we get a transposed DataFrame
with the object dtype
Your dataframe after loading looks like below, which means it has 'mixed dtypes' (one column is object / category and the other two are integers).
name value1 value2
0 a 1 1
1 b 2 2
2 c 3 3
3 d 4 4
4 e 5 5
5 f 6 6
6 g 7 7
7 h 8 8
In this case you transpose your data and it is being converted to object dtype, which means that quantile function does not understand it as numbers.
Try removing transposing step and use axis argument to decide for which direction you want to calculate quantiles.
By the way, you can do transposition with:
df = df.T

selecting rows with min and max values of a defined column in pandas

I have the following dataframe:
A,B,C,D
10,1,2,3
1,4,7,3
10,5,2,3
40,7,9,3
9,9,9,9
I would like to create another dataframe starting from the previous one which have only two row. The selection of these two rows is based on the minimum and maximum value in the column "A". I would like to get:
A,B,C,D
1,4,7,3
40,7,9,3
Do you think I should work with a sort of index.min e index.max and then select only the two rows and append then in a new dataframe? Do you have same other suggestions?
Thanks for any kind of help,
Best
IIUC you can simply subset the dataframe with an OR condition on df.A.min() and df.A.max():
df = df[(df.A==df.A.min())|(df.A==df.A.max())]
df
A B C D
1 1 4 7 3
3 40 7 9 3
Yes, you can use idxmin/idxmax and then use loc:
df.loc[df['A'].agg(['idxmin','idxmax']) ]
Output:
A B C D
1 1 4 7 3
3 40 7 9 3
Note that this only gives one row for min and one for max. If you want all values, you should use #CHRD's solution.

How to keep only the top n% rows of each group of a pandas dataframe?

I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1

Taking the average of values from a range of rows in a pandas dataframe

I have a pandas dataframe with a column called 'coverage'. For a series of specific index values, I'd like to get the mean 'coverage' value for the 100 prior rows. For example, for index position 1001, I want the mean 'coverage' for rows 901-1000. My index values of interest are in a separate list.
I'm stumped on how to tell pandas to look at a series of rows relative to a given index. I don't think I can use GroupBy, since there will be some groups of rows that overlap (for example, suppose my list of index values of interest includes 1001 and 1050).
If anyone can point me in the right direction, I'd be very grateful!
pandas.rolling_mean seems like a good candidate for your problem
For instance:
In [9]: pandas.rolling_mean(pandas.Series(range(10)), window=2)
Out[9]:
0 NaN
1 0.5
2 1.5
3 2.5
4 3.5
5 4.5
6 5.5
7 6.5
8 7.5
9 8.5
dtype: float64

Categories

Resources