How to merge and resample two messy datasets with pandas - python

I have two drilling datasets with depth ranges and variables that I’d like to resample and merge together.
Dataset 1 has ranges of depth, for example 2m to 3m, with variables for each range. I have taken these ranges and exploded them out to individual intervals using pandas df.explode
Dataset 1:
Depth_From Depth_To Variable_1
0 1 x
2 3 x
4 5 x
Becomes this:
Depth_Expl Variable_1
0 x
1 x
2 x
3 x
...
The second data set has similar ranges but they are not in depth order like the first dataset, and the depth ranges also overlap in some cases.
I'd like to reorganize these depths from lowest to highest and explode them similarly to the previous dataset. Any variables that overlap due to the overlapping ranges I’d like to just take the mean and have 1 variable result for each single depth interval of 1m. Not sure how to go about this.
Dataset 2:
Depth_From Depth_To Variable_2
3 6 x
0 2 x
2 3 x
7 8 x
Overall I am trying to reshape and merge the two datasets to look like this:
Depth_Expl Variable_1 Variable_2
0 x x
1 x x
2 x x
3 x x
Where each of the datasets are resampled on 1m basis with 1 answer for each variable. Any pointers would be appreciated.

According to your expecting output, I guess you want to:
Collapse the Depth_From and Depth_To columns into a single column called Depth_Expl
Combine two dataframes based on the Depth_Expl column
If so, you can use pd.melt() instead of pd.explode and use pd.merge() to combine tables.
Try this:
# Collapse Depth_From and Depth_To columns
df1 = pd.melt(df1, id_vars = 'Variable_1', var_name = 'col_names', value_name='Depth_Expl').drop(columns=['col_names'])
df2 = pd.melt(df2, id_vars = 'Variable_2', var_name = 'col_names', value_name='Depth_Expl').drop(columns=['col_names'])
# Combine two dataframes
df_merge = pd.merge(df1, df2, on='Depth_Expl', how='outer').sort_values('Depth_Expl')

Related

Column by column pairplotting of 2 dataframes

I want to be able to plot two dataframes against each other pairing each column successively (but not all columns against all columns). The dataframes are identical in size and column headers but differ in the values. So the dataframes are of the form:
df_X =
A B C
0 1 1 1
1 2 2 2
...
df_Y =
A B C
0 3 3 3
1 4 4 4
...
At the moment I can do this manually on subplots using by starting with a merged dataframe with two header columns:
df_merge =
col A B C
X Y X Y X Y
0 1 3 1 3 1 3
1 2 4 2 4 2 4
...
_, ax = plt.subplots(3, 1)
for i in range(3):
ax[i].scatter(df_merge[col[i]][X], df_merge[col[i]][Y])
This works, but I am wondering if there is a better way of acheving this. Particularly when trying to then calculate the numerical correlation value between the pairs, which would again involve another loop and several more lines of code.
You can get correlation with something like:
df_merge[[col[i]][X],col[i]][Y]]).corr()
You can generally assume that most statistical functions can be applied in a single line to dataframe content either with built-in Pandas functions (https://pandas.pydata.org/docs/user_guide/computation.html), or scipy/numpy functions which you can apply.
To title each plot with the correlation, for example, you can do
thisAX.set_title("Corr: {}".format(df_merge[[col[i]][X],col[i]][Y]]).corr())
(I flattened your column names to make display a bit simpler, and I reversed one of the number pairs to show negative correlation)
Note: when feeding two Pandas columns (Series) into .corr(), you'll get a dataframe returned - to get the X:Y correlation, you can pick out a single value with .corr()["{}_X".format(col[i])]["{}_Y".format(col[i])])) (those are just the column and index names of the correlation)
Here's a lightly styled version of the same plot (again, using the flattened version of your column names)

Copying (assembling) the column from smaller data frames into the bigger data frame with pandas

I have a data frame with measurements for several groups of participants, and I am doing some calculations for each group. I want to add a column in a big data frame (all participants), from secondary data frames (partial list of participants).
When I do merge a couple of times (merging a new data frame to the existing one), it creates a duplicate of the column instead of one column.
As the size of the dataframes is different I can not compare them directly.
I tried
#df1 - main bigger dataframe, df2 - smaller dataset contains group of df1
for i in range(len(df1)):
# checking indeces to place the data to correct participant:
if df1.index[i] not in df2['index']:
pass
else :
df1['rate'][i] = list(df2[rate][df2['index']==i])
It does not work properly though. Can you please help with the correct way of assembling the column?
update: where the index of the initial dataframe and the "index" column of the calculation is the same, copy the rate value from the calculation into main df
main dataframe 1df
index
rate
1
0
2
0
3
0
4
0
5
0
6
0
dataframe with calculated values
index
rate
1
value
4
value
6
value
output df
index
rate
1
value
2
0
3
0
4
value
5
0
6
value
Try this – using .join() to merge dataframes on their indices and combining two columns using .combine_first():
df = df1.join(df2, lsuffix="_df1", rsuffix="_df2")
df["rate"] = df["rate_df2"].combine_first(df["rate_df1"])
EDIT:
This assumes both dataframes use a matching index. If that is not the case for df2, run this first:
df2 = df2.set_index('index')

Appending two dataframes with multindex rows?

I have two dataframes:
The first one looks like this:
variable
entry
subentry
0
1
X
2
Y
3
Z
and the second one looks like:
variable
entry
subentry
0
1
A
2
B
I would like to merge the two dataframe such that I get:
variable
entry
subentry
0
1
X
2
Y
3
Z
1
1
A
2
B
Simply using df1.append(df2, ignore_index=True) gives
variable
0
X
1
Y
2
Z
3
A
4
B
In other words, it collapses the multindex into a single index. Is there a way around this?
Edit: Here is a code sinppet that will reproduce the problem:
arrays = [
np.array([0,0,0]),
np.array([0,1,2]),]
arrays_2 = [
np.array([0,0]),
np.array([0,1]),]
df1 = pd.DataFrame(np.random.randn(3, 1), index=arrays)
df2 = pd.DataFrame(np.random.randn(2, 1), index=arrays_2)
df = df1.append(df2, ignore_index=True)
print(df)
Edit: In practice, I am looking ao combine N dataframes, each with a different number of "entry" rows. So I am looking for an approach that will not rely on me knowing the exact of the dataframes I am combining.
One way try:
pd.concat([df1, df2], keys=[0,1]).droplevel(1)
Output:
0
0 0 -0.439749
1 -0.478744
2 0.719870
1 0 -1.055648
1 -2.007242
Use pd.concat to concat the dataframes together and since entry is the same of both, use keys parameter to create a new level with the naming you want your level to be. Finally, go back and drop the old index level (where the value was the same).

Merging columns based on the percentage column

I have a dataframe that has numerical and categorical values. Essentially what I am trying to accomplish is to merge the data based on a specific criteria. The criteria is when merging rows, once the percentage column becomes 100%, merge those rows into one. The numerical rows will be averaged and the categorical values will be listed.
I am here for ideas on how to tackle the problem in the most efficient way possible in python preferably.
Here is what the dataframe looks like:
<table><tbody><tr><th>x</th><th>y</th><th>z</th><th>a</th><th>%</th></tr><tr><td>3</td><td>8</td><td>lem</td><td>or</td><td>0.5</td></tr><tr><td>7</td><td>9</td><td>lem</td><td>or</td><td>0.5</td></tr><tr><td>5</td><td>10</td><td>lem</td><td>or</td><td>0.3</td></tr><tr><td>5</td><td>9</td><td>or</td><td>or</td><td>0.7</td></tr><tr><td>10</td><td>8</td><td>or</td><td>or</td><td>1</td></tr></tbody></table>
This is what the final dataframe would look like:
<table><tbody><tr><th>x</th><th>y</th><th>z</th><th>a</th><th>%</th></tr><tr><td>5</td><td>8.5</td><td>lem, lem</td><td>or, or </td><td>1</td></tr><tr><td>5</td><td>9.5</td><td>lem, or</td><td>or, or</td><td>1</td></tr><tr><td>10</td><td>8</td><td>or</td><td>or</td><td>1</td></tr></tbody></table>
IIUC, let's try:
s = df['%'].cumsum()
grp = s.where(s.mod(1).eq(0)).bfill()
df.groupby(grp, as_index=False).agg({'x':'mean',
'y':'mean',
'z': ", ".join,
'a':", ".join,
'%':'sum'})
Output:
x y z a %
0 5 8.5 lem, lem or, or 1.0
1 5 9.5 lem, or or, or 1.0
2 10 8.0 or or 1.0

overwrite slice of multi-index dataframe with series

I have a multi-index dataframe and want to set a slice of one of its columns equal to a series, ordered (sorted) according to the column slice' and series' index-match. The column's innermost index and series' index are identical, except their ordering (sorting). (see example below)
I can do this by first sorting the series' index according to the column's index and then using series.values (see below), but this feels like a workaround and I was wondering if it's possible to directly assign the series to the column slice.
example:
import pandas as pd
multi_index=pd.MultiIndex.from_product([['a','b'],['x','y']])
df=pd.DataFrame(0,multi_index,['p','q'])
s1=pd.Series([1,2],['y','x'])
df.loc['a','p']=s1[df.loc['a','p'].index].values
The code above gives the desired output, but I was wondering if the last line could be done simpler, e.g.:
df.loc['a','p']=s1
but this sets the column slice to NaNs.
Desired output:
p q
a x 2 0
y 1 0
b x 0 0
y 0 0
obtained output from df.loc['a','p']=s1:
p q
a x NaN 0
y NaN 0
b x 0.0 0
y 0.0 0
It seems like a simple issue to me but I haven't been able to find the answer anywhere.
Have you tried something like that?
df.loc['a']['p'] = s1
Resulting df is here
p q
a x 2 0
y 1 0
b x 0 0
y 0 0

Categories

Resources