Column by column pairplotting of 2 dataframes - python

I want to be able to plot two dataframes against each other pairing each column successively (but not all columns against all columns). The dataframes are identical in size and column headers but differ in the values. So the dataframes are of the form:
df_X =
A B C
0 1 1 1
1 2 2 2
...
df_Y =
A B C
0 3 3 3
1 4 4 4
...
At the moment I can do this manually on subplots using by starting with a merged dataframe with two header columns:
df_merge =
col A B C
X Y X Y X Y
0 1 3 1 3 1 3
1 2 4 2 4 2 4
...
_, ax = plt.subplots(3, 1)
for i in range(3):
ax[i].scatter(df_merge[col[i]][X], df_merge[col[i]][Y])
This works, but I am wondering if there is a better way of acheving this. Particularly when trying to then calculate the numerical correlation value between the pairs, which would again involve another loop and several more lines of code.

You can get correlation with something like:
df_merge[[col[i]][X],col[i]][Y]]).corr()
You can generally assume that most statistical functions can be applied in a single line to dataframe content either with built-in Pandas functions (https://pandas.pydata.org/docs/user_guide/computation.html), or scipy/numpy functions which you can apply.
To title each plot with the correlation, for example, you can do
thisAX.set_title("Corr: {}".format(df_merge[[col[i]][X],col[i]][Y]]).corr())
(I flattened your column names to make display a bit simpler, and I reversed one of the number pairs to show negative correlation)
Note: when feeding two Pandas columns (Series) into .corr(), you'll get a dataframe returned - to get the X:Y correlation, you can pick out a single value with .corr()["{}_X".format(col[i])]["{}_Y".format(col[i])])) (those are just the column and index names of the correlation)
Here's a lightly styled version of the same plot (again, using the flattened version of your column names)

Related

Calculate quantile for each observation in a dataframe

I am new to Python and I have the following dataframe structure:
data = {'name': ["a","b","c","d","e","f","g","h"], 'value1': [1,2,3,4,5,6,7,8],'value2': [1,2,3,4,5,6,7,8]}
data = pd.DataFrame.from_dict(data)
data = data.transpose()
What I want to calculate is a new dataframe, where for each row, each column has a value corresponding to the quantile in the data.
In other words, I am trying to understand how to apply the function pd.quantile to return a dataframe with each entry being equal to the quantile value of the column in the row.
I tried the following, but I don't think it works:
x.quantile(q = 0.9,axis =0)
or:
x.apply(quantile,axis=0)
Many thanks in advance.
This is because you transpose your data and as per pandas documentation here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html
When the DataFrame has mixed dtypes, we get a transposed DataFrame
with the object dtype
Your dataframe after loading looks like below, which means it has 'mixed dtypes' (one column is object / category and the other two are integers).
name value1 value2
0 a 1 1
1 b 2 2
2 c 3 3
3 d 4 4
4 e 5 5
5 f 6 6
6 g 7 7
7 h 8 8
In this case you transpose your data and it is being converted to object dtype, which means that quantile function does not understand it as numbers.
Try removing transposing step and use axis argument to decide for which direction you want to calculate quantiles.
By the way, you can do transposition with:
df = df.T

How to drop dataframe columns using both, a list and not from a list?

I am trying to drop pandas column in the following way. I have a list with columns to drop. This list will be used many times in my notebook. I have 2 columns which are only referenced once
drop_cols=['var1','var2']
df = df.drop(columns={'var0',drop_cols})
So basically, I want to drop all columns from list drop_cols in addition to a hard-coded "var0" column all in one swoop. This gives an error, How do I resolve?
df = df.drop(columns=drop_cols+['var0'])
From what I gather you have a set of columns you wish to drop from several different dataframes while at the same time adding another unique column to also be dropped a data frame. The command you have used is close but misses the point in that you can't create a concatenated list in the way you are trying to do it. This is how I would approach the problem.
Given a Dataframe of the form:
V0 V1 V2 V3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
define a function to merge colnames
def mergeNames(spc_col, multi_cols):
rslt = [spc_col]
rslt.extend(multi+cols)
return rslt
Then with
drop_cols = ['V1', 'V2']
df.drop(columns=mergeNames('V0', drop_cols)
yields:
V3
0 4
1 8
2 12

How to merge and resample two messy datasets with pandas

I have two drilling datasets with depth ranges and variables that I’d like to resample and merge together.
Dataset 1 has ranges of depth, for example 2m to 3m, with variables for each range. I have taken these ranges and exploded them out to individual intervals using pandas df.explode
Dataset 1:
Depth_From Depth_To Variable_1
0 1 x
2 3 x
4 5 x
Becomes this:
Depth_Expl Variable_1
0 x
1 x
2 x
3 x
...
The second data set has similar ranges but they are not in depth order like the first dataset, and the depth ranges also overlap in some cases.
I'd like to reorganize these depths from lowest to highest and explode them similarly to the previous dataset. Any variables that overlap due to the overlapping ranges I’d like to just take the mean and have 1 variable result for each single depth interval of 1m. Not sure how to go about this.
Dataset 2:
Depth_From Depth_To Variable_2
3 6 x
0 2 x
2 3 x
7 8 x
Overall I am trying to reshape and merge the two datasets to look like this:
Depth_Expl Variable_1 Variable_2
0 x x
1 x x
2 x x
3 x x
Where each of the datasets are resampled on 1m basis with 1 answer for each variable. Any pointers would be appreciated.
According to your expecting output, I guess you want to:
Collapse the Depth_From and Depth_To columns into a single column called Depth_Expl
Combine two dataframes based on the Depth_Expl column
If so, you can use pd.melt() instead of pd.explode and use pd.merge() to combine tables.
Try this:
# Collapse Depth_From and Depth_To columns
df1 = pd.melt(df1, id_vars = 'Variable_1', var_name = 'col_names', value_name='Depth_Expl').drop(columns=['col_names'])
df2 = pd.melt(df2, id_vars = 'Variable_2', var_name = 'col_names', value_name='Depth_Expl').drop(columns=['col_names'])
# Combine two dataframes
df_merge = pd.merge(df1, df2, on='Depth_Expl', how='outer').sort_values('Depth_Expl')

How to visualize multi-index-ed data in orange?

I am using pandas library in python to generate a multi-indexed data, i.e., the columns are multi-indexed. The indices are category and source. I save this data as .csv file. In the file, the first row is the category values and second row is corresponding source values, then the data follows. I use this file to visualize in Orange3 software. But it takes only the first row as the column name, how do I make it take column name as the combination of the two.
I am just trying to visualize the whole thing as a histogram, if possible.
Since, there are effectively 2 (category and source) + 1 (the row label) variables, 3d visualization would be best or
1 (category and source combined variable) + 1 (the row label), 2d visualisation
category 1 1 1 1 1 2 2
source a b c d e f g
label
l1 1 2 3 4 5 6 7
l2 4 5 6 7 8 9 10
According to documentation, Orange does not support reading multi-indexed data.
In order to visualize the data, you will need to convert it to a normal tabular format (one column per feature) before exporting the data to csv.
One way to do it is the DataFrame's unstack method:
df.unstack().to_csv("file.csv")
This will produce the file in the following format:
category source label
1 a l1 1
1 a l2 4
1 b l1 2
...
This way, you can use category and source as separate variables in Orange.
.
To join category and source, you need to flatten the hierarchical index before exporting to csv:
df.columns = [' '.join(col).strip() for col in df.columns.values]
df.to_csv(file.csv)
This will produce the data in the following format:
label 1 a 1 b ...
l1 1 2
l2 4 5

Dividing two columns of an unstacked dataframe

I have two columns in a pandas dataframe.
Column 1 is ed and contains strings (e.g. 'a','a','b,'c','c','a')
ed column = ['a','a','b','c','c','a']
Column 2 is job and also contains strings (e.g. 'aa','bb','aa','aa','bb','cc')
job column = ['aa','bb','aa','aa','bb','cc'] #these are example values from column 2 of my pandas data frame
I then generate a two column frequency table like this:
my_counts= pdata.groupby(['ed','job']).size().unstack().fillna(0)
Now how do I then divide the frequencies in one column by the frequencies in another column of that frequency table? I want to take that ratio and use it to argsort() so that I can sort by the calculated ratio but I don't know how to reference each column of the resulting table.
I initialized the data as follows:
ed_col = ['a','a','b','c','c','a']
job_col = ['aa','bb','aa','aa','bb','cc']
pdata = pd.DataFrame({'ed':ed_col, 'job':job_col})
my_counts= pdata.groupby(['ed','job']).size().unstack().fillna(0)
Now my_counts looks like this:
job aa bb cc
ed
a 1 1 1
b 1 0 0
c 1 1 0
To access a column, you could use my_counts.aa or my_counts['aa'].
To access a row, you could use my_counts.loc['a'].
So the frequencies of aa divided by bb are my_counts['aa'] / my_counts['bb']
and now, if you want to get it sorted, you can do:
my_counts.iloc[(my_counts['aa'] / my_counts['bb']).argsort()]

Categories

Resources