I have a bunch of data that was generated by a nested loop. The innermost loop just keeps increasing each pass, but the outer-loop only increments periodically. I want to plot the data vs. both the counters. The innermost counter works fine as the primary X-axis, but I cannot figure out how to get the second counter on my plot, making sure it shows up aligned correctly with the primary X-axis. I have tried using the secondary_xaxis function, defining transforms that just do an inverse lookup on my dataframe. Unfortunately I get an error that the transform functions are not the same length (ValueError: 'Lenghts must match to compare')
ctr A
ctr B
val1
val2
1
1
-3.22E-03
-0.001010008
1
2
-3.21E-03
-0.002629743
1
3
-3.21E-03
-0.002210752
2
4
-3.21E-03
-0.002210752
2
5
-5.86E-03
-0.004594075
3
6
-0.003212838
-0.002210758
3
7
-0.003645778
-0.002577823
3
8
0.000129821
0.000223856
3
9
-6.06E-06
2.77E-05
4
10
6.05E-07
2.23E-05
def getCtrSub( x ):
return df.loc[ 'ctrA' == x, 'ctrB' ].iloc[0]
def getCtrTop( x ):
return df.loc[ 'ctrB' == x, 'ctrA' ].iloc[0]
figInd = plt.figure(); axInd=figInd.gca();
axInd.plot( ctrB, valA, label=ind )
axInd2 = axInd.secondary_xaxis('bottom', functions=(getCtrTop, getCtrSub) )
plt.show()
Unfortunately I can't make a picture exactly of what I want, but text-wise, the x-Axis would look something like this:
1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10
1 --------- 2 ------3 ---------------4
Any help would be greatly appreciated.
I'm trying to calculate the centre of mass of 20 objects, where each object has it's own different mass.
These objects are represented in a dataframe cm_x, and their associated masses in a list. Below I show an example of just 3 of those 20 objects, for the sake of saving space. Each object has an x, y, z coordinate, but I'll just show the x and then I can apply the same technique to the rest. Below is the head of the dataframe.
bar_head_x bar_hip_centre_x bar_left_ankle_x
0 -203.3502 -195.4573 -293.262
1 -203.4280 -195.4720 -293.251
2 -203.4954 -195.4675 -293.248
3 -203.5022 -195.9193 -293.219
4 -203.5014 -195.9092 -293.328
m_head = 0.081
m_hipc = 0.139
m_lank = 0.0465
m = [m_head,m_hipc,m_lank]
I saw in another similar question, someone has suggested this method, however this doesn't incorporate the masses, and that is where I'm having an issue:
def series_sum(pd_series):
return np.sum(np.dot(pd_series.values, np.asarray(range(1, len(pd_series)+1)))/np.sum(pd_series))
cm_x.apply(series_sum, axis=1)
Basically I want for each row, to have an associated centre of mass, using the formula for centre of mass which is sum(x_i * m_i) / sum(m_i).
The desired result would be a new column in the dataframe like so:
cm_x
0 -214.92
1 ...
2 ...
3 ...
4 ...
Any help?
If I understand correctly, you can compute the desired column like this:
>>> df.mul(m).sum(axis=1)/sum(m)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
Use DataFrame.dot and divide by sum of list m:
s = df.dot(m).div(sum(m))
print (s)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503
dtype: float64
If need DataFrame add Series.to_frame:
df1 = df.dot(m).div(sum(m)).to_frame('cm_x')
print (df1)
cm_x
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503
I have dataset with 2 columns and I would like to show the variation of one feature according to the binary output value
data
id Column1 output
1 15 0
2 80 1
3 120 1
4 20 0
... ... ...
I would like to drop a plot with python where x-axis contains values of Column1 and y-axis contains the percent of getting positive values.
I know already that the form of my plot have the form of exponontial function where when column1 has smaller numbers I will get more positive output then when it have long values
exponential plot maybe need two list like this
try this
import matplotlib.pyplot as plt
# x-axis points list
xL = [5,10,15,20,25,30]
# y-axis points list
yL = [100,50,25,12,10]
plt.plot(xL, yL)
plt.axis([0, 35, 0, 200])
plt.show()
I've a scientist dataframe
radius date spin atom
0 12,50 YYYY/MM 0 he
1 11,23 YYYY/MM 2 c
2 45,2 YYYY/MM 1 z
3 11,1 YYYY/MM 1 p
I want select for each row, all rows where the difference between the radius is under, for exemple 5
I've define a function to calc (simple,it's an example):
def diff_radius (a,b)
return a-b
Is-it possible for each rows to find some rows which check the condition in calling an external function?
I try some way, not working:
for i in range(df.shape[0]):
....
df_in_radius=df.apply(lambda x : diff_radius(df[i]['radius'],x['radius']))
Can you help me?
I am assuming that the datatype of the radius column is a tuple. You can keep the diff_radius method like
def diff_radius(x):
a, b = x
return a-b
Then, you can use loc method in pandas to select the rows which matches the condition of radius differece less than 5.
df.loc[df.radius.apply(diff_radius) < 5]
Edit #1
If the datatype of the radius column is a string, then split them and typecast. The logic will go in the diff_radius method. In case of string
def diff_radius(x):
x_split = x.split(',')
a,b = int(x_split[0]), int(x_split[-1])
return a-b
I misspoke.
My dataframe is :
radius of my atom date spin atom
0 12.50 YYYY/MM 0 he
1 11.23 YYYY/MM 2 c
2 45.2 YYYY/MM 1 z
3 11.1 YYYY/MM 1 p
I do a loop , to apply on one row a special calcul of each row whose respond condition.
Example:
def diff_radius(current_row,x):
current_row['radius']-x['radius']
return a-b
df=pd.read_csv(csvfile,delimiter=";",names=('radius','date','spin','atom'))
# for each row of original dataframe
for i in range(df.shape[0]):
# first build a new and tmp dataframe with row
# which have a radius less 5 than df.iloc[i]['radius] (level of loop)
df_tmp=df[diff_radius(df.iloc[i]['radius],df['radius']) <5]
....
# start of special calc, with the df_tmp which contains all of rows
# less 5 than the current row **(i)**
I thank you sincerely for your answers
I intend to plot multiple columns in a pandas dataframe, all grouped by another column using groupby inside seaborn.boxplot. There is a nice answer here, for a similar problem in matplotlib matplotlib: Group boxplots but given the fact that seaborn.boxplot comes with groupby option I thought it could be much easier to do this in seaborn.
Here we go with a reproducible example that fails:
import seaborn as sns
import pandas as pd
df = pd.DataFrame([[2, 4, 5, 6, 1], [4, 5, 6, 7, 2], [5, 4, 5, 5, 1],
[10, 4, 7, 8, 2], [9, 3, 4, 6, 2], [3, 3, 4, 4, 1]],
columns=['a1', 'a2', 'a3', 'a4', 'b'])
# display(df)
a1 a2 a3 a4 b
0 2 4 5 6 1
1 4 5 6 7 2
2 5 4 5 5 1
3 10 4 7 8 2
4 9 3 4 6 2
5 3 3 4 4 1
#Plotting by seaborn
sns.boxplot(df[['a1','a2', 'a3', 'a4']], groupby=df.b)
What I get is something that completely ignores groupby option:
Whereas if I do this with one column it works thanks to another SO question Seaborn groupby pandas Series :
sns.boxplot(df.a1, groupby=df.b)
So I would like to get all my columns in one plot (all columns come in a similar scale).
EDIT:
The above SO question was edited and now includes a 'not clean' answer to this problem, but it would be nice if someone has a better idea for this problem.
As the other answers note, the boxplot function is limited to plotting a single "layer" of boxplots, and the groupby parameter only has an effect when the input is a Series and you have a second variable you want to use to bin the observations into each box..
However, you can accomplish what I think you're hoping for with the factorplot function, using kind="box". But, you'll first have to "melt" the sample dataframe into what is called long-form or "tidy" format where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
Then it's very simple to plot:
sns.factorplot("a", hue="b", y="c", data=df_long, kind="box")
You can use directly boxplot (I imagine when the question was asked, that was not possible, but with seaborn version > 0.6 it is).
As explained by #mwaskom, you have to "melt" the sample dataframe into its "long-form" where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
# display(df_long.head())
b a c
0 1 a1 2
1 2 a1 4
2 1 a1 5
3 2 a1 10
4 2 a1 9
Then you just plot it:
sns.boxplot(x="a", hue="b", y="c", data=df_long)
Seaborn's groupby function takes Series not DataFrames, that's why it's not working.
As a work around, you can do this :
fig, ax = plt.subplots(1,2, sharey=True)
for i, grp in enumerate(df.filter(regex="a").groupby(by=df.b)):
sns.boxplot(grp[1], ax=ax[i])
it gives :
Note that df.filter(regex="a") is equivalent to df[['a1','a2', 'a3', 'a4']]
a1 a2 a3 a4
0 2 4 5 6
1 4 5 6 7
2 5 4 5 5
3 10 4 7 8
4 9 3 4 6
5 3 3 4 4
Hope this helps
It isn't really any better than the answer you linked, but I think the way to achieve this in seaborn is using the FacetGrid feature, as the groupby parameter is only defined for Series passed to the boxplot function.
Here's some code - the pd.melt is necessary because (as best I can tell) the facet mapping can only take individual columns as parameters, so the data need to be turned into a 'long' format.
g = sns.FacetGrid(pd.melt(df, id_vars='b'), col='b')
g.map(sns.boxplot, 'value', 'variable')
It's not adding a lot to this conversation, but after struggling with this for longer than warranted (the actual clusters are unusable), I thought I would add my implementation as another example. It's got a superimposed scatterplot (because of how annoying my dataset is), shows melt using indices, and some aesthetic tweaks. I hope this is useful for someone.
output_graph
Here it is without using column headers (I saw a different thread that wanted to know how to do this using indices):
combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1)
cluster_data_df: DataFrame = DataFrame(combined_array)
if you want to use labelled columns:
column_names: List[str] = list(outcome_variable_names)
column_names.append('cluster')
cluster_data_df.set_axis(column_names, axis='columns', inplace=True)
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=['cluster'],
# value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example
# value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6']
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
The resulting dataframe (rows = sample_n x variable_n (in my case 1626 x 6 = 9756)):
index
cluster
psychometric_tst
standard deviations from the mean
0
0.0
outcome_var_1
-1.276182
1
0.0
outcome_var_1
-1.118813
2
0.0
outcome_var_1
-1.276182
9754
0.0
outcome_var_6
0.892548
9755
0.0
outcome_var_6
1.420480
If you want to use indices with melt:
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=cluster_data_df.columns[-1],
# value_vars=cluster_data_df.columns[:-1],
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
And here's the graphing code:
(Done with column headings - just note that y-axis=value_name, x-axis = var_name, hue = id_vars):
# plot graph grouped by cluster
sns.set_theme(style="ticks")
fig = plt.figure(figsize=(10, 10))
fig.set(font_scale=1.2)
fig.set_style("white")
# create boxplot
fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False,
data=graph_data)
# set box alpha:
for patch in fig.ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .2))
# create scatterplot
fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data,
dodge=True, alpha=.25, zorder=1)
# customise legend:
cluster_n: int = dbscan_output.n_clusters
## create list with legend text
i = 0
cluster_info: Dict[int, int] = dbscan_output.cluster_sizes # custom method
legend_labels: List[str] = []
while i < cluster_n:
label: str = f"cluster {i+1}, n = {cluster_info[i]}"
legend_labels.append(label)
i += 1
if -1 in cluster_info.keys():
cluster_n += 1
label: str = f"Unclustered, n = {cluster_info[-1]}"
legend_labels.insert(0, label)
## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half)
handles, labels = fig.ax.get_legend_handles_labels()
index: int = int(cluster_n*(-1))
labels = legend_labels
plt.legend(handles[index:], labels[0:])
plt.xticks(rotation=45)
plt.show()
asds
Just a note: Most of my time was spent debugging the melt function. I predominantly got the error "*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*". My output required me to concatenate my outcome variable value table and the clusters (DBSCAN), and I'd put extra square brackets around the cluster array in the concat method. So I had a column where each value was an invisible List[int], rather than a plain int. It's pretty niche, but maybe it'll help someone.
List item