Errorbar plot for Likert scale confidence values - python

I have the following dataset, for 36 fragments in total (36 rows × 3 columns):
Fragment lower upper
0 1 1 5
1 2 2 5
2 3 3 5
3 4 2 5
4 5 1 5
5 6 1 5
I've calculated these lower and upper bounds from this dataset (966 rows × 2 columns):
Fragment Confidence Value
0 33 4
1 26 4
2 23 3
3 16 2
4 36 3
which contains multiple instance of a fragment and an associated Confidence value.
The confidence values are data from a Likert scale, i.e. 1-5. I want to create an error bar plot, for example like this:
So on the y-axis to have each fragment 1-36 and on the x-axis to show the range/std/mean (?) of the confidence values for each fragment.
I've tried this, but it's not exactly what I want, I think using the lower and upper bounds isn't the best idea, maybe I need std/range...
#confpd is the second dataset from above
meanconfs = confpd.groupby('Fragment', as_index=False)['Confidence Value'].mean()
minconfs = confpd.groupby(Fragment', as_index=False)['Confidence Value'].min()
maxconfs = confpd.groupby('Fragment', as_index=False)['Confidence Value'].max()
data_dict = {}
data_dict['Fragment'] = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18',
'19','20','21','22','23','24','25','26','27','28','29','39','31','32','33','34','35','36']
data_dict['lower'] = minconfs['Confidence Value']
data_dict['upper'] = maxconfs['Confidence Value']
dataset = pd.DataFrame(data_dict)
##dataset is the first dataset I show above
for lower,upper,y in zip(dataset['lower'],dataset['upper'],range(len(dataset))):
plt.plot((lower,upper),(y,y),'ro-',color='orange')
plt.yticks(range(len(dataset)),list(dataset['Fragment']))
The result of this code is this, which is not what I want.
Any help is greatly appreciated!!

Related

Calculating distance from centroid to another point in Python

I have a data frame like below
STORE_ID LATITUDE LONGITUDE GROUP
1 18.2738 28.38833 2
2 18.3849 28.29374 1
3 18.3948 28.29303 1
4 18.1949 28.28248 1
5 18.2947 28.47392 1
6 18.7493 28.29475 2
7 18.4729 28.38392 3
8 18.1927 28.29485 2
9 18.2948 28.29384 1
10 18.1038 28.29489 3
11 18.7482 28.29374 1
12 18.9283 28.28484 2
And a second data frame like below
Tele_Booth_ID LATITUDE LONGITUDE
1 18.5638 28.19374
2 18.2947 28.03727
3 18.3849 28.26395
4 18.9482 28.91847
The first data frame has longitudes and latitudes of stores in a certain area. The store are grouped together into clusters represented by the GROUP field.
The second dataframe has longitudes and latitudes for telephone booths in that same area.
Using both these data frames I want to find the optimal locations to place more telephone booths.
If a store group has no telephone booths in the cluster or near the cluster, I would want to put a booth there. If a store group has a booth within the cluster, I would not want another booth there.
Using python how can I calculate the center point for each store group and then calculate the distance of each store group to the nearest booth?
While I am unsure of how accurate the computation of the centroid of point might be, using the ability to group on store groups You can create a new dataframe which contains the mean of the Lat and Lon of each group as follows:
Given a df1 as shown:
Store Lat Lon Group
0 1 18.2738 28.38833 2
1 3 18.3948 28.29303 1
2 4 18.1949 28.28248 1
3 5 18.2947 28.47392 1
4 6 18.7493 28.29475 2
5 7 18.4729 28.38392 3
6 8 18.1927 28.29485 2
7 9 18.2948 28.29384 1
8 10 18.1038 28.29489 3
9 11 18.7482 28.29374 1
10 12 18.9283 28.28484 2
Create a Dataframe of centered Lat and Lon as follows:
dfc = df1.groupby('Group').Lat.mean().to_frame().join(df1.groupby('Group').Lon.mean().to_frame())
This will yield the dfc dataframe shown below:
Lat Lon
Group
1 18.385480 28.327402
2 18.536025 28.315693
3 18.288350 28.339405
You can now utilize these mean Lat/Lon points to compute distance to Telephone booths as follows:
The booth dataframe is df2 shown below:
Booth Lat Lon
0 1 18.5638 28.19374
1 2 18.2947 28.03727
2 3 18.3849 28.26395
3 4 18.9482 28.91847
# to compute the distance between two sets of coordinates
def dist(loc1: tuple[float], loc2: tuple[float]) -> float:
dx = loc1[0] - loc2[0]
dy = loc1[1] - loc2[1]
return (dx**2 + dy**2)**0.5
Using the above compute distance from each group centroid to each booth as follows:
for i in range(df2.shape[0]):
dfc[f'B-{i+1:02}'] = dfc.apply(lambda row: dist((row.Lat, row.Lon), tuple(df2.iloc[i].to_list()[1:])), axis=1)
This yields the following:
Lat Lon B-01 B-02 B-03 B-04
Group
1 18.385480 28.327402 0.222853 0.304003 0.063455 0.816098
2 18.536025 28.315693 0.125075 0.368452 0.159737 0.730225
3 18.288350 28.339405 0.311594 0.302202 0.122537 0.877906

Using df.apply on a function with multiple inputs to generate multiple outputs

I have a dataframe that looks like this
initial year0 year1
0 0 12
1 1 13
2 2 14
3 3 15
Note that the number of year columns year0, year1... (year_count) is completely variable but will be constant throughout this code
I first wanted to apply a function to each of the 'year' columns to generate 'mod' columns like so
def mod(year, scalar):
return (year * scalar)
s = 5
year_count = 2
# Generate new columns
df[[f"mod{y}" for y in range (year_count)]] = df[[f"year{y}" for y in range(year_count)]].apply(mod, scalar=s)
initial year0 year1 mod0 mod1
0 0 12 0 60
1 1 13 5 65
2 2 14 10 70
3 3 15 15 75
All good so far. The problem is that I now want to apply another function to both the year column and its corresponding mod column to generate another set of val columns, so something like
def sum_and_scale(year_col, mod_col, scale):
return (year_col + mod_col) * scale
Then I apply this to each of the columns (year0, mod0), (year1, mod1) etc to generate the next tranche of columns.
With scale = 10 I should end up with
initial year0 year1 mod0 mod1 val0 val1
0 0 12 0 60 0 720
1 1 13 5 65 60 780
2 2 14 10 70 120 840
3 3 15 15 75 180 900
This is where I'm stuck - I don't know how to put two existing df columns together in a function with the same structure as in the first example, and if I do something like
df[['val0', 'val1']] = df['col1', 'col2'].apply(lambda x: sum_and_scale('mod0', 'mod1', scale=10))
I don't know how to generalise this to have arbitrary inputs and outputs and also apply the constant scale parameter. (I know the last piece of won't work but it's the other avenue to a solution I've seen)
The reason I'm asking is because I believe the loop that I currently have working is creating performance issues with the number of columns and the length of each column.
Thanks
IMHO, it's better with a simple for loop:
for i in range(2):
df[f'val{i}'] = sum_and_scale(df[f'year{i}'], df[f'mod{i}'], scale=10)

Finding row with closest numerical proximity within Pandas DataFrame

I have a Pandas DataFrame with the following hypothetical data:
ID Time X-coord Y-coord
0 1 5 68 5
1 2 8 72 78
2 3 1 15 23
3 4 4 81 59
4 5 9 78 99
5 6 12 55 12
6 7 5 85 14
7 8 7 58 17
8 9 13 91 47
9 10 10 29 87
For each row (or ID), I want to find the ID with the closest proximity in time and space (X & Y) within this dataframe. Bonus: Time should have priority over XY.
Ideally, in the end I would like to have a new column called "Closest_ID" containing the most proximal ID within the dataframe.
I'm having trouble coming up with a function for this.
I would really appreciate any help or hint that points me in the right direction!
Thanks a lot!
Let's denote df as our dataframe. Then you can do something like:
from sklearn.metrics import pairwise_distances
space_vals = df[['X-coord', 'Y-coord']]
time_vals =df['Time']
space_distance = pairwise_distance(space_vals)
time_distance = pairwise_distance(time_vals)
space_distance[space_distance == 0] = 1e9 # arbitrary large number
time_distance[time_distance == 0] = 1e9 # again
closest_space_id = np.argmin(space_distance, axis=0)
closest_time_id = np.argmin(time_distance, axis=0)
Then, you can store the last 2 results in 2 columns, or somehow decide which one is closer.
Note: this code hasn't been checked, and it might have a few bugs...

Software recommendation for circos plot with discrete axis

I would like to make a circos-like plot to visualize SNPs only (with multiple tracks for SNPs attributes). It could be done either with python, R or I am happy to consider other languages.
So far, I have taken a look at the circlize R package.
However, I get the error "Range of the sector ('C') cannot be 0" when initializing the circos plot.
I believe that this error arises from the fact that I have discrete data (SNPs) instead of having data for all positions. Or maybe this is because I have some data points that are repeated.
I have simplified my data below and show the code that I have tried so far:
Sample Gene Pos read_depth Freq
1 A 20394 43 99
1 B 56902 24 99
2 A 20394 50 99
2 B 56902 73 99
3 A 20394 67 50
3 B 56902 20 99
3 C 2100394 21 50
install.packages("circlize")
library(circlize)
data <- read.table("test_circos.txt", sep='\t', header=TRUE)
circos.par("track.height" = 0.1)
circos.initialize(factors = data$Gene, x = data$Pos)
I would like to know whether it is possible to get a circos-like plot where each of my data points (7 in my example) is plotted as an individual data point without any other points being plotted, in the way of a discrete axis.
If it is of interest to anyone, I decided to do as follows:
Number datapoints per category (='Gene'); new column 'Number':
Sample Gene Pos depth Freq Number
1 A 20394 43 99 1
1 B 56902 24 99 1
2 A 20394 50 99 2
2 B 56902 73 99 2
3 A 20394 67 50 3
3 B 56902 20 99 3
3 C 2100394 21 50 1
Design circos config file as follows (header not included in real config file):
chr - ID LABEL START END COLOUR
chr - A A 0 3 chr1
chr - B B 0 3 chr2
chr - C C 0 1 chr3
This means that my genes will have length equal to the number of SNPs identified in said genes and that each bp of the genes will represent one line (=SNP) in my SNP file.
I can then use circos as normal.
In the end, I chose circos because it seemed best documented, therefore easier to learn with the addition of appearing more flexible.

Multiplying DF row by coefficients

I want to store the coefficients of a statsmodels.api model for future use (so I don't have to run the model every time). When I get a new dataframe for which I want to make a prediction on, I want to be able to multiply each row of the dataframe by the coefficients( i.e. model.params). I would then sum the results of each row*coefficients to get the prediction for that row. However, it does not seem to be working for me when I try:
preds = []
for row in df.iterrows():
preds.append((model.params*row).sum())
Edit: example
df:
Height Weight Color
6 5 3
6 2 4
9 1 9
10 3 3
coefficients:
Height: -1.6403
Weight: 2.0435
Color: 300.4532
I would consider doing something like:
df.dot(model.params)
This computes the dot product on each of the rows of the DataFrame.
It seems like you need:
coeff_dict = {
'Height': -1.6403,
'Weight': 2.0435,
'Color': 300.4532
}
df.assign(prediction=df.assign(**coeff_dict).mul(df).sum(axis=1))
Output:
Height Weight Color prediction
0 6 5 3 901.7353
1 6 2 4 1196.0580
2 9 1 9 2691.3596
3 10 3 3 891.0871

Categories

Resources