Area Under the Curve in triplicate Python - python

I conducted an experiment in triplicate (so each wavelength has 3 corresponding values). I need to get the total area under the curve (one value) for each wavelength. How would I go about doing this in python using pandas?

Related

What is the best outlier detector for data that has a mistype in imputing by user?

I am trying to analyze and make a model based on World Bank data. In my dataset I has some columns with mistype in numbers. The actual problem consist in mistyping one digit in a big number, and the actual number is 10 times smaller than original.
In this image there is scatter plot of Population in Afghanistan from (2000 to 2015).
The real population of Afhanistan in 2015 was 33 millions (source: Wikipedia).
The person who entered the data made mistakes in entering the data, sometimes not adding a digit to the number, for example instead of 36043432, he entered only 3604343, and respectively the number is 10 times smaller than the initial one, as a result it creates enormous disturbances in dataset.
I tried to make my own algorithm to detect bottom outliers using z-score.
# Outlier Detector based on deviation from std.
# Take the natural logarithm of the data
data_Afganistan['log_pop'] =
np.log(data_Afganistan['Population'])
# Create a new column for Z-Scores
data_Afganistan['z_score'] = np.abs(data_Afganistan['log_pop'] -
data_Afganistan['log_pop'].mean()) /
data_Afganistan['log_pop'].std()
# Identify outliers
outliers = data_Afganistan[data_Afganistan['z_score'] > 1.2]
# Ploting data
plt.scatter(data_Afganistan['Year'].astype(int),
data_Afganistan['Population'], color=['blue' if year not in
outliers['Year'].values else 'red' for year in
data_Afganistan['Year']])
But results aren't so good.
There is result
Algorithm incorrectly detected outliers (red dotes).
There is need to detect all dotes that are at the bottom
**How to detect bottom dotes ? **

How can I find the line of best fit with a 2d-Histogram in Python?

I have a 2d Histogram where the value of each bin is calculated by points per bin divided by total points inside that bins row (so occurrence percentage by row). If I am trying to create a line of best fit that goes through the denser center areas of the histogram, how could I do that?
The data I have is one numpy array stored like,
percentages = [[0.00209644 0.00069881 0.00279525 0.00069881 0.00139762
0.00209644 0.00349406 0.00419287 0.00628931 0.01607268 0.01467505
0.02166317 0.02445842 0.03214535, i, i, i, and so on]
[0.02581665 0.02212856 0.02107482...]]
that is a 50 x 20 array so each bin has a value. Using these values, I made the histogram using
plt.pcolormesh(xEdges, yEdges, percentages)
So my question is, how would I create a line of best fit when this is all the information I have?
the denser center areas of the histogram
I assume density would be the z-value - percentages.
define an upper and lower bound for the values of your line, maybe .079 < percentages <= .081
find all the points within those boundaries
add a line using those points.
if the line is too thick or not continuous, adjust the boundaries and repeat.
determine the value that delineates inside or outside - maybe .08 percent
use numpy's .isclose method, with an appropriate tolerance, to find the points close to that value
draw a line using those points

How to keep the same scale/levels across different kdeplot (seaborn 2D plot)?

I plot seaborn's kdeplot using subsets of a dataset. the main dataset contains people detections and are the coordinates on a map.There can be many detections in a single frame. the data contains detections of 24 hours
Data format : [time/frame_number, x_cordinate, y_cordinate]
Problem
When i draw two different kdeplot using two subsets data (say 1-2pm and 10-11pm), Both plots are drawn fine.
By exploring the the data i found out that 1-2pm is rush hour, where there are many detections and 10-11pm is closing time where detections are way less. but kdeplot represents the data in same scale (red density areas). This behavior is understandable, Since kde defines the scale based on local max and min values.
Requirement
I want to plot hour-wise kde plots, but i want the scale to be constant over the whole day. Meaning, if 1-2pm is rush-hour, the red density areas are shown there. but at 10-11pm when there mild traffic, the color should not be red, rather lower density colors (green, blue), Since as compared to rush-hour the detections would be very less.
Inshort making scale/levels based on 24 hour min, max values and using consistently in hour-wise plots

What is the easiest way to interpolate x, y, value timeseries to another group of points?

Suppose, I have 7 timeseries, that represent a changing value measured in a point with coordinates x,y. Also I have a group of 20 points distributed spatially within the coverage of these 7 points.
Thus I want to get 20 time series, where every value is an interpolated value of initial 7 points on a corresponding moment. Timestep is day.
I know that kriging is the best interpolation method in my case. Also I know that kriging interpolation of several points over a regular grid is easy to perform with scikit-learn or pykrige packages. But I want time series (time cycle? wouldnt it work too slowly?) and I want irregular points as target positions of interpolated values, not a regular grid.
So, what is the optimal decision here?
I've seen this theme but there is no time cycling.
On the scheme are shown: points with measured time-series (x,y,value) as o and points which are targets for interpolation as x.
scheme of points

Is there a Python package that can trace a curve with a Gaussian lineshape over several x and y values?

My apologies for my ignorance in advance; I've only been learning Python for about two months. Every example question that I've seen on Stack Overflow seems to discuss a single distribution over a series of data, but not one distribution per data point with band broadening.
I have some (essentially) infinitely-thin bars at value x with height y that I need to run a line over so that it looks like the following photo:
The bars are the obtained from the the table of data on the far right. The curve is what I'm trying to make.
I am doing some TD-DFT work to calculate a theoretical UV/visible spectrum. It will output absorbance strengths (y-values, i.e., heights) for specific wavelengths of light (x-values). Theoretically, these are typically plotted as infinitely-thin bars, though we experimentally obtain a curve instead. The theoretical data can be made to appear like an experimental spectrum by running a curve over it that hugs y=0 and has a Gaussian lineshape around every absorbance bar.
I'm not sure if there's a feature that will do this for me, or if I need to do something like make a loop summing Gaussian curves for every individual absorbance, and then plot the resulting formula.
Thanks for reading!
It looks like my answer was using Seaborn to do a kernel density estimation. Because a KDE isn't weighted and only considers the density of x-values, I had to create a small loop to create a new list consisting of the x-entries each multiplied out by their respective intensities:
for j in range(len(list1)): #list1 contains x-values
list5.append([list1[j]]*int(list3[j])) #list5 was empty; see below for list3
#now to drop the brackets from within the list:
for k in range(len(list5)): #list5 was just made, containing intensity-proportional x-values
for l in list5[k]:
list4.append(l) #now just a list, rather than a list of lists
(had to make another list earlier of the intensities multiplied by 1000000 to make them all integers):
list3 = [i * 1000000 for i in list2] #list3 now contains integer intensities

Categories

Resources