Python: superimposing and smoothing graphics - python

I have 100 different graphics, they looks like
I need to superimpose all of them and next smooth result.
I try this
import Image
first = Image.open("test1.png")
second = Image.open("test2.png")
first.paste(second, (0, 0), second)
first.show()
But how can I do it to 100 graphics? And how can I smooth result?
First 10 steps in dataframe looks like
active nodes
graph
0 1024
1 598
2 349
3 706
4 541
5 623
6 576
7 614
8 578
9 613
10 595

You have it just as an image, or you also have the data that makes the graph?
If you have the data, the easiest way to smooth it is to use convolution.
n=100
smoothed_data=np.convolve(data,[1/n]*n,'same')

Related

Filtering a labeled image by particle area

I have a labeled image of detected particles and a dataframe with the corresponding area of each labeled particle. What I want to do is filter out every particle on the image with an area smaller than a specified value.
I got it working with the example below, but I know there must be a smarter and especially faster way.
For example skipping the loop by comparing the image with the array.
Thanks for your help!
Example:
labels = df["label"][df.area > 5000].to_numpy()
mask = np.zeros(labeled_image.shape)
for label in labels:
mask[labeled_image == label] = 1
Dataframe:
label centroid-0 centroid-1 area
0 1 15 3681 191
1 2 13 1345 390
2 3 43 3746 885
3 4 32 3616 817
4 5 20 4250 137
... ... ... ...
3827 3828 4149 1620 130
3828 3829 4151 852 62
3829 3830 4155 330 236
3830 3831 4157 530 377
3831 3832 4159 3975 81
You can use isin to check equality to several labels. The resulting boolean array can be directly used as the mask after casting to the required type (e.g. int):
labels = df.loc[df.area.gt(5000), 'label']
mask = np.isin(labeled_image, labels).astype(int)

How to check which points are inside a circle?

I have a dataframe df that contains the distances between all the points (IDs) in my system. So the df looks like the following:
df
radius ID1 ID2 x1 y1 x2 y2
0 0.454244 100 103 103.668919 1.335309 103.671812 1.332424
1 1.016734 100 123 103.668919 1.335309 103.677598 1.332424
2 0.643200 103 123 103.671812 1.332424 103.677598 1.332424
3 1.605608 100 124 103.668919 1.335309 103.677598 1.346851
4 1.728349 103 124 103.671812 1.332424 103.677598 1.346851
I want to compute the circle between all the points and then check witch points are inside that circle. For each points I have the coordinates in a separated dataframe coordinates.
coordinates
ID x y
0 100 103.668919 1.335309
1 103 103.671812 1.332424
2 124 103.677598 1.346851
3 125 103.677598 1.349737
4 134 103.680491 1.341080
5 135 103.680491 1.343966
6 136 103.680491 1.346851
7 137 103.680491 1.349737
8 138 103.680491 1.352622
9 146 103.683384 1.341080
Here the code
from matplotlib.patches import Circle
for i in df.index:
x = df.x1[i]
y = df.y1[i]
circ = Circle((x, y), radius = df.radius)
## it works until here: from now I need to understand what to do
## and in particular I need to find which points are inside the circle
points = circ.contains_point([coordinates.x, coordinates.y])
which returns the error
ValueError: setting an array element with a sequence.
When I have issues like this, I always do a small sanity test:
from matplotlib.patches import Circle
circ = Circle((0, 0), radius = 1)
print(circ.contains_point([0.5,0.5]))
print(circ.contains_point([2,2]))
I get (as expected)
True
False
So coordinates.x and coordinates.y are probably arrays, which explains the message.
contains_points works on a tuple or list of 2 scalars.
To generate your list, you could do a loop within a list comprehension:
points = [(x,y) for x,y in zip(coordinates.x, coordinates.y) if circ.contains_point(x,y)]

Finding the minimum distance for each row using pandas

I am trying to match DNA spiral of different bacterias with their ancestors and I have around 1 million observations. I want to identify the closest ancestor for each bacteria, i.e. I want to compare them with same or older generation ( equal or smaller generation numbers) so my data frame looks like this (for simplicity let's assume DNA vector consist of one number):
bacteria_id generation DNA_vector
213 230 23
254 230 18
256 229 39
289 229 16
310 228 24
324 228 45
I tried to create a matrix and choose the smallest value from that matrix for each bacteria but as it will consist of lot of rows and columns, I get memory error before matrix is created.
Let's assume that it is not bacteria but car and I compare each car with its own generation (e.g. cars launched in 2010) and with the older ones. And also let's change DNA_vector to number of features. And I will assume it is more similar to other car if the difference between number of features is smaller.
So I want to create two additional columns. First one will tell the minimum difference (e.g. for the first one it will be 1, and the most similar car will be model 310)
Expected output is:
bacteria_id generation DNA_vector most_similar_bacteria distance
213 230 23 310 1 (i.e. 24 -23)
254 230 18 289 2
256 229 39 324 6
289 229 16 228 8
310 228 24 324 19
324 228 45 NA NA
Do you have any recommendations?
If you're running into memory errors because of a large dataset, you could try using dask. It is a 'parallel' computing library extremely similar to pandas that allows you to process larger datasets by using your hard drive instead of RAM.
https://dask.pydata.org/en/latest/
May not be something exactly as what you're looking for, but I have had good luck using it with large datasets as you describe.

Scipy Stats ttest_1samp Hypothesis Testing For Comparing Previous Performance To Sample

My Problem I'm Trying To Solve
I have 11 months worth of performance data:
Month Branded Non-Branded Shopping Grand Total
0 2/1/2015 1330 334 161 1825
1 3/1/2015 1344 293 197 1834
2 4/1/2015 899 181 190 1270
3 5/1/2015 939 208 154 1301
4 6/1/2015 1119 238 179 1536
5 7/1/2015 859 238 170 1267
6 8/1/2015 996 340 183 1519
7 9/1/2015 1138 381 172 1691
8 10/1/2015 1093 395 176 1664
9 11/1/2015 1491 426 199 2116
10 12/1/2015 1539 530 156 2225
Let's say it's February, 1 2016 and I'm asking "are the results in January statistically different from the past 11 months?"
Month Branded Non-Branded Shopping Grand Total
11 1/1/2016 1064 408 106 1578
I came across a blog...
I came across iaingallagher's blog. I will reproduce here (in case the blog goes down).
1-sample t-test
The 1-sample t-test is used when we want to compare a sample mean to a
population mean (which we already know). The average British man is
175.3 cm tall. A survey recorded the heights of 10 UK men and we want to know whether the mean of the sample is different from the
population mean.
# 1-sample t-test
from scipy import stats
one_sample_data = [177.3, 182.7, 169.6, 176.3, 180.3, 179.4, 178.5, 177.2, 181.8, 176.5]
one_sample = stats.ttest_1samp(one_sample_data, 175.3)
print "The t-statistic is %.3f and the p-value is %.3f." % one_sample
Result:
The t-statistic is 2.296 and the p-value is 0.047.
Finally, to my question...
In iaingallagher's example, he knows the population mean and is comparing a sample (one_sample_data). In MY example, I want to see if 1/1/2016 is statistically different from the previous 11 months. So in my case, the previous 11 months is an array (instead of a single population mean value) and my sample is one data point (instead of an array)... so it's kind of backwards.
QUESTION
If I was focused on the Shopping column data:
Will scipy.stats.ttest_1samp([161,197,190,154,179,170,183,172,176,199,156], 106) produce a valid result even though my sample (first parameters) is a list of previous results and I'm comparing it to a popmean that's not the population mean but instead one sample.
If this is not the correct stats function, any recommendation on what to use for this hypothesis test situation?
If you are only interested in the "Shopping" column, try to create a .xlsx or .csv file containing the data from only the "Shopping"column.
This way you could import this data and make use of pandas to perform the same T-test for each column individually.
import pandas as pd
from scipy import stats
data = pd.read_excel("datafile.xlxs")
one_sample_data = data["Shopping"]
one_sample = stats.ttest_1samp(one_sample_data, 175.3)

Plotting a histogram in Pandas with very heavy-tailed data

I am often working with data that has a very 'long tail'. I want to plot histograms to summarize the distribution, but when I try to using pandas I wind up with a bar graph that has one giant visible bar and everything else invisible.
Here is an example of the series I am working with. Since it's very long, I used value_counts() so it will fit on this page.
In [10]: data.value_counts.sort_index()
Out[10]:
0 8012
25 3710
100 10794
200 11718
300 2489
500 7631
600 34
700 115
1000 3099
1200 1766
1600 63
2000 1538
2200 41
2500 208
2700 2138
5000 515
5500 201
8800 10
10000 10
10900 465
13000 9
16200 74
20000 518
21500 65
27000 64
53000 82
56000 1
106000 35
530000 3
I'm guessing that the answer involves binning the less common results into larger groups somehow (53000, 56000, 106000, and 53000 into one group of >50000, etc.), and also changing the y index to represent percentages of the occurrence rather than the absolute number. However, I don't understand how I would go about doing that automatically.
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
mydict = {0: 8012,25: 3710,100: 10794,200: 11718,300: 2489,500: 7631,600: 34,700: 115,1000: 3099,1200: 1766,1600: 63,2000: 1538,2200: 41,2500: 208,2700: 2138,5000: 515,5500: 201,8800: 10,10000: 10,10900: 465,13000: 9,16200: 74,20000: 518,21500: 65,27000: 64,53000: 82,56000: 1,106000: 35,530000: 3}
mylist = []
for key in mydict:
for e in range(mydict[key]):
mylist.insert(0,key)
df = pd.DataFrame(mylist,columns=['value'])
df2 = df[df.value <= 5000]
Plot as a bar:
fig = df.value.value_counts().sort_index().plot(kind="bar")
plt.savefig("figure.png")
As a histogram (limited to values 5000 & under which is >97% of your data):
I like using linspace to control buckets.
df2 = df[df.value <= 5000]
df2.hist(bins=np.linspace(0,5000,101))
plt.savefig('hist1')
EDIT: Changed np.linspace(0,5000,100) to np.linspace(0,5000,101) & updated histogram.

Categories

Resources