Determining if data in a txt file obeys certain statistics

Determining if data in a txt file obeys certain statistics - python

I'm working with a Geiger counter which can be hooked up to a computer and which records its output in the form of a .txt file, NC.txt, where it records the time since starting and the 'value' of the radiation it recorded. It looks like
import pylab
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
x1 = []
y1 = []
#Define a dictionary: counts
f = open("NC.txt", "r")
for line in f:
line = line.strip()
parts = line.split(",") #the columns are separated by commas and spaces
time = float(parts[1]) #time is recorded in the second column of NC.txt
value = float(parts[2]) #and the value it records is in the third
x1.append(time)
y1.append(value)
f.close()
xv = np.array(x1)
yv = np.array(y1)
#Statistics
m = np.mean(yv)
d = np.std(yv)
#Strip out background radiation
trueval = yv - m
#Basic plot of counts
num_bins = 10000
plt.hist(trueval,num_bins)
plt.xlabel('Value')
plt.ylabel('Count')
plt.show()
So this code so far will just create a simple histogram of the radiation counts centred at zero, so the background radiation is ignored.
What I want to do now is perform a chi-squared test to see how well the data fits, say, Poisson statistics (and then go on to compare it with other distributions later). I'm not really sure how to do that. I have access to scipy and numpy, so I feel like this should be a simple task, but just learning python as I go here, so I'm not a terrific programmer.
Does anyone know of a straightforward way to do this?
Edit for clarity: I'm not asking so much about if there is a chi-squared function or not. I'm more interested in how to compare it with other statistical distributions.
Thanks in advance.

You can use SciPy library, here is documentation and examples.

Related

Matplotlib plot excessively slow

I'm trying to plot 20 million data points however it's taking an extremely long time (over an hour) using matplotlib,
Is there something in my code that is making this unusually slow?
import csv
import matplotlib.pyplot as plt
import numpy as np
import Tkinter
from Tkinter import *
import tkSimpleDialog
from tkFileDialog import askopenfilename
plt.clf()
root = Tk()
root.withdraw()
listofparts = askopenfilename() # asks user to select file
root.destroy()
my_list1 = []
my_list2 = []
k = 0
csv_file = open(listofparts, 'rb')
for line in open(listofparts, 'rb'):
current_part1 = line.split(',')[0]
current_part2 = line.split(',')[1]
k = k + 1
if k >= 2: # skips the first line
my_list1.append(current_part1)
my_list2.append(current_part2)
csv_file.close()
plt.plot(my_list1 * 10, 'r')
plt.plot(my_list2 * 10, 'g')
plt.show()
plt.close()

There is no reason whatsoever to have a line plot of 20000000 points in matplotlib.
Let's consider printing first:
The maximum figure size in matplotlib is 50 inch. Even having a high-tech plotter with 3600 dpi would give a maximum number of 50*3600 = 180000 points which are resolvable.
For screen applications it's even less: Even a high-tech 4k screen has a limited resolution of 4000 pixels. Even if one uses aliasing effects, there are a maximum of ~3 points per pixel that would still be distinguishable for the human eye. Result: maximum of 12000 points makes sense.
Therefore the question you are asking rather needs to be: How do I subsample my 20000000 data points to a set of points that still produces the same image on paper or screen.
The solution to this strongly depends on the nature of the data. If it is sufficiently smooth, you can just take every nth list entry.
sample = data[::n]
If there are high frequency components which need to be resolved, this would require more sophisticated techniques, which will again depend on how the data looks like.
One such technique might be the one shown in How can I subsample an array according to its density? (Remove frequent values, keep rare ones).

The following approach might give you a small improvement. It removes doing the split twice per row (by using Python's CSV library) and also removes the if statement by skipping over the two header lines before doing the loop:
import matplotlib.pyplot as plt
import csv
l1, l2 = [], []
with open('input.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
# Skip two header lines
next(csv_input)
next(csv_input)
for cols in csv_input:
l1.append(cols[0])
l2.append(cols[1])
plt.plot(l1, 'r')
plt.plot(l2, 'g')
plt.show()
I would say the main slow down though will still be the plot itself.

I would recommend switching to pyqtgraph. I switched to it because of speed issues while I was trying to make matplotlib plot real time data. Worked like a charm. Here's my real time plotting example.

How does one plot a running average without importing external modules (other than matplotlib)?

Here is a link to the file with the information in 'sunspots.txt'. With the exception of external modules matploblib.pyplot and seaborn, how could one compute the running average without importing external modules like numpy and future? (If it helps, I can linspace and loadtxt without numpy.)
If it helps, my code thus far is posted below:
## open/read file
f2 = open("/Users/location/sublocation/sunspots.txt", 'r')
## extract data
lines = f2.readlines()
## close file
f2.close()
t = [] ## time
n = [] ## number
## col 1 == col[0] -- number identifying which month
## col 2 == col[1] -- number of sunspots observed
for col in lines: ## 'col' can be replaced by 'line' iff change below is made
new_data = col.split() ## 'col' can be replaced by 'line' iff change above is made
t.append(float(new_data[0]))
n.append(float(new_data[1]))
## extract data ++ close file
## check ##
# print(t)
# print(n)
## check ##
## import
import matplotlib.pyplot as plt
import seaborn as sns
## plot
sns.set_style('ticks')
plt.figure(figsize=(12,6))
plt.plot(t,n, label='Number of sunspots oberved monthly' )
plt.xlabel('Time')
plt.ylabel('Number of Sunspots Observed')
plt.legend(loc='best')
plt.tight_layout()
plt.savefig("/Users/location/sublocation/filename.png", dpi=600)
The question is from the weblink from this university (p.11 of the PDF, p.98 of the book, Exercise 3-1).
Before marking this as a duplicate:
A similar question was posted here. The difference is that all posted answers require importing external modules like numpy and future whereas I am trying to do without external imports (with the exceptions above).

Noisy data that needs to be smoothed
y = [1.0016, 0.95646, 1.03544, 1.04559, 1.0232,
1.06406, 1.05127, 0.93961, 1.02775, 0.96807,
1.00221, 1.07808, 1.03371, 1.05547, 1.04498,
1.03607, 1.01333, 0.943, 0.97663, 1.02639]
Try a running average with a window size of n
n = 3
Each window can by represented by a slice
window = y[i:i+n]
Need something to store the averages in
averages = []
Iterate over n-length slices of the data; get the average of each slice; save the average in another list.
from __future__ import division # For Python 2
for i in range(len(y) - n):
window = y[i:i+n]
avg = sum(window) / n
print(window, avg)
averages.append(avg)
When you plot the averages you'll notice there are fewer averages than there are samples in the data.
Maybe you could import an internal/built-in module and make use of this SO answer -https://stackoverflow.com/a/14884062/2823755
Lots of hits searching with running average algorithm python

How to perform sigma clipping on data set?

*Context: I am looking at the change of velocity of an object periodically, where the period is 1.846834days and I am expecting a sinusoidal fit to my set of data.
Suppose I have a set of data which look like this:
#days vel error
5725.782701 0.195802 0.036312
5729.755560 -0.006370 0.041495
5730.765352 -0.071253 0.030760
5745.710214 0.092082 0.036094
5745.932853 0.238030 0.040097
5749.705307 0.196649 0.037140
5741.682112 0.186664 0.028075
5742.681765 -0.262104 0.038049
6186.729146 -0.243796 0.031687
6187.742803 -0.009394 0.054541
6190.717317 -0.001821 0.033684
6192.716356 0.117557 0.037807
6196.704736 0.093935 0.032336
6203.683879 0.076051 0.033085
6204.679898 -0.301463 0.033483
6208.659585 -0.409340 0.036002
6209.669701 0.180807 0.041666
There are only one or two data points observed in each cycle, so what I wanted to do is to phase fold my data, plot them and fit my data using chi-square minimization. This is what I have so far:
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize as op
import csv
with open('testdata.dat') as fin:
with open('testdata_out.dat', 'w') as fout: #write output in a new .dat file
o=csv.writer(fout)
for line in fin:
o.writerow(line.split())
#create a 2d array and give names to columns
data = np.genfromtxt('testdata_out.dat',delimiter=',',dtype=('f4,f4,f4'))
data.dtype.names = ('bjd','rv','err')
# parameters
x = data['bjd']
y = data['rv']
e = data['err']
P = 1.846834 #orbital period
T = 3763.85112 #time from ephemeris
q = (data['bjd']-T)%P
#print(q)
def niceplot():
plt.xlabel('BJD')
plt.ylabel('RV (km/s)')
plt.tight_layout()
def model(q,A,V):
return A*np.cos(np.multiply(np.divide((q),P),np.multiply(2,np.pi))-262) + V
def residuals((A,V),q,y,e): #for least square
return (y-model(q,A,V)) / e
def chisq((A,V),q,y,e):
return ((residuals((A,V),q,y,e))**2).sum()
result_min = op.minimize(chisq, (0.3,0), args=(q,y,e))
print(result_min)
A,V = result_min.x
xf = np.arange(q.min(), q.max(), 0.1)
yf = model(xf,A,V)
print(xf)
print(yf)
plt.errorbar(q, y, e, fmt='ok')
plt.plot(xf,yf)
niceplot()
plt.show()
In my plot, the shape of the sine curve seems to fit my data, but it does not go through all the data points.
My question is: How can I perform sigma clipping such that I can get a better fit for my set of data? I know the curve_fit module in Scipy may do the job. But I would like to know if it is possible to perform sigma clipping while using minimize module??
Many thanks in advance!

Reading and manipulating multiple netcdf files in python

I need help with reading multiple netCDF files, despite few examples in here, none of them works properly.
I am using Python(x,y) vers 2.7.5, and other packages : netcdf4 1.0.7-4, matplotlib 1.3.1-4, numpy 1.8, pandas 0.12,
basemap 1.0.2...
I have few things I'm used to do with GrADS that I need to start doing them in Python.
I have a few 2 meter temperature data (4-hourly data, each year, from ECMWF), each file contains 2 meter temp data, with Xsize=480, Ysize=241,
Zsize(level)=1, Tsize(time) = 1460 or 1464 for leap years.
These are my files name look alike: t2m.1981.nc, t2m.1982.nc, t2m.1983.nc ...etc.
Based on this page:
( Loop through netcdf files and run calculations - Python or R )
Here is where I am now:
from pylab import *
import netCDF4 as nc
from netCDF4 import *
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np
f = nc.MFDataset('d:/data/ecmwf/t2m.????.nc') # as '????' being the years
t2mtr = f.variables['t2m']
ntimes, ny, nx = shape(t2mtr)
temp2m = zeros((ny,nx),dtype=float64)
print ntimes
for i in xrange(ntimes):
temp2m += t2mtr[i,:,:] #I'm not sure how to slice this, just wanted to get the 00Z values.
# is it possible to assign to a new array,...
#... (for eg.) the average values of 00z for January only from 1981-2000?
#creating a NetCDF file
nco = nc.Dataset('d:/data/ecmwf/t2m.00zJan.nc','w',clobber=True)
nco.createDimension('x',nx)
nco.createDimension('y',ny)
temp2m_v = nco.createVariable('t2m', 'i4', ( 'y', 'x'))
temp2m_v.units='Kelvin'
temp2m_v.long_name='2 meter Temperature'
temp2m_v.grid_mapping = 'Lambert_Conformal' # can it be something else or ..
#... eliminated?).This is straight from the solution on that webpage.
lono = nco.createVariable('longitude','f8')
lato = nco.createVariable('latitude','f8')
xo = nco.createVariable('x','f4',('x')) #not sure if this is important
yo = nco.createVariable('y','f4',('y')) #not sure if this is important
lco = nco.createVariable('Lambert_Conformal','i4') #not sure
#copy all the variable attributes from original file
for var in ['longitude','latitude']:
for att in f.variables[var].ncattrs():
setattr(nco.variables[var],att,getattr(f.variables[var],att))
# copy variable data for lon,lat,x and y
lono=f.variables['longitude'][:]
lato=f.variables['latitude'][:]
#xo[:]=f.variables['x']
#yo[:]=f.variables['y']
# write the temp at 2 m data
temp2m_v[:,:]=temp2m
# copy Global attributes from original file
for att in f.ncattrs():
setattr(nco,att,getattr(f,att))
nco.Conventions='CF-1.6' #not sure what is this.
nco.close()
#attempt to plot the 00zJan mean
file=nc.Dataset('d:/data/ecmwf/t2m.00zJan.nc','r')
t2mtr=file.variables['t2m'][:]
lon=file.variables['longitude'][:]
lat=file.variables['latitude'][:]
clevs=np.arange(0,500.,10.)
map = Basemap(projection='cyl',llcrnrlat=0.,urcrnrlat=10.,llcrnrlon=97.,urcrnrlon=110.,resolution='i')
x,y=map(*np.meshgrid(lon,lat))
cs = map.contourf(x,y,t2mtr,clevs,extend='both')
map.drawcoastlines()
map.drawcountries()
plt.plot(cs)
plt.show()
First question is at the temp2m += t2mtr[1,:,:] . I am not sure how to slice the data to get only 00z (let say for January only) of all files.
Second, While running the test, an error came at cs = map.contourf(x,y,t2mtr,clevs,extend='both') saying "shape does not match that of z: found (1,1) instead of (241,480)". I know some error probably on the output data, due to error on recording the values, but I can't figure out what/where .
Thanks for your time. I hope this is not confusing.

So t2mtr is a 3d array
ntimes, ny, nx = shape(t2mtr)
This sums all values across the 1st axis:
for i in xrange(ntimes):
temp2m += t2mtr[i,:,:]
A better way to do this is:
temp2m = np.sum(tm2tr, axis=0)
temp2m = tm2tr.sum(axis=0) # alt
If you want the average, use np.mean instead of np.sum.
To average across a subset of the times, jan_times, use an expression like:
jan_avg = np.mean(tm2tr[jan_times,:,:], axis=0)
This is simplest if you want just a simple range, e.g the first 30 times. For simplicity I'm assuming the data is daily and years are constant length. You can adjust things for the 4hr frequency and leap years.
tm2tr[0:31,:,:]
A simplistic way on getting Jan data for several years is to construct an index like:
yr_starts = np.arange(0,3)*365 # can adjust for leap years
jan_times = (yr_starts[:,None]+ np.arange(31)).flatten()
# array([ 0, 1, 2, ... 29, 30, 365, ..., 756, 757, 758, 759, 760])
Another option would be to reshape tm2tr (doesn't work well for leap years).
tm2tr.reshape(nyrs, 365, nx, ny)[:,0:31,:,:].mean(axis=1)
You could test the time sampling with something like:
np.arange(5*365).reshape(5,365)[:,0:31].mean(axis=1)
Doesn't the data set have a time variable? You might be able to extract the desired time indices from that. I worked with ECMWF data a number of years ago, but don't remember a lot of the details.
As for your contourf error, I would check the shape of the 3 main arguments: x,y,t2mtr. They should match. I haven't worked with Basemap.

what is the difference between the two datasets for numpy.fft

I am trying to find the period of a sin curve and can find the right periods for sin(t).
However for sin(k*t), the frequency shifts. I do not know how it shifts.
I can adjust the value of interd below to get the right signal only if I know the dataset is sin(0.6*t).
Why can I get the right result for sin(t)?
Anyone can detect the right signal just based on my code ? Or just a small change?
The figure below is the power spectral density of sin(0.6*t).
The dataset is like:
1,sin(1*0.6)
2,sin(2*0.6)
3,sin(3*0.6)
.........
2000,sin(2000*0.6)
And my code:
timepoints = np.loadtxt('dataset', usecols=(0,), unpack=True, delimiter=",")
intensity = np.loadtxt('dataset', usecols=(1,), unpack=True, delimiter=",")
binshu = 300
lastime = 2000
interd = 2000.0/300
sp = np.fft.fft(intensity)
freq = np.fft.fftfreq(len(intensity),d=interd)
freqnum = np.fft.fftfreq(len(intensity),d=interd).argsort()
pl.xlabel("frequency(Hz)")
pl.plot(freq[freqnum]*6.28, np.sqrt(sp.real**2+sp.imag**2)[freqnum])

I think you're making it too complicated. If you consider timepoints to be in seconds then interd is 1 (difference between values in timepoints). This works fine for me:
import numpy as np
import matplotlib.pyplot as pl
# you can do this in one line, that's what 'unpack' is for:
timepoints, intensity = np.loadtxt('dataset', usecols=(0,1), unpack=True, delimiter=",")
interd = timepoints[1] - timepoints[0] # if this is 1, it can be ignored
sp = np.fft.fft(intensity)
freq = np.fft.fftfreq(len(intensity), d=interd)
pl.plot(np.fft.fftshift(freq), np.fft.fftshift(np.abs(sp)))
pl.xlabel("frequency(Hz)")
pl.show()
You'll also note that I didn't sort the frequencies, that's what fftshift is for.
Also, don't do np.sqrt(sp.imag**2 + sp.real**2), that's what np.abs is for :)
If you're not sampling enough (the frequency is higher than your sample rate, i.e., 2*pi/interd < 0.5*k), then there's no way for fft to know how much data you're missing, so it assumes you're not missing any. You can't expect it to know a priori. This is the data you're giving it:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Determining if data in a txt file obeys certain statistics - python

You can use SciPy library, here is documentation and examples.

Related

Matplotlib plot excessively slow

How does one plot a running average without importing external modules (other than matplotlib)?

How to perform sigma clipping on data set?

Reading and manipulating multiple netcdf files in python

what is the difference between the two datasets for numpy.fft

Categories

Resources