Remove spike noise from data in Python - python

I'm transitioning all of my data analysis from MATLAB to Python and I've finally hit a block where I've been unable to quickly find a turnkey solution. I have time series data from many instruments including an ADV (acoustic doppler velocimeter) that require despiking. Previously I've used this function in MATLAB that works quite well:
http://www.mathworks.com/matlabcentral/fileexchange/15361-despiking
Is anybody aware of a similar function available in Python?

I'd use median filter, and there are plenty of options depending on your data class, for example
import scipy.ndimage as im
x= im.median_filter(x, (self.m,self.m))

Related

tsfresh efficient feature set extraction is stuck at 0%

I am using the tsfresh library to extract features from my time-series dataset.
My problem is that I can only use the setting for the minimal feature set (MinimalFCParameters) because even the efficient one (EfficientFCParameters) is always stuck at 0% and never calculates any features.
The data is pretty large (over 40. Mio. rows, 100k windows) but this is even the smallest data set I am going to use. I am using a compute cluster, so computing resources should not be the issue. I also tried to use the n_jobs parameter for the extract_features-method (n_jobs=32). Finally, as suggested by the tsfresh website, I used dask instead of pandas for the input data frame - but no success too.
My questions are: Is there anything else I can try? Or are there any other libraries I could use?

The syntax for computing the Stockwell transform using the MNE package?

I'm trying to compute the Stockwell transform of an EEG signal, and can't seem to find a command for doing so using anything in scipy. In a post from 2015 (Stockwell transform in Python), #Jariani suggested using the MNE EEG analysis package with which I am familiar. However, the command usage documentation at https://mne.tools/0.11/auto_examples/time_frequency/plot_stockwell.html requires constructing "events" and "epochs" first.
All I want is the transform for one channel without defining these quantities first if possible. Here's my code thus far:
import mne as mn
from mne.time_frequency import tfr_stockwell
# Read in raw EEG file and pick off data and times:
raw=mn.io.read_raw_edf('/Users/fishbacp/Desktop/myfile.edf', preload=True)
data,times=raw[:,:]
# Constuct the data matrix of dimension numnber_of_times -by- number_of_channels:
X=data.T
# Try to compute Stockwell transform for first channel, where frequencies from 6 to 30 Hz:
power = tfr_stockwell(X[:,0], fmin=6., fmax=30., width=.3, return_itc=False)
The error message states among other things "TypeError: inst must be Epochs or Evoked," which leads me to believe I can't input a simple times series for one channel.
Any help using the syntax correctly or suggesting some other means for computing the Stockwell Transform would be appreciated. I've tried the package at https://github.com/claudiodsf/stockwell but can't seem to follow the installation process.

Analyse audio files with Python

I actually have Photodiode connect to my PC an do capturing with Audacity.
I want to improve this by using an old RPI1 as dedicated test station. As result the shutter speed should appear on the console. I would prefere a python solution for getting signal an analyse it.
Can anyone give me some suggestions? I played around with oct2py, but i dont really under stand how to calculate the time between the two peak of the signal.
I have no expertise on sound analysis with Python and this is what I found doing some internet research as far as I am interested by this topic
pyAudioAnalysis for an eponym purpose
You an use pyAudioAnalysis developed by Theodoros Giannakopoulos
Towards your end, function mtFileClassification() from audioSegmentation.py can be a good start. This function
splits an audio signal to successive mid-term segments and extracts mid-term feature statistics from each of these sgments, using mtFeatureExtraction() from audioFeatureExtraction.py
classifies each segment using a pre-trained supervised model
merges successive fix-sized segments that share the same class label to larger segments
visualize statistics regarding the results of the segmentation - classification process.
For instance
from pyAudioAnalysis import audioSegmentation as aS
[flagsInd, classesAll, acc, CM] = aS.mtFileClassification("data/scottish.wav","data/svmSM", "svm", True, 'data/scottish.segments')
Note that the last argument of this function is a .segment file. This is used as ground-truth (if available) in order to estimate the overall performance of the classification-segmentation method. If this file does not exist, the performance measure is not calculated. These files are simple comma-separated files of the format: ,,. For example:
0.01,9.90,speech
9.90,10.70,silence
10.70,23.50,speech
23.50,184.30,music
184.30,185.10,silence
185.10,200.75,speech
...
If I have well understood your question this is at least what you want to generate isn't it ? I rather think you have to provide it there.
Most of these information are directly quoted from his wiki which I suggest you to read it. Yet don't hesitate to reach out as far as I am really interested by this topic
Other available libraries for audio analysis :

Plot specifying column by name, upper case issue

I'm learning how to plot things (CSV files) in Python, using import matplotlib.pyplot as plt.
Column1;Column2;Column3;
1;4;6;
2;2;6;
3;3;8;
4;1;1;
5;4;2;
I can plot the one above with plt.plotfile('test0.csv', (0, 1), delimiter=';'), which gives me the figure below.
Do you see the axis labels, column1 and column2? They are in lower case in the figure, but in the data file they beggin with upper case.
Also, I tried plt.plotfile('test0.csv', ('Column1', 'Column2'), delimiter=';'), which does not work.
So it seems matplotlib.pyplot works only with lowercase names :(
Summing this issue with this other, I guess it's time to try something else.
As I am pretty new to plotting in Python, I would like to ask: Where should I go from here, to get a little more than what matplotlib.pyplot provides?
Should I go to pandas?
You are mixing up two things here.
Matplotlib is designed for plotting data. It is not designed for managing data.
Pandas is designed for data analysis. Even if you were using pandas, you would still need to plot the data. How? Well, probably using matplotlib!
Independently of what you're doing, think of it as a three step process:
Data aquisition, data read-in
Data processing
Data representation / plotting
plt.plotfile() is a convenience function, which you can use if you don't need step 2. at all. But it surely has its limitations.
Methods to read in data (not complete of course) are using pure python open, python csvReader or similar, numpy / scipy, pandas etc.
Depeding on what you want to do with your data, you can already chose a suitable input method. numpy for large numerical data sets, pandas for datasets which include qualitative data or heavily rely on cross correlations etc.

Geospatial Analytics in Python

I have been doing some investigation to find a package to install and use for Geospatial Analytics
The closest I got to was https://github.com/harsha2010/magellan - This however has only scala interface and no doco how to use it with Python.
I was hoping if you someone knows of a package I can use?
What I am trying to do is analyse Uber's data and map it to the actual postcodes/suburbs and run it though SGD to predict the number of trips to a particular suburb.
There is already lots of data info here - http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/#comment-606532 and I am looking for ways to do it in Python.
In Python I'd take a look at GeoPandas. It provides a data structure called GeoDataFrame: it's a list of features, each one having a geometry and some optional attributes. You can join two GeoDataFrames together based on geometry intersection, and you can aggregate the numbers of rows (say, trips) within a single geometry (say, postcode).
I'm not familiar with Uber's data, but I'd try to find a way to get it into a GeoPandas GeoDataFrame.
Likewise postcodes can be downloaded from places like the U.S. Census, OpenStreetMap[1], etc, and coerced into a GeoDataFrame.
Join #1 to #2 based on geometry intersection. You want a new GeoDataFrame with one row per Uber trip, but with the postcode attached to each. Another StackOverflow post discusses how do to this, and it's currently harder than it ought to be.
Aggregate this by postcode and count the trips in each. The code will look like joined_dataframe.groupby('postcode').count().
My fear for the above process is if you have hundreds of thousands of very complex trip geometries, it could take forever on one machine. The link you posted uses Spark and you may end up wanting to parallelize this after all. You can write Python against a Spark cluster(!) but I'm not the person to help you with this component.
Finally, for the prediction component (e.g. SGD), check out scikit-learn: it's a pretty fully featured machine learning package, with a dead simple API.
[1]: There is a separate package called geopandas_osm that grabs OSM data and returns a GeoDataFrame: https://michelleful.github.io/code-blog/2015/04/27/osm-data/
I realize this is an old questions, but to build on Jeff G's answer.
If you arrive at this page looking for help putting together a suite of geospatial analytics tools in python - I would highly recommend this tutorial.
https://geohackweek.github.io/vector
It really picks up steam in the 3rd section.
It shows how to integrate
GeoPandas
PostGIS
Folium
rasterstats
add in scikit-learn, numpy, and scipy and you can really accomplish a lot. You can grab information from this nDarray tutorial as well

Categories

Resources