Find big differences in numpy array - python

I have a csv file that contains data from two led measurements. There are some mistakes in the file that gives huge sparks in the graph. I want to locate this places where this happens.
I have this code that makes two arrays that I plot.
x625 = np.array(df['LED Group 625'].dropna(axis=0, how='all'))
x940 = np.array(df['LED Group 940'].dropna(axis=0, how='all'))

I will provide an answer with some artificial data since you have not posted any data yet.
So after you convert the pandas columns into a numpy array, you can do something like this:
import numpy as np
# some random data. 100 lines and 1 column
x625 = np.random.rand(100,1)
# Assume that the maximum value in `x625` is a spark.
spark = x625.max()
# Find where these spark are in the `x625`
np.where(x625==spark)
#(array([64]), array([0]))
The above means that a value equal to spark is located on the 64th line of the 0th column.
Similarly, you can use np.where(x625 > any_number_here)
If instead of the location you need to create a boolean mask use this:
boolean_mask = (x625==spark)
# verify
np.where(boolean_mask)
# (array([64]), array([0]))
EDIT 1
You can use numpy.diff() to get the element wise differences of all elements into the list (variable).
diffs = np.diff(x625.ravel())
This will have in index 0 the results of element1-element0.
If the vaules in diffs are big in a specific index positio, then a spark occured in that position.

Related

Can you extract indexes of data over a threshold from numpy array or pandas dataframe

I am using the following to compare several strings to each other. It's the fastest method I've been able to devise, but it results in a very large 2D array. which I can look at and see what I want. Ideally, I would like to set a threshold and pull the index(es) for each value over that number. To make matters more complicated, I don't want the index comparing the string to itself, and it's possible the string might be duplicated elsewhere so I would want to know if that's the case, so I can't just ignore 1's.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
texts = sql.get_corpus()
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)
similarity = cosine_similarity(vectors)
sql.get_corups() returns a list of strings, currently 1600ish strings.
Is what I want possible? I've tried using comparing each of the 1.4M combinations to each other using Levenshtein, which works, but it takes 2.5 hours vs half above. I've also tried vecotrs with spacy, which takes days.
I'm not entirely sure I read your post correctly, but I believe this should get you started:
import numpy as np
# randomly distributed data we want to filter
data = np.random.rand(5, 5)
# get index of all values above a threshold
threshold = 0.5
above_threshold = data > threshold
# I am assuming your matrix has all string comparisons to
# itself on the diagonal
not_ident = np.identity(5) == 0.
# [edit: to prevent duplicate comparisons, use this instead of not_ident]
#upper_only = np.triu(np.ones((5,5)) - np.identity(5))
# 2D array, True when criteria met
result = above_threshold * not_ident
print(result)
# original shape, but 0 in place of all values not matching above criteria
values_orig_shape = data * result
print(values_orig_shape)
# all values that meet criteria, as a 1D array
values = data[result]
print(values)
# indices of all values that meet criteria (in same order as values array)
indices = [index for index,value in np.ndenumerate(result) if value]
print(indices)

Get data from Pandas multiIndex

I am using pandas and uproot to read data from a .root file, and I get a table like the following one:
table
So, from my .root file I have got some branches of a tree.
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
branches = ['nCleanedJets', 'JetPt', 'JetMass', 'JetPhi']
df = ttree.pandas.df(branches, entrystop=40306)
Essentially, I have to retrieve "JetPhi" data for each entry in which there are more than 2 subentries (or equivalently, entries for which "nCleanedJets" is equal or greater than 2), calculating the difference of "JetPhi" between the first two subentries and then make a histogram for such differences.
I have tried to look up in the internet and tried different possibilities but I have not found any useful solution.
If someone could give me any hint, advice and/or suggestion, I would be very grateful.
I used to code in C++ and I am new to python.
I used to code in C++, so I am new to python and I do not still master this language.
You can do this in Pandas with
df[df["nCleanedJets"] >= 2]
because you have a column with the number of entries. The df["nCleanedJets"] >= 2 expression returns a Series of booleans (True if a row passes, False if a row doesn't pass) and passing a Series or NumPy array as a slice in square brackets masks by that array (returning rows for which the boolean array is True).
You could also do this in Awkward Array before converting to Pandas, which would be easier if you didn't have a "nCleanedJets" column.
array = ttree.arrays(branches, entrystop=40306)
selected = array[array.counts >= 2]
awkward.topandas(selected, flatten=True)
Masking in Awkward Array follows the same principle, but with data structures instead of flat Series or NumPy arrays (each element of array is a list of records with "nCleanedJets", "JetPt", "JetPhi", "JetMass" fields, and counts is the length of each list).
awkward.topandas with flatten=True is equivalent to what uproot does when outputtype=pandas.DataFrame and flatten=True (defaults for ttree.pandas.df).

Interpolating from a pandas DataFrame or Series to a new DatetimeIndex

Let's say I have an hourly series in pandas, fine to assume the source is regular but it is gappy. If I want to interpolate it to 15min, the pandas API provides resample(15min).interpolate('cubic'). It interpolates to the new times and provides some control over the limits of interpolation. The spline is helping to refine the series as well as fill small gaps. To be concrete:
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan # too wide a gap
signal[160:168:2] = np.nan # these can be interpolated
df = pd.DataFrame({"signal":signal},index=tndx)
df1= df.resample('15min').interpolate('cubic',limit=9)
Now let's say I have an irregular datetime index. In the example below, the first time is a regular time point, the second is in the big gap and the last is in the interspersed brief gaps.
tndx2 = pd.DatetimeIndex('2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00')
How do I interpolate to from the original series (hourly) to this irregular series of times?
Is the only option to build a series that includes the original data and the destination data? How would I do this? What is the most economical way to achieve the goals of interpolating to an independent irregular index and imposing a gap limit?
In case of irregular timestamps, first you set datetime as index and then you can use interpolate method to index df1= df.resample('15min').interpolate('index')
You can find more information here https://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.interpolate.html
This is an example solution within the pandas interpolate API, which doesn't seem to have a way of using abscissa and values from the source series to interpolate to new times provided by the destination index, as separate data structure. This method solves this by tacking the destination to the source. The method makes use of the limit argument of df.interpolate and it can use any interpolation algorithm from that API but it isn't perfect because the limit is in terms of the number of values and if there are a lot of destination points in a patch of NaNs those get counted as well.
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan
signal[160:168:2] = np.nan
df = pd.DataFrame({"signal":signal},index=tndx)
# Express the destination times as a dataframe and append to the source
tndx2 = pd.DatetimeIndex(['2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00'])
df2 = pd.DataFrame( {"signal": [np.nan,np.nan,np.nan]} , index = tndx2)
big_df = df.append(df2,sort=True)
# At this point there are duplicates with NaN values at the bottom of the DataFrame
# representing the destination points. If these are surrounded by lots of NaNs in the source frame
# and we want the limit argument to work in the call to interpolate, the frame has to be sorted and duplicates removed.
big_df = big_df.loc[~big_df.index.duplicated(keep='first')].sort_index(axis=0,level=0)
# Extract at destination locations
interpolated = big_df.interpolate(method='cubic',limit=3).loc[tndx2]

How do I get one column of an array into a new Array whilst applying a fancy indexing filter on it?

So basically I have an array, that consists of 14 Columns and 426 rows, every column represents one property of a dog and every row represents one dog, now I want to know the average heart frequency of an ill dog, the 14. column is the column that indicates whether the Dog is ill or not [0 = Healthy 1 = ill], the 8. row is the heart frequency. Now my problem is, that I don't know how I can get the 8. column out of the whole array and use the boolean filter on it
I am pretty new to Python. As I mentioned above I think that I know what I have to do [Use a fancy indexing filter] but I don't know how I can do this. I tried doing it while still being in the original Array but that didn't work out, so I thought I need to get the Infos into another one and use the Boolean filter on that one.
EDIT: Ok, so here is the code that I got right now:
import numpy as np
def average_heart_rate_for_pathologic_group(D):
a=np.array(D[:, 13]) #gets information, wether the dogs are sick or not
b=np.array(D[:, 7]) #gets the heartfrequency
R=(a >= 0) #gets all the values that are from sick dogs
amhr = np.mean(R) #calculates the average heartfrequency
return amhr
I think boolean indexing is the way foward.
The shortcuts for this work like:
#Your data:
data = [[0,1,2,3,4,5,6,7,8...],[..]...]
#This indexing chooses the rows in the 8th column that equals 1 and then their
#column number 14 values. Any analysis can be done after this on the new variable
heart_frequency_ill = data[data[:,7] == 1,13]
Probably you'll have to actually copy the data from the original array into a new one with the selected data.
Could you please share a sample with let's say 3 or 4 rows of your data?
I will give a try thought.
Let me build data with 4 columns here (but you could use 14 as in your problem)
data = [['c1a','c2a','c3a','c4a'], ['c1b','c2b','c3b','c4b']]
You could use numpy.array to get its nth column.
See how one can get the 2nd column:
import numpy as np
a = np.array(data)
a[:,2]
If you want to get the 8. Column of all the dogs that are healthy, you can do it the following:
# we use 7 for the column because the index starts by 0
# we use filter and fancy to get the rows where the conditions are true
# we use n.argwhere to get the indices where the conditions are true
A[np.argwhere([A[:,13] == 0])[:,1],7]
If you also want to compute the mean:
A[np.argwhere([A[:,13] == 0])[:,1],7].mean()

Delete series value from row of a pandas data frame based on another data frame value

My question is little bit different than the question posted here
So I thought to open a new thread.I have a pandas data frame with 5 attributes.One of these attribute is created using pandas series.Here is the sample code for creating the data frame
import numpy as np
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
data = np.array([2540948, 2540955, 2540956,2540956,7138932])
x=pd.Series(data)
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
I have another data frame,the code for creating the data frame is given below
mydf2=pd.DataFrame(columns=['group','id'])
data1 = np.array([2540948, 2540955, 2540956])
y=pd.Series(data1)
mydf2.loc[0]=[1,y]
These are sample data. Actual data will have large number of rows & also the series length is large too .I want to match mydf1 with mydf2 & if it matches,sometime I wont have matching element in mydf2,then I will delete values of id from mydf1 which are there in mydf2 for example after the run,my id will be for group 1 2540956,7138932. I also tried the code mentioned in above link. But for the first line
counts = mydf1.groupby('id').cumcount()
I got error message as
TypeError: 'Series' objects are mutable, thus they cannot be hashed
in my Python 3.X. Can you please suggest me how to solve this?
This should work. We use Counter to find the difference between 2 lists of ids. (p.s. This problem does not requires the difference is in order.)
Setup
import numpy as np
from collections import Counter
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
x = [2540948, 2540955, 2540956,2540956,7138932]
y = [2540948, 2540955, 2540956,2540956,7138932]
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
mydf1.loc[1]=[2,y,'def','def#xyz.com','female']
mydf2=pd.DataFrame(columns=['group','id'])
x2 = np.array([2540948, 2540955, 2540956])
y2 = np.array([2540955, 2540956])
mydf2.loc[0]=[1,x2]
mydf2.loc[1]=[2,y2]
Code
mydf3 = mydf1[["group", "id"]]
mydf3 = mydf3.merge(mydf2, how="inner", on="group")
new_id_finder = lambda x: list((Counter(x.id_x) - Counter(x.id_y)).elements())
mydf3["new_id"] = mydf3.apply(new_id_finder, 1)
mydf3["new_id"]
group new_id
0 1 [2540956, 7138932]
1 2 [2540948, 2540956, 7138932]
One Counter object can substract another to get the difference in occurances of elements. Then, you can use elements function to retrieve all values left.

Categories

Resources