Get data from Pandas multiIndex - python

I am using pandas and uproot to read data from a .root file, and I get a table like the following one:
table
So, from my .root file I have got some branches of a tree.
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
branches = ['nCleanedJets', 'JetPt', 'JetMass', 'JetPhi']
df = ttree.pandas.df(branches, entrystop=40306)
Essentially, I have to retrieve "JetPhi" data for each entry in which there are more than 2 subentries (or equivalently, entries for which "nCleanedJets" is equal or greater than 2), calculating the difference of "JetPhi" between the first two subentries and then make a histogram for such differences.
I have tried to look up in the internet and tried different possibilities but I have not found any useful solution.
If someone could give me any hint, advice and/or suggestion, I would be very grateful.
I used to code in C++ and I am new to python.
I used to code in C++, so I am new to python and I do not still master this language.

You can do this in Pandas with
df[df["nCleanedJets"] >= 2]
because you have a column with the number of entries. The df["nCleanedJets"] >= 2 expression returns a Series of booleans (True if a row passes, False if a row doesn't pass) and passing a Series or NumPy array as a slice in square brackets masks by that array (returning rows for which the boolean array is True).
You could also do this in Awkward Array before converting to Pandas, which would be easier if you didn't have a "nCleanedJets" column.
array = ttree.arrays(branches, entrystop=40306)
selected = array[array.counts >= 2]
awkward.topandas(selected, flatten=True)
Masking in Awkward Array follows the same principle, but with data structures instead of flat Series or NumPy arrays (each element of array is a list of records with "nCleanedJets", "JetPt", "JetPhi", "JetMass" fields, and counts is the length of each list).
awkward.topandas with flatten=True is equivalent to what uproot does when outputtype=pandas.DataFrame and flatten=True (defaults for ttree.pandas.df).

Related

Multiply pd DataFrame column with 7-digit scalar

I am trying to modify a pandas dataframe column this way:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE["Var"]["Jan"] = 2678400*SLICE["Var"]["Jan"]
However, this does not work. The resulting column SLICE["Var"]["Jan"] is still the same as before the multiplication.
If I multiply with 2 orders of magnitude less, the multiplication works. Also a subsequent multiplication with 100 to receive the same value that was intended in the first place, works.
SLICE["Var"]["Jan"] = 26784*SLICE["Var"]["Jan"]
SLICE["Var"]["Jan"] = 100*SLICE["Var"]["Jan"]
I seems like the scalar is too large for the multiplication. Is this a python thing or a pandas thing? How can I make sure that the multiplication with the 7-digit number works directly?
I am using Python 3.8, the precision of numbers in the dataframe is float32, they are in a range between 5.0xE-5 and -5.0xE-5 with some numbers having a smaller absolute value than 1xE-11.
EDIT: It might have to do with the 2-level column indexing. When I delete the first level, the calculation works:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE=SLICE.droplevel(0, axis=1)
SLICE["Jan"] = 2678400*SLICE["Jan"]
Your first method might give SettingWithCopyWarning which basically means the changes are not made to the actual dataframe. You can use .loc instead:
SLICE.loc[:,('Var', 'Jan')] = SLICE.loc[:,('Var', 'Jan')]*2678400

Find big differences in numpy array

I have a csv file that contains data from two led measurements. There are some mistakes in the file that gives huge sparks in the graph. I want to locate this places where this happens.
I have this code that makes two arrays that I plot.
x625 = np.array(df['LED Group 625'].dropna(axis=0, how='all'))
x940 = np.array(df['LED Group 940'].dropna(axis=0, how='all'))
I will provide an answer with some artificial data since you have not posted any data yet.
So after you convert the pandas columns into a numpy array, you can do something like this:
import numpy as np
# some random data. 100 lines and 1 column
x625 = np.random.rand(100,1)
# Assume that the maximum value in `x625` is a spark.
spark = x625.max()
# Find where these spark are in the `x625`
np.where(x625==spark)
#(array([64]), array([0]))
The above means that a value equal to spark is located on the 64th line of the 0th column.
Similarly, you can use np.where(x625 > any_number_here)
If instead of the location you need to create a boolean mask use this:
boolean_mask = (x625==spark)
# verify
np.where(boolean_mask)
# (array([64]), array([0]))
EDIT 1
You can use numpy.diff() to get the element wise differences of all elements into the list (variable).
diffs = np.diff(x625.ravel())
This will have in index 0 the results of element1-element0.
If the vaules in diffs are big in a specific index positio, then a spark occured in that position.

sort an array by row in python

I understood that sorting a numpy array arr by column (for only a particular column, for example, its 2nd column) can be done with:
arr[arr[:,1].argsort()]
How I understood this code sample works: argsort sorts the values of the 2nd column of arr, and gives the corresponding indices as an array. This array is given to arr as row numbers. Am I correct in my interpretation?
Now I wonder what if I want to sort the array arr with respect to the 2nd row instead of the 2nd column? Is the simplest way to transpose the array before sorting it and transpose it back after sorting, or is there a way to do it like previously (by giving an array with the number of the columns we wish to display)?
Instead of doing (n,n)array[(n,)array] (n is the size of the 2d array) I tried to do something like (n,n)array[(n,1)array] to indicate the numbers of the columns but it does not work.
EXAMPLE of what I want:
arr = [[11,25],[33,4]] => base array
arr_col2=[[33,4],[11,25]] => array I got with argsort()
arr_row2=[[25,11],[4,33]] => array I tried to got in a simple way with argsort() but did not succeed
I assume that arr is a numpy array? I haven't seen the syntax arr[:,1] in any other context in python. It would be worth mentioning this in your question!
Assuming this is the case, then you should be using
arr.sort(axis=0)
to sort by column and
arr.sort(axis=1)
to sort by row. (Both sort in-place, i.e. change the value of arr. If you don't want this you can copy arr into another variable first, and apply sort to that.)
If you want to sort just a single row (in this case, the second one) then
arr[1,:].sort()
works.
Edit: I now understand what problem you are trying to solve. You would like to reorder the columns in the matrix so that the nth row goes in increasing order. You can do this simply by
arr[:,arr[1,:].argsort()]
(where here we're sorting by the 2nd row).

Numpy arrays with compound keys; find subset in both

I have two 2D numpy arrays shaped:
(19133L, 12L)
(248L, 6L)
In each case, the first 3 fields form an identifier.
I want to reduce the larger matrix so that it only contains rows with identifiers that also exist in the second matrix. So the shape should be (248L, 12L). How can I do this?
I would then like to sort it so that the arrays are indexed by the first value, second value and third value so that (3 3 4) comes after (3 3 5) etc. Is there a multi field sort function?
Edit:
I have tried pandas:
df1 = DataFrame(arr1.astype(str))
df2 = DataFrame(arr2.astype(str))
df1.set_index([0,1,2])
df2.set_index([0,1,2])
out = merge(df1,df2,how="inner")
print(out.shape)
But this results in (0,13) shape
Use pandas.
pandas.set_index() allows multiple keys. So set the index to the first three columns (use drop=False, inplace=True) to avoid needlessly mutating or copying your dataframe.
Then, merge(...how='inner') to intersect your dataframes.
In general, numpy runs out of steam very quickly for arbitrary dataframe manipulations; your default thing should be to try pandas. Also much more performant.

Change dtype for particular values in numpy array?

I have a numpy array x, dimensions = (20, 4), in which only the first row and column are real string values (alphabets) and rest of the values are numerals with their types allocated as string. I want to change these numeral values to float or integer type.
I have tried some steps:
a. I made copies of first row and column of the array as separate variables:
x_row = x[0]
x_col = x[:,0]
Then deleted them from the original array x (using numpy.delete() method) and convertd the type of remaining values by applying a for loop that iterates over each value. However, when I stack back the copied rows and columns using numpy.vstack() and numpy.hstack(), then everything again converts to strings type. So, not sure why this is happening.
b. Same procedure as point a, except I used numpy.insert() method for inserting rows and columns, but is doing the same thing - converting everything back to string type.
So, is there a way through which I don't have to go through this deleting and stacking mechanism (which isn't working anyways) and I can change all the values (except first row and column) of an array to int() or float() type?
All items in a numpy array have to have the same dtype. That is a fundamental fact about numpy. You could possibly use a numpy recarray, or you could use dtype=object which basically lets all values be anything.
I'd recommend you take a look at pandas, which provides a tabular data structure that allows different columns to have different types. It sounds like what you have is a table with row and column labels, and that's what pandas deals with nicely.

Categories

Resources