Numpy arrays with compound keys; find subset in both - python

I have two 2D numpy arrays shaped:
(19133L, 12L)
(248L, 6L)
In each case, the first 3 fields form an identifier.
I want to reduce the larger matrix so that it only contains rows with identifiers that also exist in the second matrix. So the shape should be (248L, 12L). How can I do this?
I would then like to sort it so that the arrays are indexed by the first value, second value and third value so that (3 3 4) comes after (3 3 5) etc. Is there a multi field sort function?
Edit:
I have tried pandas:
df1 = DataFrame(arr1.astype(str))
df2 = DataFrame(arr2.astype(str))
df1.set_index([0,1,2])
df2.set_index([0,1,2])
out = merge(df1,df2,how="inner")
print(out.shape)
But this results in (0,13) shape

Use pandas.
pandas.set_index() allows multiple keys. So set the index to the first three columns (use drop=False, inplace=True) to avoid needlessly mutating or copying your dataframe.
Then, merge(...how='inner') to intersect your dataframes.
In general, numpy runs out of steam very quickly for arbitrary dataframe manipulations; your default thing should be to try pandas. Also much more performant.

Related

how to extract overlapping sub-arrays with a window size and flatten them

I am trying to get better at using numpy functions and methods to run my programs in python faster
I want to do the following:
I create an array 'a' as:
a=np.random.randint(-10,11,10000).reshape(-1,10)
a.shape: (1000,10)
I create another array which takes only the first two columns in array a
b=a[:,0:2]
b,shape: (1000,2)
now I want to create an array c which has 990 rows containing flattened slices of 10 rows of array 'b'.
So the first row of array 'c' will have 20 columns which is a flattened slice of 0 to 10 rows
of array 'b'. The next row of array 'c' will have 20 columns of flattened rows 1 to 11 of array
'b' etc.
I can do this with a for loop. But I want to know if there is much faster way to do this using numpy functions and methods like strides or something else
Thanks for your time and your help.
This loops over shifts rather than rows (loop of size 10):
N = 10
c = np.hstack([b[i:i-N] for i in range(N)])
Explanation: b[i:i-N] is b's rows from i to m-(N-i)(excluding m-(N-i) itself) where m is number of rows in b. Then np.hstack stacks those selected sub-arrays horizontally(stacks b[0:m-10], b[1:m-9], b[2:m-8], ..., b[10:m]) (as question explains).
c.shape: (990, 20)
Also I think you may be looking for a shape of (991, 20) if you want to include all windows.
you can also use strides, but if you want to do operations on it, I would advise against that, since the memory is tricky using them. Here is a strides solution if you insist:
from skimage.util.shape import view_as_windows
c = view_as_windows(b, (10,2)).reshape(-1, 20)
c.shape: (991, 20)
If you don't want the last row, simply remove it by calling c[:-1].
A similar solution applies with numpy's as_strides function (they basically operate similar, not sure of internals of them).
UPDATE: if you want to find unique values and their frequencies in each row of c you can do:
unique_values = []
unique_counts = []
for row in c:
unique, unique_c = np.unique(row, return_counts=True)
unique_values.append(unique)
unique_counts.append(unique_c)
Note that numpy arrays have to be rectangular, meaning the number of elements per each(dimension) row must be the same. Since different rows in c can have different number of unique values, you cannot create a numpy array for unique values of each row (Alternative would be to make a structured numpy array). Therefore, a solution is to make a list/array of arrays, each including unique values of different rows in c. unique_values are list of arrays of unique values and unique_counts is their frequency in the same order.

Get data from Pandas multiIndex

I am using pandas and uproot to read data from a .root file, and I get a table like the following one:
table
So, from my .root file I have got some branches of a tree.
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
branches = ['nCleanedJets', 'JetPt', 'JetMass', 'JetPhi']
df = ttree.pandas.df(branches, entrystop=40306)
Essentially, I have to retrieve "JetPhi" data for each entry in which there are more than 2 subentries (or equivalently, entries for which "nCleanedJets" is equal or greater than 2), calculating the difference of "JetPhi" between the first two subentries and then make a histogram for such differences.
I have tried to look up in the internet and tried different possibilities but I have not found any useful solution.
If someone could give me any hint, advice and/or suggestion, I would be very grateful.
I used to code in C++ and I am new to python.
I used to code in C++, so I am new to python and I do not still master this language.
You can do this in Pandas with
df[df["nCleanedJets"] >= 2]
because you have a column with the number of entries. The df["nCleanedJets"] >= 2 expression returns a Series of booleans (True if a row passes, False if a row doesn't pass) and passing a Series or NumPy array as a slice in square brackets masks by that array (returning rows for which the boolean array is True).
You could also do this in Awkward Array before converting to Pandas, which would be easier if you didn't have a "nCleanedJets" column.
array = ttree.arrays(branches, entrystop=40306)
selected = array[array.counts >= 2]
awkward.topandas(selected, flatten=True)
Masking in Awkward Array follows the same principle, but with data structures instead of flat Series or NumPy arrays (each element of array is a list of records with "nCleanedJets", "JetPt", "JetPhi", "JetMass" fields, and counts is the length of each list).
awkward.topandas with flatten=True is equivalent to what uproot does when outputtype=pandas.DataFrame and flatten=True (defaults for ttree.pandas.df).

Numpy np.array() constructor behaving "inconsistently"

I have two Pandas dataframes, say df1 and df2 (shape (10, 15)), and I want to turn them into Numpy arrays, and then construct a Numpy array containing both of them (shape (2, 10, 15)). I'm currently doing this as follows:
data1 = df1.to_numpy()
data2 = df2.to_numpy()
data = np.array([data1, data2])
Now I'm trying to do this for many pairs of dataframes, and the code I'm using will break when I call data.any() for some of the pairs, giving the truth value error saying to use any() or all() (which I'm already doing). I started printing data when I saw this happening, and I noticed that the np.array() constructor will produce something that looks like [[[...]]] or [array([[...]])].
The first one works fine, but the second doesn't. The difference isn't random with respect to the dataframes, it breaks for certain ones, but all of these dataframes are preprocessed & processed the same way and I've manually checked that the ones that don't work don't have any anomalies.
Since I can't provide much explicit code/data (code is pretty bulky, and arrays are 300 entries each), my main question is why the array constructor either gives [[[...]]] or [array([[...]])] forms, and why the second one doesn't like when I call data.any()?
The issue is that after processing the data, some of the dataframes were missing rows (ie. of shape (x, 15) where x<10). The construction of the data array would give a shape of (2,) when this happened, so as long as both df1 and df2 had the same number of rows it worked fine.

How to maintain or recover Dataframe indexing after running Pairwise Distance function?

I'm using sklearn's pairwise distance function, which saved my life when computing a huge matrix, but the problem I'm having is that I lose my indices.
Specifically, I initially have a huge dataframe of 17000 x 300, which I break down into 4 different dataframes based on some class condition.
The 4 separate dataframes keep the original indices, but after I run the pairwise distance function on one of those dataframes, it gives me back a 2d array with correct values but the indices have been reset from 0 up.
How do I keep or recover the original indices?
distance1 = pair.pairwise_distances(df1, metric='euclidean')
You can create a DataFrame with matching indices using the DataFrame constructor taking the index parameter:
pd.DataFrame(distance1, index=df1.index)
Furthermore, if you would like to concatenate it horizontally to your existing DataFrame, you can use
pd.concat((df1, pd.DataFrame(distance1, index=df1.index)), axis=1)

Change dtype for particular values in numpy array?

I have a numpy array x, dimensions = (20, 4), in which only the first row and column are real string values (alphabets) and rest of the values are numerals with their types allocated as string. I want to change these numeral values to float or integer type.
I have tried some steps:
a. I made copies of first row and column of the array as separate variables:
x_row = x[0]
x_col = x[:,0]
Then deleted them from the original array x (using numpy.delete() method) and convertd the type of remaining values by applying a for loop that iterates over each value. However, when I stack back the copied rows and columns using numpy.vstack() and numpy.hstack(), then everything again converts to strings type. So, not sure why this is happening.
b. Same procedure as point a, except I used numpy.insert() method for inserting rows and columns, but is doing the same thing - converting everything back to string type.
So, is there a way through which I don't have to go through this deleting and stacking mechanism (which isn't working anyways) and I can change all the values (except first row and column) of an array to int() or float() type?
All items in a numpy array have to have the same dtype. That is a fundamental fact about numpy. You could possibly use a numpy recarray, or you could use dtype=object which basically lets all values be anything.
I'd recommend you take a look at pandas, which provides a tabular data structure that allows different columns to have different types. It sounds like what you have is a table with row and column labels, and that's what pandas deals with nicely.

Categories

Resources