I currently have a pretty large 3D numpy array (atlasarray - 14M elements with type int64) in which I want to create a duplicate array where every element is a float based on a separate dataframe lookup (organfile).
I'm very much a beginner, so I'm sure that there must be a better (quicker) way to do this. Currently, it takes around 90s, which isn't ages but I'm sure can probably be reduced. Most of this code below is taken from hours of Googling, so surely isn't optimised.
import pandas as pd
organfile = pd.read_excel('/media/sf_VMachine_Shared_Path/ValidationData/ICRP110/AF/AF_OrgansSimp.xlsx')
densityarray = atlasarray
densityarray = densityarray.astype(float)
#create an iterable list of elements that can be written over and go for each elements
for idx, x in tqdm(np.ndenumerate(densityarray), total =densityarray.size):
densityarray[idx] = organfile.loc[x,'Density']
All of the elements in the original numpy array are integers which correspond to an organID. I used pandas to read in the key from an excel file and generate a 4-column dataframe, where in this particular case I want to extract the 4th column (which is a float). OrganIDs go up to 142. Apologies for the table format below, I couldn't get it to work so put it in code format instead.
|:OrganID:|:OrganName:|:TissueType:|:Density:|
|:-------:|:---------:|:----------:|:-------:|
|:---0---:|:---Air---:|:----53----:|:-0.001-:|
|:---1---:|:-Adrenal-:|:----43----:|:-1.030-:|
Any recommendations on ways I can speed this up would be gratefully received.
Put the density from the dataframe into a numpy array:
density = np.array(organfile['Density'])
Then run:
density[atlasarray]
Don't use loops, they are slow. The following example with 14M elements takes less than 1 second to run:
density = np.random.random((143))
atlasarray = np.random.randint(0, 142, (1000, 1000, 14))
densityarray = density[atlasarray]
Shape of densityarray:
print(densityarray.shape)
(1000, 1000, 14)
Related
I have a large dataset (6M rows). For a given column - timestamp I want to take the first 11 characters of each element and construct a new column. So far I am doing it using the apply method but it takes a long time.
df_value_dl['time_sec'] = df_value_dl.apply(lambda x: str(x['timestamp'])[0:10], axis=1)
While looking for faster methods I came across numpy arrays
What would be the correct syntax to do this using np arrays. Thanks
Just in case you haven't found an solution yet: This
df_value_dl['time_sec'] = df_value_dl['timestamp'].astype('string').str[:10]
should be faster than apply.
I am using pandas and uproot to read data from a .root file, and I get a table like the following one:
table
So, from my .root file I have got some branches of a tree.
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
branches = ['nCleanedJets', 'JetPt', 'JetMass', 'JetPhi']
df = ttree.pandas.df(branches, entrystop=40306)
Essentially, I have to retrieve "JetPhi" data for each entry in which there are more than 2 subentries (or equivalently, entries for which "nCleanedJets" is equal or greater than 2), calculating the difference of "JetPhi" between the first two subentries and then make a histogram for such differences.
I have tried to look up in the internet and tried different possibilities but I have not found any useful solution.
If someone could give me any hint, advice and/or suggestion, I would be very grateful.
I used to code in C++ and I am new to python.
I used to code in C++, so I am new to python and I do not still master this language.
You can do this in Pandas with
df[df["nCleanedJets"] >= 2]
because you have a column with the number of entries. The df["nCleanedJets"] >= 2 expression returns a Series of booleans (True if a row passes, False if a row doesn't pass) and passing a Series or NumPy array as a slice in square brackets masks by that array (returning rows for which the boolean array is True).
You could also do this in Awkward Array before converting to Pandas, which would be easier if you didn't have a "nCleanedJets" column.
array = ttree.arrays(branches, entrystop=40306)
selected = array[array.counts >= 2]
awkward.topandas(selected, flatten=True)
Masking in Awkward Array follows the same principle, but with data structures instead of flat Series or NumPy arrays (each element of array is a list of records with "nCleanedJets", "JetPt", "JetPhi", "JetMass" fields, and counts is the length of each list).
awkward.topandas with flatten=True is equivalent to what uproot does when outputtype=pandas.DataFrame and flatten=True (defaults for ttree.pandas.df).
I have two Pandas dataframes, say df1 and df2 (shape (10, 15)), and I want to turn them into Numpy arrays, and then construct a Numpy array containing both of them (shape (2, 10, 15)). I'm currently doing this as follows:
data1 = df1.to_numpy()
data2 = df2.to_numpy()
data = np.array([data1, data2])
Now I'm trying to do this for many pairs of dataframes, and the code I'm using will break when I call data.any() for some of the pairs, giving the truth value error saying to use any() or all() (which I'm already doing). I started printing data when I saw this happening, and I noticed that the np.array() constructor will produce something that looks like [[[...]]] or [array([[...]])].
The first one works fine, but the second doesn't. The difference isn't random with respect to the dataframes, it breaks for certain ones, but all of these dataframes are preprocessed & processed the same way and I've manually checked that the ones that don't work don't have any anomalies.
Since I can't provide much explicit code/data (code is pretty bulky, and arrays are 300 entries each), my main question is why the array constructor either gives [[[...]]] or [array([[...]])] forms, and why the second one doesn't like when I call data.any()?
The issue is that after processing the data, some of the dataframes were missing rows (ie. of shape (x, 15) where x<10). The construction of the data array would give a shape of (2,) when this happened, so as long as both df1 and df2 had the same number of rows it worked fine.
I have a csv file that contains data from two led measurements. There are some mistakes in the file that gives huge sparks in the graph. I want to locate this places where this happens.
I have this code that makes two arrays that I plot.
x625 = np.array(df['LED Group 625'].dropna(axis=0, how='all'))
x940 = np.array(df['LED Group 940'].dropna(axis=0, how='all'))
I will provide an answer with some artificial data since you have not posted any data yet.
So after you convert the pandas columns into a numpy array, you can do something like this:
import numpy as np
# some random data. 100 lines and 1 column
x625 = np.random.rand(100,1)
# Assume that the maximum value in `x625` is a spark.
spark = x625.max()
# Find where these spark are in the `x625`
np.where(x625==spark)
#(array([64]), array([0]))
The above means that a value equal to spark is located on the 64th line of the 0th column.
Similarly, you can use np.where(x625 > any_number_here)
If instead of the location you need to create a boolean mask use this:
boolean_mask = (x625==spark)
# verify
np.where(boolean_mask)
# (array([64]), array([0]))
EDIT 1
You can use numpy.diff() to get the element wise differences of all elements into the list (variable).
diffs = np.diff(x625.ravel())
This will have in index 0 the results of element1-element0.
If the vaules in diffs are big in a specific index positio, then a spark occured in that position.
I have a long array/list of numbers (from a netcdf file), and I want to a specific term which appears multiple times in the array. This is what I have:
lon = np.array(ncfile.variables['LONGITUDE'][:])
lon[lon>1000]=float('nan');
lat = np.array(ncfile.variables['LATITUDE'][:])
lat[lat>1000]=float('nan');
What I want to do is to have no values of lon/lat over 1000 (hence the 'nan'); however, I also want all 'nan's deleted from the array, as it messes up my graph.
My question: how do I delete all the 'nan' terms from my array? I know a similar question was asked, but it did not really answer my question.
If you're using numpy for your arrays, you can do
x = x[~numpy.isnan(x)]
Note: NetCDF variables are cast to numpy arrays, so there is no need to include the np.array call during the read in.
>>> lat = ncfile.variables['LATITUDE'][:]
>>> type(lat)
<class 'numpy.ndarray'>
If you want to simply retain the portion of the lat/lon arrays that are less than 1000, you can use numpy where:
lat_new = lat[np.where(lat < 1000.)[0]]