I have a 2 dimensional array in numpy and need to apply a mathematical formula just to some values of the array which match certain criteria. This can be made using a for loop and if conditions however I think using numpy where() method works faster.
My code so far is this but it doesn't work
cond2 = np.where((SPN >= -alpha) & (SPN <= 0))
SPN[cond2] = -1*math.cos((SPN[cond2]*math.pi)/(2*alpha))
The values in the orginal array need to be replaced with the corresponding value after applying the formula.
Any ideas of how to make this work? I'm working with big arrays so need and efficient way of doing it.
Thanks
Try this:
cond2 = (SPN >= -alpha) & (SPN <= 0)
SPN[cond2] = -np.cos(SPN[cond2]*np.pi/(2*alpha))
Related
I have a csv dataset with texts. I need to search through them. I couldn't find an easy way to search for a string in a dataset and get the row and column indexes. For example, let's say the dataset is like:
df = pd.DataFrame({"China": ['Xi','Lee','Hung'], "India": ['Roy','Rani','Jay'], "England": ['Tom','Sam','Jack']})
Now let's say I want to find the string 'rani' and know its location. Is there a simple function to do that? Or do I have to loop through everything to find it?
One vectorized (and therefore relatively scalable) solution to this is to leverage numpy.where:
import numpy as np
np.where(df == 'Rani')
This returns two arrays, corresponding to column and row indices:
(array([1]), array([1]))
You can continue to take advantage of vectorized operations, but also write a more complicated filtering function, like so:
np.where(df.applymap(lambda x: "ani" in x))
In other words, "apply to each cell the function that returns True if 'ani' is in the cell", and then conduct the same np.where filtering step.
You can use any function:
def _should_include_cell(cell_contents):
return cell_contents.lower() == "rani" or "Xi" in cell_contents
np.where(df.applymap(_should_include_cell)
Some final notes:
applymap is slower than simple equality checking
if you need this to scale WAY up, consider using dask instead of pandas
Not sure how this will scale but it works
df[df.eq('Rani')].dropna(1, how='all').dropna()
India
1 Rani
Quick Pandas question:
I cleaning up the values in individual columns of a dataframe by using an apply on a series:
# For all values in col 'Rate' over 1, divide by 100
df['rate'][df['rate']>1] = df['rate'][df['rate']>1].apply(lambda x: x/100)
This is fine when the selection criteria is simple, such as df['rate']>1. This however gets very long when you start adding multiple selection criteria:
df['rate'][(df['rate']>1) & (~df['rate'].isnull()) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')] = df['rate'][(df['rate']>1) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')].apply(lambda x: x/100)
What's the most concise way to:
1. Split a column off (as a Series) from a DataFrame
2. Apply a function to the items of the Series
3. Update the DataFrame with the modified series
I've tried using df.update(), but that didn't seem to work. I've also tried using the Series as a selector, e.g. isin(Series), but I wasn't able to get that to work either.
Thank you!
When there are multiple conditions, you can keep things simple using eval:
mask = df.eval("rate > 1 & rate_type == 'fixed' & something <= 'nothing'")
df.loc[mask, 'rate'] = df['rate'].apply(function)
Read more about evaluating expressions dynamically here. Of course, this particular function can be vectorized as
df.loc[mask, 'rate'] /= 100
It will work with update
con=(df['rate']>1) & (df['rate_type']=='fixed') & (df['something']<= 'nothing')
df.update(df.loc[con,['rate']].apply(lambda x: x/100))
I want to avoid apply() and Instead vectorize my data processing.
I have a function that buckets data based on few "if" and "else" conditions. How do I pass data to this function?
def my_function(id):
if 0 <= id <= 30000:
cal_score = 5
else:
cal_score = 0
return cal_score
Apply() works, it loops through every row
But, apply() is slow on a huge set of data. (My scenario)
df['final_score'] = df.apply(lambda x : my_function(x['id']), axis = 1)
Passing a numpy array does not work!!
df['final_score'] = my_function(df['id'].values)
ERROR : "truth value of an array with more than one element is ambiguous. Use a.any() or a.call()
Its not liking the entire array being passes as the "if" loop in my function errors out due to more than 1 element
I want to update my final_score column based on ID values but by passing an entire array.
how do I design or address this ?
Use Series.between to create your condition, multiply the resultant mask by 5.
df['final_score'] = df['id'].between(0, 30000, inclusive=True) * 5
It's easy:
Convert Series to numpy array via '.values'
n_a = df['final_score'].values
Vectorize your function
vfunc = np.vectorize(my_function)
Calculate the result array using vectorized function:
res_array = vfunc(n_a)
df['final_score'] = res_array
Check https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.vectorize.html for more details
Vectorized calculations over pd.Series converted to numpy array can be 10x times faster than using internal pandas calculations
One question about mask 2-d np.array data.
For example:
one 2-d np.array value in the shape of 20 x 20.
An index t = [(1,2),(3,4),(5,7),(12,13)]
How to mask the 2-d array value by the (y,x) in index?
Usually, replacing with np.nan are based on the specific value like y[y==7] = np.nan
On my example, I want to replace the value specific location with np.nan.
For now, I can do it by:
Creating a new array value_mask in the shape of 20 x 20
Loop the value and testify the location by (i,j) == t[k]
If True, value_mask[i,j] = value[i,j] ; In verse, value_mask[i,j] = np.nan
My method was too bulky especially for hugh data(3 levels of loops).
Are there some efficiency method to achieve that? Any advice would be appreciate.
You are nearly there.
You can pass arrays of indices to arrays. You probably know this with 1D-arrays.
With a 2D-array you need to pass the array a tuple of lists (one tuple for each axis; one element in the lists (which have to be of equal length) for each array-element you want to chose). You have a list of tuples. So you have just to "transpose" it.
t1 = zip(*t)
gives you the right shape of your index array; which you can now use as index for any assignment, for example: value[t1] = np.NaN
(There are lots of nice explanation of this trick (with zip and *) in python tutorials, if you don't know it yet.)
You can use np.logical_and
arr = np.zeros((20,20))
You can select by location, this is just an example location.
arr[4:8,4:8] = 1
You can create a mask the same shape as arr
mask = np.ones((20,20)).astype(bool)
Then you can use the np.logical_and.
mask = np.logical_and(mask, arr == 1)
And finally, you can replace the 1s with the np.nan
arr[mask] = np.nan
I have a numpy array which I need to filter and perform a sum on. Similar to my previous question, although this one needs to be filtered by two conditions.
Need to return the sum of column 7 where column 0 == ptype AND column 8 == radius.
np.sum(data[data[:,0] == ptype and data[data[:,8] <= radius],7])
I get the following error:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Any ideas?
python's and looks at the boolean value on either side of the condition. Because of design decisions in numpy, arrays with more than 1 value don't have a boolean value (it raises ValueError as you've seen. The solution is to use the np.logical_and function.
mask = np.logical_and(data[:, 0] == ptype, data[:, 8] <= radius)
np.sum(data[mask, 7])
Note that & will work as well in this case as you have arrays of booleans -- However, I don't like to use that one in general as typically (and with numpy as well), & means bitwise and rather than logical and.
You need to use & instead of and with NumPy arrays:
mask = (data[:,0] == ptype) & (data[:,8] <= radius)
data[mask,7].sum()