Pandas:Aggregate by numerical column name and give error when substitution - python

In order to eliminate variations at the time of measurement, I want to compile in a specific range.
For example, I want to sum the column name in the range of ± 0.1 of the integer and assign it to an integer column. However, I can not substitute because of a shape error.
I think that it is caused by converting the type of the column, but what should I do about it?
Thank you.
import pandas as pd
import numpy as np
df = pd.DataFrame(data= np.arange(0,10000,1).reshape(100,100))
df.columns = np.arange(0,10,0.1)
print(df.head())
df.columns = df.columns.astype(float)
temp = df.columns.values
for n in np.arange(1, 9, 1):
l = n - 0.1
m = n + 0.1
calc_n = temp[np.where((temp >= l) & (temp <= m))]
calc = np.sum(df[df.columns.intersection(calc_n)], axis=1)
n_position = temp[np.where(temp == n)]
df[n_position] = calc.values
ValueError: shape mismatch: value array of shape (100,) could not be broadcast to indexing result of shape (1,100)

The ValueError is because n_position is an array. So df[n_position] gives you a dataframe instead of a column.
It is usually not a good idea to use floats as indexes. And you should be careful when comparing floats. This line calc_n = temp[np.where((temp >= l) & (temp <= m))] won't always give accurate results.
For starter, try:
for n in np.arange(1, 9, 1):
margin = 0.101 # set your own margin
calc_n = np.where(np.abs(temp-n) < margin)
df[n] = df.iloc[:,calc_n[0]].sum(axis=1)

Related

How can this function be vectorized?

I have a NumPy array with the following properties:
shape: (9986080, 2)
dtype: np.float32
I have a method that loops over the range of the array, performs an operation and then inputs result to new array:
def foo(arr):
new_arr = np.empty(arr.size, dtype=np.uint64)
for i in range(arr.size):
x, y = arr[i]
e, n = ''
if x < 0:
e = '1'
else:
w = '2'
if y > 0:
n = '3'
else:
s = '4'
new_arr[i] = int(f'{abs(x)}{e}{abs(y){n}'.replace('.', ''))
I agree with Iguananaut's comment that this data structure seems a bit odd. My biggest problem with it is that it is really tricky to try and vectorize the putting together of integers in a string and then re-converting that to an integer. Still, this will certainly help speed up the function:
def foo(arr):
x_values = arr[:,0]
y_values = arr[:,1]
ones = np.ones(arr.shape[0], dtype=np.uint64)
e = np.char.array(np.where(x_values < 0, ones, ones * 2))
n = np.char.array(np.where(y_values < 0, ones * 3, ones * 4))
x_values = np.char.array(np.absolute(x_values))
y_values = np.char.array(np.absolute(y_values))
x_values = np.char.replace(x_values, '.', '')
y_values = np.char.replace(y_values, '.', '')
new_arr = np.char.add(np.char.add(x_values, e), np.char.add(y_values, n))
return new_arr.astype(np.uint64)
Here, the x and y values of the input array are first split up. Then we use a vectorized computation to determine where e and n should be 1 or 2, 3 or 4. The last line uses a standard list comprehension to do the string merging bit, which is still undesirably slow for super large arrays but faster than a regular for loop. Also vectorizing the previous computations should speed the function up hugely.
Edit:
I was mistaken before. Numpy does have a nice way of handling string concatenation using the np.char.add() method. This requires converting x_values and y_values to Numpy character arrays using np.char.array(). Also for some reason, the np.char.add() method only takes two arrays as inputs, so it is necessary to first concatenate x_values and e and y_values and n and then concatenate these results. Still, this vectorizes the computations and should be pretty fast. The code is still a bit clunky because of the rather odd operation you are after, but I think this will help you speed up the function greatly.
You may use np.apply_along_axis. When you feed this function with another function that takes row (or column) as an argument, it does what you want to do.
For you case, You may rewrite the function as below:
def foo(row):
x, y = row
e, n = ''
if x < 0:
e = '1'
else:
w = '2'
if y > 0:
n = '3'
else:
s = '4'
return int(f'{abs(x)}{e}{abs(y){n}'.replace('.', ''))
# Where you want to you use it.
new_arr = np.apply_along_axis(foo, 1, n)

Is there a way to get the index of a percentage of a cummulative sum from a sorted list?

Given a sorted list of real numbers, e.g.
x = range(20)
The task is to find the first index of the X% of the cumulative sum of the list, e.g.
def compute_cumpercent(lint, percent):
break_point = sum(lint) * percent
mass = 0
for i, c in enumerate(lint):
if mass > break_point:
return i
mass += c
To find the index of the number in the input list which is less than and closes to 25% of the cumulative sum,
>>> compute_cumpercent(x, 0.25)
11
Firstly, is there a mathematical / name for such a function?
Other than doing it with the simple loop as above, is there a way to do the same with numpy or some bisect or otherwise?
Assume that input list is always sorted.
Something like this maybe?
import numpy as np
x = range(20)
percent = 0.25
cumsum = np.cumsum(x)
break_point = cumsum[-1] * percent
np.argmax(cumsum >= break_point) + 1 # 11
import numpy as np
x = np.arange(20)
Percent = 25
CumSumArray = np.cumsum(x)
ValueToFind = CumSumArray[-1] * Percent / 100
Idx = np.argmax(CumSumArray > ValueToFind)[0] - 1
Following this hint, one can use searchsorted to find an index of the element, that is close (lower) to a percentile/quantile value.
See example below:
import numpy as np
def find_index_left(xs, v):
return np.searchsorted(xs, v, side='left') - 1
def find_index_quantile(xs, q):
v = np.quantile(xs, q)
return find_index_left(xs, v)
xs = [5, 10, 11, 15, 20]
assert np.quantile(xs, 0.9) == 18.0
assert find_index_left(xs, 18) == 3 # zero-based index for forth element
assert find_index_quantile(xs, 0.9) == 3
Note xs has to be sorted.

Vectorizing outer and inner loop when these contain calculations and deletes

I've been checking out how to vectorize an outer and inner for loop. These have some calculations and also a delete inside them - that seems to make it much less straight forward.
How would this be vectorized best?
import numpy as np
flattenedArray = np.ndarray.tolist(someNumpyArray)
#flattenedArray is a python list of lists.
c = flattenedArray[:]
for a in range (len(flattenedArray)):
for b in range(a+1, len(flattenedArray)):
if a == b:
continue
i0 = flattenedArray[a][0]
j0 = flattenedArray[a][1]
z0 = flattenedArray[a][2]
i1 = flattenedArray[b][0]
i2 = flattenedArray[b][1]
z1 = flattenedArray[b][2]
if ((np.square(z0-z1)) <= (np.square(i0-i1) + (np.square(j0-j2)))):
if (np.square(i0-i1) + (np.square(j0-j1))) <= (np.square(z0+z1)):
c.remove(flattenedArray[b])
#MSeifert is, of course, as so often right. So the following full vectorisation is only to show "how it's done"
import numpy as np
N = 4
data = np.random.random((N, 3))
# vectorised code
j, i = np.tril_indices(N, -1) # chose tril over triu to have contiguous columns
# useful later
sqsum = np.square(data[i,0]-data[j,0]) + np.square(data[i,1]-data[j,1])
cond = np.square(data[i, 2] + data[j, 2]) >= sqsum
cond &= np.square(data[i, 2] - data[j, 2]) <= sqsum
# because equal 'b's are grouped together we can use reduceat:
cond = np.r_[False, np.logical_or.reduceat(
cond, np.add.accumulate(np.arange(N-1)))]
left = data[~cond, :]
# original code (modified to make it run)
flattenedArray = np.ndarray.tolist(data)
#flattenedArray is a python list of lists.
c = flattenedArray[:]
for a in range (len(flattenedArray)):
for b in range(a+1, len(flattenedArray)):
if a == b:
continue
i0 = flattenedArray[a][0]
j0 = flattenedArray[a][1]
z0 = flattenedArray[a][2]
i1 = flattenedArray[b][0]
j1 = flattenedArray[b][1]
z1 = flattenedArray[b][2]
if ((np.square(z0-z1)) <= (np.square(i0-i1) + (np.square(j0-j1)))):
if (np.square(i0-i1) + (np.square(j0-j1))) <= (np.square(z0+z1)):
try:
c.remove(flattenedArray[b])
except:
pass
# check they are the same
print(np.alltrue(c == left))
Vectorizing the inner loop isn't much of a problem if you work with a mask:
import numpy as np
# I'm using a random array
flattenedArray = np.random.randint(0, 100, (10, 3))
mask = np.zeros(flattenedArray.shape[0], bool)
for idx, row in enumerate(flattenedArray):
# Calculate the broadcasted elementwise addition/subtraction of this row
# with all following
added_squared = np.square(row[None, :] + flattenedArray[idx+1:])
subtracted_squared = np.square(row[None, :] - flattenedArray[idx+1:])
# Check the conditions
col1_col2_added = subtracted_squared[:, 0] + subtracted_squared[:, 1]
cond1 = subtracted_squared[:, 2] <= col1_col2_added
cond2 = col1_col2_added <= added_squared[:, 2]
# Update the mask
mask[idx+1:] |= cond1 & cond2
# Apply the mask
flattenedArray[mask]
If you also want to vectorize the outer loop one has to do it by broadcasting, that however will use a lot of memory O(n**2) instead of O(n). Given that the critical inner loop is already vectorized there won't be a lot of speedup by vectorizing the outer loop.

Pulling data from a numpy array

I have a numpy ndarray that I made using numpy.loadtxt. I want to pull an entire row from it based on a condition in the third column. Something like : if array[2][i] is meeting my conditions, then get array[0][i] and array [1][i] as well. I'm new to python, and all of the numpy features, so I'm looking for the best way to do this. Ideally, I'd like to pull 2 rows at a time, but I wont always have an even number of rows, so I imagine that is a problem
import numpy as np
'''
Created on Jan 27, 2013
#author:
'''
class Volume:
f ='/Users/Documents/workspace/findMinMax/crapc.txt'
m = np.loadtxt(f, unpack=True, usecols=(1,2,3), ndmin = 2)
maxZ = max(m[2])
minZ = min(m[2])
print("Maximum Z value: " + str(maxZ))
print("Minimum Z value: " + str(minZ))
zIncrement = .5
steps = maxZ/zIncrement
currentStep = .5
b = []
for i in m[2]:#here is my problem
while currentStep < steps:
if m[2][i] < currentStep and m[2][i] > currentStep - zIncrement:
b.append(m[2][i])
if len(b) < 2:
currentStep + zIncrement
print(b)
Here is some code that I did in java that is the general idea of what I want:
while( e < a.length - 1){
for(int i = 0; i < a.length - 1; i++){
if(a[i][2] < stepSize && a[i][2] > stepSize - 2){
x.add(a[i][0]);
y.add(a[i][1]);
z.add(a[i][2]);
}
if(x.size() < 1){
stepSize += 1;
}
}
}
First of all, you probably don't want to put your code in that class definition...
import numpy as np
def main():
m = np.random.random((3, 4))
mask = (m[2] > 0.5) & (m[2] < 0.8) # put your conditions here
# instead of 0.5 and 0.8 you can use
# an array if you like
m[:, mask]
if __name__ == '__main__':
main()
mask is a boolean array, m[:, mask] is the array you want
m[2] is the third row of m. If you type m[2] + 2 you get a new array with the old values + 2. m[2] > 0.5 creates an array with boolean values. It is best to try this stuff out with ipython (www.ipython.org)
In the expression m[:, mask] the : means "take all rows", mask describes which columns should be included.
Update
Next try :-)
for i in range(0, len(m), 2):
two_rows = m[i:i+2]
If you can write your condition as a simple function
def condition(value):
# return True or False depending on value
then you could select your subarrays like this:
cond = condition(a[2])
subarray0 = a[0,cond]
subarray1 = a[1,cond]

matplotlib hist while ignoring a particular no data value

I've got a 2D numpy array with 1.0e6 as the no data value. I'd like to generate a histogram of the data and while I've succeeded this can't be the best way to do it.
from matplotlib import pyplot
import sys
eps = sys.float_info.epsilon
no_data = 1.0e6
e_data = elevation.reshape(elevation.size)
e_data_clean = [ ]
for i in xrange(len(e_data)):
val = e_data[i]
# floating point equality check for val aprox not equal no_data
if val > no_data + eps and val < no_data - eps:
e_data_clean.append(val)
pyplot.hist(e_data_clean, bins=100)
It seems like there should be a clean (and much faster one liner for this). Is there?
You can use a boolean array to select the required indices:
selected_values = (e_data > (no_data + eps)) & (e_data < (no_data - eps))
pyplot.hist(e_data[selected_values])
(e_data > (no_data + eps)) will create an array of np.bool with the same shape as e_data, set to True at a given index if and only if the value at that index is greater than (no_data + eps). & is the element-wise and operator to satisfy both conditions.
Alternatively, if no_data is just a convention, I would set those values to numpy.nan instead, and use e_data[numpy.isfinite(e_data)].

Categories

Resources