Split disorganized arrays with numpy - python

I am using the below code to read arrays from csv files.
x,y = np.loadtxt(filename, delimiter=';', unpack=True, skiprows=1, usecols=(1,2))
Being x and array that goes like this [5,5,5,0,1,1,2,3,3,4,5,5,5]
and y [111.0,111.1,111.2,111.3,111.4,111.5...]
I want to split both arrays accordingly using x.
So my expected output would be something like this:
[1,1,1,1,1..][111.4,111.5,111.6...]
[2,2,2,2,..][111.10,111.11,111.12...]
[5,5,5,5,5,...][111.0,111.1,111.2...111.20,111.21,111.22]
...
So that I can choose between the x values and it would return its respective y values
I've tried using np.split np.split(x,[21,1,2,3...]) but It doesn't seem to be working for me.

Despite the fact that my solution is probably not the most efficient one performance-wise, you can use it as a starting point for future investigations
import numpy as np
# some dummy data
x = np.array([5,5,5,0,1,1,2,3,3,4,5,5,5])
y = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12])
def split_by_ids(data: np.array, ids: np.array):
splits = [] # result storage
# get unique indicies with their counts from ids array
elems, counts = np.unique(ids, return_counts=True)
# go through each index and its count
for index, count in zip(elems, counts):
# create array of same index and grab corresponding values from data
splits.append((np.repeat(index, count), data[ids == index]))
return splits
split_result = split_by_ids(y, x)
for ids, values in split_result:
print(f'Ids: {ids}, Values: {values}')
Above code resulted in
Ids: [0], Values: [3]
Ids: [1 1], Values: [4 5]
Ids: [2], Values: [6]
Ids: [3 3], Values: [7 8]
Ids: [4], Values: [9]
Ids: [5 5 5 5 5 5], Values: [ 0 1 2 10 11 12]

Related

Numpy python - calculating sum of columns from irregular dimension

I have a multi-dimensional array for scores, and for which, I need to get sum of each columns at 3rd level in Python. I am using Numpy to achieve this.
import numpy as np
Data is something like:
score_list = [
[[1,1,3], [1,2,5]],
[[2,7,5], [4,1,3]]
]
This should return:
[[3 8 8] [5 3 8]]
Which is happening correctly using this:
sum_array = np_array.sum(axis=0)
print(sum_array)
However, if I have irregular shape like this:
score_list = [
[[1,1], [1,2,5]],
[[2,7], [4,1,3]]
]
I expect it to return:
[[3 8] [5 3 8]]
However, it comes up with warning and the return value is:
[list([1, 1, 2, 7]) list([1, 2, 5, 4, 1, 3])]
How can I get expected result?
numpy will try to cast it into an nd array which will fail, instead consider passing each sublist individually using zip.
score_list = [
[[1,1], [1,2,5]],
[[2,7], [4,1,3]]
]
import numpy as np
res = [np.sum(x,axis=0) for x in zip(*score_list)]
print(res)
[array([3, 8]), array([5, 3, 8])]
Here is one solution for doing this, keep in mind that it doesn't use numpy and will be very inefficient for larger matrices (but for smaller matrices runs just fine).
# Create matrix
score_list = [
[[1,1,3], [1,2,5]],
[[2,7,5], [4,1,3]]
]
# Get each row
for i in range(1, len(score_list)):
# Get each list within the row
for j in range(len(score_list[i])):
# Get each value in each list
for k in range(len(score_list[i][j])):
# Add current value to the same index
# on the first row
score_list[0][j][k] += score_list[i][j][k]
print(score_list[0])
There is bound to be a better solution but this is a temporary fix for you :)
Edit. Made more efficient
A possible solution:
a = np.vstack([np.array(score_list[x], dtype='object')
for x in range(len(score_list))])
[np.add(*[x for x in a[:, i]]) for i in range(a.shape[1])]
Another possible solution:
a = sum(score_list, [])
b = [a[x] for x in range(0,len(a),2)]
c = [a[x] for x in range(1,len(a),2)]
[np.add(x[0], x[1]) for x in [b, c]]
Output:
[array([3, 8]), array([5, 3, 8])]

Substitute row in Numpy if a condition is met - Variation

I am still figuring out Numpy syntax! I have something that works but there must be a more concise way to perform this task. In the example below, I replace selected rows of an array with new entries, where the condition is just on one element.
import numpy as np
big_array = np.random.randint(10, size=(5, 2)) # multi-dimension array
print(big_array)
bad_values = np.less_equal(big_array[:,0], 4) # condition value in one dimension
bad_rows = np.nonzero(bad_values)[0] # indexes to change, e.g. rows
print(f'these are the rows to replace {bad_rows}')
new_rows = np.random.randint(10, size=((bad_rows.size),2))+10 # smaller multi-dim array
np.put(big_array[:,0],bad_rows,y[:,0]) # should be a single line to combine this
np.put(big_array[:,1],bad_rows,y[:,1]) # with this?
print(big_array)
sample output that I want might look like
[[2 4]
[5 9]
[6 6]
[6 7]
[0 6]]
these are the rows to replace [0 4]
[[16 17]
[ 5 9]
[ 6 6]
[ 6 7]
[18 17]]
I don't know how to format put for arguments with different dimensions. This seems like it should be a one-liner. (If I try where I get length issues broadcasting.) What am I missing?

Perform numpy product over non-zero elements of a row

I have a 2d array r. What I want to do is to take the product of each row (excluding the zero elements in that row). For example if I have:
r = [[1 2 0 3 4],
[0 2 5 0 1],
[1 2 3 4 0]]
Then what I want is to have another 2d array result such that:
result = [[24],
[10],
[24]]
How can I achieve this using numpy.prod?
I think I figured it out:
np.prod(r, axis = 1, where = r > 0, keepdims = True)
Output:
array([[24],
[10],
[24]])

How to delete rows of numpy array by multiple row indices?

I have two lists of indices (idx[0] and idx[1]), and I should delete the corresponding rows from numpy array y_test.
y_test
12 11 10
1 2 2
3 2 3
4 1 2
13 1 10
idx[0] = [0,2]
idx[1] = [1,3]
I tried to delete the rows as follows (using ~). But it didn't work:
result = y_test[(~idx[0]+~idx[1]+~idx[2])]
Expected result:
result =
13 1 10
Instead of removing elements, just make a new array with the desired ones. This will keep any future indexing from getting jumbled up and maintain the old array.
import numpy as np
y_test = np.asarray([[12, 11, 10], [1, 2, 2], [3, 2, 3], [4, 1, 2], [13, 1, 10]])
idx = [[0, 2], [1, 3]]
# flatten list of lists
idx_flat = [i for j in idx for i in j]
# assign values that are NOT in your idx list to a new array
result = [row for num, row in enumerate(y_test) if num not in idx_flat]
# cast this however you want it, right now 'result' is a list of np.arrays
print result
[array([13, 1, 10])]
For an understanding of the flatten step using list comprehensions check this out
You can use numpy.delete which deletes the subarrays along the axis.
np.delete(y_test, idx, axis=0)
Make sure that idx.dtype is an integer type and use numpy.astype if not.
Your approach did not work because idx is not a boolean index array but holds the indices. So ~ which is binary negation will produce ~[0, 2] = [-1, -3] (where both should be numpy arrays).
I would definitely recommend reading up on the difference between index arrays and boolean index arrays. For boolean index arrays I would suggest using numpy.logical_not and numpy.logical_or.
+ concatenates Python lists but is the standard plus for numpy arrays.
Since you are using NumPy I'd suggest masking in this way.
Setup:
import numpy as np
y_test = np.array([[12,11,10],
[1,2,2],
[3,2,3],
[4,1,2],
[13,1,10]])
idx = np.array([[0,2], [1,3]])
Generate the mask:
Generate a mask of ones then set to zero elements at index in idx:
mask = np.ones(len(y_test), dtype = int).reshape(5,1)
mask[idx.flatten()] = 0
Finally apply the mask:
y_test[~np.all(y_test * mask == 0, axis=1)]
#=> [[13 1 10]]
y_test has not been modified.

Row-wise difference in two list in pandas

I am using pandas to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.
I was able to do this using row.iterrows(), but I have >1M rows, so I believe vectorized apply might be better.
Here's sample data and code. Once you run this code, you will get expected output:
from numpy import nan as NA
import collections
df = pd.DataFrame({'ID':['A','B','C','A','B','A','A','A','D','E','E','E'],
'Value': [1,2,3,4,3,5,2,3,7,2,3,9]})
#wrap all elements by group in a list
Changed_df=df.groupby('ID')['Value'].apply(list).reset_index()
Changed_df=Changed_df.rename(columns={'Value' : 'Elements'})
Changed_df=Changed_df.reset_index(drop=True)
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
yield from flatten(el)
else:
yield el
Changed_df["Elements_s"]=Changed_df['Elements'].shift()
#attempt 1: For loop
Changed_df["Diff"]=NA
Changed_df["count"]=0
Elements_so_far = []
#replace NA with empty list in columns that will go through list operations
for col in ["Elements","Elements_s","Diff"]:
Changed_df[col] = Changed_df[col].apply(lambda d: d if isinstance(d, list) else [])
for idx,row in Changed_df.iterrows():
diff = list(set(row['Elements']) - set(Elements_so_far))
Changed_df.at[idx, "Diff"] = diff
Elements_so_far.append(row['Elements'])
Elements_so_far = flatten(Elements_so_far)
Elements_so_far = list(set(Elements_so_far)) #keep unique elements
Changed_df.loc[idx,"count"]=diff.__len__()
Commentary about the code:
I am not a fan of this code because it's clunky and inefficient.
I am saying inefficient because I have created Elements_s which holds shifted values. Another reason for inefficiency is for loop through rows.
Elements_so_far keeps track of all the elements we have discovered for every row. If there is a new element that shows up, we count that in Diff column.
We also keep track of the length of new elements discovered in count column.
I'd appreciate if an expert could help me with a vectorized version of the code.
I did try the vectorized version, but I couldn't go too far.
#attempt 2:
Changed_df.apply(lambda x: [i for i in x['Elements'] if i in x['Elements_s']], axis=1)
I was inspired from How to compare two columns both with list of strings and create a new column with unique items? to do above, but I couldn't do it. The linked SO thread does row-wise difference among columns.
I am using Python 3.6.7 by Anaconda. Pandas version is 0.23.4
You could using sort and then use numpy to get the unique indexes and then construct your groupings, e.g.:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
df.iloc[i].groupby(df.ID).Value.apply(list)
Out[]:
ID
A [1, 2, 3, 4, 5]
D [7]
E [9]
Name: Value, dtype: object
Or to get close to your current output:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
s1 = df.groupby(df.ID).Value.apply(list).rename('Elements')
s2 = df.iloc[i].groupby(df.ID).Value.apply(list).rename('Diff').reindex(s1.index, fill_value=[])
pd.concat([s1, s2, s2.apply(len).rename('Count')], axis=1)
Out[]:
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1
One alternative using drop duplicates and groupby
# Groupby and apply list func.
df1 = df.groupby('ID')['Value'].apply(list).to_frame('Elements')
# Sort values , drop duplicates by Value column then use groupby.
df1['Diff'] = df.sort_values(['ID','Value']).drop_duplicates('Value').groupby('ID')['Value'].apply(list)
# Use str.len for count.
df1['Count'] = df1['Diff'].str.len().fillna(0).astype(int)
# To fill NaN with empty list
df1['Diff'] = df1.Diff.apply(lambda x: x if type(x)==list else [])
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1

Categories

Resources