I am relatively new to python, and a piece of existing code has created an object akin to per below. This is part of a legacy piece of code. i can unfortunately not change it. The code creates many objects that look like the following format:
[[{'a': 2,'b': 3}],[{'a': 1,'c': 3}],[{'c': 2,'d': 4}]]
I am trying to create transform this object into a matrix or numpy arrays. In this specific example - it would have three rows (1,2,3) and 4 columns (a,b,c,d), with the dictionary values inserted in the cells. (I have inserted how this matrix would look as a dinky toy example. However - i am not looking to recreate the table from scratch, but i am looking for code that translate the object per above in a matrix format).
I am struggling to find a fast and easy way to do this. Any tips or advice much appreciated.
a b c d
1 2 3 0 0
2 1 0 3 0
3 0 2 0 4
I suspect you are focusing on the fast and easy, when you need to address the how first. This isn't the normal input format for np.array or `pandas. So let's focus on that.
It's a list of lists; suggesting a 2d array. But each sublist contains one dictionary, not a list of values.
In [633]: dd=[[{'a': 2,'b': 3}],[{'a': 1,'c': 3}],[{'c': 2,'d': 4}]]
In [634]: dd[0]
Out[634]: [{'b': 3, 'a': 2}]
So let's define a function that converts a dictionary into a list of numbers. We can address the question of where a,b,c,d labels come from, and whether you need to collect them from dd or not, later.
In [635]: dd[0][0]
Out[635]: {'b': 3, 'a': 2}
In [636]: def mk_row(adict):
return [adict.get(k,0) for k in ['a','b','c','d']]
.....:
In [637]: mk_row(dd[0][0])
Out[637]: [2, 3, 0, 0]
So now we just need to apply the function to each sublist
In [638]: [mk_row(d[0]) for d in dd]
Out[638]: [[2, 3, 0, 0], [1, 0, 3, 0], [0, 0, 2, 4]]
This is the kind of list that #Colin fed to pandas. It can also be given to np.array:
In [639]: np.array([mk_row(d[0]) for d in dd])
Out[639]:
array([[2, 3, 0, 0],
[1, 0, 3, 0],
[0, 0, 2, 4]])
Simpy use:
import pandas as pd
df = pd.DataFrame.from_items([('1', [2, 3, 0,0]), ('2', [1, 0, 3,0]),('3', [0, 2, 0,4])], orient='index', columns=['a', 'b', 'c','d'])
arr = df.values
You can then reference it like a normal numpy array:
print(arr[0,:])
Related
Currently trying to do some beginner matrix handling exercises, but are unsure on how to sort a nxn matrix's column by the columns first index. etc.
It should be a method that could work on any size matrix, as it will not be the same size matrix every time.
Anyone who has any good suggestions?
The implementation here can be very simple depending on how the data, ie. the matrix, is represented. If it is given as a list of column-lists, it just needs a sort. For the given example:
>>> m = [[2, 3, 7], [-1, -2, 5.2], [0, 1, 4], [2, 4, 5]]
>>> y = sorted(m, key=lambda x: x[0])
>>> y
[[-1, -2, 5.2], [0, 1, 4], [2, 3, 7], [2, 4, 5]]
Other representations might need a more complex approach. For example, if the matrix is given as a list of rows:
>>> m = [[2, -1, 0, 2], [3, -2, 1, 4], [7, 5.2, 4, 5]]
>>> order = sorted(range(len(m[0])), key=lambda x: m[0][x])
>>> order
[1, 2, 0, 3]
>>> y = [[row[x] for x in order] for row in m]
>>> y
[[-1, 0, 2, 2], [-2, 1, 3, 4], [5.2, 4, 7, 5]]
The idea here is that first, we will get the order the elements are going to be in based on the first row. We can do that by sorting range(4), so [0, 1, 2, 3] with the sorting key (the value used for sorting) being the i-th value of the first row.
The result is that we get [1, 2, 0, 3] which says: Column index 1 is first, then index 2, 0 and finally 3.
Now we want to create a new matrix where every row follows that order which we can do with a list comprehension over the original matrix, where for each row, we create a new list that has the elements of the row according to the order we determined before.
Note that this approach creates new lists for the whole matrix - if you're dealing with large matrices, you probably want to use the appropriate primitives from numpy and swap the elements around in place.
If matrix is your input, you can do:
result = list(zip(*sorted(zip(*matrix))))
So working from inside out, this expression does:
zip: to iterate the transposed of the matrix (rows become columns and vice versa)
sorted: sorts the transposed matrix. No need to provide a custom key, the sorting will be by the first element (row, which is a column in the original matrix). If there is a tie, by second element (row), ...etc.
zip: to iterate the transposed of the transposed matrix, i.e. transposing it back to its original shape
list to turn the iterable to a list (a matrix)
I have a NumPy array with each row representing some (x, y, z) coordinate like so:
a = array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2]])
I also have another NumPy array with unique values of the z-coordinates of that array like so:
b = array([1, 2])
How can I apply a function, let's call it "f", to each of the groups of rows in a which correspond to the values in b? For example, the first value of b is 1 so I would get all rows of a which have a 1 in the z-coordinate. Then, I apply a function to all those values.
In the end, the output would be an array the same shape as b.
I'm trying to vectorize this to make it as fast as possible. Thanks!
Example of an expected output (assuming that f is count()):
c = array([2, 2])
because there are 2 rows in array a which have a z value of 1 in array b and also 2 rows in array a which have a z value of 2 in array b.
A trivial solution would be to iterate over array b like so:
for val in b:
apply function to a based on val
append to an array c
My attempt:
I tried doing something like this, but it just returns an empty array.
func(a[a[:, 2]==b])
The problem is that the groups of rows with the same Z can have different sizes so you cannot stack them into one 3D numpy array which would allow to easily apply a function along the third dimension. One solution is to use a for-loop, another is to use np.split:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2],
[4, 3, 1]])
a_sorted = a[a[:,2].argsort()]
inds = np.unique(a_sorted[:,2], return_index=True)[1]
a_split = np.split(a_sorted, inds)[1:]
# [array([[0, 0, 1],
# [4, 5, 1],
# [4, 3, 1]]),
# array([[1, 1, 2],
# [4, 5, 2]])]
f = np.sum # example of a function
result = list(map(f, a_split))
# [19, 15]
But imho the best solution is to use pandas and groupby as suggested by FBruzzesi. You can then convert the result to a numpy array.
EDIT: For completeness, here are the other two solutions
List comprehension:
b = np.unique(a[:,2])
result = [f(a[a[:,2] == z]) for z in b]
Pandas:
df = pd.DataFrame(a, columns=list('XYZ'))
result = df.groupby(['Z']).apply(lambda x: f(x.values)).tolist()
This is the performance plot I got for a = np.random.randint(0, 100, (n, 3)):
As you can see, approximately up to n = 10^5 the "split solution" is the fastest, but after that the pandas solution performs better.
If you are allowed to use pandas:
import pandas as pd
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').agg(f)
Here f can be any custom function working on grouped data.
Numeric example:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2]])
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').size()
z
1 2
2 2
dtype: int64
Remark that .size is the way to count number of rows per group.
To keep it into pure numpy, maybe this can suit your case:
tmp = np.array([a[a[:,2]==i] for i in b])
tmp
array([[[0, 0, 1],
[4, 5, 1]],
[[1, 1, 2],
[4, 5, 2]]])
which is an array with each group of arrays.
c = np.array([])
for x in np.nditer(b):
c = np.append(c, np.where((a[:,2] == x))[0].shape[0])
Output:
[2. 2.]
I wish to extract values from a multiindex DataFrame, this df has two indexes, a_idx and b_idx. The values to be extracted are i.e. (1,1)
[in] df.loc[(1, 1), :]
[out] 0
Name: (1, 1), dtype: int64
which is as intended. But then if I want to obtain two values (1,2) and (2,3):
[in] df.loc[([1, 2], [2, 3]), :]
[out]
value
a_idx b_idx
1 2 1
3 6
2 2 3
3 9
Which is not what I wanted, I needed the specific pairs, not the 4 values.
Furthermore, I wish to select elements from this database with two arrays select_a and select_b: .loc[[, that have the same length as eachother, but not as the dataframe. So for
select_a = [1, 1, 2, 2, 3]
select_b = [1, 3, 2, 3, 1]
My gist was that I should do this using:
df.loc[(select_a, select_b), :]
and then receive a list of all items with a_idx==select_a[i] and b_idx==select_b[i] for all i in len(select_a).
I have tried xs and slice indexing, but this did not return the desired results. My main reason for going to the indexing method is because of computational speed, as the real dataset is actually 4.3 million lines and the dataset that has to be created will have even more.
If this is not the best way to achieve this result, then please point me in the right direction. Any sources are also welcome, what I found in the pandas documentation was not geared towards this kind of indexing (or at least I have not been able to find it)
The dataframe is created using the following code:
numbers = pd.DataFrame(np.random.randint(0,10,10), columns=["value"])
numbers["a"] = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
numbers["b"] = [1, 2, 3, 4, 1, 2, 3, 1, 2, 3]
print("before adding the index to the dataframe")
print(numbers)
index_cols = pd.MultiIndex.from_arrays(
[numbers["a"].values, numbers["b"].values],
names=["a_idx", "b_idx"])
df = pd.DataFrame(numbers.values,
index=index_cols,
columns=numbers.columns.values)
df = df.sort_index()
df.drop(columns=["a","b"],inplace=True)
print("after adding the indexes to the dataframe")
print(df)
You were almost there. To get the pair for those indexes, you need to have the syntax like this:
df.loc[[(1, 2), (2, 3)], :]
You can also do this using select_a and select_b. Just make sure that you pass the pairs to df.loc as tuples.
This question already has answers here:
Finding indices of matches of one array in another array
(4 answers)
Closed 3 years ago.
I have a numpy array A which contains unique IDs that can be in any order - e.g. A = [1, 3, 2]. I have a second numpy array B, which is a record of when the ID is used - e.g. B = [3, 3, 1, 3, 2, 1, 2, 3, 1, 1, 2, 3, 3, 1]. Array B is always much longer than array A.
I need to find the indexed location of the ID in A for each time the ID is used in B. So in the example above my returned result would be: result = [1, 1, 0, 1, 2, 0, 2, 1, 0, 0, 2, 1, 1, 0].
I've already written a simple solution that gets the correct result using a for loop to append the result to a new list and using numpy.where, but I can't figure out the correct syntax to vectorize this.
import numpy as np
A = np.array([1, 3, 2])
B = np.array([3, 3, 1, 3, 2, 1, 2, 3, 1, 1, 2, 3, 3, 1])
IdIndxs = []
for ID in B:
IdIndxs.append(np.where(A == ID)[0][0])
IdIndxs = np.array(IdIndxs)
Can someone come up with a simple vector based solution that runs quickly - the for loop becomes very slow when running on a typical problem where is A is of the size of 10K-100K elements and B is some multiple, usually 5-10x larger than A.
I'm sure the solution is simple, but I just can't see it today.
You can use this:
import numpy as np
# test data
A = np.array([1, 3, 2])
B = np.array([3, 3, 1, 3, 2, 1, 2, 3, 1, 1, 2, 3, 3, 1])
# get indexes
sorted_keys = np.argsort(A)
indexes = sorted_keys[np.searchsorted(A, B, sorter=sorted_keys)]
Output:
[1 1 0 1 2 0 2 1 0 0 2 1 1 0]
The numpy-indexed library (disclaimer: I am its author) was designed to provide these type of vectorized operations where numpy for some reason does not. Frankly given how common this vectorized list.index equivalent is useful it definitely ought to be in numpy; but numpy is a slow-moving project that takes backwards compatibility very seriously, and I dont think we will see this until numpy2.0; but until then this is pip and conda installable with the same ease.
import numpy_indexed as npi
idx = npi.indices(A, B)
Reworking your logic but using a list comprehension and numpy.fromiter which should boost performance.
IdIndxs = np.fromiter([np.where(A == i)[0][0] for i in B], B.dtype)
About performance
I've done a quick test comparing fromiter with your solution, and I do not see such boost in performance. Even using a B array of millions of elements, they are of the same order.
I have the following matrix
B = [[1,2], [3,4]]
and would like to store the matrix as lines of the syntax i j b_ij where i and j are the matrix indices and b_ij is the value at that indexed position.
That is, the matrix above would look like:
0 0 1
0 1 2
1 0 3
1 1 4
Is there anyway to do this with a library in Python? Also, is this a common format for storing a matrix as raw bytes? I know it is easy enough to iterate over a matrix to store it in this fashion but that seems rather inefficient.
It's not a library, but you can just use a list comprehension:
>>> B = [[1, 2], [3, 4]]
>>> matrix = [ [i, j, B[i][j]] for i in range(len(B)) for j in range(len(B[i])) ]
>>> print(matrix)
[[0, 0, 1], [0, 1, 2], [1, 0, 3], [1, 1, 4]]
You could also expand the for loop to do something similar, or nest comprehensions inside each other if your matrix B goes deeper than this.