I have a Pandas data frame object of shape (X,Y) that looks like this:
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
and a numpy sparse matrix (CSC) of shape (X,Z) that looks something like this
[[0, 1, 0],
[0, 0, 1],
[1, 0, 0]]
How can I add the content from the matrix to the data frame in a new named column such that the data frame will end up like this:
[[1, 2, 3, [0, 1, 0]],
[4, 5, 6, [0, 0, 1]],
[7, 8, 9, [1, 0, 0]]]
Notice the data frame now has shape (X, Y+1) and rows from the matrix are elements in the data frame.
import numpy as np
import pandas as pd
import scipy.sparse as sparse
df = pd.DataFrame(np.arange(1,10).reshape(3,3))
arr = sparse.coo_matrix(([1,1,1], ([0,1,2], [1,2,0])), shape=(3,3))
df['newcol'] = arr.toarray().tolist()
print(df)
yields
0 1 2 newcol
0 1 2 3 [0, 1, 0]
1 4 5 6 [0, 0, 1]
2 7 8 9 [1, 0, 0]
Consider using a higher dimensional datastructure (a Panel), rather than storing an array in your column:
In [11]: p = pd.Panel({'df': df, 'csc': csc})
In [12]: p.df
Out[12]:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
In [13]: p.csc
Out[13]:
0 1 2
0 0 1 0
1 0 0 1
2 1 0 0
Look at cross-sections etc, etc, etc.
In [14]: p.xs(0)
Out[14]:
csc df
0 0 1
1 1 2
2 0 3
See the docs for more on Panels.
df = pd.DataFrame(np.arange(1,10).reshape(3,3))
df['newcol'] = pd.Series(your_2d_numpy_array)
You can add and retrieve a numpy array from dataframe using this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'b':range(10)}) # target dataframe
a = np.random.normal(size=(10,2)) # numpy array
df['a']=a.tolist() # save array
np.array(df['a'].tolist()) # retrieve array
This builds on the previous answer that confused me because of the sparse part and this works well for a non-sparse numpy arrray.
Here is other example:
import numpy as np
import pandas as pd
""" This just creates a list of touples, and each element of the touple is an array"""
a = [ (np.random.randint(1,10,10), np.array([0,1,2,3,4,5,6,7,8,9])) for i in
range(0,10) ]
""" Panda DataFrame will allocate each of the arrays , contained as a touple
element , as column"""
df = pd.DataFrame(data =a,columns=['random_num','sequential_num'])
The secret in general is to allocate the data in the form a = [ (array_11, array_12,...,array_1n),...,(array_m1,array_m2,...,array_mn) ] and panda DataFrame will order the data in n columns of arrays. Of course , arrays of arrays could be used instead of touples, in that case the form would be :
a = [ [array_11, array_12,...,array_1n],...,[array_m1,array_m2,...,array_mn] ]
This is the output if you print(df) from the code above:
random_num sequential_num
0 [7, 9, 2, 2, 5, 3, 5, 3, 1, 4] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [8, 7, 9, 8, 1, 2, 2, 6, 6, 3] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2 [3, 4, 1, 2, 2, 1, 4, 2, 6, 1] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
3 [3, 1, 1, 1, 6, 2, 8, 6, 7, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4 [4, 2, 8, 5, 4, 1, 2, 2, 3, 3] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
5 [3, 2, 7, 4, 1, 5, 1, 4, 6, 3] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
6 [5, 7, 3, 9, 7, 8, 4, 1, 3, 1] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
7 [7, 4, 7, 6, 2, 6, 3, 2, 5, 6] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
8 [3, 1, 6, 3, 2, 1, 5, 2, 2, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
9 [7, 2, 3, 9, 5, 5, 8, 6, 9, 8] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Other variation of the example above:
b = [ (i,"text",[14, 5,], np.array([0,1,2,3,4,5,6,7,8,9])) for i in
range(0,10) ]
df = pd.DataFrame(data=b,columns=['Number','Text','2Elemnt_array','10Element_array'])
Output of df:
Number Text 2Elemnt_array 10Element_array
0 0 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 1 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2 2 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
3 3 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4 4 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
5 5 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
6 6 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
7 7 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
8 8 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
9 9 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
If you want to add other columns of arrays, then:
df['3Element_array']=[([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3]),([1,2,3])]
The final output of df will be:
Number Text 2Elemnt_array 10Element_array 3Element_array
0 0 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [1, 2, 3]
1 1 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [1, 2, 3]
2 2 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [1, 2, 3]
3 3 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [1, 2, 3]
4 4 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [1, 2, 3]
5 5 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [1, 2, 3]
6 6 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [1, 2, 3]
7 7 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [1, 2, 3]
8 8 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [1, 2, 3]
9 9 text [14, 5] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [1, 2, 3]
Related
Say I have this array:
array = np.array([[1,2,3],[4,5,6],[7,8,9]])
Returns:
123
456
789
How should I go about getting it to return something like this?
111222333
111222333
111222333
444555666
444555666
444555666
777888999
777888999
777888999
You'd have to use np.repeat twice here.
np.repeat(np.repeat(array, 3, axis=1), 3, axis=0)
# [[1 1 1 2 2 2 3 3 3]
# [1 1 1 2 2 2 3 3 3]
# [1 1 1 2 2 2 3 3 3]
# [4 4 4 5 5 5 6 6 6]
# [4 4 4 5 5 5 6 6 6]
# [4 4 4 5 5 5 6 6 6]
# [7 7 7 8 8 8 9 9 9]
# [7 7 7 8 8 8 9 9 9]
# [7 7 7 8 8 8 9 9 9]]
For fun (because the nested repeat will be more efficient), you could use einsum on the input array and an array of ones that has extra dimensions to create a multidimensional array with the dimensions in an ideal order to reshape to the expected 2D shape:
np.einsum('ij,ikjl->ikjl', array, np.ones((3,3,3,3))).reshape(9,9)
The generic method being:
i,j = array.shape
k = 3 # extra rows
l = 3 # extra cols
np.einsum('ij,ikjl->ikjl', a, np.ones((i,k,j,l))).reshape(i*k,j*l)
Output:
array([[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3],
[4, 4, 4, 5, 5, 5, 6, 6, 6],
[4, 4, 4, 5, 5, 5, 6, 6, 6],
[4, 4, 4, 5, 5, 5, 6, 6, 6],
[7, 7, 7, 8, 8, 8, 9, 9, 9],
[7, 7, 7, 8, 8, 8, 9, 9, 9],
[7, 7, 7, 8, 8, 8, 9, 9, 9]])
What is however nice with this method, is that it's quite easy to change the order to obtain other patterns or work with higher dimensions.
Example with other patterns:
>>> np.einsum('ij,iklj->iklj', a, np.ones((3,3,3,3))).reshape(9,9)
array([[1, 2, 3, 1, 2, 3, 1, 2, 3],
[1, 2, 3, 1, 2, 3, 1, 2, 3],
[1, 2, 3, 1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6, 4, 5, 6],
[4, 5, 6, 4, 5, 6, 4, 5, 6],
[4, 5, 6, 4, 5, 6, 4, 5, 6],
[7, 8, 9, 7, 8, 9, 7, 8, 9],
[7, 8, 9, 7, 8, 9, 7, 8, 9],
[7, 8, 9, 7, 8, 9, 7, 8, 9]])
>>> np.einsum('ij,kjil->kjil', a, np.ones((3,3,3,3))).reshape(9,9)
array([[1, 1, 1, 4, 4, 4, 7, 7, 7],
[2, 2, 2, 5, 5, 5, 8, 8, 8],
[3, 3, 3, 6, 6, 6, 9, 9, 9],
[1, 1, 1, 4, 4, 4, 7, 7, 7],
[2, 2, 2, 5, 5, 5, 8, 8, 8],
[3, 3, 3, 6, 6, 6, 9, 9, 9],
[1, 1, 1, 4, 4, 4, 7, 7, 7],
[2, 2, 2, 5, 5, 5, 8, 8, 8],
[3, 3, 3, 6, 6, 6, 9, 9, 9]])
This question already has an answer here:
Numpy - create matrix with rows of vector
(1 answer)
Closed 2 years ago.
I want to create a NumPy array by duplicating another array by a few rows. I did it as shown below. Is there a NumPyier way of doing this?
>>> a = np.arange(0,10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = tuple( a for _ in range(3) )
>>> b
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))
>>> c = np.vstack( b )
>>> c
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
I found a way to do it. Sharing it here.
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> a[None,:]
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> np.repeat( a[None,:], 3, axis=0 )
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
I have a dataframe (df) that looks like this:
a b
loc.1 [1, 2, 3, 4, 7, 5, 6]
loc.2 [3, 4, 3, 7, 7, 8, 6]
loc.3 [1, 4, 3, 1, 7, 8, 6]
...
I want to find the maximum of the array in column b and append this to the original data frame. My thought was something like this:
for line in df:
split = map(float,b.split(','))
count_max = max(split)
print count
Ideal output should be:
a b max_val
loc.1 [1, 2, 3, 4, 7, 5, 6] 7
loc.2 [3, 4, 3, 7, 7, 8, 6] 8
loc.3 [1, 4, 3, 1, 7, 8, 6] 8
...
But this does not work, as I cannot use b.split as it is not defined...
If working with lists without NaNs best is use max in list comprehension or map:
a['max'] = [max(x) for x in a['b']]
a['max'] = list(map(max, a['b']))
Pure pandas solution:
a['max'] = pd.DataFrame(a['b'].values.tolist()).max(axis=1)
Sample:
array = {'loc.1': np.array([ 1,2,3,4,7,5,6]),
'loc.2': np.array([ 3,4,3,7,7,8,6]),
'loc.3': np.array([ 1,4,3,1,7,8,6])}
L = [(k, v) for k, v in array.items()]
a = pd.DataFrame(L, columns=['a','b']).set_index('a')
a['max'] = [max(x) for x in a['b']]
print (a)
b max
a
loc.1 [1, 2, 3, 4, 7, 5, 6] 7
loc.2 [3, 4, 3, 7, 7, 8, 6] 8
loc.3 [1, 4, 3, 1, 7, 8, 6] 8
EDIT:
You can also get max in list comprehension:
L = [(k, v, max(v)) for k, v in array.items()]
a = pd.DataFrame(L, columns=['a','b', 'max']).set_index('a')
print (a)
b max
a
loc.1 [1, 2, 3, 4, 7, 5, 6] 7
loc.2 [3, 4, 3, 7, 7, 8, 6] 8
loc.3 [1, 4, 3, 1, 7, 8, 6] 8
Try this:
df["max_val"] = df["b"].apply(lambda x:max(x))
You can use numpy arrays for a vectorised calculation:
df = pd.DataFrame({'a': ['loc.1', 'loc.2', 'loc.3'],
'b': [[1, 2, 3, 4, 7, 5, 6],
[3, 4, 3, 7, 7, 8, 6],
[1, 4, 3, 1, 7, 8, 6]]})
df['maxval'] = np.array(df['b'].values.tolist()).max(axis=1)
print(df)
# a b maxval
# 0 loc.1 [1, 2, 3, 4, 7, 5, 6] 7
# 1 loc.2 [3, 4, 3, 7, 7, 8, 6] 8
# 2 loc.3 [1, 4, 3, 1, 7, 8, 6] 8
I have a very large 2D numpy array of m x n elements. For each row, I need to remove exactly one element. So for example from a 4x6 matrix I might need to delete [0, 1], [1, 4], [2, 3], and [3, 3] - I have this set of coordinates stored in a list. In the end, the matrix will ultimately shrink in width by 1.
Is there a standard way to do this using a mask? Ideally, I need this to be as performant as possible.
Here is a method that use ravel_multi_index() to calculate one-dim index, and then delete() the elements, and reshape back to two-dim array:
import numpy as np
n = 12
a = np.repeat(np.arange(10)[None, :], n, axis=0)
index = np.random.randint(0, 10, n)
ravel_index = np.ravel_multi_index((np.arange(n), index), a.shape)
np.delete(a, ravel_index).reshape(n, -1)
the index:
array([4, 6, 9, 0, 3, 5, 3, 8, 9, 8, 4, 4])
the result:
array([[0, 1, 2, 3, 4, 5, 6, 7, 9],
[1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 9],
[1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 9],
[0, 1, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 1, 2, 4, 5, 6, 7, 8, 9]])
I have a pandas df with two columns having either lists or NaN values. There are no rows having NaN in both columns. I want to create a third column that merges the values of the other two columns in the following way:-
if row df.a is NaN -> df.c = df.b
if row df.b is Nan -> df.c = df.a
else df.c = df.a + df.b
Input:-
df
a b
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] NaN
1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] NaN
2 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] NaN
3 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] NaN
4 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] NaN
5 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] NaN
6 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] NaN
7 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] NaN
8 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
9 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
10 NaN [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
11 NaN [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
output:
df.c
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
3 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
5 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
6 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
7 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
8 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
9 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
10 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
11 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
I tried to use this nested condition with apply
df['c'] = df.apply(lambda x: x.a if x.b is float else (x.b if x.a is float else (x['a'] + x['b'])), axis = 1)
but is giving me this error :
TypeError: ('can only concatenate list (not "float") to list', u'occurred at index 0').
I am using ( and it's acutally working)
if x is float
because is the only way I found to separate a list from a NaN value.
When you use pd.DataFrame.stack null values are dropped by default. We can then group by the first level of the index and concatenate the lists together with sum
df.stack().groupby(level=0).sum()
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
3 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
5 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
6 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
7 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
8 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
9 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
10 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
11 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
dtype: object
We can then add it to a copy of the dataframe with assign
df.assign(c=df.stack().groupby(level=0).sum())
Or add it to a new column in place
df['c'] = df.stack().groupby(level=0).sum()
You can convert the NaNs to list, and then apply np.sum:
In [718]: df['c'] = df[['a', 'b']].applymap(lambda x: [] if x != x else x).apply(np.sum, axis=1); df['c']
Out[718]:
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
3 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
5 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
6 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
7 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, ...
8 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, ...
9 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
10 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Name: c, dtype: object
This works for any number of columns that have list/NaN contents.
You can use fillna for replace NaNs to empty list first:
df = pd.DataFrame({'a': [[0, 1, 2], np.nan, [0, 1, 2]],
'b':[np.nan,[0, 1, 2],[ 5, 6, 7, 8, 9]]})
print (df)
s = pd.Series([[]], index=df.index)
df['c'] = df['a'].fillna(s) + df['b'].fillna(s)
print (df)
a b c
0 [0, 1, 2] NaN [0, 1, 2]
1 NaN [0, 1, 2] [0, 1, 2]
2 [0, 1, 2] [5, 6, 7, 8, 9] [0, 1, 2, 5, 6, 7, 8, 9]