I'm trying to find a nice way to take a 2d numpy array and attach column and row names as a structured array. For example:
import numpy as np
column_names = ['a', 'b', 'c']
row_names = ['1', '2', '3']
matrix = np.reshape((1, 2, 3, 4, 5, 6, 7, 8, 9), (3, 3))
# TODO: insert magic here
matrix['3']['a'] # 7
I've been able to use set the columns like this:
matrix.dtype = [(n, matrix.dtype) for n in column_names]
This lets me do matrix[2]['a'] but now I want to rename the rows so I can do matrix['3']['a'].
As far as I know it's not possible to "name" the rows with pure structured NumPy arrays.
But if you have pandas it's possible to provide an "index" (which essentially acts like a "row name"):
>>> import pandas as pd
>>> import numpy as np
>>> column_names = ['a', 'b', 'c']
>>> row_names = ['1', '2', '3']
>>> matrix = np.reshape((1, 2, 3, 4, 5, 6, 7, 8, 9), (3, 3))
>>> df = pd.DataFrame(matrix, columns=column_names, index=row_names)
>>> df
a b c
1 1 2 3
2 4 5 6
3 7 8 9
>>> df['a']['3'] # first "column" then "row"
7
>>> df.loc['3', 'a'] # another way to index "row" and "column"
7
Related
I want to have access to an element inside a panda dataframe, my df looks like below
index
A
B
0
3, 2, 1
5, 6, 7
1
3, 2, 1
5, 6, 7
2
3, 2, 1
5, 6, 7
I want to print from A the second value for every index for example, the problem I don't know how to select them.
Output should be
(2,2,2)
Assuming "3, 2, 1" is a list, you can do this with :
df.A.apply(lambda x: x[1])
if this is a string, you can do this with :
df.A.apply(lambda x: x.split(", ")[1])
If the entries in A are a non-string iterable (like a list or tuple, e.g.), you can use pandas string indexing:
df['A'].str[1]
Full example:
>>> import pandas as pd
>>> a = (3, 2, 1)
>>> df = pd.DataFrame([[a], [a], [a]], columns=['A'])
>>> df
A
0 (3, 2, 1)
1 (3, 2, 1)
2 (3, 2, 1)
>>> df['A'].str[1]
0 2
1 2
2 2
Name: A, dtype: int64
If the entries are strings, you can use pandas string methods to split them into a list and apply the same approach above:
>>> import pandas as pd
>>> a = '3,2,1'
>>> df = pd.DataFrame([[a], [a], [a]], columns=['A'])
>>> df
A
0 3,2,1
1 3,2,1
2 3,2,1
>>> df['A'].str.split(',').str[1]
0 2
1 2
2 2
Name: A, dtype: object
If column A contain string values:
import pandas as pd
data = {
"A" :["3, 2, 1","3, 2, 1", "3, 2, 1"],
"B" : ["5, 6, 7", "5, 6, 7", "5, 6, 7"]
}
df = pd.DataFrame(data)
output = df["A"].apply(lambda x: (x.split(",")[1]).strip()).to_list()
print(output)
Result:
['2', '2', '2']
I'm new on stackoverflow and have switched from R to python. I'm trying to do something probably not too difficult, and while I can do this by butchering, I am wondering what the most pythonic way to do it is. I am trying to divide certain values (E where F=a) in a column by values further down in the column (E where F=b) using column D as a lookup:
import pandas as pd
df = pd.DataFrame({'D':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1], 'E':[10,20,30,40,50,100, 250, 250, 360, 567, 400],'F':['a', 'a', 'a', 'a', 'a', 'b','b', 'b', 'b', 'b', 'c']})
print(df)
out = pd.DataFrame({'D': [1, 2, 3, 4, 5], 'a/b': [0.1, 0.08, 0.12 , 0.1111, 0.0881]}
print(out)
Can anyone help write this nicely?
I'm not entirely sure what you mean by "using D column as lookup" since there is no need for such lookup in the example you provided.
However the quick and dirty way to achieve the output you did provide is
output = pd.DataFrame({'a/b': df[df['F'] == 'a']['E'].values / df[df['F'] == 'b']['E'].values})
output['D'] = df['D']
which makes output to be
a/b D
0 0.100000 1
1 0.080000 2
2 0.120000 3
3 0.111111 4
4 0.088183 5
Lookup with .loc in pandas dataframe as df.loc[rows, columns] where the conditions for rows and columns are True
import numpy as np
# get indices from column D. I convert it to a list structure to make sure that the order is maintained.
idx = list(set(df['D']))
# A is an array of values with 'F'=a
A = np.array([df.loc[(df['F']=='a') & (df['D']==i),'E'].values[0] for i in idx])
# B is an array of values with 'F'=b
B = np.array([df.loc[(df['F']=='b') & (df['D']==i),'E'].values[0] for i in idx])
# Now devide towards your new dataframe of divisions
out = pd.DataFrame(np.vstack([A/B,idx]).T, columns = ['a/b','D'])
Instead of using numpy.vstack, you can use:
out = pd.DataFrame(A/B,idx).T
out.columns = ['a/b','D']
with the same result. I tried to do it in a single line (for no reason whatsoever)
Got it:
df = df.set_index('D')
out = df.loc[(df['F'] == 'a'), 'E'] / df.loc[(df['F'] == 'b'), 'E']
out = out.reset_index()
Thanks for your thoughts - I got inspired.
I would like to assign constant numpy array value to pandas dataframe column.
Here is what I tried:
import pandas as pd
import numpy as np
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new'] = np.array([]) # did not work
my_df['new'] = np.array([])*len(df) # did not work
Here is what worked:
my_df['new'] = my_df['new'].apply(lambda x: np.array([]))
I am curious why it works with simple scalar, but does not work with numpy array. Is there simpler way to assign numpy array value?
Your "new" column will contains arrays, so it must be a object type column.
The simplest way to initialize it is :
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new']=None
You can then fill it as you want. For example :
for index,(a,b,_) in my_df.iterrows():
my_df.loc[index,'new']=np.arange(a,b)
#
# col_1 col_2 new
# 0 1 4 [1, 2, 3]
# 1 2 5 [2, 3, 4]
# 2 3 6 [3, 4, 5]
this is the sample dataframe to be fit
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(3,.4)
neigh.fit(df)
neighbor_index = neigh.kneighbors([[1.3,4.5,2.5]],return_distance=False)
print(neighbor_index)
output:
here is my 3 nearest neighbors index-->
array([[0, 1, 3]], dtype=int64)
I want the actual index in the dataframe like array([[a,b,d]]) how can I get this ??
This is easy to achieve. You just need some pandas indexing magic.
Do this:
from sklearn.neighbors import NearestNeighbors
import pandas as pd
#load the data
df = pd.read_csv('data.csv')
print(df)
#build the model and fit it
neigh = NearestNeighbors(3,.4)
neigh.fit(df)
#get the index
neighbor_index = neigh.kneighbors([[1.3,4.5,2.5]],return_distance=False)
print(neighbor_index)
#get the row index (the row names) of the dataframe
names = list(df.index[neighbor_index])
print(names)
Results:
0 1 2
a 1 2 3
b 3 4 5
c 5 2 3
d 4 3 5
[[0 1 3]]
[array(['a', 'b', 'd'], dtype=object)]
See the pandas documentation here about using numeric indices with a pandas DataFrame.
Below is an example recreating the dataframe in your question. The .iloc function will return rows in a dataframe based on their numeric index. You can retrieve the rows by their numeric index to get the index as it appears in the dataframe.
df = pd.DataFrame([[1, 2, 3], [3, 4, 5], [5, 3, 2], [4, 3, 5]], index=['a', 'b', 'c', 'd'])
df.iloc[[0, 1, 3]].index
which returns ['a', 'b', 'd']
I would appreciate any help please :)
I'm trying to create a record array from 1d array of strings
and 2d array of numbers (so I can use np.savetxt and dump it into a file).
Unfortunately the docs aren't informative: np.core.records.fromarrays
>>> import numpy as np
>>> x = ['a', 'b', 'c']
>>> y = np.arange(9).reshape((3,3))
>>> print x
['a', 'b', 'c']
>>> print y
[[0 1 2]
[3 4 5]
[6 7 8]]
>>> records = np.core.records.fromarrays([x,y])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/numpy/core/records.py", line 560, in fromarrays
raise ValueError, "array-shape mismatch in array %d" % k
ValueError: array-shape mismatch in array 1
And the output I need is:
[['a', 0, 1, 2]
['b', 3, 4, 5]
['c', 6, 7, 8]]
If all you wish to do is dump x and y to a CSV file, then it is not necessary to use a recarray. If, however, you have some other reason for wanting a recarray, here is how you could create it:
import numpy as np
import numpy.lib.recfunctions as recfunctions
x = np.array(['a', 'b', 'c'], dtype=[('x', '|S1')])
y = np.arange(9).reshape((3,3))
y = y.view([('', y.dtype)]*3)
z = recfunctions.merge_arrays([x, y], flatten=True)
# [('a', 0, 1, 2) ('b', 3, 4, 5) ('c', 6, 7, 8)]
np.savetxt('/tmp/out', z, fmt='%s')
writes
a 0 1 2
b 3 4 5
c 6 7 8
to /tmp/out.
Alternatively, to use np.core.records.fromarrays you would need to list each column of y separately, so the input passed to fromarrays is, as the doc says, a "flat list of arrays".
x = ['a', 'b', 'c']
y = np.arange(9).reshape((3,3))
z = np.core.records.fromarrays([x] + [y[:,i] for i in range(y.shape[1])])
Each item in the list passed to fromarrays will become one column of the resultant recarray. You can see this by inspecting the source code:
_array = recarray(shape, descr)
# populate the record array (makes a copy)
for i in range(len(arrayList)):
_array[_names[i]] = arrayList[i]
return _array
By the way, you might want to use pandas here for the extra convenience (no mucking around with dtypes, flattening, or iterating over columns required):
import numpy as np
import pandas as pd
x = ['a', 'b', 'c']
y = np.arange(9).reshape((3,3))
df = pd.DataFrame(y)
df['x'] = x
print(df)
# 0 1 2 x
# 0 0 1 2 a
# 1 3 4 5 b
# 2 6 7 8 c
df.to_csv('/tmp/out')
# ,0,1,2,x
# 0,0,1,2,a
# 1,3,4,5,b
# 2,6,7,8,c