Related
I have python list:
my_list = [1, 'V']
I have pd.Dataframe:
A B C
0 f v b
1 f i n
2 f i m
I need to create new column in my dataframe with value = my_list:
A B C D
0 f v b [1, 'V']
1 f i n [1, 'V']
2 f i m [1, 'V']
As far as I understand python lists can be values, bc df.groupby with apply "list":
df = df.groupby(['A', 'B'], group_keys=True)['C'].apply(list).reset_index(name='H')
A B H
0 f i [n, m]
1 f v [b]
Its posible without convert my_list type? What the the easiest way to do that?
I tried:
df['D'] = my_list
df['D'] = pd.Series(my_list)
but they did not meet my expectations
You can try using: np.repeat and set its repeat parameter to number of rows which can be found out from the shape of the dataframe.
my_list = [1, 'V']
df = pd.DataFrame({'col1': ['f', 'f', 'f'], 'col2': ['v', 'i', 'i'], 'col3': ['b', 'n', 'm']})
df['new_col'] = np.repeat(my_list, df.shape[0])
This will repeat the values of my_list as many times as there are rows in the DataFrame.
You can do it by creating a new array with my_list through hstack and then forming a new DataFrame. The code below has been tested and works fine.
import numpy as np
import pandas as ph
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.array([1, 'V']).repeat(3).reshape(2,3).transpose()
df = pd.DataFrame(np.hstack((a1,a2)))
Edit: Another code that has been tested is:
import pandas as pd
import numpy as np
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.squeeze(np.dstack((np.array(1).repeat(3), np.array('V').repeat(3))))
df = pd.DataFrame(np.hstack((a1,a2)))
How can I split a 2D array by a grouping variable, and return a list of arrays please (also the order is important).
To show expected outcome, the equivalent in R can be done as
> (A = matrix(c("a", "b", "a", "c", "b", "d"), nr=3, byrow=TRUE)) # input
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
[3,] "b" "d"
> (split.data.frame(A, A[,1])) # output
$a
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
$b
[,1] [,2]
[1,] "b" "d"
EDIT: To clarify: I'd like to split the array/matrix, A into a list of multiple arrays based on the unique values in the first column. That is, split A into one array where the first column has an a, and another array where the first column has a b.
I have tried Python equivalent of R "split"-function but this gives three arrays
import numpy as np
import itertools
A = np.array([["a", "b"], ["a", "c"], ["b", "d"]])
b = a[:,0]
def split(x, f):
return list(itertools.compress(x, f)), list(itertools.compress(x, (not i for i in f)))
split(A, b)
([array(['a', 'b'], dtype='<U1'),
array(['a', 'c'], dtype='<U1'),
array(['b', 'd'], dtype='<U1')],
[])
And also numpy.split, using np.split(A, b), but which needs integers. I though I may be able to use How to convert strings into integers in Python? to convert the letters to integers, but even if I pass integers, it doesn't split as expected
c = np.transpose(np.array([1,1,2]))
np.split(A, c) # returns 4 arrays
Can this be done? thanks
EDIT: please note that this is a small example, and the number of groups may be greater than two and they may not be ordered.
You can use pandas:
import pandas as pd
import numpy as np
a = np.array([["a", "b"], ["a", "c"], ["b", "d"]])
listofdfs = {}
for n,g in pd.DataFrame(a).groupby(0):
listofdfs[n] = g
listofdfs['a'].values
Output:
array([['a', 'b'],
['a', 'c']], dtype=object)
And,
listofdfs['b'].values
Output:
array([['b', 'd']], dtype=object)
Or, you could use itertools groupby:
import numpy as np
from itertools import groupby
l = [np.stack(list(g)) for k, g in groupby(a, lambda x: x[0])]
l[0]
Output:
array([['a', 'b'],
['a', 'c']], dtype='<U1')
And,
l[1]
Output:
array([['b', 'd']], dtype='<U1')
If I understand your question, you can do simple slicing, as in:
a = np.array([["a", "b"], ["a", "c"], ["b", "d"]])
x,y=a[:2,:],a[2,:]
x
array([['a', 'b'],
['a', 'c']], dtype='<U1')
y
array(['b', 'd'], dtype='<U1')
I am trying to do the following:
Given a dataFrame of distance, I want to identify the k-nearest neighbours for each element.
Example:
A B C D
A 0 1 3 2
B 5 0 2 2
C 3 2 0 1
D 2 3 4 0
If k=2, it should return:
A: B D
B: C D
C: D B
D: A B
Distances are not necessarily symmetric.
I am thinking there must be something somewhere that does this in an efficient way using Pandas DataFrames. But I cannot find anything?
Homemade code is also very welcome! :)
Thank you!
The way I see it, I simply find n + 1 smallest numbers/distances/neighbours for each row and remove the 0, which would then give you n numbers/distances/neighbours. Keep in mind that the code will not work if you have a distance of zeroes! Only the diagonals are allowed to be 0.
import pandas as pd
import numpy as np
X = pd.DataFrame([[0, 1, 3, 2],[5, 0, 2, 2],[3, 2, 0, 1],[2, 3, 4, 0]])
X.columns = ['A', 'B', 'C', 'D']
X.index = ['A', 'B', 'C', 'D']
X = X.T
for i in X.index:
Y = X.nsmallest(3, i)
Y = Y.T
Y = Y[Y.index.str.startswith(i)]
Y = Y.loc[:, Y.any()]
for j in Y.index:
print(i + ": ", list(Y.columns))
This prints out:
A: ['B', 'D']
B: ['C', 'D']
C: ['D', 'B']
D: ['A', 'B']
I have 3 different lists of unequal length.
I want to append the shorter lists with "X" and make sizes equal to the length of the longest list.
A = [10,20,30,40,50]
B = ["A", "B", "C"]
C = ["X1", "X2"]
After appending "X" , it should be like the following:
A = [10,20,30,40,50]
B = ["A", "B", "C", "X","X"]
C = ["P1", "P2", "X", "X", "X"]
I have used the below code for achieving it,
for i, a in enumerate(A):
if i < len(B):
pass
else:
B.append('X')
How can i do it efficiently in python ?
Use the extend method
B.extend(['X'] * (len(A)-len(B)))
Calculate the max length and for each list, append the delta.
In Python, List has a binary operator + to concat multiple lists together, as well as * to tile itself.
A = [10,20,30,40,50]
B = ["A", "B", "C"]
C = ["X1", "X2"]
max_length = max(max(len(A), len(B)), len(C))
A += ['X'] * (max_length - len(A))
B += ['X'] * (max_length - len(B))
C += ['X'] * (max_length - len(C))
Then organize them using a container list, for less repeated codes and better extensibility.
A = [10,20,30,40,50]
B = ["A", "B", "C"]
C = ["X1", "X2"]
arrays = [A, B, C]
max_length = 0
for array in arrays:
max_length = max(max_length, len(array))
for array in arrays:
array += ['X'] * (max_length - len(array))
Result:
print(A) # [10, 20, 30, 40, 50]
print(B) # ['A', 'B', 'C', 'X', 'X']
print(C) # ['X1', 'X2', 'X', 'X', 'X']
The python itertools module has a lot of nifty functions that are good for cases like this. For example:
>>> from itertools import izip_longest, izip
>>> A = [10, 20, 30, 40, 50]
>>> B = ["A", "B", "C"]
>>> C = ["X1", "X2"]
>>> A, B, C = (list(x) for x in (izip(*izip_longest(A, B, C, fillvalue='X'))))
>>> A
[10, 20, 30, 40, 50]
>>> B
['A', 'B', 'C', 'X', 'X']
>>> C
['X1', 'X2', 'X', 'X', 'X']
Write function that makes this for you
A = [10, 20, 30, 40, 50]
B = ["A", "B", "C"]
C = ["X1", "X2"]
def extend_with_extra_elements(*some_lists):
max_some_lists_length = max(map(len, some_lists))
for some_list in some_lists:
extra_elements_count = max_some_lists_length - len(some_list)
extra_elements = ['X'] * extra_elements_count
yield some_list + extra_elements
A, B, C = extend_with_extra_elements(A, B, C)
efficient enough
Try to use max() to get the max length and then append list to B and C.
If you want to replace X with P, you can use a list comprehension [i.replace('X','P') for i in C] to get ['P1','P2']:
>>> m=max(len(A),len(B),len(C))
>>> B+['X']*(m-len(B))
['A', 'B', 'C', 'X', 'X']
>>> [i.replace('X','P') for i in C]+['X']*(m-len(C))
['P1', 'P2', 'X', 'X', 'X']
I have a list of numpy arrays that contains a list of name-value pairs which are both strings. Every name and value can be found multiple times in the list, and I would like to convert it to a binary matrix.
The columns represent the values while the rows represent a key/name, and when a field is set to 1 it represents that particular name value pair.
E.g
I have
A : aa
A : bb
A : cc
B : bb
C : aa
and i want to convert it to
aa bb cc
A 1 1 1
B 0 1 0
C 1 0 0
I have some code that does this but I was wondering if there is an easier/out of the box way of doing this with numpy or some other library.
This is my code so far:
resources = Set(result[:,1])
resourcesDict = {}
i = 0
for r in resources:
resourcesDict[r] = i
i = i + 1
clients = Set(result[:,0])
clientsDict = {}
i = 0
for c in clients:
clientsDict[c] = i
i = i + 1
arr = np.zeros((len(clientsDict),len(resourcesDict)), dtype = 'bool')
for line in result[:,0:2]:
arr[clientsDict[line[0]],resourcesDict[line[1]]] = True
and in result theres the following
array([["a","aa"],["a","bb"],..]
I feel that using Pandas.DataFrame.pivot is the best way
>>> df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6]})
>>> df
foo bar baz
0 one A 1
1 one B 2
2 one C 3
3 two A 4
4 two B 5
5 two C 6
Or
you can load your pair list using
>>> df = pd.read_csv('ratings.csv')
Then
>>> df.pivot(index='foo', columns='bar', values='baz')
A B C
one 1 2 3
two 4 5 6
you probably have something like
m_dict = {'A': ['aa', 'bb', 'cc'], 'B': ['bb'], 'C': ['aa']}
i would go like this:
res = {}
for k, v in m_dict.items():
res[k] = defaultdict(int)
for col in v:
res[k][v] = 1
edit
given your format, it would probably be more in the line of :
m_array = [['A', 'aa'], ['A', 'bb'], ['A', 'cc'], ['B', 'bb'], ['C', 'aa']]
res = defaultdict(lambda: defaultdict(int))
for k, v in m_array:
res[k][v] = 1
which both give:
>>> res['A']['aa']
1
>>> res['B']['aa']
0
This is a job for np.unique. It is not clear what format your data is in, but you need to get two 1-D arrays, one with the keys, another with the values, e.g.:
kvp = np.array([['A', 'aa'], ['A', 'bb'], ['A', 'cc'],
['B', 'bb'], ['C', 'aa']])
keys, values = kvp.T
rows, row_idx = np.unique(keys, return_inverse=True)
cols, col_idx = np.unique(values, return_inverse=True)
out = np.zeros((len(rows), len(cols)), dtype=np.int)
out[row_idx, col_idx] += 1
>>> out
array([[1, 1, 1],
[0, 1, 0],
[1, 0, 0]])
>>> rows
array(['A', 'B', 'C'],
dtype='|S2')
>>> cols
array(['aa', 'bb', 'cc'],
dtype='|S2')
If you have no repeated key-value pairs, this code will work just fine. If there are repetitions, I would suggest abusing scipy's sparse module:
import scipy.sparse as sps
kvp = np.array([['A', 'aa'], ['A', 'bb'], ['A', 'cc'],
['B', 'bb'], ['C', 'aa'], ['A', 'bb']])
keys, values = kvp.T
rows, row_idx = np.unique(keys, return_inverse=True)
cols, col_idx = np.unique(values, return_inverse=True)
out = sps.coo_matrix((np.ones_like(row_idx), (row_idx, col_idx))).A
>>> out
array([[1, 2, 1],
[0, 1, 0],
[1, 0, 0]])
d = {'A': ['aa', 'bb', 'cc'], 'C': ['aa'], 'B': ['bb']}
rows = 'ABC'
cols = ('aa', 'bb', 'cc')
print ' ', ' '.join(cols)
for row in rows:
print row, ' ',
for col in cols:
print ' 1' if col in d.get(row) else ' 0',
print
>>> aa bb cc
>>> A 1 1 1
>>> B 0 1 0
>>> C 1 0 0