Related
I have this dataframe df1 of 8 rows:
ID
A
B
C
D
E
F
G
H
And I have this array arr of size 4 [-1, 0, 1, 2], and an m = 2, so I want to assign the values of this array jumping m times to df1, so I can have eventually:
ID N
A -1
B -1
C 0
D 0
E 1
F 1
G 2
H 2
How to do that in Python?
df1=pd.DataFrame({'ID':['A','B', 'C', 'D', 'E', 'F', 'G', 'H']})
arr=[-1,0,1,2]
m=2
If arr should be repeated again and again:
df1['N']=(arr*m)[:len(df1)]
Result:
ID
N
0
A
-1
1
B
0
2
C
1
3
D
2
4
E
-1
If each element should be repeated:
df1['N']=[arr[i] for i in range(len(arr)) for j in range(m)][:len(df1)]
Result:
ID
N
0
A
-1
1
B
-1
2
C
0
3
D
0
4
E
1
~~ Without numpy
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = sum([[x]*m for x in arr], [])
~~ With Numpy
import numpy as np
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = np.repeat(arr, m)
Output:
ID N
0 A -1
1 B -1
2 C 0
3 D 0
4 E 1
5 F 1
6 G 2
7 H 2
I have the following dataframe:
NAME
SIGNAL
a
0
b
0
c
0
d
0
e
1
f
1
g
1
h
0
i
0
j
0
k
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
t
0
I need to write a function that will allow me to extract another dataframe, or just modify the existing frame based on a condition:
Get all columns (in my case NAME) if SIGNAL column is 1 for the row but also extract 2 rows extra from above and 2 rows extra from bellow.
In my example, the function should return me the following table:
NAME
SIGNAL
c
0
d
0
e
1
f
1
g
1
h
0
i
0
j
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
Thanks!
UPDATE:
This is the code I have so far:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['a', 0], ['b', 0], ['c', 1], ['d', 1], ['e', 0], ['f', 0], ['g', 0], ['h', 1], ['i', 0], ['j', 0], ['k', 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['NAME', 'SIGNAL'])
# print dataframe.
print(df)
print("----------------")
for index, row in df.iterrows():
#print(row['Name'], row['Age'])
if((df.iloc[index]['SIGNAL'] == 1) & (df.iloc[index-1]['SIGNAL'] == 0)): #check when the signal change from 0 to 1
print(df.iloc[index]['NAME']) #first line with signal 1 after it was 0
#print the above 2 lines
print(df.iloc[index-1]['NAME'])
print(df.iloc[index-2]['NAME'])
My dataframe is like:
NAME SIGNAL
0 a 0
1 b 0
2 c 1
3 d 1
4 e 0
5 f 0
6 g 0
7 h 1
8 i 0
9 j 0
10 k 0
My code is returning:
c
b
a
h
g
f
The problem here is that I cannot return the value of "d" and "e" + "f" or "i" and "j" because i get the error "IndexError: single positional indexer is out-of-bounds" if i try if condition:
(df.iloc[index]['SIGNAL'] == 1) & (df.iloc[index+1]['SIGNAL'] == 0)
enter code here
Also the extended bounds will be variable, sometimes I will work with 2 extra rows from top and bottom sometimes with more.
I am looking for a solution using dataframes functions and not iteration.
thanks!
This will return the desired data frame:
df[(df.shift(periods=-2, axis="rows").SIGNAL == 1) | (df.shift(periods=-1, axis="rows").SIGNAL == 1) | (df.SIGNAL == 1) | (df.shift(periods=1, axis="rows").SIGNAL == 1) | (df.shift(periods=2, axis="rows").SIGNAL == 1)]
Output:
NAME
SIGNAL
c
0
d
0
e
1
f
1
g
1
h
0
i
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
Add .NAME to the end to get your series of names
2 c
3 d
4 e
5 f
6 g
7 h
8 i
11 l
12 m
13 n
14 o
15 p
16 q
17 r
18 s
Name: NAME, dtype: object
Update: for arbitrarily large span
m=(df.shift(periods=-400, axis="rows").SIGNAL == 1)
for i in list(range(-399,401)):
m= m | (df.shift(periods=i, axis="rows").SIGNAL == 1)
print(df[m])
Disclaimer:
This method may be inefficient for large spans
I am given a problem where I have to loop back and forth a 5x5 list within a list. So I created a list within a list where all the elements are 0 for convenience:
lst = [[0 for x in range(6)] for y in range(6)]
print(lst)
which will give me:
[[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]]
I have to be able to start from any coordinate in this nested list, for exmaple lst[3][2] then keep checking each coordinate lst[3][3], lst[3][4], lst[3][4], lst[4][0],... and so forth until I reach the maximum which islst[4][4] after which I have to loop back to lst[4][3], lst[4][2], lst[4][1],... until it reaches lst[0][0] after which I have to loop back up again. It's like an infinite loop where I start in a certain spot then loop endlessly back and forth until I tell it to stop.
I can do a nested loop but it stops at lst[4],[4]:
for x in range(len(lst)):
for y in range(len(lst)):
lst[x][y] = do something
I can tweak the ranges to start at a specific coordinate but I can't create an infinite loop that will keep looping back and forth. I tried adding a while loop too:
while True:
for x in range(len(lst)):
for y in range(len(lst)):
lst[x][y] = do something
but after it loops completely it starts over from lst[0][0] not lst[4][3]. Not to mention it starts from the starting point I decided on.
It's easier to deal with the location if you treat the location in the array as a linear location and compute x,y:
import random
MAX_X,MAX_Y = 3,3
L = [[0] * MAX_Y for _ in range(MAX_X)]
# Initial array values
cnt = 1
for x in range(MAX_X):
for y in range(MAX_Y):
L[x][y] = cnt
cnt += 1
print(L)
# Pick a linear starting location
cur = random.randrange(MAX_X * MAX_Y)
d = 1 # direction to advance
while True:
x,y = divmod(cur,MAX_Y) # compute x,y from linear location
print(f'{x},{y} = {L[x][y]}')
cur += d # advance in direction
# if went off either end, reverse direction
if cur == MAX_X * MAX_Y or cur == -1:
d = -d
cur += 2 * d # stepped one off the wrong way, so go back two.
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
2,0 = 7
2,1 = 8
2,2 = 9
2,1 = 8
2,0 = 7
1,2 = 6
1,1 = 5
1,0 = 4
0,2 = 3
0,1 = 2
0,0 = 1
0,1 = 2
0,2 = 3
1,0 = 4
1,1 = 5
1,2 = 6
2,0 = 7
2,1 = 8
2,2 = 9
2,1 = 8
2,0 = 7
...
Another option if the 2D matrix isn't a hard requirement and the array is read-only is to unroll the loop. Create a single array with the values from the 2D array in increasing order, then decreasing order. If you want to start at a particular location in the array, compute the starting location and slice the previous values to the end of the array:
import random
import itertools
MAX_X,MAX_Y = 3,3
L = [[0] * MAX_Y for _ in range(MAX_X)]
# Initial array values (same as before)
cnt = 1
for x in range(MAX_X):
for y in range(MAX_Y):
L[x][y] = cnt
cnt += 1
print(L)
linear = sum(L,[]) # Make a 1D array joining all the rows.
# Pick a starting X/Y location
start_x = random.randrange(MAX_X)
start_y = random.randrange(MAX_Y)
# compute its linear location
cur = start_x * MAX_Y + start_y
# slice elements before starting location to end of the list
# and include the reversed array values as well.
unrolled = linear[cur:] + linear[-2:0:-1] + linear[:cur]
print(unrolled)
# cycle through the list endlessly
for n in itertools.cycle(unrolled):
print(n)
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
[5, 6, 7, 8, 9, 8, 7, 6, 5, 4, 3, 2, 1, 2, 3, 4]
5
6
7
8
9
8
7
6
5
4
3
2
1
2
3
4
5
6
...
This seems to work:
import time
def showgrid(lst):
for i in lst:
for j in i:
print(str(j).center(3),end=' ')
print()
print()
lst = [[0 for x in range(3)] for y in range(3)] # array all zeros
d = 1 # direction, start forward
x,y = 1,1 # start mid grid
ctr = 1 # data for array position
while True:
lst[y][x] = ctr # set array data
x+=d # next cell
if x == len(lst[0]) or x == -1: # reached end of row
y+=d # next row
x = len(lst[0])-1-x + d # move to other end of row
if y == len(lst) or y == -1: # end or array
d=-d # switch direction
y+=d # start back
x = len(lst[0])-1-x + d # revert other end of row
ctr+=1
showgrid(lst)
time.sleep(.1)
Output
0 0 0
0 1 0
0 0 0
0 0 0
0 1 2
0 0 0
0 0 0
0 1 2
3 0 0
0 0 0
0 1 2
3 4 0
0 0 0
0 1 2
3 4 5
0 0 0
0 1 2
3 6 5
0 0 0
0 1 2
7 6 5
0 0 0
0 1 8
7 6 5
0 0 0
0 9 8
7 6 5
0 0 0
10 9 8
7 6 5
Differently than other approaches, I decided to get the range of all possible x and y (including loop and reverse loop) first, using the get_range_positions function. Afterwards I did another function to yield the needed values, given a start position for x and y, and this function I called get_coordinates.
At the end, I loop through the list, given the yielded values of x and y.
def get_range_positions(len_first_list,
len_second_list):
result = []
# loop
for a in range(len_first_list):
for b in range(len_second_list):
result.append((a,b))
# reverse loop, but we don't want
# the last and first value again
f = 0
l = len(result) - 2
for i in result[l:f:-1]:
result.append(i)
return result
def get_coordinates(start_x,start_y, range_positions):
# finding out where start_x and start_y is in range_positions
start = range_positions.index((start_x,start_y))
# Return X and Y from (start_x,start_y) until the end
for i in range_positions[start:]:
yield i[0], i[1]
# Return X and Y for the entire result_changes, forever.
while True:
for i in range_positions:
yield i[0], i[1]
lst = [[x for x in range(6)] for y in range(6)]
range_positions = get_range_positions(6,6)
print(range_positions)
for x, y in get_coordinates(start_x=0,
start_y=0,
range_positions=range_positions):
lst[x,y] = do something
You could produce a flat 1D sequence of indices for your 2D list:
indices1D = [(i, j) for i in range(NROWS) for j in range(NCOLUMNS)]
To be able to go back, attach reversed indices and use itertools.cycle() to repeat it forever:
import itertools
indices = itertools.cycle(indices1d + indices1d[::-1][1:-1])
[1:-1] above is to cut/avoid duplicating the turn-around points.
To start at a given index, you could use itertools.islice:
start_position = i_start * NCOLUMNS + j_start
for i, j in itertools.islice(indices, start_position, None):
print(lst[i][j])
Putting it all together:
import itertools
def back_n_forth(seq, i_start, j_start):
indices1d = [(i, j) for i in range(NROWS) for j in range(NCOLUMNS)]
indices = itertools.cycle(indices1d + indices1d[::-1][1:-1])
start_position = i_start * NCOLUMNS + j_start
for i, j in itertools.islice(indices, start_position, None):
yield seq[i][j]
# some random 2d list
NROWS = 3
NCOLUMNS = 2
lst = [
[chr(ord('a') + i) + str(j) for j in range(NCOLUMNS)]
for i in range(NROWS)
]
for row in lst:
print(*row)
print()
# look at the first few items in the infinite sequence
for x in itertools.islice(back_n_forth(lst, 1, 0), 20):
print(x, end=' ')
Output
a0 a1
b0 b1
c0 c1
b0 b1 c0 c1 c0 b1 b0 a1 a0 a1 b0 b1 c0 c1 c0 b1 b0 a1 a0 a1
I have a list of values like
mylist = ["001k","002k"..."400k"]
and a pandas df like
id code
1 500k
2 001k
...
100 400k
I would like to binarize the values of code based on mylist.
Hence, row 1 receives 0 everywhere because "500k" is not in mylist.
Alternatively, row 2 receives 1 at "001k" column and 0 elsewhere.
The final df would seems like
id 001k 002k ... 400k
1 0 0 0
2 1 0 0
...
100 0 0 1
You can do batch comparisons using numpy, giving you booleans:
>>> import numpy as np
>>> x = np.array(["001k", "002k", "400k"])
>>> y = np.array(["500k", "001k", "400k"])
>>> x[None, :] == y[:, None]
array([[False, False, False],
[ True, False, False],
[False, False, True]], dtype=bool)
From there, it's simple to transform it to integers:
>>> (x[None, :] == y[:, None]).astype(int)
array([[0, 0, 0],
[1, 0, 0],
[0, 0, 1]])
You can then do that easily by taking df["code"].values and np.array(mylist) which are numpy arrays e.g.
mylist = ["001k","002k","300k","400k"]
x = np.array(mylist)
df = pd.DataFrame({'code':['500k','600k','001k','002k','001k','400k']})
y = df["code"].values
ndf = pd.DataFrame((x[None, :] == y[:, None]).astype(int),columns=mylist)
Output:
001k 002k 300k 400k
0 0 0 0 0
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 1 0 0 0
5 0 0 0 1
Or
df["code"] = df["code"].apply(lambda x: x in mylist)
Based on your edits, you're looking for dummies:
pd.get_dummies(df["code"])
output
id 001k 002k ... 400k
1 0 0 0
2 1 0 0
...
100 0 0 1
I have a data that looks something like this:
numpy array:
[[a, abc],
[b, def],
[c, ghi],
[d, abc],
[a, ghi],
[e, fg],
[f, f76],
[b, f76]]
its like a user-item matrix.
I want to construct a sparse matrix with shape: number_of_items, num_of_users which gives 1 if the user has rated/bought an item or 0 if he hasn't. So, for the above example, shape should be (5,6). This is just an example, there are thousands of users and thousands of items.
Currently I'm doing this using two for loops. Is there any faster/pythonic way of achieving the same?
desired output:
1,0,0,1,0,0
0,1,0,0,0,0
1,0,1,0,0,0
0,0,0,0,1,0
0,1,0,0,0,1
where rows: abc,def,ghi,fg,f76
and columns: a,b,c,d,e,f
The easiest way is to assign integer labels to the users and items and use these as coordinates into the sparse matrix, for example:
import numpy as np
from scipy import sparse
users, I = np.unique(user_item[:,0], return_inverse=True)
items, J = np.unique(user_item[:,1], return_inverse=True)
points = np.ones(len(user_item), int)
mat = sparse.coo_matrix(points, (I, J))
pandas.get_dummies provides the easier way to convert categorical columns to sparse matrix
import pandas as pd
#construct the data
x = pd.DataFrame([['a', 'abc'],['b', 'def'],['c' 'ghi'],
['d', 'abc'],['a', 'ghi'],['e', 'fg'],
['f', 'f76'],['b', 'f76']],
columns = ['user','item'])
print(x)
# user item
# 0 a abc
# 1 b def
# 2 c ghi
# 3 d abc
# 4 a ghi
# 5 e fg
# 6 f f76
# 7 b f76
for col, col_data in x.iteritems():
if str(col)=='item':
col_data = pd.get_dummies(col_data, prefix = col)
x = x.join(col_data)
print(x)
# user item item_abc item_def item_f76 item_fg item_ghi
# 0 a abc 1 0 0 0 0
# 1 b def 0 1 0 0 0
# 2 c ghi 0 0 0 0 0
# 3 d abc 1 0 0 0 0
# 4 a ghi 0 0 0 0 1
# 5 e fg 0 0 0 1 0
# 6 f f76 0 0 1 0 0
# 7 b f76 0 0 1 0 0
Here's what I could come up with:
You need to be careful since np.unique will sort the items before returning them, so the output format is slightly different to the one you gave in the question.
Moreover, you need to convert the array to a list of tuples because ('a', 'abc') in [('a', 'abc'), ('b', 'def')] will return True, but ['a', 'abc'] in [['a', 'abc'], ['b', 'def']] will not.
A = np.array([
['a', 'abc'],
['b', 'def'],
['c', 'ghi'],
['d', 'abc'],
['a', 'ghi'],
['e', 'fg'],
['f', 'f76'],
['b', 'f76']])
customers = np.unique(A[:,0])
items = np.unique(A[:,1])
A = [tuple(a) for a in A]
combinations = it.product(customers, items)
C = np.array([b in A for b in combinations], dtype=int)
C.reshape((values.size, customers.size))
>> array(
[[1, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0]])
Here is my approach using pandas, let me know if it performed better:
#create dataframe from your numpy array
x = pd.DataFrame(x, columns=['User', 'Item'])
#get rows and cols for your sparse dataframe
cols = pd.unique(x['User']); ncols = cols.shape[0]
rows = pd.unique(x['Item']); nrows = rows.shape[0]
#initialize your sparse dataframe,
#(this is not sparse, but you can check pandas support for sparse datatypes
spdf = pd.DataFrame(np.zeros((nrow, ncol)), columns=cols, index=rows)
#define apply function
def hasUser(xx):
spdf.ix[xx.name, xx] = 1
#groupby and apply to create desired output dataframe
g = x.groupby(by='Item', sort=False)
g['User'].apply(lambda xx: hasUser(xx))
Here is the sampel dataframes for above code:
spdf
Out[71]:
a b c d e f
abc 1 0 0 1 0 0
def 0 1 0 0 0 0
ghi 1 0 1 0 0 0
fg 0 0 0 0 1 0
f76 0 1 0 0 0 1
x
Out[72]:
User Item
0 a abc
1 b def
2 c ghi
3 d abc
4 a ghi
5 e fg
6 f f76
7 b f76
Also, in case you want to make groupby apply function execution
parallel , this question might be of help:
Parallelize apply after pandas groupby