Binarize pandas column based on list - python

I have a list of values like
mylist = ["001k","002k"..."400k"]
and a pandas df like
id code
1 500k
2 001k
...
100 400k
I would like to binarize the values of code based on mylist.
Hence, row 1 receives 0 everywhere because "500k" is not in mylist.
Alternatively, row 2 receives 1 at "001k" column and 0 elsewhere.
The final df would seems like
id 001k 002k ... 400k
1 0 0 0
2 1 0 0
...
100 0 0 1

You can do batch comparisons using numpy, giving you booleans:
>>> import numpy as np
>>> x = np.array(["001k", "002k", "400k"])
>>> y = np.array(["500k", "001k", "400k"])
>>> x[None, :] == y[:, None]
array([[False, False, False],
[ True, False, False],
[False, False, True]], dtype=bool)
From there, it's simple to transform it to integers:
>>> (x[None, :] == y[:, None]).astype(int)
array([[0, 0, 0],
[1, 0, 0],
[0, 0, 1]])
You can then do that easily by taking df["code"].values and np.array(mylist) which are numpy arrays e.g.
mylist = ["001k","002k","300k","400k"]
x = np.array(mylist)
df = pd.DataFrame({'code':['500k','600k','001k','002k','001k','400k']})
y = df["code"].values
ndf = pd.DataFrame((x[None, :] == y[:, None]).astype(int),columns=mylist)
Output:
001k 002k 300k 400k
0 0 0 0 0
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 1 0 0 0
5 0 0 0 1

Or
df["code"] = df["code"].apply(lambda x: x in mylist)

Based on your edits, you're looking for dummies:
pd.get_dummies(df["code"])
output
id 001k 002k ... 400k
1 0 0 0
2 1 0 0
...
100 0 0 1

Related

Find number of consecutively increasing/decreasing values in a pandas column (and fill another col with it) in an optimized way

I am trying to create a new column for a dataframe. The column I use for it is a price column. Basically what I am trying to achieve is getting the number of times that the price has increased/decreased consecutively. I need this to be rather quick because the dataframes can be quite big.
For example the result should look like :
input = [1,2,3,2,1]
increase = [0,1,2,0,0]
decrease = [0,0,0,1,2]
You can compute the diff and apply a cumsum on the positive/negative values:
df = pd.DataFrame({'col': [1,2,3,2,1]})
s = df['col'].diff()
df['increase'] = s.gt(0).cumsum().where(s.gt(0), 0)
df['decrease'] = s.lt(0).cumsum().where(s.lt(0), 0)
Output:
col increase decrease
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
resetting the count
As I realize your example is ambiguous, here is an additional method in case your want to reset the counts for each increasing/decreasing group, using groupby.
The resetting counts are labeled inc2/dec2:
df = pd.DataFrame({'col': [1,2,3,2,1,2,3,1]})
s = df['col'].diff()
s1 = s.gt(0)
s2 = s.lt(0)
df['inc'] = s1.cumsum().where(s1, 0)
df['dec'] = s2.cumsum().where(s2, 0)
si = df['inc'].eq(0)
sd = df['dec'].eq(0)
df['inc2'] = si.groupby(si.cumsum()).cumcount()
df['dec2'] = sd.groupby(sd.cumsum()).cumcount()
Output:
col inc dec inc2 dec2
0 1 0 0 0 0
1 2 1 0 1 0
2 3 2 0 2 0
3 2 0 1 0 1
4 1 0 2 0 2
5 2 3 0 1 0
6 3 4 0 2 0
7 1 0 3 0 1
data = {
'input': [1,2,3,2,1]
}
df = pd.DataFrame(data)
diffs = df['input'].diff()
df['a'] = (df['input'] > df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] > df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] > df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
df['b'] = (df['input'] < df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] < df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] < df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
print(df)
output
input a b
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
Coding this manually using numpy might look like this
import numpy as np
input = np.array([1, 2, 3, 2, 1])
increase = np.zeros(len(input))
decrease = np.zeros(len(input))
for i in range(1, len(input)):
if input[i] > input[i-1]:
increase[i] = increase[i-1] + 1
decrease[i] = 0
elif input[i] < input[i-1]:
increase[i] = 0
decrease[i] = decrease[i-1] + 1
else:
increase[i] = 0
decrease[i] = 0
increase # array([0, 1, 2, 0, 0], dtype=int32)
decrease # array([0, 0, 0, 1, 2], dtype=int32)

Pandas: create column based on first off occurrences in column B after signal in column A

I have A column with signal on == 1 and B column with signal off == 1 ,the rest values are zero.
data = {'A': [1, 0, 0, 0, 0, 1, 0],
'B': [1, 0, 1, 1, 0, 0, 1]}
df = pd.DataFrame.from_dict(data)
I need to create a column C where:
A == 1 and B == 0 or 1, C= 1
C = 1 till to B == 1, than C = 0
Here what the result should be:
df['C'] = [1, 1, 0, 0, 0, 1, 0]
I used
df.loc[df['A'] == 1, 'C'] = 1
to set at 1 the row where A == 1, but I can not find the way to get first non zero in B column, after the 1 signal on A, and replace the other with zeros till to next 1 in A.
You can do mask, with transform idxmax , mask here is to set B to 0 when A equal to 1 , since no matter what value of B, the C will be 1.
df['C']=(df.index<df.B.mask(df.A.eq(1),0).groupby(df.A.cumsum()).transform('idxmax')).astype(int)
df
A B C
0 1 1 1
1 0 0 1
2 0 1 0
3 0 1 0
4 0 0 0
5 1 0 1
6 0 1 0
Update
s=df.B.mask(df.A.eq(1),0)
s=(s==1)&(s.shift(-1)==0)
df['C']=(df.index<s.groupby(df.A.cumsum()).transform('idxmax')).astype(int)
df.loc[df.A==1,'C']=1
Hello and welcome to stackoverflow.
This is a case you usually wouldn't use pandas for as the value of C depends on previous rows. And pandas is more about using "split-apply-combine" on independent measurements
If it is not runtime-critical I would probably write a plain old loop for this:
In [4]: C = []
...: signal = 0
...: for _, row in df.iterrows():
...: if ((signal == 1) and (row.B == 1)):
...: signal = 0
...: elif(row.A == 1):
...: signal = 1
...: C.append(signal)
...:
In [5]: C
Out[5]: [1, 1, 0, 0, 0, 1, 0]
In [6]: df['C'] = C
In [7]: df
Out[7]:
A B C
0 1 1 1
1 0 0 1
2 0 1 0
3 0 1 0
4 0 0 0
5 1 0 1
6 0 1 0
This won't have a good performance, but imho it is worth it to cleanly express the intent of your code if it is still "fast enough".
Solution based on iterrows (as proposed in one of other answers)
may be too slow.
Define the following function computing the output signal for a group
of input rows (starting on each case of A == 1):
def signal(grp):
return pd.Series(np.equal(np.where(grp.A == 1, 0, grp.B)
.cumsum(), 0).astype(int), index=grp.index)
Then group df and apply this function:
df['C'] = df.groupby(df.A.cumsum()).apply(signal)\
.reset_index(level=0, drop=True)
Edit
Yet faster solution, without grouping, is:
sig = df.A.replace(0, np.nan)
sig.update(df.A.lt(df.B).astype(int).replace(0, np.nan) - 1)
df['C'] = sig.ffill().fillna(0, downcast='infer')
For a sample of 7000 rows (your data repeated 1000 times) the execution
time of this solution is 14 times shorter than the solution by YOBEN_S.

Pandas dataframe if else condition based on previous rows not working

I have a pandas dataframe as below:
df = pd.DataFrame({'X':[1,1,1, 0, 0]})
df
X
0 1
1 1
2 1
3 0
4 0
Now I want to modify X based on the below condition:
If X = 0 , previous row + 1
So, my final output should look like below:
X
0 1
1 1
2 1
3 2
4 3
This can be achieved by iterating over rows and setting up a current and previous row and using iloc and is working as expected
for i in range(0, len(df)):
current_row = df.iloc[i]
if i > 0:
previous_row =df.iloc[i-1]
else:
previous_row = current_row
if (current_row['X'] == 0):
current_row['X'] = previous_row['X'] +1
I want more efficient way of doing that and I tried the below code but the output is not what I expected (the value of X for 5th row should be 3):
conditions = [df["X"] == 0]
values = [df["X"] .shift() + 1]
df['X'] = np.select(conditions, values)
>>> df
X
0 1
1 1
2 1
3 2
4 1
You could try the following:
import numpy as np
import pandas as pd
df = pd.DataFrame({'X': [1, 1, 1, 0, 0]})
# values previous to zero
pe_zero = df.X.shift(-1).eq(0) * df.X # [0 0 1 0 0]
# 1 for reach zero value as you sum one to the previous value
eq_zero = df.X.eq(0)
# find consecutive groups of 0
groups = pe_zero + eq_zero
consecutive = (groups.gt(0) != groups.gt(0).shift()).cumsum()
# find cumulative sum by groups
cumulative = groups.groupby(consecutive).cumsum()
# choose from cumulative when equals to zero else from original
result = np.where(eq_zero, cumulative, df.X)
print(result)
Output
[1 1 1 2 3]
UPDATE
For df = pd.DataFrame({'X': [1, 1, 1, 0, 0, 1, 1, 0, 0]})
returns:
[1 1 1 2 3 1 1 2 3]
You could try this:
arr = df.X.values # extract the column as a numpy array for faster iteration
for i, val in enumerate(arr[1:], start=1):
if val == 0:
arr[i] = arr[i-1] + 1

construct sparse matrix using categorical data

I have a data that looks something like this:
numpy array:
[[a, abc],
[b, def],
[c, ghi],
[d, abc],
[a, ghi],
[e, fg],
[f, f76],
[b, f76]]
its like a user-item matrix.
I want to construct a sparse matrix with shape: number_of_items, num_of_users which gives 1 if the user has rated/bought an item or 0 if he hasn't. So, for the above example, shape should be (5,6). This is just an example, there are thousands of users and thousands of items.
Currently I'm doing this using two for loops. Is there any faster/pythonic way of achieving the same?
desired output:
1,0,0,1,0,0
0,1,0,0,0,0
1,0,1,0,0,0
0,0,0,0,1,0
0,1,0,0,0,1
where rows: abc,def,ghi,fg,f76
and columns: a,b,c,d,e,f
The easiest way is to assign integer labels to the users and items and use these as coordinates into the sparse matrix, for example:
import numpy as np
from scipy import sparse
users, I = np.unique(user_item[:,0], return_inverse=True)
items, J = np.unique(user_item[:,1], return_inverse=True)
points = np.ones(len(user_item), int)
mat = sparse.coo_matrix(points, (I, J))
pandas.get_dummies provides the easier way to convert categorical columns to sparse matrix
import pandas as pd
#construct the data
x = pd.DataFrame([['a', 'abc'],['b', 'def'],['c' 'ghi'],
['d', 'abc'],['a', 'ghi'],['e', 'fg'],
['f', 'f76'],['b', 'f76']],
columns = ['user','item'])
print(x)
# user item
# 0 a abc
# 1 b def
# 2 c ghi
# 3 d abc
# 4 a ghi
# 5 e fg
# 6 f f76
# 7 b f76
for col, col_data in x.iteritems():
if str(col)=='item':
col_data = pd.get_dummies(col_data, prefix = col)
x = x.join(col_data)
print(x)
# user item item_abc item_def item_f76 item_fg item_ghi
# 0 a abc 1 0 0 0 0
# 1 b def 0 1 0 0 0
# 2 c ghi 0 0 0 0 0
# 3 d abc 1 0 0 0 0
# 4 a ghi 0 0 0 0 1
# 5 e fg 0 0 0 1 0
# 6 f f76 0 0 1 0 0
# 7 b f76 0 0 1 0 0
Here's what I could come up with:
You need to be careful since np.unique will sort the items before returning them, so the output format is slightly different to the one you gave in the question.
Moreover, you need to convert the array to a list of tuples because ('a', 'abc') in [('a', 'abc'), ('b', 'def')] will return True, but ['a', 'abc'] in [['a', 'abc'], ['b', 'def']] will not.
A = np.array([
['a', 'abc'],
['b', 'def'],
['c', 'ghi'],
['d', 'abc'],
['a', 'ghi'],
['e', 'fg'],
['f', 'f76'],
['b', 'f76']])
customers = np.unique(A[:,0])
items = np.unique(A[:,1])
A = [tuple(a) for a in A]
combinations = it.product(customers, items)
C = np.array([b in A for b in combinations], dtype=int)
C.reshape((values.size, customers.size))
>> array(
[[1, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0]])
Here is my approach using pandas, let me know if it performed better:
#create dataframe from your numpy array
x = pd.DataFrame(x, columns=['User', 'Item'])
#get rows and cols for your sparse dataframe
cols = pd.unique(x['User']); ncols = cols.shape[0]
rows = pd.unique(x['Item']); nrows = rows.shape[0]
#initialize your sparse dataframe,
#(this is not sparse, but you can check pandas support for sparse datatypes
spdf = pd.DataFrame(np.zeros((nrow, ncol)), columns=cols, index=rows)
#define apply function
def hasUser(xx):
spdf.ix[xx.name, xx] = 1
#groupby and apply to create desired output dataframe
g = x.groupby(by='Item', sort=False)
g['User'].apply(lambda xx: hasUser(xx))
Here is the sampel dataframes for above code:
spdf
Out[71]:
a b c d e f
abc 1 0 0 1 0 0
def 0 1 0 0 0 0
ghi 1 0 1 0 0 0
fg 0 0 0 0 1 0
f76 0 1 0 0 0 1
x
Out[72]:
User Item
0 a abc
1 b def
2 c ghi
3 d abc
4 a ghi
5 e fg
6 f f76
7 b f76
Also, in case you want to make groupby apply function execution
parallel , this question might be of help:
Parallelize apply after pandas groupby

outputting large matrix in python from a dictionary

I have a python dictionary formatted in the following way:
data[author1][author2] = 1
This dictionary contains an entry for every possible author pair (all pairs of 8500 authors), and I need to output a matrix that looks like this for all author pairs:
"auth1" "auth2" "auth3" "auth4" ...
"auth1" 0 1 0 3
"auth2" 1 0 2 0
"auth3" 0 2 0 1
"auth4" 3 0 1 0
...
I have tried the following method:
x = numpy.array([[data[author1][author2] for author2 in sorted(data[author1])] for author1 in sorted(data)])
print x
outf.write(x)
However, printing this leaves me with this:
[[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
...,
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]]
and the output file is just a blank text file. I am trying to format the output in a way to read into Gephi (https://gephi.org/users/supported-graph-formats/csv-format/)
You almost got it right, your list comprehension is inverted. This will give you the expected result:
d = dict(auth1=dict(auth1=0, auth2=1, auth3=0, auth4=3),
auth2=dict(auth1=1, auth2=0, auth3=2, auth4=0),
auth3=dict(auth1=0, auth2=2, auth3=0, auth4=1),
auth4=dict(auth1=3, auth2=0, auth3=1, auth4=0))
np.array([[d[i][j] for i in sorted(d.keys())] for j in sorted(d[k].keys())])
#array([[0, 1, 0, 3],
# [1, 0, 2, 0],
# [0, 2, 0, 1],
# [3, 0, 1, 0]])
You could use pandas. Using #Saullo Castro input:
import pandas as pd
df = pd.DataFrame.from_dict(d)
Result:
>>> df
auth1 auth2 auth3 auth4
auth1 0 1 0 3
auth2 1 0 2 0
auth3 0 2 0 1
auth4 3 0 1 0
And if you want to save you can just do df.to_csv(file_name)

Categories

Resources