Creating and manipulating dataframes with only binary values - python

Given the following dataframe:
df = pd.DataFrame({'s1':[1,2,3,4], 's2':[4,3,2,1], 's3':[7,4,3,1], 's4':[9,4,3,1]})
I want to do the following:
Map a predicate >2 over ['s1', 's2'], map a predicate >4 over ['s3', 's4'] if true set field to 1 else 0.
Remove all rows where s1 and s2 and s3 and s4 = 0.
Group by permutations, for example how many rows are [0,1,1,0] etc
Query for different counts for example how many rows have s3=1 or s2=1?
The problem I'm having doing this on a larger dataset is that I have to split the dataset up into series and then iterate over each series and then put them back to a dataframe. I want to do all the transformatios and queries using only one pass over the data.
Update:
I have been trying something like this.
binary = pd.DataFrame({'s1':[1,0,1,0], 's2':[0,0,1,0], 's3':[1,0,1,1]})
binary.loc[(cool!=0).any(axis=1)]
binary.groupby(['s1', 's2','s3']).count() # it works for 2 values but not 3.

Items 1 and 2
To map the predicate, use the gt function. Then use any to select rows that have at least one True value (i.e. exclude rows that are all False).
You can use astype(int) when applying the predicate, but it doesn't seem necessary until after you filter for rows that are all False.
# Apply predicate.
df[['s1', 's2']] = df[['s1', 's2']].gt(2)
df[['s3', 's4']] = df[['s3', 's4']].gt(4)
# Remove rows that are all False and convert to 0/1.
df = df.loc[df.any(axis=1), :].astype(int)
The resulting binary DataFrame df:
s1 s2 s3 s4
0 0 1 1 1
1 0 1 0 0
2 1 0 0 0
3 1 0 0 0
Item 3
To get a count of all row combinations at once, use apply to get a Series containing a tuple of each row, and use value_counts:
# Counts of permutations.
perms = df.apply(tuple, axis=1).value_counts()
The resulting output:
(1, 0, 0, 0) 2
(0, 1, 0, 0) 1
(0, 1, 1, 1) 1
Item 4
Sum over a Boolean array corresponding to your condition:
# Count of rows where s3=1 or s2=1.
row_count = ((df['s3'] == 1) | (df['s2'] == 1)).sum()
This yields 2 as expected.

Related

creating rows for several one hot encoded columns (all combinations) to be scored by model

I start of with my wants with this simplified example:
data = {'dg1_1':[1, 0],
'dg1_2':[0, 1],
'dg2_1':[0, 1],
'dg2_2':[1, 0],
'cont1':[13.0, 13.0]}
wants = pd.DataFrame(data)
I do not really have this and this is meant to be generated. I have 2 one hot encoded groups dg1 and dg2. This is obviously simplified and dg1 and dg2 can contain different number of columns. From some observations (a sample) I can get them also like this:
dg1_indeces = observations.columns[wants.columns.str.startswith("dg1")]
dg2_indeces = observations.columns[wants.columns.str.startswith("dg2")]
Given one observation (ab)using my wants to explain:
one_observation = wants.head(1)
I want to create all possibly combinations given one_observation so that for each encoded group, I only turn on one column in each "one hot encoded group" at the time. So I can do:
haves = pd.concat([haves]*(len(dg1_indeces) * len(dg2_indeces)), ignore_index=True)
haves.loc[:, dg1_indeces] = 0
haves.loc[:, dg2_indeces] = 0
print(haves)
This gives me all rows with the hot encoded groups all zero - I now want to get to my wants (see at the top) in the most efficient way. I guess avoiding loops to then score the data using an existing model. Hope this makes sense?
PS:
This my naïve way of possibly achieving this:
row = 0
for dg1 in dg1_indeces:
for dg2 in dg2_indeces:
haves.loc[row, dg1] = 1
haves.loc[row, dg2] = 1
row += 1
You can build from bottom with pd.MultiIndex.from_product or merge with cross
s1 = df.columns[df.columns.str.startswith('dg1')]
s2 = df.columns[df.columns.str.startswith('dg2')]
#if s1 and s2 is dataframe idx = s1.merge(s2,how='cross')
idx = pd.MultiIndex.from_product([s1,s2]).map('|'.join)
pd.Series(idx).str.get_dummies('|')
Out[115]:
dg1_1 dg1_2 dg2_1 dg2_2
0 1 0 1 0
1 1 0 0 1
2 0 1 1 0
3 0 1 0 1
Let's add a third attribute to the dg2 group and change the cont1 value of the second row to make things less confusing:
data = {'dg1_1':[1, 0],
'dg1_2':[0, 1],
'dg2_1':[0, 1],
'dg2_2':[1, 0],
'dg2_3':[0, 0],
'cont1':[13.0, 14.0]}
wants = pd.DataFrame(data)
So now you have 2 groups, one with 2 attributes and one with 3 attributes. Only one attribute can be "hot" per group. If we lay out a 2 x 3 matrix and fill each cell with 2 ** (i,j):
0 1 2
0 (1, 1) (1, 2) (1, 4)
1 (2, 1) (2, 2) (2, 4)
Then convert the matrix to binary:
0 1 2
0 (01, 001) (01, 010) (01, 100)
1 (10, 001) (10, 010) (10, 100)
It essentially satisfies our requirement that only one attribute per group is "hot". If you unravel (i.e. flatten) it:
dg1 dg2
01 001
01 010
01 100
10 001
10 010
10 100
It becomes the list of permutations that you can cross join against every observation.
# Get the columns were are interested in
cols = wants.columns[wants.columns.str.startswith("dg")].to_series()
# shape is an (n1, n2, n3, ...) tuple where n_i is the number of attribute per group
shape = cols.str.split("_", expand=True).groupby(0).size().to_numpy()
rows = []
# Make the matrix
for i in range(shape.prod()):
string = ''
for dim, index in enumerate(np.unravel_index(i, shape)):
string += bin(2 ** index)[2:].zfill(shape[dim])
rows.append(map(int, list(string)))
permutations = pd.DataFrame(rows, columns=cols)
# Result
wants[["cont1"]].merge(permutations, how="cross")

Get rows before and after from an index in pandas dataframe

I want to get a specific amount of rows before and after a specific index. However, when I try to get the rows, and the range is greater than the number of indices, it does not return anything. Given this, I would like you to continue looking for rows, as I show below:
df = pd.DataFrame({'column': range(1, 6)})
column
0 1
1 2
2 3
3 4
4 5
index = 2
df.iloc[idx]
3
# Now I want to get three values before and after that index.
# Something like this:
def get_before_after_rows(index):
rows_before = df[(index-1): (index-1)-2]
rows_after = df[(index+1): (index+1)-2]
return rows_before, rows_after
rows_before, rows_after = get_before_after_rows(index)
rows_before
column
0 1
1 2
4 5
rows_after
column
0 1
3 4
4 5
You are mixing iloc and loc which is very dangerous. It works in your example because the index is sequentially numbered starting from zero so these two functions behave identically.
Anyhow, what you want is basically taking rows with wrap-around:
def get_around(df: pd.DataFrame, index: int, n: int) -> (pd.DataFrame, pd.DataFrame):
"""Return n rows before and n rows after the specified positional index"""
idx = index - np.arange(1, n+1)
before = df.iloc[idx].sort_index()
idx = (index + np.arange(1, n+1)) % len(df)
after = df.iloc[idx].sort_index()
return before, after
# Get 3 rows before and 3 rows after the *positional index* 2
before, after = get_around(df, 2, 3)

fill in entire dataframe cell by cell based on index AND column names?

I have a dataframe where the row indices and column headings should determine the content of each cell. I'm working with a much larger version of the following df:
df = pd.DataFrame(index = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'],
columns = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
Specifically, I want to apply the custom function edit_distance() or equivalent (see here for function code) which calculates a difference score between two strings. The two inputs are the row and column names. The following works but is extremely slow:
for seq in df.index:
for seq2 in df.columns:
df.loc[seq, seq2] = edit_distance(seq, seq2)
This produces the result I want:
ae azde afgle arlde afghijklbcmde
afghijklde 8 7 5 6 3
afghijklmde 9 8 6 7 2
ade 1 1 3 2 10
afghilmde 7 6 4 5 4
amde 2 1 3 2 9
What is a better way to do this, perhaps using applymap() ?. Everything I've tried with applymap() or apply or df.iterrows() has returned errors of the kind AttributeError: "'float' object has no attribute 'index'" . Thanks.
Turns out there's an even better way to do this. onepan's dictionary comprehension answer above is good but returns the df index and columns in random order. Using a nested .apply() accomplishes the same thing at about the same speed and doesn't change the row/column order. The key is to not get hung up on naming the df's rows and columns first and filling in the values second. Instead, do it the other way around, initially treating the future index and columns as standalone pandas Series.
series_rows = pd.Series(['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'])
series_cols = pd.Series(['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
df.index = series_rows
df.columns = series_cols
you could use comprehensions, which speeds it up ~4.5x on my pc
first = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde']
second = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde']
pd.DataFrame.from_dict({f:{s:edit_distance(f, s) for s in second} for f in first}, orient='index')
# output
# ae azde afgle arlde afghijklbcmde
# ade 1 2 2 2 2
# afghijklde 1 3 4 4 9
# afghijklmde 1 3 4 4 10
# afghilmde 1 3 4 4 8
# amde 1 3 3 3 3
# this matches to edit_distance('ae', 'afghijklde') == 8, e.g.
note I used this code for edit_distance (first response in your link):
def edit_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]

Comparing rows of two pandas dataframes?

This is a continuation of my question. Fastest way to compare rows of two pandas dataframes?
I have two dataframes A and B:
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
For a condensed example:
A B C D E
0 0 0 0 1 0
1 1 1 1 1 0
2 1 0 0 1 1
3 0 1 1 1 0
B is 1024 rows x 10 columns, and is a full iteration from 0 to 1023 in binary form.
Example:
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I am trying to find which rows in A, at a particular 10 columns of A, correspond with each row of B.
Each row of A[My_Columns_List] is guaranteed to be somewhere in B, but not every row of B will match up with a row in A[My_Columns_List]
For example, I want to show that for columns [B,D,E] of A,
rows [1,3] of A match up with row [6] of B,
row [0] of A matches up with row [2] of B,
row [2] of A matches up with row [3] of B.
I have tried using:
pd.merge(B.reset_index(), A.reset_index(),
left_on = B.columns.tolist(),
right_on =A.columns[My_Columns_List].tolist(),
suffixes = ('_B','_A')))
This works, but I was hoping that this method would be faster:
S = 2**np.arange(10)
A_ID = np.dot(A[My_Columns_List],S)
B_ID = np.dot(B,S)
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
But when I do this, out_row_idx returns an array containing all the indices of A, which doesn't tell me anything.
I think this method will be faster, but I don't know why it returns an array from 0 to 999.
Any input would be appreciated!
Also, credit goes to #jezrael and #Divakar for these methods.
I'll stick by my initial answer but maybe explain better.
You are asking to compare 2 pandas dataframes. Because of that, I'm going to build dataframes. I may use numpy, but my inputs and outputs will be dataframes.
Setup
You said we have a a 1000 x 500 array of ones and zeros. Let's build that.
A_init = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A_init.columns = pd.MultiIndex.from_product([range(A_init.shape[1]/10), range(10)])
A = A_init
In addition, I gave A a MultiIndex to easily group by columns of 10.
Solution
This is very similar to #Divakar's answer with one minor difference that I'll point out.
For one group of 10 ones and zeros, we can treat it as a bit array of length 8. We can then calculate what it's integer value is by taking the dot product with an array of powers of 2.
twos = 2 ** np.arange(10)
I can execute this for every group of 10 ones and zeros in one go like this
AtB = A.stack(0).dot(twos).unstack()
I stack to get a row of 50 groups of 10 into columns in order to do the dot product more elegantly. I then brought it back with the unstack.
I now have a 1000 x 50 dataframe of numbers that range from 0-1023.
Assume B is a dataframe with each row one of 1024 unique combinations of ones and zeros. B should be sorted like B = B.sort_values().reset_index(drop=True).
This is the part I think I failed at explaining last time. Look at
AtB.loc[:2, :2]
That value in the (0, 0) position, 951 means that the first group of 10 ones and zeros in the first row of A matches the row in B with the index 951. That's what you want!!! Funny thing is, I never looked at B. You know why, B is irrelevant!!! It's just a goofy way of representing the numbers from 0 to 1023. This is the difference with my answer, I'm ignoring B. Ignoring this useless step should save time.
These are all functions that take two dataframes A and B and returns a dataframe of indices where A matches B. Spoiler alert, I'll ignore B completely.
def FindAinB(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
twos = 2 ** np.arange(10)
return A.stack(0).dot(twos).unstack()
def FindAinB2(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
# use clever bit shifting instead of dot product with powers
# questionable improvement
return (A.stack(0) << np.arange(10)).sum(1).unstack()
I'm channelling my inner #Divakar (read, this is stuff I've learned from Divakar)
def FindAinB3(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
a = A.values.reshape(-1, 10)
a = np.einsum('ij->i', a << np.arange(10))
return pd.DataFrame(a.reshape(A.shape[0], -1), A.index)
Minimalist One Liner
f = lambda A: pd.DataFrame(np.einsum('ij->i', A.values.reshape(-1, 10) << np.arange(10)).reshape(A.shape[0], -1), A.index)
Use it like
f(A)
Timing
FindAinB3 is an order of magnitude faster

Fastest way to compare rows of two pandas dataframes?

So I have two pandas dataframes, A and B.
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
B is 1024 rows x 10 columns, and is a full iteration of 0's and 1's, hence having 1024 rows.
I am trying to find which rows in A, at a particular 10 columns of A, correspond with a given row in B. I need the whole row to match up, rather than element by element.
For example, I would want
A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)==(1,0,1,0,1,0,0,1,0,0)).all(axis=1)]
To return something that rows (3,5,8,11,15) in A match up with that (1,0,1,0,1,0,0,1,0,0) row of B at those particular columns (1,2,3,4,5,6,7,8,9,10)
And I want to do this over every row in B.
The best way I could figure out to do this was:
import numpy as np
for i in B:
B_array = np.array(i)
Matching_Rows = A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)] == B_array).all(axis=1)]
Matching_Rows_Index = Matching_Rows.index
This isn't terrible for one instance, but I use it in a while loop that runs around 20,000 times; therefore, it slows it down quite a bit.
I have been messing around with DataFrame.apply to no avail. Could map work better?
I was just hoping someone saw something obviously more efficient as I am fairly new to python.
Thanks and best regards!
We can abuse the fact that both dataframes have binary values 0 or 1 by collapsing the relevant columns from A and all columns from B into 1D arrays each, when considering each row as a sequence of binary numbers that could be converted to decimal number equivalents. This should reduce the problem set considerably, which would help with performance. Now, after getting those 1D arrays, we can use np.in1d to look for matches from B in A and finally np.where on it to get the matching indices.
Thus, we would have an implementation like so -
# Setup 1D arrays corresponding to selected cols from A and entire B
S = 2**np.arange(10)
A_ID = np.dot(A[range(1,11)],S)
B_ID = np.dot(B,S)
# Look for matches that exist from B_ID in A_ID, whose indices
# would be desired row indices that have matched from B
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
Sample run -
In [157]: # Setup dataframes A and B with rows 0, 4 in A having matches from B
...: A_arr = np.random.randint(0,2,(10,14))
...: B_arr = np.random.randint(0,2,(7,10))
...:
...: B_arr[2] = A_arr[4,1:11]
...: B_arr[4] = A_arr[4,1:11]
...: B_arr[5] = A_arr[0,1:11]
...:
...: A = pd.DataFrame(A_arr)
...: B = pd.DataFrame(B_arr)
...:
In [158]: S = 2**np.arange(10)
...: A_ID = np.dot(A[range(1,11)],S)
...: B_ID = np.dot(B,S)
...: out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
...:
In [159]: out_row_idx
Out[159]: array([0, 4])
You can use merge with reset_index - output are indexes of B which are equal in A in custom columns:
A = pd.DataFrame({'A':[1,0,1,1],
'B':[0,0,1,1],
'C':[1,0,1,1],
'D':[1,1,1,0],
'E':[1,1,0,1]})
print (A)
A B C D E
0 1 0 1 1 1
1 0 0 0 1 1
2 1 1 1 1 0
3 1 1 1 0 1
B = pd.DataFrame({'0':[1,0,1],
'1':[1,0,1],
'2':[1,0,0]})
print (B)
0 1 2
0 1 1 1
1 0 0 0
2 1 1 0
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A')))
index_B 0 1 2 index_A A B C D E
0 0 1 1 1 2 1 1 1 1 0
1 0 1 1 1 3 1 1 1 0 1
2 1 0 0 0 1 0 0 0 1 1
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A'))[['index_B','index_A']])
index_B index_A
0 0 2
1 0 3
2 1 1
You can do it in pandas by using loc or ix and telling it to find the rows where the ten columns are all equal. Like this:
A.loc[(A[1]==B[1]) & (A[2]==B[2]) & (A[3]==B[3]) & A[4]==B[4]) & (A[5]==B[5]) & (A[6]==B[6]) & (A[7]==B[7]) & (A[8]==B[8]) & (A[9]==B[9]) & (A[10]==B[10])]
This is quite ugly in my opinion but it will work and gets rid of the loop so it should be significantly faster. I wouldn't be surprised if someone could come up with a more elegant way of coding the same operation.
In this special case, your rows of 10 zeros and ones can be interpreted as 10 digit binaries. If B is in order, then it can be interpreted as a range from 0 to 1023. In this case, all we need to do is take A's rows in 10 column chunks and calculate what its binary equivalent is.
I'll start by defining a range of powers of two so I can do matrix multiplication with it.
twos = pd.Series(np.power(2, np.arange(10)))
Next, I'll relabel A's columns into a MultiIndex and stack to get my chunks of 10.
A = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A.columns = pd.MultiIndex.from_tuples(zip((A.columns / 10).tolist(), (A.columns % 10).tolist()))
A_ = A.stack(0)
A_.head()
Finally, I'll multiply A_ with twos to get integer representation of each row and unstack.
A_.dot(twos).unstack()
This is now a 1000 x 50 DataFrame where each cell represents which of B's rows we matched for that particular 10 column chunk for that particular row of A. There isn't even a need for B.

Categories

Resources