Find the minimal rows with maximum 1s column - python

I have a numpy 2D array with all zeros and 1s and I want those rows that has atleast one 1 for each column. For example:
PROBLEM STATEMENT: Find minimal rows that gives maximum 1s across all columns.
INPUT1:
A B C D E
t1 0 0 0 1 1
t2 0 1 1 0 1
t3 0 1 1 0 1
t4 1 0 1 0 1
t5 1 0 1 0 1
t6 1 1 1 1 0
Here, there are multiple answers like (t6, t1), (t6, t2), (t6, t3), (t6, t4), (t6, t5).
INPUT2:
A B C D E
t1 0 0 0 1 1
t2 0 1 1 0 1
t3 0 1 1 0 1
t4 1 0 1 0 1
t5 1 0 1 0 1
t6 1 1 1 1 1
Answer: t6
I don't want to use brute force method as my original matrix is very big. Is there a smart way to do this?

Naive solution, worst-case O(2^n)
This iterates over all possible choices of rows, starting with as few rows as possible, making average cases usually low-polynomial time.
from itertools import combinations
import numpy as np
def minimum_rows(arr):
out_list = []
rows = arr.shape[0]
for x in range(1, rows):
for combo in combinations(range(rows),x):
if np.logical_or.reduce(arr[[combo]]).all():
out_list.append(combo)
if out_list:
return out_list
I wrote this entirely on my phone without much testing, so it may or may not work. It employs no tricks, but is fairly fast. Note that it will be slower when the ratio columns/rows is larger or the the probability of a given element being True is smaller, as that makes it less likely for fewer rows to meet the conditions required, causing x to increase, which in turn will increase the number of combinations iterated though.

Related

How to merge strings that have certain number of substrings in common to produce some groups in a data frame in Python

I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.
But here, I have an advanced version of the similar question:
I have a sample data:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
So, I can have three groups and there is no common sub strings between any two groups.
Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
The result is:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:
group
'b,c', 0
'a', 1
'a,c,d,e', 2
'f,g,h,i', 3
'j,k,l', 4
'k,l,m' 4
Use:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
from itertools import combinations, chain
from collections import Counter
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]
#create new columns for matched sets
for val in f1:
j = ','.join(val)
a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)
#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]
a = a[['ACTIVITY']].assign(group = new)
print (a)
ACTIVITY group
0 b,c 0
1 a 1
2 a,c,d,e 2
3 f,g,h,i 3
4 j,k,l 4
5 k,l,m 4

pythonic way of making dummy column from sum of two values

I have a dataframe with one column called label which has the values [0,1,2,3,4,5,6,8,9].
I would like to make dummy columns out of this, but I would like some labels to be joined together, so for example I want dummy_012 to be 1 if the observation has either label 0, 1 or 2.
If i use the command df2 = pd.get_dummies(df, columns=['label']), it would create 9 columns, 1 for each label.
I know I can use df2['dummy_012']=df2['dummy_0']+df2['dummy_1']+df2['dummy_2'] after that to turn it into one joint column, but I want to know if there's a more pythonic way of doing it (or some function where i can just change the parameters to the joins).
Maybe this approach can give a idea:
groups = ['012', '345', '6789']
for gp in groups:
df.loc[df['Label'].isin([int(x) for x in gp]), 'Label_Group'] = f'dummies_{gp}'
Output:
Label Label_Group
0 0 dummies_012
1 1 dummies_012
2 2 dummies_012
3 3 dummies_345
4 4 dummies_345
5 5 dummies_345
6 6 dummies_6789
7 8 dummies_6789
8 9 dummies_6789
And then apply dummy:
df_dummies = pd.get_dummies(df['Label_Group'])
dummies_012 dummies_345 dummies_6789
0 1 0 0
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 1 0
6 0 0 1
7 0 0 1
8 0 0 1
I don't know that this is pythonic because a more elegant solution might exist, but I does allow you to change parameters and it's vectorized. I've read that get_dummies() can be a bit slow with large amounts of data and vectorizing pandas is good practice in general. So I vectorized this function and had it do its calculations with numpy arrays. It should give you a boost in performance as the dataset increases in size compared to similar functions.
This function will take your dataframe and a list of numbers as strings and will return your dataframe with the column you wanted.
def get_dummy(df,column_nos):
new_col_name = 'dummy_'+''.join([i for i in column_nos])
vector_sum = sum([df[i].values for i in column_nos])
df[new_col_name] = [1 if i>0 else 0 for i in vector_sum]
return df
In case you'd rather the input to be integers rather than strings, you can tweak the above function to look like below.
def get_dummy(df,column_nos):
column_names = ['dummy_'+str(i) for i in column_nos]
new_col_name = 'dummy_'+''.join([str(i) for i in sorted(column_nos)])
vector_sum = sum([df[i].values for i in column_names])
df[new_col_name] = [1 if i>0 else 0 for i in vector_sum]
return df

Python DataFrame Accumulator Based on Flag

I have a logic-driven flag column and I need to create a column that increments by 1 when the flag is true and decrements by 1 when the flag is false down to a floor of zero.
I've tried a few different methods and I can't get the Accumulator 'shift' to reference the new value created by the process. I know the method below wouldn't stop at zero anyway, but I was just trying to work through the concept before and this is the most to-the-point example to explain the goal. Do I need a for loop to iterate line-by-line?
df = pd.DataFrame(data=np.random.randint(2,size=10), columns=['flag'])
df['accum'] = 0
df['accum'] = np.where(df['flag'] == 1, df['accum'].shift(1) + 1, df['accum'].shift(1) - 1)
df['dOutput'] = [1,0,1,2,1,2,3,2,1,0] #desired output
df
Output
As far as I know, there's no numpy or pandas vectorized operation to do this, so, you should iterate line-by-line:
def cumsum_with_floor(series):
acc = 0
output = []
accum_list = []
for val in series:
val = 1 if val else -1
acc += val
accum_list.append(val)
acc = acc if acc > 0 else 0
output.append(acc)
return pd.Series(output, index=series.index), pd.Series(accum_list, index=series.index)
series = pd.Series([1,0,1,1,0,0,0,1])
dOutput, accum = cumsum_with_floor(series)
dOutput
Out:
0 1
1 0
2 1
3 2
4 1
5 0
6 0
7 1
dtype: int64
accum # shifted by one step forward compared with you example
Out:
0 1
1 -1
2 1
3 1
4 -1
5 -1
6 -1
7 1
dtype: int64
But may be there's somebody who knows suitable combination of pd.clip and pd.cumsum or other vectorized operations.

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Calculating number of permutations of a matrix with elements being adjacent integers only

I'm trying to write a Python code in order to determine the number of possible permutations of a matrix where neighbouring elements can only be adjacent integer numbers. I also wish to know how many times each total set of numbers appears (by that I mean, the same numbers of each integer in n matrices, but not in the same matrix permutation)
Forgive me if I'm not being clear, or if my terminology isn't ideal! Consider a 5 x 5 zero matrix. This is an acceptable permutaton, as all of the elements are adjacent to an identical number.
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
25 x 0, 0 x 1, 0 x 2
The elements within the matrix can be changed to 1 or 2. Changing any of the elements to 1 would also be an acceptable permutation, as the 1 would be surrounded by an adjacent integer, 0. For example, changing the central [2,2] element of the matrix:
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
24 x 0, 1 x 1, 0 x 2
However, changing the [2,2] element in the centre to a 2 would mean that all of the elements surrounding it would have to switch to 1, as 2 is not adjacent to 0.
0 0 0 0 0
0 1 1 1 0
0 1 2 1 0
0 1 1 1 0
0 0 0 0 0
16 x 0, 8 x 1, 1 x 2
I want to know how many permutations are possible from that zeroed 5x5 matrix by changing the elements to 1 and 2, whilst keeping neighbouring elements as adjacent integers. In other words, any permutations where 0 and 2 are adjacent are not allowed.
I also wish to know how many matrices contain a certain number of each integer. For example, both of the below matrices would be 24 x 0, 1 x 1, 0 x 2. Over every permutation, I'd like to know how many correspond to this frequency of integers.
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Again, sorry if I'm not being clear or my nomenclature is poor! Thanks for your time - I'd really appreciate some help with this, and any words or guidance would be kindly received.
Thanks,
Sam
First, what you're calling a permutation isn't.
Secondly your problem is that a naive brute force solution would look at 3^25 = 847,288,609,443 possible combinations. (Somewhat less, but probably still in the hundreds of billions.)
The right way to solve this is called dynamic programming. What you need to do for your basic problem is calculate, for i from 0 to 4, for each of the different possible rows you could have there how many possible matrices you could have had that end in that row.
Add up all of the possible answers in the last row, and you'll have your answer.
For the more detailed count, you need to divide it by row, by cumulative counts you could be at for each value. But otherwise it is the same.
The straightforward version should require tens of thousands of operation. The detailed version might require millions. But this will be massively better than the hundreds of billions that the naive recursive version takes.
Just search for some more simple rules:
1s can be distributed arbitrarily in the array, since the matrix so far only consists of 0s. 2s can aswell be distributed arbitrarily, since only neighbouring elements must be either 1 or 2.
Thus there are f(x) = n! / x! possibilities to distributed 1s and 2s over the matrix.
So the total number of possible permutations is 2 * sum(x = 1 , n * n){f(x)}.
Calculating the number of possible permutations with a fixed number of 1s can easily be solved by simple calculating f(x).
The number of matrices with a fixed number of 2s and 1s is a bit more tricky. Here you can only rely on the fact that all mirrored versions of the matrix yield the same number of 1s and 2s and are valid. Apart from using that fact you can only brute-force search for correct solutions.

Categories

Resources