I was trying to get all combinations of elements in disjoint sets, but there should only be one element of those sets at a time. Say for the matrix (where rows are elements and columns are the sets):
s1 s2 s3
1 | 1 | 0 | 0 -> element 1 in set 1
2 | 1 | 0 | 0 -> element 2 in set 1
3 | 0 | 1 | 0 -> element 3 in set 2
4 | 0 | 1 | 0 -> element 4 in set 2
5 | 0 | 0 | 1 -> element 5 in set 3
6 | 0 | 0 | 1 -> element 6 in set 3
We can get the possible combinations as:
1 2 3 4 5 6 -> element
___________
1 0 1 0 1 0 -> combination 1
1 0 1 0 0 1 -> combination 2
1 0 0 1 1 0 -> combination 3
1 0 0 1 0 1 -> combination 4
0 1 1 0 1 0 -> combination 5
0 1 1 0 0 1 -> combination 6
0 1 0 1 1 0 -> combination 7
0 1 0 1 0 1 -> combination 8
Where each row represents a combination of elements that exists in the sets 1 to 3.
I have tried a brute-force approach [1](Combinations with restrictions in Python) where we generate all combinations then remove the invalid combinations in the list. I was wondering if there are quicker ways to do this?
comb_count, active_count = get_combination_count(self.sets)
temp_comb_count = comb_count
set_combinations = np.zeros((int(comb_count), len(self.sets)))
for index_m, set in enumerate(self.sets):
if self.get_num_elements(set) == 0:
continue
indices = self.get_active_elements(set)
amount = temp_comb_count/active_count[index_m]
repeats = comb_count/temp_comb_count
index_i = 0
for index_k in range(int(repeats)):
for index_j in indices:
counter = 0
while counter < amount:
set_combinations[index_i, index_j] = 1
counter = counter + 1
index_i = index_i + 1
temp_comb_count /= active_count[index_m]
Here is attempt for a quicker computation for the set of all possible combinations.
2 Sets and 3 Elements - 1 element not in any set
2 Sets and 4 Elements
Related
I have a data frame with a column with only 0's and 1's. I need to create a flag column where there are more than a certain number of consecutive ones in the first column.
In the example below, x >= 4 , if there are 4 or more consecutive one's, then the flag should be 1 for all those consecutive rows.
col1 Flag
0 1 0
1 0 0
2 1 1
3 1 1
4 1 1
5 1 1
6 0 0
7 1 0
8 1 0
9 0 0
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 0 0
One change, let's say there is a new column group, we need to group by that and find the flag,
Group col1 Flag
0 A 1 0
1 B 0 0
2 B 1 1
3 B 1 1
4 B 1 1
5 B 1 1
6 C 0 0
7 C 1 0
8 C 1 0
9 C 0 0
10 D 1 0
11 D 1 0
12 D 1 0
13 E 1 0
14 E 1 0
15 E 0 0
As you can there are consecutive ones from 10 to 14 but they belong to different groups. And elements in group can be in any order.
No that hard try with cumsum create the key then do the transform count
(df.groupby(df.col1.ne(1).cumsum())['col1'].transform('count').ge(5) & df.col1.eq(1)).astype(int)
Out[83]:
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 1
11 1
12 1
13 1
14 1
15 0
Name: col1, dtype: int32
You can achieve this in a couple of steps:
rolling(4).sum() to attain consecutive summations of your column
Use where to get the 1's from "col1" where their summation window (from the previous step) is >= 4. Turn the rest of the values into np.NaN
bfill(limit=3) to backwards fill the leftover 1s in your column by a maximum of 3 places.
fillna(0) fill what's leftover with 0
df["my_flag"] = (df["col1"]
.where(
df["col1"].rolling(4).sum() >= 4
) # Selects the 1's whose consecutive sum >= 4. All other values become NaN
.bfill(limit=3) # Moving backwards from our leftover values,
# take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
col1 Flag my_flag
0 1 0 0
1 0 0 0
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 0 0 0
7 1 0 0
8 1 0 0
9 0 0 0
10 1 1 1
11 1 1 1
12 1 1 1
13 1 1 1
14 1 1 1
15 0 0 0
Edit:
For a grouped approach, you just need to use groupby().rolling to create your mask for use in where(). Everything after that is the same. I separated the rolling step to keep it as readable as possible:
grouped_counts_ge_4 = (df.groupby("Group")["col1"]
.rolling(4)
.sum()
.ge(4)
.reset_index(level=0, drop=True))
df["my_flag"] = (df["col1"]
.where(grouped_counts_ge_4)
.bfill(limit=3) # Moving backwards from our leftover values, take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
Group col1 Flag my_flag
0 A 1 0 0
1 B 0 0 0
2 B 1 1 1
3 B 1 1 1
4 B 1 1 1
5 B 1 1 1
6 C 0 0 0
7 C 1 0 0
8 C 1 0 0
9 C 0 0 0
10 D 1 0 0
11 D 1 0 0
12 D 1 0 0
13 E 1 0 0
14 E 1 0 0
15 E 0 0 0
Try this:
df['Flag'] = np.where(df['col1'].groupby((df['col1'].diff().ne(0) | df['col1'].eq(0)).cumsum()).transform('size').ge(4),1,0)
Suppose I have a dataframe like this:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total day_90
-------------------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156 47
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231 35
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78 16
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5 3
where the last column [day_90] contains the value of which column ([0] - [90]) accumulates 90% of the [total] for each row. To clarify, take the first row as an example: in the 47th column, the ID A hits a total of 90% of 156 events that he will achieve in 90 days.
What I need is: for each row, count the length of the first sequence of 0s that is bigger than 7 (or any arbitrary number predefined). So, for example: for the first row, I want to know how long is the first sequence of zeros after column 47, but only if the sequence exceeds 7 zeros in a row. If there are 6 zeros and then a non-zero, then I don't want to count it.
Finally, I want to store this result in a new column after [day_90]. So if ID A has a sequence of 10 zeros right after column 47, I want to add a new column [0_sequence] that holds the value of 10 for that ID.
I really have no idea where to start. Any help is appreciated =)
Your problem is basically a variant of the island-and-gap problem: a non-zero creates a new "island" while a 0 extend the current island. And you want to find the first island that is of a certain size. Before I answer your question, let me walk you through a minified version of the problem.
Let's say you have a Series:
>>> a = pd.Series([0,0,0,13,0,0,4,12,0,0])
0 0
1 0
2 0
3 13
4 0
5 0
6 4
7 12
8 0
9 0
And you want to find the length of the first sequence of 0s that is at least 3-element in length. Let's first assign them into "islands":
# Every time the number is non-zero, a new "island" is created
>>> b = (a != 0).cumsum()
0 0 <-- island 0
1 0
2 0
3 1 <-- island 1
4 1
5 1
6 2 <-- island 2
7 3 <-- island 3
8 3
9 3
For each island, we are only interested in elements that are equal to 0:
>>> c = b[a == 0]
0 0
1 0
2 0
4 1
5 1
8 3
9 3
Now let's determine the size of each island:
>>> d = c.groupby(c).count()
0 3 <-- island 0 is of size 3
1 2 <-- island 1 is of size 2
3 2 <-- island 3 is of size 2
dtype: int64
And filter for islands whose size >= 3:
>>> e = d[d >= 3]
0 3
The answer is the first element of e (island 0, size 3) if e is not empty. Otherwise, there's no island meeting our criteria.
First Attempt
And applying it to your problem:
def count_sequence_length(row, n):
"""Return of the length of the first sequence of 0
after the column in `day_90` whose length is >= n
"""
if row['day_90'] + n > 90:
return 0
# The columns after `day_90`
idx = np.arange(row['day_90']+1, 91)
a = row[idx]
b = (a != 0).cumsum()
c = b[a == 0]
d = c.groupby(c).count()
e = d[d >= n]
return 0 if len(e) == 0 else e.iloc[0]
df['0_sequence'] = df.apply(count_sequence_length, n=7, axis=1)
Second Attempt
The above version is nice, but slow because it calculates the size of all islands. Since you only care about the size of first the island meeting the criteria, a simple for loop works much faster:
def count_sequence_length_2(row, n):
if row['day_90'] + n > 90:
return 0
size = 0
for i in range(row['day_90']+1, 91):
if row[i] == 0:
# increase the size of the current island
size += 1
elif size >= n:
# found the island we want. Search no more
break
else:
# create a new island
size = 0
return size if size >= n else 0
df['0_sequence'] = df.apply(count_sequence_length_2, n=7, axis=1)
This achieves a speed up between 10 - 20x on when I benchmark it.
Here is my solution, see the comments in the code:
import numpy as np, pandas as pd
import io
# Test data:
text=""" ID 0 1 2 3 4 5 6 7 8 day_90
0 A 2 21 0 18 3 0 0 0 2 4
1 B 0 20 12 2 0 8 14 23 0 5
2 C 0 38 19 3 1 3 3 7 1 1
3 D 3 0 0 1 0 0 0 0 0 0"""
df= pd.read_csv( io.StringIO(text),sep=r"\s+",engine="python")
#------------------------
# Convert some column names into integer:
cols= list(range(9))
df.columns= ["ID"]+ cols +["day_90"]
#----------
istart,istop= df.columns.get_loc(0), df.columns.get_loc(8)+1
# The required length of the 1st zero sequence:
lseq= 2
# The function for aggregating: this is the main calculation, 'r' is a row of 'df':
def zz(r):
s= r.iloc[r.day_90+istart:istop] # get the day columns starting with as fixed in 'day_90'
#--- Manipulate 's' to make possible using 'groupby' for getting different sequences:
crit=s.eq(0)
s= pd.Series(np.where(crit, np.nan, np.arange(len(s))),index=s.index)
if np.isnan(s.iloc[0]):
s.iloc[0]= 1
s= s.ffill()
s[~crit]= np.nan
#---
# get the sequences and their sizes:
ssiz= s.groupby(s).size()
return ssiz.iloc[0] if len(ssiz) and ssiz.iloc[0]>lseq else np.nan
#---
df["zseq"]= df.agg(zz,axis=1)
ID 0 1 2 3 4 5 6 7 8 day_90 zseq
0 A 2 21 0 18 3 0 0 0 2 4 3.0
1 B 0 20 12 2 0 8 14 23 0 5 NaN
2 C 0 38 19 3 1 3 3 7 1 1 NaN
3 D 3 0 0 1 0 0 0 0 0 0 NaN
Let's say we have this sample data.
| mem_id | main_title | sub_title |
-----------------------------------
| 1 | 1 | 1 |
| 10 | 3 | 2 |
| 3 | 3 | 2 |
| 45 | 1 | 2 |
| 162 | 2 | 2 |
...
1) summary of data
mem_id : unique id of 200 people
main_title : 3 unique labels (1,2,3)
sub_title : 6 unique labels (1,2,3,4,5,6) and each main_title can have one of these sub_title.
possible to have repetition like one mem_id can have multiple case of (1 : main , 1 : sub)
2) question
I'd like to make R table function result in python.
R table function result is like this.
I can make every possible combination from all main_title and sub_title.
Also can get the count from each case by mem_id.
count.data <- table(data$mem_id, data$main_title, data$sub_title)
count.table <- as.data.frame(count.data)
===============================================
mem_id main_title sub_title value
1 1 1 1 0
2 2 1 1 0
3 3 1 1 0
4 4 1 1 0
5 5 1 1 0
6 6 1 1 0
7 7 1 1 0
.
.
.
I've tried to get this result in Python and the result below is what i got so far.
cross_table1 = pd.melt(data, id_vars=['main_title ', 'sub_title'], value_vars='mem_id', value_name='mem_id')
==================================================
main_title sub_title variable mem_id
1 1 1 mem_id 10
2 1 1 mem_id 10
3 3 1 mem_id 10
4 4 2 mem_id 10
5 1 4 mem_id 132
6 4 1 mem_id 65
7 4 3 mem_id 88
.
.
.
cross_table2 = cross_table1.pivot_table(index=['main_title ', 'sub_title', 'mem_id'], values='variable', aggfunc='count')
cross_table32.reset_index().sort_values('value')
==============================================
main_title sub_title mem_id value
1 1 1 1 4
2 1 1 2 3
3 3 1 3 1
4 4 2 3 10
5 1 4 3 2
6 1 1 4 5
7 3 2 5 2
.
.
.
I recognize this only show the positive result of value(count of case) column.
What i need is to include all possible combination of main_title and sub_title, so like 1&1(main&sub) case has to have 200 rows with possible zero value in count column.
It would be so grateful if I can get any help or advice!!
Thanks :)
In pandas you can do with groupby + reindex
s=df.groupby(df.columns.tolist()).size()
idx=pd.MultiIndex.from_product(list(map(set,df.values.T)))
s=s.reindex(idx,fill_value=0)
s
Out[15]:
162 1 1 0
2 0
2 1 0
2 1
3 1 0
2 0
1 1 1 1
2 0
2 1 0
2 0
3 1 0
2 0
10 1 1 0
2 0
2 1 0
2 0
3 1 0
2 1
3 1 1 0
2 0
2 1 0
2 0
3 1 0
2 1
45 1 1 0
2 1
2 1 0
2 0
3 1 0
2 0
dtype: int64
I have a data frame that represents fail-data for a series of parts, showing which of 3 tests (A, B, C) pass (0) or fail (1).
A B C
1 0 1 1
2 0 0 0
3 1 0 0
4 0 0 1
5 0 0 0
6 0 1 0
7 1 1 0
8 1 1 1
I'd like to add a final column to the dataframe showing the First Fail (FF) of each part, or a default (P) if no fails.
A B C | FF
1 0 1 1 | B
2 0 0 0 | P
3 1 0 0 | A
4 0 0 1 | C
5 0 0 0 | P
6 0 1 0 | B
7 1 1 0 | A
8 1 1 1 | A
Any easy way to do this pandas? Does it require iterating over each row?
maybe:
>>> df['FF'] = df.dot(df.columns).str.slice(0, 1).replace('', 'P')
>>> df
A B C FF
1 0 1 1 B
2 0 0 0 P
3 1 0 0 A
4 0 0 1 C
5 0 0 0 P
6 0 1 0 B
7 1 1 0 A
8 1 1 1 A
alternatively:
>>> df['FF'] = np.where(df.any(axis=1), df.idxmax(axis=1), 'P')
>>> df
A B C FF
1 0 1 1 B
2 0 0 0 P
3 1 0 0 A
4 0 0 1 C
5 0 0 0 P
6 0 1 0 B
7 1 1 0 A
8 1 1 1 A
i am trying to display a 2D sudoku board in python like this:
0 0 3 |0 2 0 |6 0 0
9 0 0 |3 0 5 |0 0 1
0 0 1 |8 0 6 |4 0 0
------+------+------
0 0 8 |1 0 2 |9 0 0
7 0 0 |0 0 0 |0 0 8
0 0 6 |7 0 8 |2 0 0
------+------+------
0 0 2 |6 0 9 |5 0 0
8 0 0 |2 0 3 |0 0 9
0 0 5 |0 1 0 |3 0 0
I managed to display the board without the seperation lines using this code:
rows = 'ABCDEFGHI'
cols = '123456789'
def display(values):
for r in rows :
for c in cols :
print values[r+c],
print
values is a dictionary {'A1':'0', 'A2':'0', 'A3':'3', 'A4':'0', 'A5':'2'...etc} I get this output:
0 0 3 0 2 0 6 0 0
9 0 0 3 0 5 0 0 1
0 0 1 8 0 6 4 0 0
0 0 8 1 0 2 9 0 0
7 0 0 0 0 0 0 0 8
0 0 6 7 0 8 2 0 0
0 0 2 6 0 9 5 0 0
8 0 0 2 0 3 0 0 9
0 0 5 0 1 0 3 0 0
Any help?
The following may work. But I think that a function that render a string as a result may be more useful (for writing the result to a text file for example, without too much monkey-patching).
rows = 'ABCDEFGHI'
cols = '123456789'
def display(values):
for i, r in enumerate(rows):
if i in [3, 6]:
print '------+-------+------'
for j, c in enumerate(cols):
if j in [3, 6]:
print '|',
print values[r + c],
print
Result:
9 6 0 | 5 0 7 | 9 5 2
1 9 3 | 9 3 4 | 5 4 2
4 9 7 | 2 3 0 | 1 3 1
------+-------+------
3 0 1 | 6 7 3 | 9 8 3
2 4 5 | 7 8 7 | 8 0 8
0 1 4 | 9 3 9 | 3 9 6
------+-------+------
6 1 2 | 8 7 6 | 5 0 1
4 3 9 | 3 0 8 | 5 6 6
4 1 7 | 5 9 9 | 3 1 7
Here's an approach that's a bit messy. If you add some other identifier to your row and column strings, you actually get the in between columns in a recognizable form:
# Use "x" as an identifier for a row or column where there should be a separator
rows = 'ABCxDEFxGHI'
cols = '123x456x789'
values = {'A1': '0', 'A2': '0', 'A3': '3'}
def display(values):
for r in rows:
for c in cols:
if r == "x":
if c == "x":
# Both row and column are the separator, show a plus
print "+",
else:
# Only the row is the separator, show a dash
print "-"
elif c == "x":
# Only the column is the separator, show a pipe
print "|",
else:
# Not a separator, print the given cell (or ? if not found)
print values.get(r+c, "?"),
# Make sure there's a newline so we start a new row
print ""
display(values)
Another possibility is to be more clever and insert the separator modified cells into the dictionary (ie "xx": "+", "Ax": "|"), but that's more work. You can use the get() method of the dictionary to automatically fill in one set of those however (ie default to returning a pipe or hyphen).
sudoku="
0 0 3 0 2 0 6 0 0
9 0 0 3 0 5 0 0 1
0 0 1 8 0 6 4 0 0
0 0 8 1 0 2 9 0 0
7 0 0 0 0 0 0 0 8
0 0 6 7 0 8 2 0 0
0 0 2 6 0 9 5 0 0
8 0 0 2 0 3 0 0 9
0 0 5 0 1 0 3 0 0"
wherever spaces are present in the string , you can replace it with a '-'
import re
re.sub(r'\s+', '-', sudoku)
Not what u are looking for? Let me know
This should do the trick:
rows = 'ABCDEFGHIJK'
cols = '123456789'
def display(values):
for r in rows :
if r == "D" or r == "H":
print '------+-------+------'
else:
for c in cols :
if c%3 == 0 and c != 9:
print values[r+c] + "|"
else:
print values[r+c]
Solution for python3:
values = {'A9': '1' , 'D8' : '9', 'A1': '7', ...}
sortedKeys = sorted(values)
for i in range(0, 9):
if i != 0 and i % 3 == 0:
print("- - - + - - - + - - -")
for j in range(0, 9):
if j != 0 and j % 3 == 0:
print("|", end=' ')
key = sortedKeys[i*9 + j]
print(values[key], end=' ')
print()