Count consecutive row values but reset count with every 0 in row - python

Within a dataframe, I need to count and sum consecutive row values in column A into a new column, column B.
Starting with column A, the script would count the consecutive runs in 1s but when a 0 appears it prints the total count in column B, it then resets the count and continues through the remaining data.
Desired outcome:
A | B
0 0
1 0
1 0
1 0
1 0
0 4
0 0
1 0
1 0
0 2
I've tried using .shift() along with various if-statements but have been unsuccessful.

This could be a way to do it. Probably there exists a more elegant solution.
df['B'] = df['A'].groupby(df['A'].ne(df['A'].shift()).cumsum()).cumsum().shift(fill_value=0) * (df['A'].diff() == -1)
This part df['A'].groupby(df['A'].ne(df['A'].shift()) groups the data by consecutive occurences of values.
Then we take the cumsum which counts the cumulated sum along each group. Then we shift the results by 1 row because you want the count after the group. Then we mask out all the rows which are not the last row of the group + 1.

Here is one way to do it. However, I get the feeling that there might be better ways.. But you can try this for now:
The routine function is use to increment the counter variable until it encounters a value of 0 in the A column. At which point it grabs the total count, and then resets the counter variable.
I use a for-loop to iterate through the A column, and append the returned B values to a list
This list is then inserted into the dataframe.
df = pd.DataFrame({"A":[0,1,1,1,1,0,0,1,1,0]})
def routine(row, c):
val = 0
if row:
c += 1
else:
val = c
c = 0
return(val, c)
B_vals = []
counter = 0
for item in df['A'].values:
b, counter = routine(item, counter)
B_vals.append(b)
df['B'] = B_vals
print(df)
OUTPUT:
A B
0 0 0
1 1 0
2 1 0
3 1 0
4 1 0
5 0 4
6 0 0
7 1 0
8 1 0
9 0 2

Related

Python pandas create new column with string code based on boolean rows

I have a dataframe with multiple columns containing booleans/ints(1/0). I need a new result pandas column with strings that are built by following code: How many times the True's are consecutive, if the chain is interrupted or not, and from what column to what column the trues are.
For example this is the following dataframe:
column_1 column_2 column_3 column_4 column_5 column_6 column_7 column_8 column_9 column_10
0 0 1 0 1 1 1 1 0 0 1
1 0 1 1 0 1 1 1 0 0 1
2 1 1 0 0 0 1 1 0 0 1
3 1 1 1 0 0 0 0 1 1 1
4 1 1 1 0 0 1 0 0 1 1
5 1 1 1 0 0 0 1 1 0 1
6 0 1 1 1 1 1 1 0 1 0
Where the following row for example: 1: [0 1 1 0 1 1 1 0 0 1]
Would result in code string in the column_result: i2/2-3/c2-c3_c5-c7/6 which is build in four segments I can read somewhere in my code later.
Segment 1:
Where 'i' stands for interrupted, if not interrupted would be 'c' for consecutive
2 stands for how many times it found 2 or more consecutive True's,
Segment 2:
The consecutive count of the consecutive group, in this case the first consecutive count is 2, and the second count is 3..
Semgent 3:
The number/id of the column where the first True was found and the column number of where the last True was found of that consecutive True's.
Semgent 4:
Just the total count of Trues in the row.
Another example would be the following row: 6: [0 1 1 1 1 1 1 0 1 0]
Would result in code string in the column_result: c1/6/c2-c7/7
The below code is the startcode I used to create the above dataframe that has random int's for bools:
def create_custom_result(df: pd.DataFrame) -> pd.Series:
return df
def create_dataframe() -> pd.DataFrame:
df = pd.DataFrame() # empty df
for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: # create random bool/int values
df[f'column_{i}'] = np.random.randint(2, size=50)
df["column_result"] = '' # add result column
return df
if __name__=="__main__":
df = create_dataframe()
custom_results = create_custom_result(df=df)
Would someone have any idea of how to tackle this? To be honest I have no idea where to start. I found the following that probably came closest: count sets of consecutive true values in a column, however, it uses the column and not the rows horizontal. Maybe someone can tell me if I should try np.array stuff, or maybe pandas has some function that can help me? I found some groupby functions that work horizontal, but I wouldnt know how to convert that to the string code to be used in the result column. Or should I loop through the Dataframe by rows and then build the column_result code in segments?
Thanks in advance!
I tried some things already, looping through the dataframe row by row, but had no idea how to build a new column with the code strings.
I also found this artikel: pandas groupby .. but wouldnt know how to create a new column str data by the group I found. Also, almost everything I find is group stuff by the single column and not through the rows of all columns.
these codes maybe works ?
df = pd.DataFrame(np.random.randint(0,2, size=(12,8)))
df.columns=["col1","col2","col3","col4","col5","col6","col7","col8"]
def func(df:pd.DataFrame) -> pd.DataFrame:
result_list = []
copy = df.copy()
cumsum = copy.cumsum(axis=1)
for r,s in cumsum.iterrows():
count = 0
last = -1
interrupted = 0
consecutive = 0
consecutives = []
ranges = []
for x in s.values:
count += 1
if x != 0:
if x!=last:
consecutive += 1
last = x
if consecutive == 2:
ranges.append(count-1)
elif x==last:
if consecutive > 1:
interrupted += 1
ranges.append(count-1)
consecutives.append(str(consecutive))
consecutive = 0
else:
if consecutive > 1:
consecutives.append(str(consecutive))
ranges.append(count)
result = f'{interrupted}i/{len(consecutives)}c/{"-".join(consecutives)}/{"_".join([ f"c{ranges[i]}-c{ranges[i+1]}" for i in range(0,len(ranges),2) ])}/{last}'
result_list.append(result.split("/"))
copy["results"] = pd.Series(["/".join(i) for i in result_list])
copy[["interrupts_count","consecutives_count","consecutives lengths","consecutives columns ranges","total"]] = pd.DataFrame(np.array(result_list))
return copy
result_df = func(df)
Maybe go with simple class for each column that will receive series from original DataFrame (i.e. sliced vertically) and new value. Using original DataFrame sliced vertical array calculate all starting values as fields (start of consecutive true values, length of consecutive true values, last value..). And finally using start values and new next value update fields and prepare string output.

Finding a swing pattern in a pandas dataframe

The title is not the best but I am not sure how to describe my problem in a line.
The problem I have is that I want to calculate whether some values, say A, B, C, D, occur in order.
A
B
C
D
Total
1
1
0
0
0
0
0
1
0
0
1!
0
0
1
0
0
1!
0
0
0
0
0
1!
0
0
0
0
0
1!
1
In the table above, values of A, B, C, D are calculated individually. They are 1 if they pass a certain threshold and 0 otherwise.
My question is that I want to identify when A = 1 occurs before B = 1 occurs before C = 1 occurs before D = 1. In this case, from row 3 onwards (marked with an exclamation point), each value is consecutively 1. However, for A in row 1, i ignore A = 1 because B = 0 in the next row.
I tried implementing a for loop but that takes way too long. I am sure there is a more efficient method. My data is stored in a pandas dataframe.
Try shift and np.prod:
# we shift `A` by 3 rows, `B` by 2, ...
# then take product of the shifted values
df['Total'] = np.prod([df[col].shift(3-i, fill_value=0)
for i,col in enumerate(['A','B','C','D'])],
axis=0)

How to generate a sequence of numbers when encountered a value in python pandas dataframe

sample and expected data
The block one is current data and block 2 is the expected data that is, when i encounter 1 i need the next row to be incremented by one and for next country b same should happen
First replace all another values after first 1 to 1, so is possible use GroupBy.cumsum:
df = pd.DataFrame({'c':['a']*3 + ['b']*3+ ['c']*3, 'v':[1,0,0,0,1,0,0,0,1]})
s = df.groupby('c')['v'].cumsum()
df['new'] = s.where(s.eq(0), 1).groupby(df['c']).cumsum()
print (df)
c v new
0 a 1 1
1 a 0 2
2 a 0 3
3 b 0 0
4 b 1 1
5 b 0 2
6 c 0 0
7 c 0 0
8 c 1 1
Another solution is replace all not 1 values to missing values and forward filling 1 per groups, then first missing values are replaced to 0, so cumulative sum also working perfectly:
s = df['v'].where(df['v'].eq(1)).groupby(df['c']).ffill().fillna(0).astype(int)
df['new'] = s.groupby(df['c']).cumsum()

Finding the count of letters in each column

I need to find the count of letters in each column as follows:
String: ATCG
TGCA
AAGC
GCAT
string is a series.
I need to write a program to get the following:
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
I have written the following code but I am getting a row in 0 index and column at the end (column index 450, actual column no 451) with nan values. I should not be getting either the row or the column 451. I need to have only 450 columns.
f = zip(*string)
counts = [{letter: column.count(letter) for letter in column} for column in
f]
counts=pd.DataFrame(counts).transpose()
print(counts)
counts = counts.drop(counts.columns[[450]], axis =1)
Can anyone please help me understand the issue?
Here is one way you can implement your logic. If required, you can turn your series into a list via lst = s.tolist().
lst = ['ATCG', 'TGCA', 'AAGC', 'GCAT']
arr = [[i.count(x) for i in zip(*lst)] for x in ('ATCG')]
res = pd.DataFrame(arr, index=list('ATCG'))
Result
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
Explanation
In the list comprehension, deal with columns first by iterating the first, second, third and fourth elements of each string sequentially.
Deal with rows second by iterating through 'ATCG' sequentially.
This produces a list of lists which can be fed directly into pd.DataFrame.
With Series.value_counts():
>>> s = pd.Series(['ATCG', 'TGCA', 'AAGC', 'GCAT'])
>>> s.str.join('|').str.split('|', expand=True)\
... .apply(lambda row: row.value_counts(), axis=0)\
... .fillna(0.)\
... .astype(int)
0 1 2 3
A 2 1 1 1
C 0 1 2 1
G 1 1 1 1
T 1 1 0 1
I'm not sure how logically you want to order the index, but you could call .reindex() or .sort_index() on this result.
The first line, s.str.join('|').str.split('|', expand=True) gets you an "expanded" version
0 1 2 3
0 A T C G
1 T G C A
2 A A G C
3 G C A T
which should be faster than calling pd.Series(list(x)) ... on each row.

Python: Grouping columns and counting

I have a file with 13 columns and I am looking to perform some grouping tasks. The input looks like so:
A B C D E F G H I J K L M
0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 1 1
Excluding column A, the grouping is to be done as follows producing five new columns, the columns J,K,L,M will be merged into one as it is a special case.
A,B > new column D,E > new colum
B C Result
1 0 1
0 1 1
1 1 1
0 0 0
If either of the two columns has "1" in it or maybe both, I want to count it as 1. Right now I have written this little snippet but I am not sure how to proceed.
from collections import Counter
with open("datagroup.txt") as inFile:
print Counter([" ".join(line.split()[::2]) for line in inFile])
* Edit *
A B&C D&E F&G H&I J,K,L,M
1 1 0 0 1 1
1 1 0 0 0 1
0 1 0 0 1 0
1 0 0 0 0 1
0 1 0 1 1 1
1 0 0 0 0 1
Basically what I want to do is to exclude the first column and then compare every two columns after that until column J, If either column has a "1" present, I want to report that as "1" even if both columns have "1" I would still report that as "1". For the last for columns, namely: J,K,L,M if I see a "1" in either four, it should be reported as "1".
First, you're obviously going to have to iterate over the rows in some way to do something for each row.
Second, I have no idea what what you're trying to do with the [::2], since that will just give you all the even columns, or what the Counter is for in the first place, or why specifically you're trying to count strings made up of a bunch of concatenated columns.
But I think what you want is this:
with open("datagroup.txt") as inFile:
for row in inFile:
columns = row.split()
outcolumns = []
outcolumns.append(columns[0]) # A
for group in zip(columns[1:-4:2], columns[2:-4:2])+columns[-4:]:
outcolumns.append('1' if '1' in group else '0')
print(' '.join(outcolumns))
You can make this a lot more concise with a bit of itertools and comprehensions, but I wanted to keep this verbose and simple so you'd understand it.

Categories

Resources