The title is not the best but I am not sure how to describe my problem in a line.
The problem I have is that I want to calculate whether some values, say A, B, C, D, occur in order.
A
B
C
D
Total
1
1
0
0
0
0
0
1
0
0
1!
0
0
1
0
0
1!
0
0
0
0
0
1!
0
0
0
0
0
1!
1
In the table above, values of A, B, C, D are calculated individually. They are 1 if they pass a certain threshold and 0 otherwise.
My question is that I want to identify when A = 1 occurs before B = 1 occurs before C = 1 occurs before D = 1. In this case, from row 3 onwards (marked with an exclamation point), each value is consecutively 1. However, for A in row 1, i ignore A = 1 because B = 0 in the next row.
I tried implementing a for loop but that takes way too long. I am sure there is a more efficient method. My data is stored in a pandas dataframe.
Try shift and np.prod:
# we shift `A` by 3 rows, `B` by 2, ...
# then take product of the shifted values
df['Total'] = np.prod([df[col].shift(3-i, fill_value=0)
for i,col in enumerate(['A','B','C','D'])],
axis=0)
Related
Within a dataframe, I need to count and sum consecutive row values in column A into a new column, column B.
Starting with column A, the script would count the consecutive runs in 1s but when a 0 appears it prints the total count in column B, it then resets the count and continues through the remaining data.
Desired outcome:
A | B
0 0
1 0
1 0
1 0
1 0
0 4
0 0
1 0
1 0
0 2
I've tried using .shift() along with various if-statements but have been unsuccessful.
This could be a way to do it. Probably there exists a more elegant solution.
df['B'] = df['A'].groupby(df['A'].ne(df['A'].shift()).cumsum()).cumsum().shift(fill_value=0) * (df['A'].diff() == -1)
This part df['A'].groupby(df['A'].ne(df['A'].shift()) groups the data by consecutive occurences of values.
Then we take the cumsum which counts the cumulated sum along each group. Then we shift the results by 1 row because you want the count after the group. Then we mask out all the rows which are not the last row of the group + 1.
Here is one way to do it. However, I get the feeling that there might be better ways.. But you can try this for now:
The routine function is use to increment the counter variable until it encounters a value of 0 in the A column. At which point it grabs the total count, and then resets the counter variable.
I use a for-loop to iterate through the A column, and append the returned B values to a list
This list is then inserted into the dataframe.
df = pd.DataFrame({"A":[0,1,1,1,1,0,0,1,1,0]})
def routine(row, c):
val = 0
if row:
c += 1
else:
val = c
c = 0
return(val, c)
B_vals = []
counter = 0
for item in df['A'].values:
b, counter = routine(item, counter)
B_vals.append(b)
df['B'] = B_vals
print(df)
OUTPUT:
A B
0 0 0
1 1 0
2 1 0
3 1 0
4 1 0
5 0 4
6 0 0
7 1 0
8 1 0
9 0 2
sample and expected data
The block one is current data and block 2 is the expected data that is, when i encounter 1 i need the next row to be incremented by one and for next country b same should happen
First replace all another values after first 1 to 1, so is possible use GroupBy.cumsum:
df = pd.DataFrame({'c':['a']*3 + ['b']*3+ ['c']*3, 'v':[1,0,0,0,1,0,0,0,1]})
s = df.groupby('c')['v'].cumsum()
df['new'] = s.where(s.eq(0), 1).groupby(df['c']).cumsum()
print (df)
c v new
0 a 1 1
1 a 0 2
2 a 0 3
3 b 0 0
4 b 1 1
5 b 0 2
6 c 0 0
7 c 0 0
8 c 1 1
Another solution is replace all not 1 values to missing values and forward filling 1 per groups, then first missing values are replaced to 0, so cumulative sum also working perfectly:
s = df['v'].where(df['v'].eq(1)).groupby(df['c']).ffill().fillna(0).astype(int)
df['new'] = s.groupby(df['c']).cumsum()
I need to find the count of letters in each column as follows:
String: ATCG
TGCA
AAGC
GCAT
string is a series.
I need to write a program to get the following:
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
I have written the following code but I am getting a row in 0 index and column at the end (column index 450, actual column no 451) with nan values. I should not be getting either the row or the column 451. I need to have only 450 columns.
f = zip(*string)
counts = [{letter: column.count(letter) for letter in column} for column in
f]
counts=pd.DataFrame(counts).transpose()
print(counts)
counts = counts.drop(counts.columns[[450]], axis =1)
Can anyone please help me understand the issue?
Here is one way you can implement your logic. If required, you can turn your series into a list via lst = s.tolist().
lst = ['ATCG', 'TGCA', 'AAGC', 'GCAT']
arr = [[i.count(x) for i in zip(*lst)] for x in ('ATCG')]
res = pd.DataFrame(arr, index=list('ATCG'))
Result
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
Explanation
In the list comprehension, deal with columns first by iterating the first, second, third and fourth elements of each string sequentially.
Deal with rows second by iterating through 'ATCG' sequentially.
This produces a list of lists which can be fed directly into pd.DataFrame.
With Series.value_counts():
>>> s = pd.Series(['ATCG', 'TGCA', 'AAGC', 'GCAT'])
>>> s.str.join('|').str.split('|', expand=True)\
... .apply(lambda row: row.value_counts(), axis=0)\
... .fillna(0.)\
... .astype(int)
0 1 2 3
A 2 1 1 1
C 0 1 2 1
G 1 1 1 1
T 1 1 0 1
I'm not sure how logically you want to order the index, but you could call .reindex() or .sort_index() on this result.
The first line, s.str.join('|').str.split('|', expand=True) gets you an "expanded" version
0 1 2 3
0 A T C G
1 T G C A
2 A A G C
3 G C A T
which should be faster than calling pd.Series(list(x)) ... on each row.
My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.
You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
I have a file with 13 columns and I am looking to perform some grouping tasks. The input looks like so:
A B C D E F G H I J K L M
0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 1 1
Excluding column A, the grouping is to be done as follows producing five new columns, the columns J,K,L,M will be merged into one as it is a special case.
A,B > new column D,E > new colum
B C Result
1 0 1
0 1 1
1 1 1
0 0 0
If either of the two columns has "1" in it or maybe both, I want to count it as 1. Right now I have written this little snippet but I am not sure how to proceed.
from collections import Counter
with open("datagroup.txt") as inFile:
print Counter([" ".join(line.split()[::2]) for line in inFile])
* Edit *
A B&C D&E F&G H&I J,K,L,M
1 1 0 0 1 1
1 1 0 0 0 1
0 1 0 0 1 0
1 0 0 0 0 1
0 1 0 1 1 1
1 0 0 0 0 1
Basically what I want to do is to exclude the first column and then compare every two columns after that until column J, If either column has a "1" present, I want to report that as "1" even if both columns have "1" I would still report that as "1". For the last for columns, namely: J,K,L,M if I see a "1" in either four, it should be reported as "1".
First, you're obviously going to have to iterate over the rows in some way to do something for each row.
Second, I have no idea what what you're trying to do with the [::2], since that will just give you all the even columns, or what the Counter is for in the first place, or why specifically you're trying to count strings made up of a bunch of concatenated columns.
But I think what you want is this:
with open("datagroup.txt") as inFile:
for row in inFile:
columns = row.split()
outcolumns = []
outcolumns.append(columns[0]) # A
for group in zip(columns[1:-4:2], columns[2:-4:2])+columns[-4:]:
outcolumns.append('1' if '1' in group else '0')
print(' '.join(outcolumns))
You can make this a lot more concise with a bit of itertools and comprehensions, but I wanted to keep this verbose and simple so you'd understand it.