Python compare and count row values - python

I'd like to compare two columns row by row and count when a specific value in each row is not correct. For instance:
group landing_page
control new_page
control old_page
treatment new_page
treatment old_page
control old_page
I'd like to count the number of times treatment is not equal to new_page or control is not equal to old_page. It could be the opposite too I guess, aka treatment is equal to new_page.

Use pandas groupby to find the counts of group/landing page pairs.
Use groupby again to find the group counts.
To find the count of other landing pages within each group, subtract each
landing page count from the group count.
df = pd.DataFrame({'group': ['control', 'control', 'treatment',
'treatment', 'control'],
'landing_page': ['new_page', 'old_page', 'new_page',
'old_page', 'old_page']})
# find counts per pairing
df_out = df.groupby(['group', 'landing_page'])['landing_page'].count().to_frame() \
.rename(columns={'landing_page': 'count'}).reset_index()
# find totals for groups
df_out['grp_total'] = df_out.groupby('group')['count'].transform('sum')
# find count not equal to landing page
df_out['inverse_count'] = df_out['grp_total'] - df_out['count']
print(df_out)
group landing_page count grp_total inverse_count
0 control new_page 1 3 2
1 control old_page 2 3 1
2 treatment new_page 1 2 1
3 treatment old_page 1 2 1

This sounds like a job for the zip() function.
First, setup your inputs and counters:
group = ["control", "control", "treatment", "treatment", "control"]
landingPage = ["new_page", "old_page", "new_page", "old_page", "old_page"]
treatmentNotNew = 0
controlNotOld = 0
Then zip the two inputs you are comparing into an iterator of tuples:
zipped = zip(group, landingPage)
Now you can iterate over the tuple values a (group) and b (landing apge) while counting each time that treatment != new_page and control != old_page:
for a, b in zipped:
if((a == "treatment") and (not b == "new_page")):
treatmentNotNew += 1
if((a == "control") and (not b == "old_page")):
controlNotOld += 1
Finally, print your result!
print("treatmentNotNew = " + str(treatmentNotNew))
print("controlNotOld = " + str(controlNotOld))
>> treatmentNotNew = 1
>> controlNotOld = 1

I would create a new column with map that maps your desired output given the input. Then you can easily test if the new mapping column equals the landing_page column.
df = pd.DataFrame({
'group': ['control', 'control', 'treatment', 'treatment', 'control'],
'landing_page': ['old_page', 'old_page', 'new_page', 'old_page', 'new_page']
})
df['mapping'] = df.group.map({'control': 'old_page', 'treatment': 'new_page'})
(df['landing_page'] != df['mapping']).sum()
# 2

Related

how to change the iterrows method to apply

I have this code, in which I have rows around 60k. It taking around 4 hrs to complete the whole process. This code is not feasible and want to use apply instead iterrow because of time constraints.
Here is the code,
all_merged_k = pd.DataFrame(columns=all_merged_f.columns)
for index, row in all_merged_f.iterrows():
if (row['route_count'] == 0):
all_merged_k = all_merged_k.append(row)
else:
for i in range(row['route_count']):
row1 = row.copy()
row['Route Number'] = i
row['Route_Broken'] = row1['routes'][i]
all_merged_k = all_merged_k.append(row)
Basically, what the code is doing is that if the route count is 0 then append the same row, if not then whatever the number of counts is it will append that number of rows with all same value except the routes column (as it contains nested list) so breaking them in multiple rows. And adding them in new columns called Route_Broken and Route Number.
Sample of data:
routes route_count
[[CHN-IND]] 1
[[CHN-IND],[IND-KOR]] 2
O/P data:
routes route_count Broken_Route Route Number
[[CHN-IND]] 1 [CHN-IND] 1
[[CHN-IND],[IND-KOR]] 2 [CHN-IND] 1
[[CHN-IND],[IND-KOR]] 2 [IND-KOR] 2
Can it be possible using apply because 4 hrs is very high and cant be put into production. I need extreme help. Pls help me.
So below code doesn't work
df.join(df['routes'].explode().rename('Broken_Route')) \
.assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})
or
(df.assign(Broken_Route=df['routes'],
count=df['routes'].str.len().apply(range))
.explode(['Broken_Route', 'count'])
)
It doesn't working if the index matches, we can see the last row, Route Number should be 1
Are you expect something like that:
>>> df.join(df['routes'].explode().rename('Broken_Route')) \
.assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})
routes route_count Broken_Route Route Number
0 [[CHN-IND]] 1 [CHN-IND] 1
1 [[CHN-IND], [IND-KOR]] 2 [CHN-IND] 1
1 [[CHN-IND], [IND-KOR]] 2 [IND-KOR] 2
2 0 1
Setup:
data = {'routes': [[['CHN-IND']], [['CHN-IND'], ['IND-KOR']], ''],
'route_count': [1, 2, 0]}
df = pd.DataFrame(data)
Update 1: added a record with route_count=0 and routes=''.
You can assign the routes and counts and explode:
(df.assign(Broken_Route=df['routes'],
count=df['routes'].str.len().apply(range))
.explode(['Broken_Route', 'count'])
)
NB. multi-column explode requires pandas ≥1.3.0, if older use this method
output:
routes route_count Broken_Route count
0 [[CHN-IND]] 1 [CHN-IND] 0
1 [[CHN-IND], [IND-KOR]] 2 [CHN-IND] 0
1 [[CHN-IND], [IND-KOR]] 2 [IND-KOR] 1

How to re-number strings after sorting a dataframe

Description:
I have a GUI that allows the user to add variables that are displayed in a dataframe. As the variables are added, they are automatically numbered, ex.'FIELD_0' and 'FIELD_1' etc and each variable has a value associated with it. The data is actually row-based instead of column based, in that the 'FIELD' ids are in column 0 and progress downwards and the corresponding value is in column 1, in the same corresponding row. As shown below:
0 1
0 FIELD_0 HH_5_MILES
1 FIELD_1 POP_5_MILES
The user is able to reorder these values and move them up/down a row. However, it's important that the number ordering remains sequential. So, if the user positions 'FIELD_1' above 'FIELD_0' then it gets re-numbered appropriately. Example:
0 1
0 FIELD_0 POP_5_MILES
1 FIELD_1 HH_5_MILES
Currently, I'm using the below code to perform this adjustment - this same re-numbering occurs with other variable names within the same dataframe.
df = pandas.DataFrame({0:['FIELD_1','FIELD_0']})
variable_list = ['FIELD', 'OPERATOR', 'RESULT']
for var in variable_list:
field_list = ['%s_%s' % (var, _) for _, field_name in enumerate(df[0].isin([var]))]
field_count = 0
for _, field_name in enumerate(df.loc[:, 0]):
if var in field_name:
df.loc[_, 0] = field_list[field_count]
field_count += 1
This gets me the result I want, but it seems a bit inelegant. If there is a better way, I'd love to know what it is.
It appears you're looking to overwrite the Field values so that they always appear in order starting with 0.
We can filter to only rows which str.contains the word FIELD. Then assign those to a list comprehension like field_list.
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
# Select Where Values are Field
m = df[0].str.contains('FIELD')
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'FIELD_{n}' for n in range(m.sum())]
print(df)
df:
0
0 FIELD_0
1 OTHER_1
2 FIELD_1
3 OTHER_0
For multiple variables:
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
variable_list = ['FIELD', 'OTHER']
for v in variable_list:
# Select Where Values are Field
m = df[0].str.contains(v)
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'{v}_{n}' for n in range(m.sum())]
df:
0
0 FIELD_0
1 OTHER_0
2 FIELD_1
3 OTHER_1
You can use sort values as below:
def f(x):
l=x.split('_')[1]
return int(l)
df.sort_values(0, key=lambda col: [f(k) for k in col]).reset_index(drop=True)
0
0 FIELD_0
1 FIELD_1

How can you increase the speed of an algorithm that computes a usage streak?

I have the following problem: I have data (table called 'answers') of a quiz application including the answered questions per user with the respective answering date (one answer per line), e.g.:
UserID
Time
Term
QuestionID
Answer
1
2019-12-28 18:25:15
Winter19
345
a
2
2019-12-29 20:15:13
Winter19
734
b
I would like to write an algorithm to determine whether a user has used the quiz application several days in a row (a so-called 'streak'). Therefore, I want to create a table ('appData') with the following information:
UserID
Term
HighestStreak
1
Winter19
7
2
Winter19
10
For this table I need to compute the variable 'HighestStreak'. I managed to do so with the following code:
for userid, term in zip(appData.userid, appData.term):
final_streak = 1
for i in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
temp_streak = 1
while i + pd.DateOffset(days=1) in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
i += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Unfortunately, running this code takes about 45 minutes. The table 'answers' has about 4,000 lines. Is there any structural 'mistake' in my code that makes it so slow or do processes like this take that amount of time?
Any help would be highly appreciated!
EDIT:
I managed to increase the speed from 45 minutes to 2 minutes with the following change:
I filtered the data to students who answered at least one answer first and set the streak to 0 for the rest (as the streak for 0 answers is 0 in every case):
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
Furthermore I moved the filtered list out of the loop, so the algorithm does not need to filter twice, resulting in the following new code:
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
for userid, term in zip(appData.userid, appData.term):
activeDays = answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique()
final_streak = 1
for day in activeDays:
temp_streak = 1
while day + pd.DateOffset(days=1) in activeDays:
day += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Of course, 2 minutes is much better than 45 minutes. But are there any more tips?
my attempt, which borrows some key ideas from the connected components problem; a fairly early problem when looking at graphs
first I create a random DataFrame with some user id's and some dates.
import datetime
import random
import pandas
import numpy
#generate basic dataframe of users and answer dates
def sample_data_frame():
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.Series(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()),
name='date')
users = pandas.Series(users, name='user')
df = pandas.merge(date_range, users, how='cross')
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
def sample_data_frame_v2(): #pandas version <1.2
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.DataFrame(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()), columns = ['date'])
users = pandas.DataFrame(users, columns = ['user'])
date_range['key'] = 1
users['key'] = 1
df = users.merge(date_range, on='key')
df.drop(labels = 'key', axis = 1)
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
put your DataFrame in sorted order, so that the next row is next answer day and then by user
create two new columns from the row below containing the userid and the date of the row below
if the user of row below is the same as the current row and the current date + 1 day is the same as the row below set the column result to false numerically known as 0, otherwise if it's a new streak set to True, which can be represented numerically as 1.
cumulatively sum the results which will group your streaks
finally count how many entries exist per group and find the max for each user
for 10k users over 364 days worth of answers my running time is about a 1 second
df = sample_data_frame()
df = df.sort_values(by=['user', 'date']).reset_index(drop = True)
df['shift_date'] = df['date'].shift()
df['shift_user'] = df['user'].shift()
df['result'] = ~((df['shift_date'] == df['date'] - datetime.timedelta(days=1)) & (df['shift_user'] == df['user']))
df['group'] = df['result'].cumsum()
summary = (df.groupby(by=['user', 'group']).count()['result'].max(level='user'))
summary.sort_values(ascending = False) #print user with highest streak

Pandas Parse DataFrame Field and Maintain ID Field

I have a made-up pandas series that I split on a delimiter:
s2 = pd.Series(['2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*'])
split = s2.str.split('*')
The general logic to parse this string:
Asterisks are the delimiter
Numbers immediately before asterisks identify the length of the following block
Three indicators
C indicates field names will follow
N indicates new field values will follow
O indicates old field values will follow
Numbers immediately after indicators (tough because they are next to numbers before asterisks) identify how many field names or values will follow
The parsing logic and code works on a single pandas series. Therefore, it is less important to understand that than it is to understand applying the logic/code to a dataframe.
I calculate the number of fields in the string (in this case, the 3 in the second block which is C316):
number_of_fields = int(split[0][1][1:int(split[0][0])])
I apply a lot of list splitting to extract the results I need into three separate lists (field names, new values, and old values):
i=2
string_length = int(split[0][1][int(split[0][0]):])
field_names_list = []
while i < number_of_fields + 2:
field_name = split[0][i][0:string_length]
field_names_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 3 + number_of_fields
string_length = int(split[0][2 + number_of_fields][string_length:])
new_values_list = []
while i < 3+number_of_fields*2:
field_name = split[0][i][0:string_length]
new_values_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 4 + number_of_fields*2
string_length = int(split[0][3 + number_of_fields*2][string_length:])
old_values_list = []
while i <= 3 + number_of_fields*3:
old_value = split[0][i][0:string_length]
old_values_list.append(old_value)
if i == 3 + number_of_fields*3:
string_length = 0
else:
string_length = int(split[0][i][string_length:])
i+=1
I combine the lists into a df with three columns:
df = pd.DataFrame(
{'field_name': field_names_list,
'new_value': new_values_list,
'old_value': old_values_list
})
field_name new_value old_value
0 first_field_name field value
1 second_field_name Y
2 third_field_name hello
How would I apply this same process to a df with multiple strings? The df would look like this:
row_id string
0 24 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
1 25 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
I'm unsure how to maintain the row_id with the eventual columns. The end result should look like this:
row_id field_name new_value old_value
0 24 first_field_name field value
1 24 second_field_name Y
2 24 third_field_name hello
3 25 first_field_name field value
4 25 second_field_name Y
5 25 third_field_name hello
I know I can concatenate multiple dataframes, but that would come after maintaining the row_id. How do I keep the row_id with the corresponding values after a series of list slicing operations?

find gene name from liste to dataframe

I actually have to know if I got some gene if my result, to do so I have one list with my genes' names and a dataframe with the same sames:
For exemple
liste["gene1","gene2","gene3","gene4","gene5"]
and a dataframe:
name1 name2
gene1_0035 gene1_0042
gene56_0042 gene56_0035
gene4_0042 gene4_0035
gene2_0035 gene2_0042
gene57_0042 gene57_0035
then I did:
df=pd.read_csv("dataframe_not_max.txt",sep='\t')
df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1']))
#print(df)
print(list(df.columns.values))
name1=df.ix[:,1]
name2=df.ix[:,2]
liste=[]
for record in SeqIO.parse(data, "fasta"):
liste.append(record.id)
print(liste)
print(len(liste))
count=0
for a, b in zip(name1, name2):
if a in liste:
count+=1
if b in liste:
count+=1
print(count)
And what I want is to know how many time I find the gene in ma dataframe from my list but they do not have exactly the same ID since in the list there is not the _number after the gene name, then the if i in liste does not reconize the ID.
Is it possible to say something like :
if a without_number in liste:
In the above exemple it would be :
count = 3 because only gene 1,2 and 4 are present in both the list and the datafra.
Here is a more complicated exemple to see if your script indeed works for my data:
Let's say I have a dataframe such:
cluster_name qseqid sseqid pident_x
15 cluster_016607 EOG090X00GO_0035_0035 EOG090X00GO_0042_0035
16 cluster_016607 EOG090X00GO_0035_0035 EOG090X00GO_0042_0042
18 cluster_016607 EOG090X00GO_0035_0042 EOG090X00GO_0042_0035
19 cluster_016607 EOG090X00GO_0035_0042 EOG090X00GO_0042_0042
29 cluster_015707 EOG090X00LI_0035_0035 EOG090X00LI_0042_0042
30 cluster_015707 EOG090X00LI_0035_0035 EOG090X00LI_0042_0035
34 cluster_015707 EOG090X00LI_0042_0035 g1726.t1_0035_0042
37 cluster_015707 EOG090X00LI_0042_0042 g1726.t1_0035_0042
and a list : ["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]
here I get 6 but I should get 2 because I have only 2 sequences in my data EOG090X00LI and EOG090X00GO
in fact, here I want to count when a sequence is present only when it appears once, even if it is for exemple: EOG090X00LI vs seq123454
I do not know if it is clear?
I used for the exemple :
df=pd.read_csv("test_busco_augus.csv",sep=',')
#df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1']))
print(df)
print(list(df.columns.values))
name1=df.ix[:,3]
name2=df.ix[:,4]
liste=["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]
print(liste)
#get boolean mask for each column
m1 = name1.str.contains('|'.join(liste))
m2 = name2.str.contains('|'.join(liste))
#chain masks and count Trues
a = (m1 & m2).sum()
print (a)
Using isin
df.apply(lambda x : x.str.split('_').str[0],1).isin(l).sum(1).eq(2).sum()
Out[923]: 3
Adding value_counts
df.apply(lambda x : x.str.split('_').str[0],1).isin(l).sum(1).value_counts()
Out[925]:
2 3
0 2
dtype: int64
Adjusted for updated OP
find where sum is equal to 1
df.stack().str.split('_').str[0].isin(liste).sum(level=0).eq(1).sum()
2
Old Answer
stack and str accessor
You can use split on '_' to scrape the first portion then use isin to determine membership. I also use stack and all with the parameter level=0 to see if membership is True for all columns
df.stack().str.split('_').str[0].isin(liste).all(level=0).sum()
3
applymap
df.applymap(lambda x: x.split('_')[0] in liste).all(1).sum()
3
sum/all with generators
sum(all(x.split('_')[0] in liste for x in r) for r in df.values)
3
Two many map
sum(map(lambda r: all(map(lambda x: x.split('_')[0] in liste, r)), df.values))
3
I think need:
#add _ to end of values
liste = [record.id + '_' for record in SeqIO.parse(data, "fasta")]
#liste = ["gene1_","gene2_","gene3_","gene4_","gene5_"]
#get boolean mask for each column
m1 = df['name1'].str.contains('|'.join(liste))
m2 = df['name2'].str.contains('|'.join(liste))
#chain masks and count Trues
a = (m1 & m2).sum()
print (a)
3
EDIT:
liste=["EOG090X00LI","EOG090X00GO","EOG090X00BA"]
#extract each values before _, remove duplicates and compare by liste
a = name1.str.split('_').str[0].drop_duplicates().isin(liste)
b = name2.str.split('_').str[0].drop_duplicates().isin(liste)
#compare a with a for equal and sum Trues
c = a.eq(b).sum()
print (c)
2
You could convert your dataframe to a series (combining all columns) using stack(), then search for your gene names in liste followed by an underscore _ using Series.str.match():
s = df.stack()
sum([s.str.match(i+'_').any() for i in liste])
Which returns 3
Details:
df.stack() returns the following Series:
0 name1 gene1_0035
name2 gene1_0042
1 name1 gene56_0042
name2 gene56_0035
2 name1 gene4_0042
name2 gene4_0035
3 name1 gene2_0035
name2 gene2_0042
4 name1 gene57_0042
name2 gene57_0035
Since all your genes are followed by an underscore in that series, you just need to see if gene_name followed by _ is in that Series. s.str.match(i+'_').any() returns True if that is the case. Then, you get the sum of True values, and that is your count.

Categories

Resources