Similar time submissions for tests - python

I have a dataframe for each of my online students shaped like this:
df=pd.DataFrame(columns=["Class","Student","Challenge","Time"])
I create a temporary df for each of the challenges like this:
for ch in challenges:
df=main_df.loc[(df['Challenge'] == ch)]
I want to group the records that are 5 minutes apart from each other. The idea is that I will create a df that shows a group of students working together and answering the questions relatively the same time. I have different aspects of catching cheaters but I need to be able to show that two or more students have been answering the questions relatively the same time.
I was thinking of using resampling or using a grouper method but I'm a bit confused on how to implement it. Could someone guide me to the right direction for solving for this?
Sample df:
df = pd.DataFrame(
data={
"Class": ["A"] * 4 + ["B"] * 2,
"Student": ['Scooby','Daphne','Shaggy','Fred','Velma','Scrappy'],
"Challenge": ["Challenge3"] *6,
"Time": ['07/10/2022 08:22:44','07/10/2022 08:27:22','07/10/2022 08:27:44','07/10/2022 08:29:55','07/10/2022 08:33:14','07/10/2022 08:48:44'],
}
)
EDIT:
For the output I was thinking of having another df with an extra column called 'Grouping' and have an incremental number for the groups that were discovered.
The final df would first be empty and then appended or concated with the grouping df.
new_df = pd.dataframe(columns=['Class','Student','Challenge','Time','Grouping'])
new_df = pd.concat([new_df,df])
Class Student Challenge Time Grouping
1 A Daphne Challenge3 07/10/2022 08:27:22 1
2 A Shaggy Challenge3 07/10/2022 08:27:44 1
3 A Fred Challenge3 07/10/2022 08:29:55 1
The purpose of this is so that I may have multiple samples to verify if the same two or more students are sharing answers.
EDIT 2:
I was thinking that I could do a lambda based operation. What if I created another column called "Threshold_Delta" that calculates the difference in time from one answer to another? Then I need to figure out how to group those minutes up.
df['Challenge_Delta'] = (df['Time']-df['Time'].shift())

This solution implements a rolling time list. The logic is this – keep adding entries as long as all the entries in the list are within the time window. When you detect that some of the entries are not in the time window relative to the most current (to be added) entry then that indicates that you have a unique cheater (maybe) group. Then write out that group as a list, remove the out-of-window entries and add the newest entry. At the end there may a set of entries left on the list. dump_last() allows that set to be pulled. After you have these lists then there is a whole bunch of cheater-detecting analytics that you could perform.
Make sure the Time column is datetime
df['Time'] = df['Time'].astype('datetime64[ns]')
The Rolling Time List Class defintion
class RollingTimeList:
def __init__(self):
self.cur = pd.DataFrame(columns=['Student','Time'])
self.window = pd.Timedelta('5T')
def __add(self, row):
idx = self.cur.index.max()
new_idx = idx+1 if idx==idx else 0
self.cur.loc[new_idx] = row[['Student','Time']]
def handle_row(self, row):
rc = None
if len(self.cur) > 0:
window_mask = (row['Time'] - self.cur['Time']).abs() <= self.window
if ~window_mask.all():
if len(self.cur) > 1:
rc = self.cur['Student'].to_list()
self.cur = self.cur.loc[window_mask]
self.__add(row)
return rc
def dump_last(self):
rc = None
if len(self.cur) > 1:
rc = self.cur['Student'].to_list()
self.cur = self.cur[0:0]
return rc
Instantiate and apply the class
rolling_list = RollingTimeList()
s = df.apply(rolling_list.handle_row, axis=1)
idx = s.index.max()
s.loc[idx+1 if idx==idx else 0] = rolling_list.dump_last()
print(s.dropna())
Result
3 [Scooby, Daphne, Shaggy]
4 [Daphne, Shaggy, Fred]
5 [Fred, Velma]
Merge that back into the original data frame
df['window_groups'] = s
df['window_groups'] = df['window_groups'].shift(-1).fillna('')
print(df)
Result
Class Student Challenge Time window_groups
0 A Scooby Challenge3 2022-07-10 08:22:44
1 A Daphne Challenge3 2022-07-10 08:27:22
2 A Shaggy Challenge3 2022-07-10 08:27:44 [Scooby, Daphne, Shaggy]
3 A Fred Challenge3 2022-07-10 08:29:55 [Daphne, Shaggy, Fred]
4 B Velma Challenge3 2022-07-10 08:33:14 [Fred, Velma]
5 B Scrappy Challenge3 2022-07-10 08:48:44
Visualization idea: cross reference table
dfd = pd.get_dummies(s.dropna().apply(pd.Series).stack()).groupby(level=0).sum()
xrf = dfd.T.dot(dfd)
print(xrf)
Result
Daphne Fred Scooby Shaggy Velma
Daphne 2 1 1 2 0
Fred 1 2 0 1 1
Scooby 1 0 1 1 0
Shaggy 2 1 1 2 0
Velma 0 1 0 0 1
Or as Series of unique student combinations and counts
combos = xrf.stack()
unique_combo_mask = combos.index.get_level_values(0)<combos.index.get_level_values(1)
print(combos[unique_combo_mask])
Result
Daphne Fred 1
Scooby 1
Shaggy 2
Velma 0
Fred Scooby 0
Shaggy 1
Velma 1
Scooby Shaggy 1
Velma 0
Shaggy Velma 0
Also from here, you could determine that within some kind of time constraint (or other constraint) that the count really should be not more than one - due to overlapping lists. Like in this example, Daphne and Shaggy probably should not be counted twice. They didn't really come within 5 minutes of each other on two separate occasions. In which case anything greater than one could be set to one before being accumulated into a larger pool.

Related

how to change the iterrows method to apply

I have this code, in which I have rows around 60k. It taking around 4 hrs to complete the whole process. This code is not feasible and want to use apply instead iterrow because of time constraints.
Here is the code,
all_merged_k = pd.DataFrame(columns=all_merged_f.columns)
for index, row in all_merged_f.iterrows():
if (row['route_count'] == 0):
all_merged_k = all_merged_k.append(row)
else:
for i in range(row['route_count']):
row1 = row.copy()
row['Route Number'] = i
row['Route_Broken'] = row1['routes'][i]
all_merged_k = all_merged_k.append(row)
Basically, what the code is doing is that if the route count is 0 then append the same row, if not then whatever the number of counts is it will append that number of rows with all same value except the routes column (as it contains nested list) so breaking them in multiple rows. And adding them in new columns called Route_Broken and Route Number.
Sample of data:
routes route_count
[[CHN-IND]] 1
[[CHN-IND],[IND-KOR]] 2
O/P data:
routes route_count Broken_Route Route Number
[[CHN-IND]] 1 [CHN-IND] 1
[[CHN-IND],[IND-KOR]] 2 [CHN-IND] 1
[[CHN-IND],[IND-KOR]] 2 [IND-KOR] 2
Can it be possible using apply because 4 hrs is very high and cant be put into production. I need extreme help. Pls help me.
So below code doesn't work
df.join(df['routes'].explode().rename('Broken_Route')) \
.assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})
or
(df.assign(Broken_Route=df['routes'],
count=df['routes'].str.len().apply(range))
.explode(['Broken_Route', 'count'])
)
It doesn't working if the index matches, we can see the last row, Route Number should be 1
Are you expect something like that:
>>> df.join(df['routes'].explode().rename('Broken_Route')) \
.assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})
routes route_count Broken_Route Route Number
0 [[CHN-IND]] 1 [CHN-IND] 1
1 [[CHN-IND], [IND-KOR]] 2 [CHN-IND] 1
1 [[CHN-IND], [IND-KOR]] 2 [IND-KOR] 2
2 0 1
Setup:
data = {'routes': [[['CHN-IND']], [['CHN-IND'], ['IND-KOR']], ''],
'route_count': [1, 2, 0]}
df = pd.DataFrame(data)
Update 1: added a record with route_count=0 and routes=''.
You can assign the routes and counts and explode:
(df.assign(Broken_Route=df['routes'],
count=df['routes'].str.len().apply(range))
.explode(['Broken_Route', 'count'])
)
NB. multi-column explode requires pandas ≥1.3.0, if older use this method
output:
routes route_count Broken_Route count
0 [[CHN-IND]] 1 [CHN-IND] 0
1 [[CHN-IND], [IND-KOR]] 2 [CHN-IND] 0
1 [[CHN-IND], [IND-KOR]] 2 [IND-KOR] 1

Pandas very slow query

I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2

How to create grouped piechart

I have this pandas dataframe:
df =
GROUP MARK
ABC 1
ABC 0
ABC 1
DEF 1
DEF 1
DEF 1
DEF 1
XXX 0
I need to create a piechart (using Python or R). The size of each pie should correspond to the proportional count (i.e. the percent) of rows with particular GROUP. Moreover, each pie should be divided into 2 sub-parts corresponding to the percent of rows with MARK==1 and MARK==0 within given GROUP.
I was googling for this type of piecharts and found this one. But this example seems to be overcomplicated for my case. Another good example is done in JavaScript, which doesn't serve for me because of the language.
Can somebody tell me what's the name of this type of piecharts and where can I find some examples of code in Python or R.
Here is a solution in R that uses base R only. Not sure how you want to arrange your pies, but I used par(mfrow=...).
df <- read.table(text=" GROUP MARK
ABC 1
ABC 0
ABC 1
DEF 1
DEF 1
DEF 1
DEF 1
XXX 0", header=TRUE)
plot_pie <- function(x, multiplier=1, label){
pie(table(x), radius=multiplier * length(x), main=label)
}
par(mfrow=c(1,3), mar=c(0,0,2,0))
invisible(lapply(split(df, df$GROUP), function(x){
plot_pie(x$MARK, label=unique(x$GROUP),
multiplier=0.2)
}))
This is the result:

Python 2.7, comparing 3 columns of a CSV

What is the easiest/simplest way to iterate through a large CSV file in Python 2.7, comparing 3 columns?
I am a total beginner and have only completed a few online courses, I have managed to use CSV reader to do some basic stats on the CSV file, but nothing comparing groups within each other.
The data is roughly set up as follows:
Group sub-group processed
1 a y
1 a y
1 a y
1 b
1 b
1 b
1 c y
1 c y
1 c
2 d y
2 d y
2 d y
2 e y
2 e
2 e
2 f y
2 f y
2 f y
3 g
3 g
3 g
3 h y
3 h
3 h
Everything belongs to a group, but within each group are sub-groups of 3 rows (replicates). As we are working through samples, we will adding to the processed column, but we don't always do the full complement, so sometimes there will only be 1 or 2 processed out of the potential 3.
I'm trying to work towards a statistic showing % completeness of each group, with a sub group being "complete" if it has at least 1 row processed (doesn't have to have all 3).
I've managed to get halfway there, by using the following:
for row in reader:
all_groups[group] = all_groups.get(group,0)+1
if not processed == "":
processed_groups[group] = processed_groups.get(group,0)+1
result = {}
for family in (processed_groups.viewkeys() | all_groups.keys()):
if group in processed_groups: result.setdefault(group, []).append(processed_groups[group])
if group in processed_groups: result.setdefault(group, []).append(all_groups[group])
for group,v1 in result.items():
todo = float(v1[0])
done = float(v1[1])
progress = round((100 / done * todo),2)
print group,"--", progress,"%"
The problem with the above code is it doesn't take into account the fact that some sub-groups may not be totally processed. As a result, the statistic will never read as 100% unless the processed column is always complete.
What I get:
Group 1 -- 55.56%
Group 2 -- 77.78%
Group 3 -- 16.67%
What I want:
Group 1 -- 66.67%%
Group 2 -- 100%
Group 3 -- 50%
How would you make it so that it just looks to see if the first row for each sub column is complete, and just use that, before continuing on to the next sub group?
One way to do this is with a couple of defaultdict of sets. The first keeps track of all of the subgroups seen, the second keeps track of those subgroups that have been processed. Using a set simplifies the code somewhat, as does using a defaultdict when compared to using a standard dictionary (although it's still possible).
import csv
from collections import defaultdict
subgroups = defaultdict(set)
processed_subgroups = defaultdict(set)
with open('data.csv') as csvfile:
for group, subgroup, processed in csv.reader(csvfile):
subgroups[group].add(subgroup)
if processed == 'y':
processed_subgroups[group].add(subgroup)
for group in sorted(processed_subgroups):
print("Group {} -- {:.2f}%".format(group, (len(processed_subgroups[group]) / float(len(subgroups[group])) * 100)))
Output
Group 1 -- 66.67%
Group 2 -- 100.00%
Group 3 -- 50.00%

Merging DataFrames on multiple conditions - not specifically on equal values

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278

Categories

Resources