I have this pandas dataframe:
df =
GROUP MARK
ABC 1
ABC 0
ABC 1
DEF 1
DEF 1
DEF 1
DEF 1
XXX 0
I need to create a piechart (using Python or R). The size of each pie should correspond to the proportional count (i.e. the percent) of rows with particular GROUP. Moreover, each pie should be divided into 2 sub-parts corresponding to the percent of rows with MARK==1 and MARK==0 within given GROUP.
I was googling for this type of piecharts and found this one. But this example seems to be overcomplicated for my case. Another good example is done in JavaScript, which doesn't serve for me because of the language.
Can somebody tell me what's the name of this type of piecharts and where can I find some examples of code in Python or R.
Here is a solution in R that uses base R only. Not sure how you want to arrange your pies, but I used par(mfrow=...).
df <- read.table(text=" GROUP MARK
ABC 1
ABC 0
ABC 1
DEF 1
DEF 1
DEF 1
DEF 1
XXX 0", header=TRUE)
plot_pie <- function(x, multiplier=1, label){
pie(table(x), radius=multiplier * length(x), main=label)
}
par(mfrow=c(1,3), mar=c(0,0,2,0))
invisible(lapply(split(df, df$GROUP), function(x){
plot_pie(x$MARK, label=unique(x$GROUP),
multiplier=0.2)
}))
This is the result:
Related
I have a dataframe for each of my online students shaped like this:
df=pd.DataFrame(columns=["Class","Student","Challenge","Time"])
I create a temporary df for each of the challenges like this:
for ch in challenges:
df=main_df.loc[(df['Challenge'] == ch)]
I want to group the records that are 5 minutes apart from each other. The idea is that I will create a df that shows a group of students working together and answering the questions relatively the same time. I have different aspects of catching cheaters but I need to be able to show that two or more students have been answering the questions relatively the same time.
I was thinking of using resampling or using a grouper method but I'm a bit confused on how to implement it. Could someone guide me to the right direction for solving for this?
Sample df:
df = pd.DataFrame(
data={
"Class": ["A"] * 4 + ["B"] * 2,
"Student": ['Scooby','Daphne','Shaggy','Fred','Velma','Scrappy'],
"Challenge": ["Challenge3"] *6,
"Time": ['07/10/2022 08:22:44','07/10/2022 08:27:22','07/10/2022 08:27:44','07/10/2022 08:29:55','07/10/2022 08:33:14','07/10/2022 08:48:44'],
}
)
EDIT:
For the output I was thinking of having another df with an extra column called 'Grouping' and have an incremental number for the groups that were discovered.
The final df would first be empty and then appended or concated with the grouping df.
new_df = pd.dataframe(columns=['Class','Student','Challenge','Time','Grouping'])
new_df = pd.concat([new_df,df])
Class Student Challenge Time Grouping
1 A Daphne Challenge3 07/10/2022 08:27:22 1
2 A Shaggy Challenge3 07/10/2022 08:27:44 1
3 A Fred Challenge3 07/10/2022 08:29:55 1
The purpose of this is so that I may have multiple samples to verify if the same two or more students are sharing answers.
EDIT 2:
I was thinking that I could do a lambda based operation. What if I created another column called "Threshold_Delta" that calculates the difference in time from one answer to another? Then I need to figure out how to group those minutes up.
df['Challenge_Delta'] = (df['Time']-df['Time'].shift())
This solution implements a rolling time list. The logic is this – keep adding entries as long as all the entries in the list are within the time window. When you detect that some of the entries are not in the time window relative to the most current (to be added) entry then that indicates that you have a unique cheater (maybe) group. Then write out that group as a list, remove the out-of-window entries and add the newest entry. At the end there may a set of entries left on the list. dump_last() allows that set to be pulled. After you have these lists then there is a whole bunch of cheater-detecting analytics that you could perform.
Make sure the Time column is datetime
df['Time'] = df['Time'].astype('datetime64[ns]')
The Rolling Time List Class defintion
class RollingTimeList:
def __init__(self):
self.cur = pd.DataFrame(columns=['Student','Time'])
self.window = pd.Timedelta('5T')
def __add(self, row):
idx = self.cur.index.max()
new_idx = idx+1 if idx==idx else 0
self.cur.loc[new_idx] = row[['Student','Time']]
def handle_row(self, row):
rc = None
if len(self.cur) > 0:
window_mask = (row['Time'] - self.cur['Time']).abs() <= self.window
if ~window_mask.all():
if len(self.cur) > 1:
rc = self.cur['Student'].to_list()
self.cur = self.cur.loc[window_mask]
self.__add(row)
return rc
def dump_last(self):
rc = None
if len(self.cur) > 1:
rc = self.cur['Student'].to_list()
self.cur = self.cur[0:0]
return rc
Instantiate and apply the class
rolling_list = RollingTimeList()
s = df.apply(rolling_list.handle_row, axis=1)
idx = s.index.max()
s.loc[idx+1 if idx==idx else 0] = rolling_list.dump_last()
print(s.dropna())
Result
3 [Scooby, Daphne, Shaggy]
4 [Daphne, Shaggy, Fred]
5 [Fred, Velma]
Merge that back into the original data frame
df['window_groups'] = s
df['window_groups'] = df['window_groups'].shift(-1).fillna('')
print(df)
Result
Class Student Challenge Time window_groups
0 A Scooby Challenge3 2022-07-10 08:22:44
1 A Daphne Challenge3 2022-07-10 08:27:22
2 A Shaggy Challenge3 2022-07-10 08:27:44 [Scooby, Daphne, Shaggy]
3 A Fred Challenge3 2022-07-10 08:29:55 [Daphne, Shaggy, Fred]
4 B Velma Challenge3 2022-07-10 08:33:14 [Fred, Velma]
5 B Scrappy Challenge3 2022-07-10 08:48:44
Visualization idea: cross reference table
dfd = pd.get_dummies(s.dropna().apply(pd.Series).stack()).groupby(level=0).sum()
xrf = dfd.T.dot(dfd)
print(xrf)
Result
Daphne Fred Scooby Shaggy Velma
Daphne 2 1 1 2 0
Fred 1 2 0 1 1
Scooby 1 0 1 1 0
Shaggy 2 1 1 2 0
Velma 0 1 0 0 1
Or as Series of unique student combinations and counts
combos = xrf.stack()
unique_combo_mask = combos.index.get_level_values(0)<combos.index.get_level_values(1)
print(combos[unique_combo_mask])
Result
Daphne Fred 1
Scooby 1
Shaggy 2
Velma 0
Fred Scooby 0
Shaggy 1
Velma 1
Scooby Shaggy 1
Velma 0
Shaggy Velma 0
Also from here, you could determine that within some kind of time constraint (or other constraint) that the count really should be not more than one - due to overlapping lists. Like in this example, Daphne and Shaggy probably should not be counted twice. They didn't really come within 5 minutes of each other on two separate occasions. In which case anything greater than one could be set to one before being accumulated into a larger pool.
I have a pandas dataframe and I want to summarize/reorganize it to produce a figure. I think what I'm looking for involves groupby.
Here's what my dataframe df looks like:
Channel Flag
1 pass
2 pass
3 pass
1 pass
2 pass
3 pass
1 fail
2 fail
3 fail
And this is what I want my dataframe to look like:
Channel pass fail
1 2 1
2 2 1
3 2 1
Running the following code gives something "close", but not in the format I would like:
In [12]: df.groupby(['Channel', 'Flag']).size()
Out[12]:
Channel Flag
1 fail 1
pass 2
2 fail 1
pass 2
3 fail 1
pass 2
Maybe this output is actually fine to make my plot. It's just that I already have the code to plot the data with the previous format. I'm adding the code in case it would be relevant:
df_all = pd.DataFrame()
df_all['All'] = df['Pass'] + df['Fail']
df_pass = df[['Pass']] # The double square brackets keep the column name
df_fail = df[['Fail']]
maxval = max(df_pass.index) # maximum channel value
layout = FastqPlots.make_layout(maxval=maxval)
value_cts = pd.Series(df_pass['Pass'])
for entry in value_cts.keys():
layout.template[np.where(layout.structure == entry)] = value_cts[entry]
sns.heatmap(data=pd.DataFrame(layout.template, index=layout.yticks, columns=layout.xticks),
xticklabels="auto", yticklabels="auto",
square=True,
cbar_kws={"orientation": "horizontal"},
cmap='Blues',
linewidths=0.20)
ax.set_title("Pass reads output per channel")
plt.tight_layout() # Get rid of extra margins around the plot
fig.savefig(out + "/channel_output_all.png")
Any help/advice would be much appreciated.
Thanks!
df.groupby(['Channel', 'Flag'],as_index=False).size().pivot('Channel','Flag','size')
I am trying to calculate the difference between records by group and also include row number by group. This could be done using lag and row number functions in HIVE using windowing functions. Trying to recreate this using PIG and python UDFs.
In the following example, I need the row number to restart from 1 for each name and increment for a new month (new record). Also, I need the difference in balance from prior month for each name.
input data
name month balance
A 1 10
A 2 5
A 3 15
B 2 20
B 3 10
B 4 45
B 5 50
output data
name month balance row_number balance_diff
A 1 10 1 0
A 2 5 1 -5
A 3 15 3 10
B 2 20 1 0
B 3 10 2 -10
B 4 45 3 35
B 5 50 4 5
How can I do this using PIG and python UDF? Below is what I tried.
PIG
output = foreach (group input by (name)) {
sorted = order input BY month asc;
row_details= myudf.rownum_and_diff(sorted.(month, balance));
generate flatten (sorted), flatten (row_details));
};
Python UDF
def row_num(mth):
return [x+1 for x,y in enumerate (mth)]
def diff(bal, n=1):
return [x-y if (x is not None and y is not None) else 0.0 \
for x,y in zip(bal, [:n] + bal)]
#outputSchema('udfbag:bag{udftuple:tuple(row_number: int, balance_diff: int)}')
def row_metrics(mthbal):
mth, bal = zip(*mthbal)
row_number = row_num(mth)
balance_diff = diff(bal)
return zip(row_number, balance_diff)
My python functions work. However, I am having trouble combining the two bags (sorted and row_detail) once I bring the results into PIG. Any help is much appreciated.
I have also seen the enumerate function in PIG doing what I want with the row number. As part of learning PIG, however, I am looking for a solution using python UDFs.
Try this.
Python UDF:
def row_num(mth):
return [x+1 for x,y in enumerate (mth)]
def diff(bal, n=1):
return [0]+[x-y for x,y in zip(bal[n:],bal[:-n])]
#outputSchema('udfbag:bag{udftuple:tuple(name: chararray, mth: int, row_number: int, balance_diff: int)}')
def row_metrics(mthbal):
name, mth, bal = zip(*mthbal)
row_number = row_num(mth)
balance_diff = diff(bal)
return zip(name,mth,row_number, balance_diff)
Pig Script:
register 'myudf.py' using jython as myudf;
inpdat = load 'input.dat' using PigStorage(',') as (name:chararray, month:int, balance:int);
outdat = foreach (group inpdat by name) {
sorted = order inpdat BY month asc;
row_details = myudf.row_metrics(sorted);
generate flatten (row_details);
};
dump outdat;
Using the stitch function from piggybank worked in my case. Would be interested to learn any other ways to do this.
REGISTER /mypath/piggybank.jar;
define Stitch org.apache.pig.piggybank.evaluation.Stitch;
input = load 'input.dat' using PigStorage(',') as (name:chararray, month:int, balance:int);
output = FOREACH (group input by name) {
sorted = ORDER input by month asc;
udf_fields = myudf.row_metrics(sorted.(month, balance));
generate flatten(Stitch(sorted,udf_fields)) as (name, month, balance, row_number, balance_diff);
};
What is the easiest/simplest way to iterate through a large CSV file in Python 2.7, comparing 3 columns?
I am a total beginner and have only completed a few online courses, I have managed to use CSV reader to do some basic stats on the CSV file, but nothing comparing groups within each other.
The data is roughly set up as follows:
Group sub-group processed
1 a y
1 a y
1 a y
1 b
1 b
1 b
1 c y
1 c y
1 c
2 d y
2 d y
2 d y
2 e y
2 e
2 e
2 f y
2 f y
2 f y
3 g
3 g
3 g
3 h y
3 h
3 h
Everything belongs to a group, but within each group are sub-groups of 3 rows (replicates). As we are working through samples, we will adding to the processed column, but we don't always do the full complement, so sometimes there will only be 1 or 2 processed out of the potential 3.
I'm trying to work towards a statistic showing % completeness of each group, with a sub group being "complete" if it has at least 1 row processed (doesn't have to have all 3).
I've managed to get halfway there, by using the following:
for row in reader:
all_groups[group] = all_groups.get(group,0)+1
if not processed == "":
processed_groups[group] = processed_groups.get(group,0)+1
result = {}
for family in (processed_groups.viewkeys() | all_groups.keys()):
if group in processed_groups: result.setdefault(group, []).append(processed_groups[group])
if group in processed_groups: result.setdefault(group, []).append(all_groups[group])
for group,v1 in result.items():
todo = float(v1[0])
done = float(v1[1])
progress = round((100 / done * todo),2)
print group,"--", progress,"%"
The problem with the above code is it doesn't take into account the fact that some sub-groups may not be totally processed. As a result, the statistic will never read as 100% unless the processed column is always complete.
What I get:
Group 1 -- 55.56%
Group 2 -- 77.78%
Group 3 -- 16.67%
What I want:
Group 1 -- 66.67%%
Group 2 -- 100%
Group 3 -- 50%
How would you make it so that it just looks to see if the first row for each sub column is complete, and just use that, before continuing on to the next sub group?
One way to do this is with a couple of defaultdict of sets. The first keeps track of all of the subgroups seen, the second keeps track of those subgroups that have been processed. Using a set simplifies the code somewhat, as does using a defaultdict when compared to using a standard dictionary (although it's still possible).
import csv
from collections import defaultdict
subgroups = defaultdict(set)
processed_subgroups = defaultdict(set)
with open('data.csv') as csvfile:
for group, subgroup, processed in csv.reader(csvfile):
subgroups[group].add(subgroup)
if processed == 'y':
processed_subgroups[group].add(subgroup)
for group in sorted(processed_subgroups):
print("Group {} -- {:.2f}%".format(group, (len(processed_subgroups[group]) / float(len(subgroups[group])) * 100)))
Output
Group 1 -- 66.67%
Group 2 -- 100.00%
Group 3 -- 50.00%
My problem is pretty general and can probably be solved in many ways. But what is a smart way considering time and memory?
I have time series data of user interactions of the following form:
cookie_id interaction
--------- -----------
1234 did_something
1234 viewed_banner*
1234 did_something
1234 did_something
1234 viewed_and_clicked_banner*
... ...
I want it to train models predicting whether a user will click on a banner or not whenever a banner is displayed (i.e. the interactions marked with *). To do this I need to aggregate all previous interactions whenever a point of interest (either viewed_banner or viewed_and clicked_banner) shows up in the feed:
cookie_id interaction
--------- -----------
1234 did_something
1234 viewed_banner <- point of interest
cookie_id interaction
--------- -----------
1234 did_something
1234 viewed_banner
1234 did_something
1234 did_something
1234 viewed_and_clicked_banner <- point of interest
This is the core of the problem: Splitting the data up into overlapping groups! After doing this each group can then be aggregated into for instance:
cookie_id did_something viewed_banner viewed_and_cli... clicked?
--------- ------------- ------------- ----------------- --------
1234 1 0 0 no
1234 3 1 0 yes
Here the numbers in did_something and viewed_banner are the counts of these interaction (not including the point of interest), but other kind of aggregation could be performed as well. The clicked? attribute just describes which of the two kinds of "point of interest" was the last interaction in the interaction feed.
I have tried to look at Pandas apply and groupby methods, but can not come up with something that generates the desired overlapping groups.
The alternative is to use some for-loops, but I would rather not do that if there is a simple and efficient way to solve the problem.
Here is what I tried, I think it need more data to verify the code:
data = """cookie_id interaction
1234 did_something
1234 viewed_banner*
1234 did_something
1234 did_something
1234 viewed_and_clicked_banner*
"""
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(data), delim_whitespace=True)
flag = df.interaction.str.endswith("*")
group_flag = flag.astype(float).mask(~flag).ffill(limit=1).fillna(0).cumsum()
df["interaction"] = df.interaction.str.rstrip("*")
interest_df = df[flag]
def f(s):
return s.value_counts()
df2 = df.groupby(group_flag).interaction.apply(f).unstack().fillna(0).cumsum()
result = df2[::2].reset_index(drop=True)
result["clicked"] = interest_df.interaction.str.contains("clicked").reset_index(drop=True)
print result
output:
did_something viewed_and_clicked_banner viewed_banner clicked
0 1 0 0 False
1 3 0 1 True
The basic idea is split the dataframe into groups:
odd groups are continuous rows without *
even groups are only one row with *
It assume that the first row in the dataframe is without *.
Then do value_counts for every group and combine the results into a dataframe. cumsum() the counts and drop even rows will get the right counts.
I don't know how the clicked column is calculated. Can you explain this in detail?