Grouping samples for snakemake workflow - python

I have a data table which is formatted as following:
| Read1 | Read2 | Group | SampleID |
| ---------- | ---------- | ----- | -------- |
| file.fq.gz | file.fq.gz | 1 | 1.1 |
| file.fq.gz | file.fq.gz | 2 | 2.1 |
| file.fq.gz | file.fq.gz | 3 | 3.1 |
| file.fq.gz | file.fq.gz | 2 | 2.2 |
| file.fq.gz | file.fq.gz | 1 | 1.2 |
| file.fq.gz | file.fq.gz | 2 | 2.3 |
Where the Read columns contain directory information for these files, and the group number is the only relevant feature. I am looking for a way to pass the reads belonging to rows of a specific group 1,2,and 3 respectively to snakemake, in order to perform a process involving all of these files. I know a for loop could work, such as:
for x in [1,2,3]:
subset = df[df['Group'] == x]
analyze_subset_etc
However is there a more efficient way to do this which would better utilize the resources
and computational efficiency of snakemake?
Further clarification:
The main steps of the workflow are needing to be performed for each row of the dataframe, so those steps would be like:
def r1(sample):
return df.loc[sample, 'Read1']
def r2(sample):
return df.loc[sample, 'Read2']
rule trim_reads:
input:
read1 = r1
read2 = r2
etc
Based on this framework it is difficult to pass all the samples by group, as they are not unique. Thus, I'm looking for a different way to couple these parameters.

It's hard to tell without more detail about what you need to do. Maybe something along these lines?
rule all:
input:
expand('{group}.txt', group=[1, 2, 3]),
rule one:
output:
'{group}.txt',
run:
subset = df[df['Group'] == wildcards.group]
analyze_subset_etc(subset)

Related

PySpark search inside very large dataframe

I have a very large dataframe in pyspark. It has over 10 million rows and over 30 columns.
What is the best and efficient method to search the entire dataframe for a given list of values and remove the row which contains that value?
The given list of values:
list=['1097192','10727550','1098754']
The dataframe(df) is :
+---------+--------------+---------------+---------+------------+
| id | first_name | last_name | Salary | Verifycode |
+---------+--------------+---------------+---------+------------+
| 1986 | Rollie | Lewin | 1097192 | 42254172 | -Remove Row
| 289743 | Karil | Sudron | 2785190 | 3703538 |
| 3864 | Massimiliano | Dallicott | 1194553 | 23292573 |
| 49074 | Gerry | Grinnov | 1506584 | 62291161 |
| 5087654 | Nat | Leatherborrow | 1781870 | 55183252 |
| 689 | Thaine | Tipple | 2150105 | 40583249 |
| 7907 | Myrlene | Croley | 2883250 | 70380540 |
| 887 | Nada | Redier | 2676139 | 10727550 | -Remove Row
| 96533 | Sonny | Bosden | 1050067 | 13110714 |
| 1098754 | Dennie | McGahy | 1804487 | 927935 | -Remove Row
+---------+--------------+---------------+---------+------------+
If it was a smaller dataframe I could use collect() or toLocalIterator() functions and then iterate over the rows and remove it based on list values.
Since it is a very large dataframe what is the best way to solve this?
I have come up with this solution now but is there a better way:
column_names = df.schema.names
for name in column_names:
df = df.filter(~col(name).isin(list))
You got the correct approach of filtering the Dataframe using filter and isin function. You can use isin function if the list is small (in few thousands not millions). Also make sure that your dataframe is partitioned to at least 3*number of CPUs on the executors. It is a must to have lot of partitions without that parallelism will suffer.
I am comfortable with Scala so please take the concept from the below code. You need to build a Column object by combining all columns to be filtered on. Then provide that final column object on the dataframe.filter
column_names = df.schema.names
colFinal // initialize with 1 column name as col("colName").isin(list)
for name in column_names:
colFinal = colFinal.or(col(name).isin(list))
df = df.filter(!colFinal) // apply negation of final column object

How to join two tables in PySpark with two conditions in an optimal way

I have the following two tables in PySpark:
Table A - dfA
| ip_4 | ip |
|---------------|--------------|
| 10.10.10.25 | 168430105 |
| 10.11.25.60 | 168499516 |
And table B - dfB
| net_cidr | net_ip_first_4 | net_ip_last_4 | net_ip_first | net_ip_last |
|---------------|----------------|----------------|--------------|-------------|
| 10.10.10.0/24 | 10.10.10.0 | 10.10.10.255 | 168430080 | 168430335 |
| 10.10.11.0/24 | 10.10.11.0 | 10.10.11.255 | 168430336 | 168430591 |
| 10.11.0.0/16 | 10.11.0.0 | 10.11.255.255 | 168493056 | 168558591 |
I have joined both tables in PySpark using the following command:
dfJoined = dfB.alias('b').join(F.broadcast(dfA).alias('a'),
(F.col('a.ip') >= F.col('b.net_ip_first'))&
(F.col('a.ip') <= F.col('b.net_ip_last')),
how='right').select('a.*, b.*)
So I obtain:
| ip | net_cidr | net_ip_first_4 | net_ip_last_4| ...
|---------------|---------------|----------------|--------------| ...
| 10.10.10.25 | 10.10.10.0/24 | 10.10.10.0 | 10.10.10.255 | ...
| 10.11.25.60 | 10.10.11.0/24 | 10.10.11.0 | 10.10.11.255 | ...
The size of the tables makes this option not optimal due to the 2 conditions, I had thought of sorting table B so that it only implies one join condition.
Is there any way to limit the join and take only the first record that matches the join condition? Or some way to make the join in an optimal way?
Table A (number of records) << Table B (number of records)
Thank you!

Performant alternative to constructing a dataframe by applying repeated pivots

I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.

Group a huge csv file in python

I have a huge .csv file (above 100 GB) in the form:
| Column1 | Column2 | Column3 | Column4 | Column5 |
|---------|---------|---------|---------|---------------------|
| A | B | 35 | X | 2017-12-19 11:28:34 |
| A | C | 22 | Z | 2017-12-19 11:27:24 |
| A | B | 678 | Y | 2017-12-19 11:38:36 |
| C | A | 93 | X | 2017-12-19 11:44:42 |
And want to summarize it
by the unique values in Column1 and Column2
with sum(Column3),
max(Column5)
the value of Column4, where Column5 was at its maximum.
Therefore the above extract should become:
| Column1 | Column2 | sum(Column3) | Column4 | max(Column5) |
|---------|---------|--------------|---------|---------------------|
| A | B | 702 | Y | 2017-12-19 11:38:36 |
| A | C | 22 | Z | 2017-12-19 11:27:24 |
| C | A | 93 | X | 2017-12-19 11:44:42 |
With these additional considerations:
The .csv is not sorted
I have python under windows
The solution should be on a standalone PC (Cloud instances are not acceptable)
I have tried Dask and the .compute() step (should it ever complete) will take about a week. Anything faster than this would be a good solution.
I am open to all kinds of solutions - splitting the file into chunks, multiprocessing... whatever would work
Edit 1:
I had not used multiprocessing in dask. Adding it improves the speed signifficantly (as suggested by one of the comments), but the 32G RAM is not enough for this approach to complete.
Edit 2:
Dask 0.16.0 is not a possible solution, as it is absolutely broken. After 5 hours of writing partitions to disk, it has written 8 out of 300 partitions and after reporting to have written 7, now it reports having written 4, instead of 8 (without throwing an error).

Maximizing a combination of a series of values

This is a complicated one, but I suspect there's some principle I can apply to make it simple - I just don't know what it is.
I need to parcel out presentation slots to a class full of students for the semester. There are multiple possible dates, and multiple presentation types. I conducted a survey where students could rank their interest in the different topics. What I'd like to do is get the best (or at least a good) distribution of presentation slots to students.
So, what I have:
List of 12 dates
List of 18 students
CSV file where each student (row) has a rating 1-5 for each date
What I'd like to get:
Each student should have one of presentation type A (intro), one of presentation type B (figures) and 3 of presentation type C (aims)
Each date should have at least 1 of each type of presentation
Each date should have no more than 2 of type A or type B
Try to give students presentations that they rated highly (4 or 5)
I should note that I realize this looks like a homework problem, but it's real life :-). I was thinking that I might make a Student class for each student that contains the dates for each presentation type, but I wasn't sure what the best way to populate it would be. Actually, I'm not even sure where to start.
TL;DR: I think you're giving your students too much choice :D
But I had a shot at this problem anyway. Pretty fun exercise actually, although some of the constraints were a little vague. Most of all, I had to guess what the actual students' preference distribution would look like. I went with uniformly distributed, independent variables, although that's probably not very realistic. Still I think it should work just as well on real data as it does on my randomly generated data.
I considered brute forcing it, but a rough analysis gave me an estimate of over 10^65 possible configurations. That's kind of a lot. And since we don't have a trillion trillion years to consider all of them, we'll need a heuristic approach.
Because of the size of the problem, I tried to avoid doing any backtracking. But this meant that you could get stuck; there might not be a solution where everyone only gets dates they gave 4's and 5's.
I ended up implementing a double-edged Iterative Deepening-like search, where both the best case we're still holding out hope for (i.e., assign students to a date they gave a 5) and the worst case scenario we're willing to accept (some student might have to live with a 3) are gradually lowered until a solution is found. If we get stuck, reset, lower expectations, and try again. Tasks A and B are assigned first, and C is done only after A and B are complete, because the constraints on C are far less stringent.
I also used a weighting factor to model the trade off between maximizing students happiness with satisfying the types-of-presentations-per-day limits.
Currently it seems to find a solution for pretty much every random generated set of preferences. I included an evaluation metric; the ratio between the sum of the preference values of all assigned student/date combos, and the sum of all student ideal/top 3 preference values. For example, if student X had two fives, one four and the rest threes on his list, and is assigned to one of his fives and two threes, he gets 5+3+3=11 but could ideally have gotten 5+5+4=14; he is 11/14 = 78.6% satisfied.
After some testing, it seems that my implementation tends to produce an average student satisfaction of around 95%, at lot better than I expected :) But again, that is with fake data. Real preferences are probably more clumped, and harder to satisfy.
Below is the core of the algorihtm. The full script is ~250 lines and a bit too long for here I think. Check it out at Github.
...
# Assign a date for a given task to each student,
# preferring a date that they like and is still free.
def fill(task, lowest_acceptable, spread_weight=0.1, tasks_to_spread="ABC"):
random_order = range(nStudents) # randomize student order, so everyone
random.shuffle(random_order) # has an equal chance to get their first pick
for i in random_order:
student = students[i]
if student.dates[task]: # student is already assigned for this task?
continue
# get available dates ordered by preference and how fully booked they are
preferred = get_favorite_day(student, lowest_acceptable,
spread_weight, tasks_to_spread)
for date_nr in preferred:
date = dates[date_nr]
if date.is_available(task, student.count, lowest_acceptable == 1):
date.set_student(task, student.count)
student.dates[task] = date
break
# attempt to "fill()" the schedule while gradually lowering expectations
start_at = 5
while start_at > 1:
lowest_acceptable = start_at
while lowest_acceptable > 0:
fill("A", lowest_acceptable, spread_weight, "AAB")
fill("B", lowest_acceptable, spread_weight, "ABB")
if lowest_acceptable == 1:
fill("C", lowest_acceptable, spread_weight_C, "C")
lowest_acceptable -= 1
And here is an example result as printed by the script:
Date
================================================================================
Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
================================================================================
1 | | A | B | | C | | | | | | | |
2 | | | | | A | | | | | B | C | |
3 | | | | | B | | | C | | A | | |
4 | | | | A | | C | | | | | | B |
5 | | | C | | | | A | B | | | | |
6 | | C | | | | | | | A | B | | |
7 | | | C | | | | | B | | | | A |
8 | | | A | | C | | B | | | | | |
9 | C | | | | | | | | A | | | B |
10 | A | B | | | | | | | C | | | |
11 | B | | | A | | C | | | | | | |
12 | | | | | | A | C | | | | B | |
13 | A | | | B | | | | | | | | C |
14 | | | | | B | | | | C | | A | |
15 | | | A | C | | B | | | | | | |
16 | | | | | | A | | | | C | B | |
17 | | A | | C | | | B | | | | | |
18 | | | | | | | C | A | B | | | |
================================================================================
Total student satisfaction: 250/261 = 95.00%

Categories

Resources