Group a huge csv file in python - python

I have a huge .csv file (above 100 GB) in the form:
| Column1 | Column2 | Column3 | Column4 | Column5 |
|---------|---------|---------|---------|---------------------|
| A | B | 35 | X | 2017-12-19 11:28:34 |
| A | C | 22 | Z | 2017-12-19 11:27:24 |
| A | B | 678 | Y | 2017-12-19 11:38:36 |
| C | A | 93 | X | 2017-12-19 11:44:42 |
And want to summarize it
by the unique values in Column1 and Column2
with sum(Column3),
max(Column5)
the value of Column4, where Column5 was at its maximum.
Therefore the above extract should become:
| Column1 | Column2 | sum(Column3) | Column4 | max(Column5) |
|---------|---------|--------------|---------|---------------------|
| A | B | 702 | Y | 2017-12-19 11:38:36 |
| A | C | 22 | Z | 2017-12-19 11:27:24 |
| C | A | 93 | X | 2017-12-19 11:44:42 |
With these additional considerations:
The .csv is not sorted
I have python under windows
The solution should be on a standalone PC (Cloud instances are not acceptable)
I have tried Dask and the .compute() step (should it ever complete) will take about a week. Anything faster than this would be a good solution.
I am open to all kinds of solutions - splitting the file into chunks, multiprocessing... whatever would work
Edit 1:
I had not used multiprocessing in dask. Adding it improves the speed signifficantly (as suggested by one of the comments), but the 32G RAM is not enough for this approach to complete.
Edit 2:
Dask 0.16.0 is not a possible solution, as it is absolutely broken. After 5 hours of writing partitions to disk, it has written 8 out of 300 partitions and after reporting to have written 7, now it reports having written 4, instead of 8 (without throwing an error).

Related

Grouping samples for snakemake workflow

I have a data table which is formatted as following:
| Read1 | Read2 | Group | SampleID |
| ---------- | ---------- | ----- | -------- |
| file.fq.gz | file.fq.gz | 1 | 1.1 |
| file.fq.gz | file.fq.gz | 2 | 2.1 |
| file.fq.gz | file.fq.gz | 3 | 3.1 |
| file.fq.gz | file.fq.gz | 2 | 2.2 |
| file.fq.gz | file.fq.gz | 1 | 1.2 |
| file.fq.gz | file.fq.gz | 2 | 2.3 |
Where the Read columns contain directory information for these files, and the group number is the only relevant feature. I am looking for a way to pass the reads belonging to rows of a specific group 1,2,and 3 respectively to snakemake, in order to perform a process involving all of these files. I know a for loop could work, such as:
for x in [1,2,3]:
subset = df[df['Group'] == x]
analyze_subset_etc
However is there a more efficient way to do this which would better utilize the resources
and computational efficiency of snakemake?
Further clarification:
The main steps of the workflow are needing to be performed for each row of the dataframe, so those steps would be like:
def r1(sample):
return df.loc[sample, 'Read1']
def r2(sample):
return df.loc[sample, 'Read2']
rule trim_reads:
input:
read1 = r1
read2 = r2
etc
Based on this framework it is difficult to pass all the samples by group, as they are not unique. Thus, I'm looking for a different way to couple these parameters.
It's hard to tell without more detail about what you need to do. Maybe something along these lines?
rule all:
input:
expand('{group}.txt', group=[1, 2, 3]),
rule one:
output:
'{group}.txt',
run:
subset = df[df['Group'] == wildcards.group]
analyze_subset_etc(subset)

Drop duplicates based on first level column in MultiIndex DataFrame

I have a MultiIndex Pandas DataFrame like so:
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
| | VECTOR | SEGMENTS | OVERALL | INDIVIDUAL |
| | | | TIP X | TIP Y | CURVATURE | TIP X | TIP Y | CURVATURE |
| 0 | (TOP, TOP) | 2 | 3.24 | 1.309 | 44 | 1.62 | 0.6545 | 22 |
| 1 | (TOP, BOTTOM) | 2 | 3.495 | 0.679 | 22 | 1.7475 | 0.3395 | 11 |
| 2 | (BOTTOM, TOP) | 2 | 3.495 | -0.679 | -22 | 1.7475 | -0.3395 | -11 |
| 3 | (BOTTOM, BOTTOM) | 2 | 3.24 | -1.309 | -44 | 1.62 | -0.6545 | -22 |
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
How can I drop duplicates based on all columns contained under 'OVERALL' or 'INDIVIDUAL'? So if I choose 'INDIVIDUAL' to drop duplicates from the values of TIP X, TIP Y, and CURVATURE under INDIVIDUAL must all match for it to be a duplicate?
And further, as you can see from the table 1 and 2 are duplicates that are simply mirrored about the x-axis. These must also be dropped.
Also, can I center the OVERALL and INDIVIDUAL headings?
EDIT: frame.drop_duplicates(subset=['INDIVIDUAL'], inplace=True) produces KeyError: Index(['INDIVIDUAL'], dtype='object')
You can pass pandas .drop_duplicates a subset of tuples for multi-indexed columns:
df.drop_duplicates(subset=[
('INDIVIDUAL', 'TIP X'),
('INDIVIDUAL', 'TIP Y'),
('INDIVIDUAL', 'CURVATURE')
])
Or, if your row indices are unique, you could use the following approach that saves some typing:
df.loc[df['INDIVIDUAL'].drop_duplicates().index]
Update:
As you suggested in the comments, if you want to do operations on the dataframe you can do that in-line:
df.loc[df['INDIVIDUAL'].abs().drop_duplicates().index]
Or for non-pandas functions, you can use .transform:
df.loc[df['INDIVIDUAL'].transform(np.abs).drop_duplicates().index]

How to efficiently extract unique rows from massive CSV using Python or R

I have a massive CSV (1.4gb, over 1MM rows) of stock market data that I will process using R.
The table looks roughly like this. For each ticker, there are thousands of rows of data.
+--------+------+-------+------+------+
| Ticker | Open | Close | High | Low |
+--------+------+-------+------+------+
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| B | 32 | 23 | 43 | 344 |
+--------+------+-------+------+------+
To make processing and testing easier, I'm breaking this colossus into smaller files using the script mentioned in this question: How do I slice a single CSV file into several smaller ones grouped by a field?
The script would output files such as data_a.csv, data_b.csv, etc.
But, I would also like to create index.csv which simply lists all the unique stock ticker names.
E.g.
+---------+
| Ticker |
+---------+
| A |
| B |
| C |
| D |
| ... |
+---------+
Can anybody recommend an efficient way of doing this in R or Python, when handling a huge filesize?
You could loop through each file, grabbing the index of each and creating a set union of all indices.
import glob
tickers = set()
for csvfile in glob.glob('*.csv'):
data = pd.read_csv(csvfile, index_col=0, header=None) # or True, however your data is set up
tickers.update(data.index.tolist())
pd.Series(list(tickers)).to_csv('index.csv', index=False)
You can retrieve the index from the file names:
(index <- data.frame(Ticker = toupper(gsub("^.*_(.*)\\.csv",
"\\1",
list.files()))))
## Ticker
## 1 A
## 2 B
write.csv(index, "index.csv")

Maximizing a combination of a series of values

This is a complicated one, but I suspect there's some principle I can apply to make it simple - I just don't know what it is.
I need to parcel out presentation slots to a class full of students for the semester. There are multiple possible dates, and multiple presentation types. I conducted a survey where students could rank their interest in the different topics. What I'd like to do is get the best (or at least a good) distribution of presentation slots to students.
So, what I have:
List of 12 dates
List of 18 students
CSV file where each student (row) has a rating 1-5 for each date
What I'd like to get:
Each student should have one of presentation type A (intro), one of presentation type B (figures) and 3 of presentation type C (aims)
Each date should have at least 1 of each type of presentation
Each date should have no more than 2 of type A or type B
Try to give students presentations that they rated highly (4 or 5)
I should note that I realize this looks like a homework problem, but it's real life :-). I was thinking that I might make a Student class for each student that contains the dates for each presentation type, but I wasn't sure what the best way to populate it would be. Actually, I'm not even sure where to start.
TL;DR: I think you're giving your students too much choice :D
But I had a shot at this problem anyway. Pretty fun exercise actually, although some of the constraints were a little vague. Most of all, I had to guess what the actual students' preference distribution would look like. I went with uniformly distributed, independent variables, although that's probably not very realistic. Still I think it should work just as well on real data as it does on my randomly generated data.
I considered brute forcing it, but a rough analysis gave me an estimate of over 10^65 possible configurations. That's kind of a lot. And since we don't have a trillion trillion years to consider all of them, we'll need a heuristic approach.
Because of the size of the problem, I tried to avoid doing any backtracking. But this meant that you could get stuck; there might not be a solution where everyone only gets dates they gave 4's and 5's.
I ended up implementing a double-edged Iterative Deepening-like search, where both the best case we're still holding out hope for (i.e., assign students to a date they gave a 5) and the worst case scenario we're willing to accept (some student might have to live with a 3) are gradually lowered until a solution is found. If we get stuck, reset, lower expectations, and try again. Tasks A and B are assigned first, and C is done only after A and B are complete, because the constraints on C are far less stringent.
I also used a weighting factor to model the trade off between maximizing students happiness with satisfying the types-of-presentations-per-day limits.
Currently it seems to find a solution for pretty much every random generated set of preferences. I included an evaluation metric; the ratio between the sum of the preference values of all assigned student/date combos, and the sum of all student ideal/top 3 preference values. For example, if student X had two fives, one four and the rest threes on his list, and is assigned to one of his fives and two threes, he gets 5+3+3=11 but could ideally have gotten 5+5+4=14; he is 11/14 = 78.6% satisfied.
After some testing, it seems that my implementation tends to produce an average student satisfaction of around 95%, at lot better than I expected :) But again, that is with fake data. Real preferences are probably more clumped, and harder to satisfy.
Below is the core of the algorihtm. The full script is ~250 lines and a bit too long for here I think. Check it out at Github.
...
# Assign a date for a given task to each student,
# preferring a date that they like and is still free.
def fill(task, lowest_acceptable, spread_weight=0.1, tasks_to_spread="ABC"):
random_order = range(nStudents) # randomize student order, so everyone
random.shuffle(random_order) # has an equal chance to get their first pick
for i in random_order:
student = students[i]
if student.dates[task]: # student is already assigned for this task?
continue
# get available dates ordered by preference and how fully booked they are
preferred = get_favorite_day(student, lowest_acceptable,
spread_weight, tasks_to_spread)
for date_nr in preferred:
date = dates[date_nr]
if date.is_available(task, student.count, lowest_acceptable == 1):
date.set_student(task, student.count)
student.dates[task] = date
break
# attempt to "fill()" the schedule while gradually lowering expectations
start_at = 5
while start_at > 1:
lowest_acceptable = start_at
while lowest_acceptable > 0:
fill("A", lowest_acceptable, spread_weight, "AAB")
fill("B", lowest_acceptable, spread_weight, "ABB")
if lowest_acceptable == 1:
fill("C", lowest_acceptable, spread_weight_C, "C")
lowest_acceptable -= 1
And here is an example result as printed by the script:
Date
================================================================================
Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
================================================================================
1 | | A | B | | C | | | | | | | |
2 | | | | | A | | | | | B | C | |
3 | | | | | B | | | C | | A | | |
4 | | | | A | | C | | | | | | B |
5 | | | C | | | | A | B | | | | |
6 | | C | | | | | | | A | B | | |
7 | | | C | | | | | B | | | | A |
8 | | | A | | C | | B | | | | | |
9 | C | | | | | | | | A | | | B |
10 | A | B | | | | | | | C | | | |
11 | B | | | A | | C | | | | | | |
12 | | | | | | A | C | | | | B | |
13 | A | | | B | | | | | | | | C |
14 | | | | | B | | | | C | | A | |
15 | | | A | C | | B | | | | | | |
16 | | | | | | A | | | | C | B | |
17 | | A | | C | | | B | | | | | |
18 | | | | | | | C | A | B | | | |
================================================================================
Total student satisfaction: 250/261 = 95.00%

In mysql, is it possible to add a column based on values in one column?

I have a mysql table data which has following columns
+-------+-----------+----------+
|a | b | c |
+-------+-----------+----------+
| John | 225630096 | 447 |
| John | 225630118 | 491 |
| John | 225630206 | 667 |
| John | 225630480 | 1215 |
| John | 225630677 | 1609 |
| John | 225631010 | 2275 |
| Ryan | 154247076 | 6235 |
| Ryan | 154247079 | 6241 |
| Ryan | 154247083 | 6249 |
| Ryan | 154247084 | 6251 |
+-------+-----------+----------+
I want to add a column d based on the values in a and c (See expected table below). Values in a is the name of the subject, b is one of its attribute, and c another. So, if the values of c are within 15 units of each other for each subject assign them a same cluster number (for example, each value in c for Ryan is within 15 unit, so they all are assigned 1), but if not assign them a different value as in for John, where each row gets a different value for d.
+-------+-----------+----------+---+
|a | b | c |d |
+-------+-----------+----------+---+
| John | 225630096 | 447 | 1 |
| John | 225630118 | 491 | 2 |
| John | 225630206 | 667 | 3 |
| John | 225630480 | 1215 | 4 |
| John | 225630677 | 1609 | 5 |
| John | 225631010 | 2275 | 6 |
| Ryan | 154247076 | 6235 | 1 |
| Ryan | 154247079 | 6241 | 1 |
| Ryan | 154247083 | 6249 | 1 |
| Ryan | 154247084 | 6251 | 1 |
+-------+-----------+----------+---+
I am not sure if this could be done in mysql, but if not i would welcome any python based answers as well, in that case, working on this table as cdv format.
Thanks.
You could use a query with variables:
SELECT a, b, c,
CASE WHEN #last_a != a THEN #d:=1
WHEN (#last_a = a) AND (c>#last_c+15) THEN #d:=#d+1
ELSE #d END d,
#last_a := a,
#last_c := c
FROM
tablename, (SELECT #d:=1, #last_a:=null, #last_c:=null) _n
ORDER BY a, c
Please see fiddle here.
Explanation
I'm using a join between tablename and the subquery (SELECT ...) _n just to initialize some variables (d is initialized to 1, #last_a to null, #last_c to null).
Then, for every row, I'm checking if the last encountered a -the one on the previous row- is different than the current a: in that case set #d to 1 (and return it).
If the last encountered a is the same as the current row and c is greater than the last encountered c + 15, then increment #d and return its value.
Otherwise, just return d without incrementing it. This will happen when a has not changed and c is not greater than the previous c+15, or this will happen at the first row (because #last_a and #last_c have been initialized to null).
To make it work, we need to order by a and c.

Categories

Resources