Extracting and aggregating data out of filenames in python or pandas - python

I have these four lists, which are the filenames of images and the filenames are in the format:
(disease)-(randomized patient ID)-(image number by this patient)
A single patient can have multiple images per disease.
See these slices below:
print(train_cnv_list[0:3])
print(train_dme_list[0:3])
print(train_drusen_list[0:3])
print(train_normal_list[0:3])
>>>
['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
I'd like to figure out:
How many images are there per patient / per list?
Is there any overlap in the "randomized patient ID" across the four lists? If so, can I aggregate that into some kind of report (patient, disease, number of images) using something like groupby?
patient - disease1 - total number of images
- disease2 - total number of images
- disease3 - total number of images
where total number of images is a max(image number by this patient)
I did see that this yields a patient id:
train_cnv_list[0][4:11]
>>> 9911627
Thanks, in advance, for any guidance.

You can do it easily with Pandas:
import pandas as pd
cnv_list=['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
dme_list=['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
dru_list=['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
nor_list=['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
data =[]
data.extend(cnv_list)
data.extend(dme_list)
data.extend(dru_list)
data.extend(nor_list)
df = pd.DataFrame(data, columns=["files"])
df["files"]=df["files"].str.replace ('.jpeg','')
df=df["files"].str.split('-', expand=True).rename(columns={0:"disease",1:"PatientID",2:"pictureName"})
res = df.groupby(['PatientID','disease']).apply(lambda x: x['pictureName'].count())
print(res)
Result:
PatientID disease
8773471 DME 1
8797076 DME 1
8889850 DME 1
8986660 DRUSEN 1
9025088 DRUSEN 1
9100857 DRUSEN 1
9490249 NORMAL 1
9504376 NORMAL 1
9509694 NORMAL 1
9911627 CNV 2
9935363 CNV 1
and even more now than you have a dataFrame...

Here are a few functions that might get you on the right track, but as #rick-supports-monica mentioned, this is a great use case for pandas. You'll have an easier time manipulating data.
def contains_duplicate_ids(img_list):
patient_ids = []
for image in img_list:
patient_id = image.split('.')[0].split('-')[1]
patient_ids.append(patient_id)
if len(set(patient_ids)) == len(patient_ids):
return False
return True
def get_duplicates(img_list):
patient_ids = []
duplicates = []
for image in img_list:
patient_id = image.split('.')[0].split('-')[1]
if patient_id in patient_ids:
duplicates.append(patient_id)
patient_ids.append(patient_id)
return duplicates
def count_images(img_list):
return len(set(img_list))
From get_duplicates you can use the patient IDs returned to lookup whatever you want from there. I'm not sure I completely understand the structure of the lists. It looks like {disease}-{patient_id}-{some_other_int}.jpg. I'm not sure how to add additional lookups to the functionality without understanding the input a bit more.
I mentioned pandas, but didn't mention how to use it, here's one way you could get your existing data into a dataframe:
import pandas as pd
# Sample data
train_cnv_list = ['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911628-94.jpeg', 'CNM-9911629-94.jpeg']
train_dme_list = ['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
train_drusen_list = ['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
train_normal_list = ['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
# Convert list to dataframe
def dataframe_from_list(img_list):
df = pd.DataFrame(img_list, columns=['filename'])
df['disease'] = [filename.split('.')[0].split('-')[0] for filename in img_list]
df['patient_id'] = [filename.split('.')[0].split('-')[1] for filename in img_list]
df['some_other_int'] = [filename.split('.')[0].split('-')[2] for filename in img_list]
return df
# Generate a dataframe for each list
cnv_df = dataframe_from_list(train_cnv_list)
dme_df = dataframe_from_list(train_dme_list)
drusen_df = dataframe_from_list(train_drusen_list)
normal_df = dataframe_from_list(train_normal_list)
# or combine them into one long dataframe
df = pd.concat([cnv_df, dme_df, drusen_df, normal_df], ignore_index=True)

Start by creating a well defined data structure, use counter in order to answer your first question.
from typing import NamedTuple
from collections import Counter,defaultdict
class FileInfo(NamedTuple):
disease:str
patient_id:str
image_id: str
l1 = ['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
l2 = ['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
l3 = ['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
l4 = ['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
lists = [l1,l2,l3,l4]
data_lists = []
for l in lists:
data_lists.append([FileInfo(*f[:-5].split('-')) for f in l])
counters = []
for l in data_lists:
counters.append(Counter(fi.patient_id for fi in l))
print(counters)
print('-----------')
cross_lists_data = dict()
for l in data_lists:
for file_info in l:
if file_info.patient_id not in cross_lists_data:
cross_lists_data[file_info.patient_id] = defaultdict(int)
cross_lists_data[file_info.patient_id][file_info.disease] += 1
print(cross_lists_data)

Start by concatenating your data
import pandas as pd
import numpy as np
train_cnv_list = ['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
train_dme_list = ['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
train_drusen_list = ['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
train_normal_list = ['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
train_data = np.array([
train_cnv_list,
train_dme_list,
train_drusen_list,
train_normal_list
])
Create a Series with the flattened array
>>> train = pd.Series(train_data.flat)
>>> train
0 CNV-9911627-77.jpeg
1 CNV-9935363-45.jpeg
2 CNV-9911627-94.jpeg
3 DME-8889850-2.jpeg
4 DME-8773471-3.jpeg
5 DME-8797076-11.jpeg
6 DRUSEN-8986660-50.jpeg
7 DRUSEN-9100857-3.jpeg
8 DRUSEN-9025088-5.jpeg
9 NORMAL-9490249-31.jpeg
10 NORMAL-9509694-5.jpeg
11 NORMAL-9504376-3.jpeg
dtype: object
Use Series.str.extract together with regex to extract the information from the filenames and separate it into different columns
>>> pat = '(?P<Disease>\w+)-(?P<Patient_ID>\d+)-(?P<IMG_ID>\d+).jpeg'
>>> train = train.str.extract(pat)
>>> train
Disease Patient_ID IMG_ID
0 CNV 9911627 77
1 CNV 9935363 45
2 CNV 9911627 94
3 DME 8889850 2
4 DME 8773471 3
5 DME 8797076 11
6 DRUSEN 8986660 50
7 DRUSEN 9100857 3
8 DRUSEN 9025088 5
9 NORMAL 9490249 31
10 NORMAL 9509694 5
11 NORMAL 9504376 3
Finally, aggregate the data and compute the total number of images per group based on the maximum IMG_ID number.
>>> report = train.groupby(["Patient_ID","Disease"])['IMG_ID'].agg(Total_IMGs="max")
>>> report
Total_IMGs
Patient_ID Disease
8773471 DME 3
8797076 DME 11
8889850 DME 2
8986660 DRUSEN 50
9025088 DRUSEN 5
9100857 DRUSEN 3
9490249 NORMAL 31
9504376 NORMAL 3
9509694 NORMAL 5
9911627 CNV 94
9935363 CNV 45

Related

suggestion on how to solve an infinte loop problem (python-pandas)

I have a data frame with 384 rows (and an additional dummy one in the bigining).
each row has 4 variable I wrote manually. 3 calculated fields based on those 4 variables.
and 3 that are comparing each calculated variable to the row before. each field can have 1 of two values (basically True/False).
Final goal - I want to arrange the data frame in a way that the 64 possible combination of the 6 calculated fields (2^6), occur 6 times (2^6*6=384).
Each iteration does a frequency table (pivot) and if one of the fields differ from 6 it breaks and randomize the order.
The problem that there are 384!-12*6! possible combinations and my computer is running the following script for over 4 days without a solution.
import pandas as pd
from numpy import random
# a function that calculates if a row is congruent or in-congruent
def set_cong(df):
if df["left"] > df["right"] and df["left_size"] > df["right_size"] or df["left"] < df["right"] and df["left_size"] < df["right_size"]:
return "Cong"
else:
return "InC"
# open file and calculate the basic fields
DF = pd.read_csv("generator.csv")
DF["distance"] = abs(DF.right-DF.left)
DF["CR"] = DF.left > DF.right
DF["Cong"] = DF.apply(set_cong, axis=1)
again = 1
# main loop to try and find optimal order
while again == 1:
# make a copy of the DF to not have to load it each iteration
df = DF.copy()
again = 0
df["rand"] = [[random.randint(low=1, high=100000)] for i in range(df.shape[0])]
# as 3 of the fields are calculated based on the previous row the first one is a dummy and when sorted needs to stay first
df.rand.loc[0] = 0
Sorted = df.sort_values(['rand'])
Sorted["Cong_n1"] = Sorted.Cong.eq(Sorted.Cong.shift())
Sorted["Side_n1"] = Sorted.CR.eq(Sorted.CR.shift())
Sorted["Dist_n1"] = Sorted.distance.eq(Sorted.distance.shift())
# here the dummy is deleted
Sorted = Sorted.drop(0, axis=0)
grouped = Sorted.groupby(['distance', 'CR', 'Cong', 'Cong_n1', 'Dist_n1', "Side_n1"])
for name, group in grouped:
if group.shape[0] != 6:
again = 1
break
Sorted.to_csv("Edos.csv", sep="\t",index=False)
print ("bye")
the data frame looks like this:
left right size_left size_right distance cong CR distance_n1 cong_n1 side_n1
1 6 22 44 5 T F dummy dummy dummy
5 4 44 22 1 T T F T F
2 3 44 22 1 F F T F F

Delete first item of every tuples in a list in Python

I have a list of tuples in this format:
[("25.00", u"A"), ("44.00", u"X"),("17.00", u"E"),("34.00", u"Y")]
I want to count the number of time we have each letter.
I already created a sorted list with all the letter and now I want to count them.
First of all I have a problem with the u before the second item of each tuple, I don't know how to delete it, I guess it's something about enconding.
Here is my code
# coding=utf-8
from collections import Counter
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('test.xlsx', sheet_name='Essais', skiprows=1)
groupes = []
students = []
group_of_each_letter = []
number_of_students_per_group = []
final_list = []
def print_a_list(list):
for items in list:
print(items)
for i in df.index:
groupes.append(df['GROUPE'][i])
students.append(df[u'ÉTUDIANT'][i])
groupes = groupes[1:]
students = students[1:]
group_of_each_letter = list(set(groupes))
group_of_each_letter = sorted(group_of_each_letter)
z = zip(students, groupes)
z = list(set(z))
final_list = list(zip(*z))
for j in group_of_each_letter:
number_of_students_per_group.append(final_list.count(j))
print_a_list(number_of_students_per_group)
Group of each letter is a list with the group letters without duplicate.
The problem is that I got the right number of value with the for loop at the end but the list is filled with '0'.
The screenshot below is a sample of the excel file. The column "ETUDIANT" means "Student number" but I cant edit the file, I have to deal with it. GROUPE means GROUP obviously. The goal is to count the number of student per group. I think I'm on the right way even if there is easier ways to do that.
Thanks in advance for your help even if I know that my question is a bit ambiguous
Building off of kerwei's answer:
Use groupby() and then nunique()
This will give you the number of unique Student IDs in each Group.
import pandas as pd
df = pd.read_excel('test.xlsx', sheet_name='Essais', skiprows=1)
# Drop the empty row, which is actually the subheader
df.drop(0, axis=0, inplace=True)
# Now we get a count of unique students by group
student_group = df.groupby('GROUPE')[u'ÉTUDIANT'].nunique()
I think a groupby.count() should be sufficient. It'll count the number of occurrences of your GROUPE letter in the dataframe.
import pandas as pd
df = pd.read_excel('test.xlsx', sheet_name='Essais', skiprows=1)
# Drop the empty row, which is actually the subheader
df.drop(0, axis=0, inplace=True)
# Now we get a count of students by group
sub_student_group = df.groupby(['GROUPE','ETUDIANT']).count().reset_index()
>>>sub_student_group
GROUPE ETUDIANT
0 B 29
1 L 88
2 N 65
3 O 27
4 O 29
5 O 34
6 O 35
7 O 54
8 O 65
9 O 88
10 O 99
11 O 114
12 O 122
13 O 143
14 O 147
15 U 122
student_group = sub_student_group.groupby('GROUPE').count()
>>>student_group
ETUDIANT
GROUPE
B 1
L 1
N 1
O 12
U 1

groupby and sum two columns and set as one column in pandas

I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.
A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()
You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3

Python text processing: NLTK and pandas

I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.
I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.
My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:
import pandas as pd
import nltk
pd.options.display.max_colwidth = 10000
txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45
txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
txt_lines.append(line)
txt = str(txt_lines)
len(txt)
Out[14]: 1668813
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086
Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).
As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.
Is there any way to improve the above code?
Also, str(x) creates 1 big string out of the comments while [str(x) for x in txt_data.comment] creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix() from tm package.
Many thanks.
The benefit of using a pandas DataFrame would be to apply the nltk functionality to each row like so:
word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]
df = pd.DataFrame(random_word_list, columns=['text'])
df.head()
text
0 Aaru Aaronic abandonable abandonedly abaction ...
1 abampere abampere abacus aback abalone abactor...
2 abaisance abalienate abandonedly abaff abacina...
3 Ababdeh abalone abac abaiser abandonable abact...
4 abandonable abandon aba abaiser abaft Abama ab...
len(df)
50
txt = df.text.apply(word_tokenize)
txt.head()
0 [Aaru, Aaronic, abandonable, abandonedly, abac...
1 [abampere, abampere, abacus, aback, abalone, a...
2 [abaisance, abalienate, abandonedly, abaff, ab...
3 [Ababdeh, abalone, abac, abaiser, abandonable,...
4 [abandonable, abandon, aba, abaiser, abaft, Ab...
txt.apply(len)
0 1000
1 1000
2 1000
3 1000
4 1000
....
44 1000
45 1000
46 1000
47 1000
48 1000
49 1000
Name: text, dtype: int64
As a result, you get the .count() for each row entry:
txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()
0 27
1 24
2 17
3 25
4 32
You can then sum the result using:
txt.sum()
1239

Improve python code in terms of speed

I have a very big file (1.5 billion lines) in the following form:
1 67108547 67109226 gene1$transcript1 0 + 1 0
1 67108547 67109226 gene1$transcript1 0 + 2 1
1 67108547 67109226 gene1$transcript1 0 + 3 3
1 67108547 67109226 gene1$transcript1 0 + 4 4
.
.
.
1 33547109 33557650 gene2$transcript1 0 + 239 2
1 33547109 33557650 gene2$transcript1 0 + 240 0
.
.
.
1 69109226 69109999 gene1$transcript1 0 + 351 1
1 69109226 69109999 gene1$transcript1 0 + 352 0
What I want to do is to reorganize/sort this file based on the identifier on column 4. The file is consisted of blocks. If you concatenate columns 4,1,2 and 3 you create the unique identifier for each block. This is the key for the dicionary all_exons and the value is a numpy array containing all the values of column 8. Then I have a second dictionary unique_identifiers that has as key the attributes from column 4 and values a list of the corresponding block identifiers. As output I write a file in the following form:
>gene1
0
1
3
4
1
0
>gene2
2
0
I already wrote some code (see below) that does this, but my implementation is very slow. It takes around 18 hours to run.
import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np
def parse_blocks(bedtools_file):
unique_identifiers = {} # Dictionary with key: gene, value: list of exons
all_exons = {} # Dictionary contatining all exons
# Parse file and ...
with open(bedtools_file) as fp:
sp_line = []
for line in fp:
sp_line = line.strip().split("\t")
current_id = sp_line[3].split("$")[0]
identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
if(identifier in all_exons):
item = float(sp_line[7])
all_exons[identifier]=np.append(all_exons[identifier],item)
else:
all_exons[identifier] = np.array([sp_line[7]],float)
if(current_id in unique_identifiers):
unique_identifiers[current_id].add(identifier)
else:
unique_identifiers[current_id] =set([identifier])
return unique_identifiers, all_exons
identifiers, introns = parse_blocks(options.bed)
w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
w.write(">"+str(gene)+"\n")
for intron in sorted(list(identifiers[gene])):
for base in introns[intron]:
w.write(str(base)+"\n")
w.close()
How can I impove the above code in order to run faster?
You also import pandas, therefore, I provide a pandas solution which requires basically only two lines of code.
However, I do not know how it performs on large data sets and whether that is faster than your approach (but I am pretty sure it is).
In the example below, the data you provide is stored in table.txt. I then use groupby to get all the values in your 8th column, store them in a list for the respective identifier in your column 4 (note that my indices start at 0) and convert this data structure into a dictionary which can then be printed easily.
import pandas as pd
df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'
op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))
So in this case op looks like this:
{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}
Now you could print the output like this and pipeline it in a certain file:
for k,v in op.iteritems():
print k.split('$')[0]
for val in v:
print val
This gives you the desired output:
gene1
0
1
3
4
1
0
gene2
2
0
Maybe you can give it a try and let me know how it compares to your solution!?
Edit2:
In the comments you mentioned that you would like to print the genes in the correct order. You can do this as follows:
# add some fake genes to op from above
op['gene0$stuff'] = [7,9]
op['gene4$stuff'] = [5,9]
# print using 'sorted'
for k,v in sorted(op.iteritems()):
print k.split('$')[0]
for val in v:
print val
which gives you:
gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9
EDIT1:
I am not sure whether duplicates are intended but you could easily get rid of them by doing the following:
op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))
Now op2 would look like this:
{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}
You print the output as before:
for k,v in op2.iteritems():
print k.split('$')[0]
for val in v:
print val
which gives you
gene1
0
1
3
4
gene2
0
2
I'll try to simplify your question, my solution is like this:
First, scan over the big file. For every different current_id, open a temporary file and append value of column 8 to that file.
After the full scan, catenate all chunks to a result file.
Here's the code:
# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess
class ChunkBoss(object):
"""Boss for file chunks"""
def __init__(self):
self.opened_files = {}
def write_chunk(self, current_id, value):
if current_id not in self.opened_files:
self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
self.opened_files[current_id].write('>%s\n' % current_id)
self.opened_files[current_id].write('%s\n' % value)
def cat_result(self, filename):
"""Catenate chunks to one big file
"""
# Sort the chunks
chunk_file_list = []
for current_id in sorted(self.opened_files.keys()):
chunk_file_list.append(self.opened_files[current_id].name)
# Flush chunks
[chunk.flush() for chunk in self.opened_files.values()]
# By calling cat command
with open(filename, 'wb') as fp:
subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)
def clean_up(self):
[os.unlink(chunk.name) for chunk in self.opened_files.values()]
def main():
boss = ChunkBoss()
with open('bigfile.data') as fp:
for line in fp:
data = line.strip().split()
current_id = data[3].split("$")[0]
value = data[7]
# Write value to temp chunk
boss.write_chunk(current_id, value)
boss.cat_result('result.txt')
boss.clean_up()
if __name__ == '__main__':
main()
I tested the performance of my script, with bigfile.data containing about 150k lines. It took about 0.5s to finish on my laptop. Maybe you can give it a try.

Categories

Resources