I need to print my data as structure using python - python

Data: [(Taru, 1234ABCD, 4536, EF32), (Aarul, 10045660, 4562, ABDE), (Vinay, 1254EFDC, 2587, AC42]in list form
output should be like (Tabular form)
Response: Taru 1234ABCD
4536
EF32
Aarul 10045660
4562
ABDE
Vinay 1254EFDC
2587
AC42
Please give your inputs to resolve this query.Thanks

You can use this small script:
l = [['Taru', '12345678ABCDEF', 453678], ['Aarul', '10045660ABDECABF', 45621278]]
print("HEADER1 HEADER2 HEADER3")
for ele1,ele2,ele3 in l:
print("{:<14}{:<11}{:13}".format(ele1,ele2,ele3))
Result:
HEADER1 HEADER2 HEADER3
Taru 12345678ABCDEF 453678
Aarul 10045660ABDECABF 45621278

I think your main question is in how to split the list you got? This seems to be the pattern to do so
EDIT as per comment it was mainly formatting this is one possible sollution
entries = [["Taru", "1234ABCD", "4536", "EF32"], ["Aarul", "10045660", "4562", "ABDE"], ["Vinay", "1254EFDC", "2587", "AC42"]]
csv = 'Name,information\n'
# this has split your array into the parts you want
for entry in entries:
left = entry[0]
for word in entry[1:]:
print("{:<10}{:<10}".format(left,word))
csv += str(left) + ',' + str(word) + '\n'
left = ''
print()
with open('output.csv', 'w') as file:
file.write(csv)
OUPUT:
Taru 1234ABCD
4536
EF32
Aarul 10045660
4562
ABDE
Vinay 1254EFDC
2587
AC42
ouput.csv:
Name,information
Taru,1234ABCD
,4536
,EF32
Aarul,10045660
,4562
,ABDE
Vinay,1254EFDC
,2587
,AC42

Related

How to browse a csv in Python by intervals of letters at the beginning of lines

I have a csv that contains a lot of data. When I launch a webscraping, I receive a:
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
In order to limit the amount of data to be processed for webscraping, I would like to divide the following script into several scripts, each browsing intervals of the csv file:
# Get the data from the csv containing pmid list by author :
with open("D:/Nancy/Pèse-Savants/Excercice Covid-19/Exercice 3/pmid_par_auteur.csv",'r', encoding='utf-8') as f:
# Sseperate author's list from pmid's list into 2 columns :
with open ("pmid_par_auteur_uniformise.csv", "w", encoding='utf-8') as fu:
csv_f = csv.reader(f, delimiter = ';')
for ligne in csv_f:
fu.write(ligne[0] + '\n')
auteur_pmid_doi = []
# Clean up encoded data in 'utf-8'
with open("pmid_par_auteur_uniformise.csv",encoding='utf-8') as fu:
csv_fu = csv.reader(fu)
for ligne in csv_fu:
ligne[1] = ligne[1].replace("'", " ")
ligne[1] = ligne[1].replace("[", " ")
ligne[1] = ligne[1].replace("]", " ")
ligne[1] = ligne[1].split(" , ")
# Get DOI for each pmid for each author that wrote on Covid-19
pmid_doi = []
for pmid in ligne[1]:
try :
handle = Entrez.esummary(db="pubmed", id=pmid)
record = Entrez.read(handle)
record = record[0]['DOI']
except IndexError :
print ('Missing DOI')
except KeyError :
print ('Missing DOI')
else :
pmid_doi.append([pmid, record])
#handles are a finite resource, I close it in order to avoid exhausting the handle supply with a large dataset.
handle.close()
# Delete temporary variables to free some space in the RAM:
auteur_pmid_doi.append([ligne[0], pmid_doi])
del (ligne[1])
del (handle)
del (record)
del (pmid_doi)
auteur_pmid_doi
Each script would run through a data interval ranging like this:
From the first line starting with letter A to the last line starting with the letter E.
From the first line starting with letter F, to the last line starting with the letter J.
From the first line starting with the letter k to the last line starting with the letter O.
And so on up to “z”.
How do you browse the lines of a csv through these types of intervals?
I add the link to my csv and thank you in advance for your help.
pmid_par_auteur_uniformise.csv
Assuming the file is sorted, the following will:
group rows into their starting letter.
collect rows from five groups at a time.
call process() with the rows.
sample process() will:
convert the pmids string to a list of strings
count the number of rows
print the first and last row of the group.
code:
import csv
import itertools
import ast
def process(rows):
rows = [[name,ast.literal_eval(pmids)] for name,pmids in rows]
print(f' -> {len(rows)} row(s)')
print(f' first row: {rows[0]}')
print(f' last row: {rows[-1]}')
with open('pmid_par_auteur_uniformise.csv',encoding='utf-8-sig',newline='') as f:
r = csv.reader(f)
rows = []
for i,(key,group) in enumerate(itertools.groupby(r,key=lambda x: x[0][0])):
rows.extend(list(group))
print(key,end='')
if i % 5 == 4:
process(rows)
rows = []
if rows:
process(rows)
Output:
ABCDE -> 3124 row(s)
first row: ['A Aljabali Alaa A', ['32397911']]
last row: ['Eğrilmez Sait', ['32366061', '32328919', '32099313', '31486610', '31245968', '31045617', '30202612', '29697353', '29109898', '28761568', '28405481', '28169512', '27849319', '27758983', '27447356', '27028890', '26887566', '26801653', '26654385', '26558208', '26401170', '26208680', '26159181', '25955827', '25247377', '25111119', '24621171', '24367973', '24322806', '23970986', '23684361', '23205897', '23061415', '21598429', '21453232', '21414053', '21353409', '21332970', '20490802', '20385953', '20164799', '19958116', '26649479', '19237782', '18404071', '18209647', '18039348', '17641704', '15696768', '15305561', '15177606', '15177597', '15050248', '12908533', '12780405', '12752051', '12662987']]
FGHIJ -> 2087 row(s)
first row: ['Fabbricatore Davide', ['32383763', '31898206', '31564087', '29754460', '29460403', '29252600', '28775965']]
last row: ['Jüni Peter', ['32450456', '32396180', '32385067', '32385063', '32333878', '32294317', '32293511', '32241376', '32215640', '32199780', '32139280', '32139222', '32006758', '32006156', '31920002', '31857278', '31857277', '31854112', '31851302', '31845894', '31841136', '31707794', '31696762', '31693078', '31672177', '31648781', '31589276', '31570258', '31537275', '31525083', '31497854', '31488373', '31488372', '31476244', '31462531', '31434508', '31410968', '31397487', '31379378', '31368907', '31329852', '31269364', '31217143', '31204678', '31197439', '31164366', '31132298', '31084961', '31056295', '30975683', '30888959', '30852547', '30846254', '30833323', '30789921', '30703644', '30689825', '30667361', '30601734', '30596995', '30592349', '30566213', '30560696', '30424891', '30356345', '30354650', '30354532', '30347031', '30291678', '30215374', '30182362', '30170848', '30166073', '30165632', '30165437', '30165435', '30153988', '30146969', '30107514', '30044478', '29992264', '29916872', '29912740', '29885826', '29850808', '29794879', '29786535', '29785878', '29742109', '29628287', '29606865', '29487111', '29478826', '29467161', '29277234', '29251754', '29228059', '29205157', '29162610', '29155984', '29130845', '29113968', '29097450', '29045581', '29038228', '29020259', '28967416', '28948934', '28886622', '28850362', '28827257', '28796809', '28790165', '28781251', '28742627', '28732814', '28699595', '28671552', '28611089', '28601820', '28566364', '28536005', '28528767', '28472484', '28430920', '28425755', '28330794', '28329389', '28253938', '28213601', '28213600', '28185702', '28079554', '28067197', '28029055', '28027351', '28017369', '28003290', '27998831', '27923461', '27884241', '27753599', '27733354', '27677503', '27665852', '27578808', '27497359', '27479866', '27478115', '27437661', '27389906', '27372195', '27318845', '27296200', '27289296', '27252878', '27179724', '27125947', '27078262', '27033859', '31997951', '26997557', '26979080', '26916479', '26896474', '26823484', '26762519', '26741741', '26700531', '26690319', '26655339', '26649651', '26606735', '26585615', '26490760', '26453687', '26428025', '26408014', '26376691', '26373562', '26352574', '26334160', '26324049', '26210282', '26208006', '26205445', '26196758', '26142466', '26071600', '26043895', '26040806', '26010634', '26007299', '25979551', '25934823', '25910501', '25875821', '25858975', '25794671', '25794517', '25791214', '25634905', '25623431', '25572026', '25551539', '25546177', '25529190', '25524605', '25495124', '25494429', '25489846', '25433627', '25423953', '25416325', '25330508', '25229835', '25208215', '25189359', '25187201', '25184244', '25182248', '25176289', '25173601', '25173535', '25173339', '25169183', '25163691', '25112661', '25042419', '25042271', '25011716', '24958760', '24958153', '24919052', '24882698', '24847017', '24755380', '24738641', '24711124', '24694729', '24682843', '24676282', '24631113', '24602961', '24552862', '24531331', '24429160', '24332419', '24206920', '24132187', '24064474', '24064377', '24039795', '23993323', '23968698', '23946263', '23909727', '23822782', '23793972', '23759706', '23747228', '23723742', '23702009', '23514285', '23487519', '23386662', '23370065', '23339812', '23277909', '23169986', '23152242', '23045205', '23008508', '22995882', '22945832', '22924638', '22922416', '22910755', '22868835', '22846347', '22759453', '22739992', '22726632', '22711083', '22645184', '22625186', '22607867', '22580250', '22456025', '22447805', '22440496', '22362513', '22361598', '22319063', '22302840', '22301368', '22285579', '22238228', '22093210', '22078420', '22075451', '22056618', '22027687', '22008217', '21959221', '21931648', '21930465', '21904996', '21878462', '21851904', '21768536', '21700254', '21646500', '21641358', '21632168', '21596229', '21396782', '21385807', '21362706', '21356042', '26063638', '21330239', '21296599', '21224324', '21205944', '21161860', '21042932', '20884434', '20870808', '20853471', '20847017', '20807617', '20639294', '20633818', '26061467', '20562074', '20506333', '20464751', '20461793', '20298923', '20152241', '20152233', '20142179', '20091539', '19950329', '19930626', '19889649', '19821404', '19821403', '19821302', '19821296', '19819375', '19778775', '19736281', '19736154', '19679616', '19620501', '19370423', '19284063', '19204314', '19074491', '19036745', '18804739', '18765162', '18757996', '18534034', '18512273', '18502079', '18414453', '18316340', '18272504', '18050181', '17968921', '17903638', '17869634', '17868802', '17726091', '17707588', '17696267', '17606174', '17606172', '17602184', '17438317', '17321312', '16979535', '16824829', '16717169', '16704569', '16255025', '16125589', '16105989', '15947376', '15911545', '15897534', '15649954', '15641050', '15582059', '15513969', '15485938', '15122753', '15087341', '14960422', '12912727', '12814907', '12654410', '12574052', '12456259', '12435252', '12111917', '12039807', '12038917', '11914306']]
KLMNO -> 3147 row(s)
first row: ['Kaakinen Markus', ['32320402', '32306232', '32301734', '31895044', '31815340', '31761930', '30465150', '30176935', '29892700', '29623505', '29048938', '27441787', '27094352', '26620915', '26563678']]
last row: ['Ozog David M', ['32452977', '32446829', '32442698', '32291807', '32246972', '32224709', '31820478', '31797796', '31743247', '31592926', '31567612', '31449081', '31335426', '31335419', '31268498', '30528311', '30528310', '30322300', '30235390', '29381548', '28975212', '28658462', '28522039', '28522038', '27984327', '27336945', '27051812', '26845540', '26547045', '26504503', '26458039', '25946625', '25738444', '24891062', '24664987', '24196328', '23157724', '23069917', '22269028', '22093099', '22055283', '21931055', '21864935', '21518098', '21457398', '20855672', '20384757', '20227579', '17254036', '16918570']]
PQRST -> 2954 row(s)
first row: ['Paakkari Leena', ['32438595', '32302535', '31654998', '31410386', '31297559', '30753409', '30578457', '30412226', '29510702', '28673131', '27655781', '26630180', '24609436']]
last row: ['Tánczos Krisztián', ['32453702', '29953455', '29264668', '27597981', '27288610', '26543848', '25608924', '24818123', '24457113', '24012232']]
UVWXY -> 1509 row(s)
first row: ['Ubaldi Filippo M', ['32404170', '32354663', '32038484', '31824427', '31756248', '31403619', '31174207', '30848112', '30739329', '28895679', '28362681', '27827818', '27310263', '26921622', '26728489', '26531067', '25985139', '24884585', '21665542', '16580372']]
last row: ['Yılmaz Aydın', ['32299200', '32029697', '31799089', '31615316', '31130132', '30683026', '30582673', '29151304', '29135403', '29052061', '28940588', '28653494', '28621292', '28393719', '27704696', '27481085', '27442525', '27412127', '26885104', '26556895', '26523900', '26482979', '26422878', '26374581', '26281327', '26257957', '26107228', '25619495', '25492957', '25492815', '24976998', '24814084', '24506753', '23949189', '23431310', '23377781', '23030749', '22609980', '22320975', '22087531', '22019748', '21851414', '21334917', '21038140', '20981186', '20954282', '20689268', '20517728', '20332658', '19803278', '17845896', '16650973', '16304288', '16217989', '15143428', '14971870']]
ZdvÁÇ -> 741 row(s)
first row: ['Zabetakis Ioannis', ['32438620', '32340775', '32270798', '32224958', '31816871', '31540159', '31137500', '30721934', '30669323', '30619728', '30381909', '30319088', '29882848', '29757226', '29494487', '29135918', '28714908', '28119955', '27109548', '24973582', '24735421', '24128590', '24084786', '23957417', '23480708', '23433838', '25212344', '22087726', '26047447', '16390205']]
last row: ['Çinier Göksel', ['32462219', '32406873', '32338313', '32250347', '32222434', '32147660', '32147654', '32035356', '31846583', '31764010', '31707766', '31670716', '31582673', '31542896', '31483310', '31339201', '31204510', '31139269', '31038781', '30961362', '30928819', '30815636', '30808220', '30694809', '30230925', '30174880', '30149941', '30075884', '30024391', '30022507', '29894304', '29848928', '29523425', '29487682', '29451310', '29339686', '29191504', '28971172', '28898454', '28864320', '28838153', '28595215', '28595209', '28592959', '28424447', '28401800', '28169085', '27641906', '27608320', '27581673', '27414730', '27341666', '26946973', '26778640', '26295613']]
ÖØÜĆČ -> 12 row(s)
first row: ['Özdemir Vural', ['32319847', '32316827', '32223589', '32105560', '32027574', '31990612', '31855503', '31794335', '31794294', '31199695', '31094658', '31066623', '30789303', '30707659', '30362880', '30281399', '30260734', '30036157', '30004300', '29791250', '29432059', '29431577', '29293405', '29083982', '29064337', '28655746', '28622116', '28253085', '27726641', '27631187', '27310474', '27211534', '26793622', '26785082', '26684591', '26645377', '26484977', '26430925', '26345196', '26270647', '26161545', '26066837', '26061584', '25970399', '25748435', '25656538', '25353263', '25000304', '24955641', '24795761', '24795460', '24766116', '24730382', '24649998', '24521341', '24456465', '24456464', '26120345', '27447251', '24048056', '27442201', '23765483', '23574338', '23531886', '23301640', '23258262', '23249198', '23249197', '23249193', '23249192', '23194449', '22987569', '22545073', '22523528', '22401659', '22279516', '22279515', '22220951', '22198458', '21848419', '21848418', '21490881', '21476845', '21400375', '21399751', '20977184', '20808949', '20547595', '20526970', '20455752', '19882466', '19470561', '19290811', '19290807', '19214141', '19207026', '19040373', '19040370', '18708948', '18570104', '18480690', '18266561', '18075468', '17963681', '17924827', '17716237', '17559347', '17481382', '17481379', '17474081', '17439540', '17429316', '17286540', '17224711', '17184207', '17093467', '16900136', '16544144', '16433578', '15738749', '14965233', '14582457', '12054061', '11864722']]
last row: ['Čivljak Rok', ['32426118', '32118371', '31661701', '30801727', '29164144', '27625226', '26638539', '26538030', '26494527', '26012149', '25801665', '25634680', '25274934', '24392752', '24192278', '23941014', '21294322', '23120878', '18773823', '15936084', '15330131']]
İŚŞŠ -> 10 row(s)
first row: ['İrkeç Murat', ['32366061', '32287143', '31844978', '31232743', '30605936', '30471351', '29944505', '29554816', '29196953', '29135602', '28598963', '28553957', '28390092', '27874294', '27790127', '27775457', '27467041', '27513901', '27452505', '27055218', '26257227', '26187885', '26035419', '25811727', '25686056', '25642816', '25603441', '25370397', '25264994', '25203662', '25119963', '25069074', '25069002', '24967185', '24803156', '24790882', '24774884', '24767236', '24646901', '24627252', '24401152', '24216677', '24145558', '23806084', '23730902', '23635854', '23564611', '23561604', '23378724', '23377585', '23323578', '23084388', '22885886', '22799438', '22415150', '21575121', '21174000', '20813741', '20595897', '20544681', '20299976', '19878104', '19825835', '19681791', '19618994', '19491966', '19396777', '19158563', '19085375', '18700919', '18675411', '18580271', '18524193', '18471650', '18414106', '18352875', '18216575', '18083592', '17760635', '17522853', '17519661', '17413955', '17300569', '17198020', '17102682', '17068451', '16849641', '16280977', '16075220', '15968156', '15068429', '14746595', '14746583', '14566648', '14533030', '12791215', '12789598', '12789579', '12738551', '12695716', '12648019', '12427231', '12072719', '12065849', '12027104', '11821216', '11820668']]
last row: ['Šín Robin', ['32434337', '32304370', '31974532', '31971245']]

parse a structured (structure of machine) text-file (config-file) into a structured table format

main goal is to get from a more or less readable config file into a table format which can be read from everyone witouth deeper understanding of the machine and their configuration standards.
i've got a config file:
******A MANO:111111 ,20190726,001,0914,06621242746
DXS*HAWA776A0A*VA*V0/6*1
ST*001*0001
ID1*HAW250755*VMI1-9900****250755*6*0
CB1*021545*DeBright*7.030.16*3.02*250755
PA1*0*100
PA1*1*60
PA2*2769*166140*210*12600*0*0*0*0
******E MANO:111111 ,20190726,001,0914,06621242746
******A MANO:222222 ,20190726,001,0914,06621242746
DXS*HAWA776A0A*VA*V0/6*1
ST*001*0001
ID1*HAW250755*VMI1-9900****250755*6*0
CB1*021545*DeBright*7.030.16*3.02*250755
PA1*0*100
PA1*1*60
PA2*2769*166140*210*12600*0*0*0*0
******E MANO:222222 ,20190726,001,0914,06621242746
There are several objects in the file always starting with 'A MANO:' and ending with 'E MANO:' followed by the object-number.
all the lines underneath are the attributes of the object (settings of the machine). Not all objects have the same amount of settings. There could be 55 Lines for one object and 199 for another.
what i tried so far:
from pyparsing import *
'''
grammar:
object_nr ::= Word(nums, exact=6)
num ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
'''
path_input = r'\\...\...'
with open(path_input) as input_file:
line = input_file.readline()
cnt = 1
object_nr_parser = Word(nums, exact=6)
for match, start, stop in object_nr_parser.scanString(input_file):
print(match, start, stop)
which gives me the printout:
['201907'] 116 122
['019211'] 172 178
the number it founds and the start and ending points in the string. But this numbers are not what I'm looking for nor correct. i can't even find the second number in the config-file.
is it the right way to solve this with pyparsing or is there a more convenient way to do it? Where did i do the mistake?
At the end it would be astounding if i would have an object for every machine with attributes which would be all the lines between the A MANO: and the E MANO:
expected result would be something like this:
{"object": "111111",
"line1":"DXS*HAWA776A0A*VA*V0/6*1",
"line2":"ST*001*0001",
"line3":"ID1*HAW250755*VMI1-9900****250755*6*0",
"line4":"CB1*021545*DeBright*7.030.16*3.02*250755",
"line5":"PA1*0*100",
"line6":"PA1*1*60",
"line7":"PA2*2769*166140*210*12600*0*0*0*0"},
{"object": "222222",
"line1":"DXS*HAWA776A0A*VA*V0/6*1",
"line2":"ST*001*0001",
"line3":"ID1*HAW250755*VMI1-9900****250755*6*0",
"line4":"CB1*021545*DeBright*7.030.16*3.02*250755",
"line5":"PA1*0*100",
"line6":"PA1*1*60",
"line7":"PA2*2769*166140*210*12600*0*0*0*0",
"line8":"PA2*2769*166140*210*12600*0*0*0*0",
"line9":"PA2*2769*166140*210*12600*0*0*0*0",
"line10":"PA2*2769*166140*210*12600*0*0*0*0"}
Not sure if that is the best solution for the purpose but it's the one that came into mind at this point.
One of the dirtiest ways to get the thing done would be using regex and replace the MANO with line break and all the line breaks with ';'. I don't think that this would be a solution one should use
You can parse it line by line:
import re
with open('file.txt', 'r') as f:
lines = f.readlines()
lines = [x.strip() for x in lines]
result = []
name = ''
i = 1
for line in lines:
if 'A MANO' in line:
name = re.findall('A MANO:(\d+)', line)[0]
result.append({'object': name})
i = 1
elif 'E MANO' not in line:
result[-1][f'line{i}'] = line
i += 1
Output:
[{
'object': '111111',
'line1': 'DXS*HAWA776A0A*VA*V0/6*1',
'line2': 'ST*001*0001',
'line3': 'ID1*HAW250755*VMI1-9900****250755*6*0',
'line4': 'CB1*021545*DeBright*7.030.16*3.02*250755',
'line5': 'PA1*0*100',
'line6': 'PA1*1*60',
'line7': 'PA2*2769*166140*210*12600*0*0*0*0'
}, {
'object': '222222',
'line1': 'DXS*HAWA776A0A*VA*V0/6*1',
'line2': 'ST*001*0001',
'line3': 'ID1*HAW250755*VMI1-9900****250755*6*0',
'line4': 'CB1*021545*DeBright*7.030.16*3.02*250755',
'line5': 'PA1*0*100',
'line6': 'PA1*1*60',
'line7': 'PA2*2769*166140*210*12600*0*0*0*0'
}
]
But I suggest using more compact output format:
import re
with open('file.txt', 'r') as f:
lines = f.readlines()
lines = [x.strip() for x in lines]
result = {}
name = ''
for line in lines:
if 'A MANO' in line:
name = re.findall('A MANO:(\d+)', line)[0]
result[name] = []
elif 'E MANO' not in line:
result[name].append(line)
Output:
{
'111111': ['DXS*HAWA776A0A*VA*V0/6*1', 'ST*001*0001', 'ID1*HAW250755*VMI1-9900****250755*6*0', 'CB1*021545*DeBright*7.030.16*3.02*250755', 'PA1*0*100', 'PA1*1*60', 'PA2*2769*166140*210*12600*0*0*0*0'],
'222222': ['DXS*HAWA776A0A*VA*V0/6*1', 'ST*001*0001', 'ID1*HAW250755*VMI1-9900****250755*6*0', 'CB1*021545*DeBright*7.030.16*3.02*250755', 'PA1*0*100', 'PA1*1*60', 'PA2*2769*166140*210*12600*0*0*0*0']
}

Join whole word by its Tag Python

let say i have this sentences:
His/O name/O is/O Petter/Name Jack/Name and/O his/O brother/O name/O is/O
Jonas/Name Van/Name Dame/Name
How can i get result like this:
Petter Jack, Jonas Van Dame.
So far i've already tried this, but still its just join 2 word :
import re
pattern = re.compile(r"\w+\/Name)
sent = sentence.split()
for i , w in sent:
if pattern.match(sent[i]) != None:
if pattern.match(sent[i+1]) != None:
#....
#join sent[i] and sent[i+1] element
#....
Try something like this
pattern = re.compile(r"((\w+\/Name\s*)+)")
names = pattern.findall(your_string)
for name in names:
print(''.join(name[0].split('/Name')))
I'm thinking about a two-phase solution
r = re.compile(r'\w+\/Name(?:\ \w+\/Name)*')
result = r.findall(s)
# -> ['Petter/Name Jack/Name', 'Jonas/Name Van/Name Dame/Name']
for r in result:
print(r.replace('/Name', ''))
# -> Petter Jack
# -> Jonas Van Dame

Biopython translate() error

I have a file that looks as so:
Type Variant_class ACC_NUM dbsnp genomic_coordinates_hg18 genomic_coordinates_hg19 HGVS_cdna HGVS_protein gene disease sequence_context_hg18 sequence_context_hg19 codon_change codon_number intron_number site location location_reference_point author journal vol page year pmid entrezid sift_score sift_prediction mutpred_score
1 DM CM920001 rs1800433 null chr12:9232351:- NM_000014.4 NP_000005.2:p.C972Y A2M Chronicobstructivepulmonarydisease null CACAAAATCTTCTCCAGATGCCCTATGGCT[G/A]TGGAGAGCAGAATATGGTCCTCTTTGCTCC TGT TAT 972 null null 2 null Poller HUMGENET 88 313 1992 1370808 2 0 DAMAGING 0.594315245478036
1 DM CM004784 rs74315453 null chr22:43089410:- NM_017436.4 NP_059132.1:p.M183K A4GALT Pksynthasedeficiency(pphenotype) null TGCTCTCCGACGCCTCCAGGATCGCACTCA[T/A]GTGGAAGTTCGGCGGCATCTACCTGGACAC ATG AAG 183 null null 2 null Steffensen JBC 275 16723 2000 10747952 53947 0 DAMAGING 0.787878787878788
I want to translate the information from column 13 and 14 to their corresponding amino acids. Here is the script that I've generated:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
InFile = open("disease_mut_splitfinal.txt", 'rU')
InFile.readline()
OriginalSeq_list = []
MutSeq_list = []
import csv
with open("disease_mut_splitfinal.txt") as f:
reader = csv.DictReader(f, delimiter= "\t")
for row in reader:
OriginalSeq = row['codon_change']
MutSeq = row['codon_number']
region = row["genomic_coordinates_hg19"]
gene = row["gene"]
OriginalSeq_list.append(OriginalSeq)
MutSeq_list.append(MutSeq)
OutputFileName = "Translated.txt"
OutputFile = open(OutputFileName, 'w')
OutputFile.write(''+region+'\t'+gene+'\n')
for i in range(0, len(OriginalSeq_list)):
OrigSeq = OriginalSeq_list[i]
MutSEQ = MutSeq_list[i]
print OrigSeq
translated_original = OrigSeq.translate()
translated_mut= MutSEQ.translate()
OutputFile.write("\n" + OriginalSeq_list[i]+ "\t" + str(translated_original) + "\t" +MutSeq_list[i] + "\t" + str(translated_mut)+ "\n")
However, I keep getting this error:
TypeError: translate expected at least 1 arguments, got 0
I'm kind of at a loss for what I'm doing wrong. Any suggestions?
https://www.dropbox.com/s/cd8chtacj3glb8d/disease_mut_splitfinal.txt?dl=0
(File should still be downloadable even if you don't have a dropbox)
You are using the string method "translate" instead of the biopython seq object method translate, which is what I assume you want to do. You need to convert the string into a seq object and then translate that. Try
from Bio import Seq
OrigSeq = Seq.Seq(OriginalSeq_list[i])
translated_original = OrigSeq.translate()
Alternatively
from Bio.Seq import Seq
OrigSeq = Seq(OriginalSeq_list[i])
translated_original = OrigSeq.translate()

How to merge only the unique lines from file_a to file_b?

This question has been asked here in one form or another but not quite the thing I'm looking for. So, this is the situation I shall be having: I already have one file, named file_a and I'm creating another file - file_b. file_a is always bigger than file_b in size. There will be a number of duplicate lines in file_b (hence, in file_a as well) but both the files will have some unique lines. What I want to do is: to copy/merge only the unique lines from file_a to file_b and then sort the line order, so that the file_b becomes the most up-to-date one with all the unique entries. Either of the original files shouldn't be more than 10MB in size. What's the most efficient (and fastest) way I can do that?
I was thinking something like that, which does the merging alright.
#!/usr/bin/env python
import os, time, sys
# Convert Date/time to epoch
def toEpoch(dt):
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
# input files
o_file = "file_a"
c_file = "file_b"
n_file = [o_file,c_file]
m_file = "merged.file"
for x in range(len(n_file)):
P = open(n_file[x],"r")
output = P.readlines()
P.close()
# Sort the output, order by 2nd last field
#sp_lines = [ line.split('\t') for line in output ]
#sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
#for line in sp_lines:
for line in output:
if "group_" in line:
F.write(line)
F.close()
But, it's:
not with only the unique lines
not sorted (by next to last field)
and introduces the 3rd file i.e. m_file
Just a side note (long story short): I can't use sorted() here as I'm using v2.3, unfortunately. The input files look like this:
On 23/03/11 00:40:03
JobID Group.User Ctime Wtime Status QDate CDate
===================================================================================
430792 group_atlas.pltatl16 0 32 4 02/03/11 21:52:38 02/03/11 22:02:15
430793 group_atlas.atlas084 30 472 4 02/03/11 21:57:43 02/03/11 22:09:35
430794 group_atlas.atlas084 12 181 4 02/03/11 22:02:37 02/03/11 22:05:42
430796 group_atlas.atlas084 8 185 4 02/03/11 22:02:38 02/03/11 22:05:46
I tried to use cmp() to sort by the 2nd last field but, I think, it doesn't work just because of the first 3 lines of the input files.
Can anyone please help? Cheers!!!
Update 1:
For the future reference, as suggested by Jakob, here is the complete script. It worked just fine.
#!/usr/bin/env python
import os, time, sys
from sets import Set as set
def toEpoch(dt):
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
def yield_lines(fileobj):
#I want to discard the headers
for i in xrange(3):
fileobj.readline()
#
for line in fileobj:
yield line
def app(path1, path2):
file1 = set(yield_lines(open(path1)))
file2 = set(yield_lines(open(path2)))
return file1.union(file2)
# Input files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"
print time.strftime('%H:%M:%S', time.localtime())
# Sorting the output, order by 2nd last field
sp_lines = [ line.split('\t') for line in app(o_file, c_file) ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)
for line in sp_lines:
MF = '\t'.join(line)
F.write(MF)
F.close()
It took about 2m:47s to finish for 145244 lines.
[testac1#serv07 ~]$ ./uniq-merge.py
17:19:21
No. of lines: 145244
17:22:08
thanks!!
Update 2:
Hi eyquem, this is the Error message I get when I run your script(s).
From the first script:
[testac1#serv07 ~]$ ./uniq-merge_2.py
File "./uniq-merge_2.py", line 44
fm.writelines( '\n'.join(v)+'\n' for k,v in output )
^
SyntaxError: invalid syntax
From the second script:
[testac1#serv07 ~]$ ./uniq-merge_3.py
File "./uniq-merge_3.py", line 24
output = sett(line.rstrip() for line in fa)
^
SyntaxError: invalid syntax
Cheers!!
Update 3:
The previous one wasn't sorting the list at all. Thanks to eyquem to pointing that out. Well, it does now. This is a further modification to Jakob's version - I converted the set:app(path1, path2) to a list:myList() and then applied the sort( lambda ... ) to the myList to sort the merged file by the nest to last field. This is the final script.
#!/usr/bin/env python
import os, time, sys
from sets import Set as set
def toEpoch(dt):
# Convert date/time to epoch
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
def yield_lines(fileobj):
# Discard the headers (1st 3 lines)
for i in xrange(3):
fileobj.readline()
for line in fileobj:
yield line
def app(path1, path2):
# Remove duplicate lines
file1 = set(yield_lines(open(path1)))
file2 = set(yield_lines(open(path2)))
return file1.union(file2)
print time.strftime('%H:%M:%S', time.localtime())
# I/O files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"
# Convert set into to list
myList = list(app(o_file, c_file))
# Sort the list by the date
sp_lines = [ line.split('\t') for line in myList ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)
# Finally write to the outFile
for line in sp_lines:
MF = '\t'.join(line)
F.write(MF)
F.close()
There is no speed boost at all, it took 2m:50s to process the same 145244 lines. Is anyone see any scope of improvement, please let me know. Thanks to Jakob and eyquem for their time. Cheers!!
Update 4:
Just for future reference, this is a modified version of eyguem, which works much better and faster then the previous ones.
#!/usr/bin/env python
import os, sys, re
from sets import Set as sett
from time import mktime, strptime, strftime
def sorting_merge(o_file, c_file, m_file ):
# RegEx for Date/time filed
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d')
def kl(lines,pat = pat):
# match only the next to last field
line = lines.split('\t')
line = line[-2]
return mktime(strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
output = sett()
head = []
# Separate the header & remove the duplicates
def rmHead(f_n):
f_n.readline()
for line1 in f_n:
if pat.search(line1): break
else: head.append(line1) # line of the header
for line in f_n:
output.add(line.rstrip())
output.add(line1.rstrip())
f_n.close()
fa = open(o_file, 'r')
rmHead(fa)
fb = open(c_file, 'r')
rmHead(fb)
# Sorting date-wise
output = [ (kl(line),line.rstrip()) for line in output if line.rstrip() ]
output.sort()
fm = open(m_file,'w')
# Write to the file & add the header
fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head[0]+head[1])))
for t,line in output:
fm.write(line + '\n')
fm.close()
c_f = "03_a"
o_f = "03_b"
sorting_merge(o_f, c_f, 'outfile.txt')
This version is much faster - 6.99 sec. for 145244 lines compare to the 2m:47s - then the previous one using lambda a, b: cmp(). Thanks to eyquem for all his support. Cheers!!
EDIT 2
My previous codes have problems with output = sett(line.rstrip() for line in fa) and output.sort(key=kl)
Moreover, they have some complications.
So I examined the choice of reading the files directly with a set() function taken by Jakob Bowyer in his code.
Congratulations Jakob ! (and Michal Chruszcz by the way) : set() is unbeatable, it's faster than a reading one line at a time.
Then , I abandonned my idea to read the files line after line.
.
But I kept my idea to avoid a sorting with the help of cmp() function because, as it is described in the doc:
s.sort([cmpfunc=None])
The sort() method takes an optional
argument specifying a comparison
function of two arguments (list items)
(...) Note that this slows the sorting
process down considerably
http://docs.python.org/release/2.3/lib/typesseq-mutable.html
Then, I managed to obtain a list of tuples (t,line) in which the t is
time.mktime(time.strptime(( 1st date-and-hour in line ,'%d/%m/%y %H:%M:%S'))
by the instruction
output = [ (kl(line),line.rstrip()) for line in output]
.
I tested 2 codes. The following one in which 1st date-and-hour in line is computed thanks to a regex:
def kl(line,pat = pat):
return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
output = [ (kl(line),line.rstrip()) for line in output if line.rstrip()]
output.sort()
And a second code in which kl() is:
def kl(line,pat = pat):
return time.mktime(time.strptime(line.split('\t')[-2],'%d/%m/%y %H:%M:%S'))
.
The results are
Times of execution:
0.03598 seconds for the first code with regex
0.03580 seconds for the second code with split('\t')
that is to say the same
This algorithm is faster than a code using a function cmp() :
a code in which the set of lines output isn't transformed in a list of tuples by
output = [ (kl(line),line.rstrip()) for line in output]
but is only transformed in a list of the lines (without duplicates, then) and sorted with a function mycmp() (see the doc):
def mycmp(a,b):
return cmp(time.mktime(time.strptime(a.split('\t')[-2],'%d/%m/%y %H:%M:%S')),
time.mktime(time.strptime(b.split('\t')[-2],'%d/%m/%y %H:%M:%S')))
output = [ line.rstrip() for line in output] # not list(output) , to avoid the problem of newline of the last line of each file
output.sort(mycmp)
for line in output:
fm.write(line+'\n')
has an execution time of
0.11574 seconds
.
The code:
#!/usr/bin/env python
import os, time, sys, re
from sets import Set as sett
def sorting_merge(o_file , c_file, m_file ):
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'(?=[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)')
def kl(line,pat = pat):
return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
output = sett()
head = []
fa = open(o_file)
fa.readline() # first line is skipped
while True:
line1 = fa.readline()
mat1 = pat.search(line1)
if not mat1: head.append(line1) # line1 is here a line of the header
else: break # the loop ends on the first line1 not being a line of the heading
output = sett( fa )
fa.close()
fb = open(c_file)
while True:
line1 = fb.readline()
if pat.search(line1): break
output = output.union(sett( fb ))
fb.close()
output = [ (kl(line),line.rstrip()) for line in output]
output.sort()
fm = open(m_file,'w')
fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
for t,line in output:
fm.write(line + '\n')
fm.close()
te = time.clock()
sorting_merge('ytre.txt','tataye.txt','merged.file.txt')
print time.clock()-te
This time, I hope it will run correctly, and that the only thing to do is to wait the times of execution on real files much bigger than the ones on which I tested the codes
.
EDIT 3
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'(?=[ \t]+'
'[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'|'
'[ \t]+aborted/deleted)')
.
EDIT 4
#!/usr/bin/env python
import os, time, sys, re
from sets import Set
def sorting_merge(o_file , c_file, m_file ):
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'(?=[ \t]+'
'[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'|'
'[ \t]+aborted/deleted)')
def kl(line,pat = pat):
return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
head = []
output = Set()
fa = open(o_file)
fa.readline() # first line is skipped
for line1 in fa:
if pat.search(line1): break # first line after the heading
else: head.append(line1) # line of the header
for line in fa:
output.add(line.rstrip())
output.add(line1.rstrip())
fa.close()
fb = open(c_file)
for line1 in fb:
if pat.search(line1): break
for line in fb:
output.add(line.rstrip())
output.add(line1.rstrip())
fb.close()
if '' in output: output.remove('')
output = [ (kl(line),line) for line in output]
output.sort()
fm = open(m_file,'w')
fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
for t,line in output:
fm.write(line+'\n')
fm.close()
te = time.clock()
sorting_merge('A.txt','B.txt','C.txt')
print time.clock()-te
Maybe something along these lines?
from sets import Set as set
def yield_lines(fileobj):
#I want to discard the headers
for i in xrange(3):
fileobj.readline()
for line in fileobj:
yield line
def app(path1, path2):
file1 = set(yield_lines(open(path1)))
file2 = set(yield_lines(open(path2)))
return file1.union(file2)
EDIT: Forgot about with :$
I wrote this new code, with the ease of using a set. It is faster that my previous code. And, it seems, than your code
#!/usr/bin/env python
import os, time, sys, re
from sets import Set as sett
def sorting_merge(o_file , c_file, m_file ):
# Convert Date/time to epoch
def toEpoch(dt):
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
pat = re.compile('([0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)'
'[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d')
fa = open(o_file)
head = []
fa.readline()
while True:
line1 = fa.readline()
mat1 = pat.search(line1)
if not mat1:
head.append(('',line1.rstrip()))
else:
break
output = sett((toEpoch(pat.search(line).group(1)) , line.rstrip())
for line in fa)
output.add((toEpoch(mat1.group(1)) , line1.rstrip()))
fa.close()
fb = open(c_file)
while True:
line1 = fb.readline()
mat1 = pat.search(line1)
if mat1: break
for line in fb:
output.add((toEpoch(pat.search(line).group(1)) , line.rstrip()))
output.add((toEpoch(mat1.group(1)) , line1.rstrip()))
fb.close()
output = list(output)
output.sort()
output[0:0] = head
output[0:0] = [('',time.strftime('On %d/%m/%y %H:%M:%S'))]
fm = open(m_file,'w')
fm.writelines( line+'\n' for t,line in output)
fm.close()
te = time.clock()
sorting_merge('ytr.txt','tatay.txt','merged.file.txt')
print time.clock()-te
Note that this code put a heading in the merged file
.
EDIT
Aaaaaah... I got it... :-))
Execution's time divided by 3 !
#!/usr/bin/env python
import os, time, sys, re
from sets import Set as sett
def sorting_merge(o_file , c_file, m_file ):
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'(?=[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)')
def kl(line,pat = pat):
return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
fa = open(o_file)
head = []
fa.readline()
while True:
line1 = fa.readline()
mat1 = pat.search(line1)
if not mat1:
head.append(line1.rstrip())
else:
break
output = sett(line.rstrip() for line in fa)
output.add(line1.rstrip())
fa.close()
fb = open(c_file)
while True:
line1 = fb.readline()
mat1 = pat.search(line1)
if mat1: break
for line in fb:
output.add(line.rstrip())
output.add(line1.rstrip())
fb.close()
output = list(output)
output.sort(key=kl)
output[0:0] = [time.strftime('On %d/%m/%y %H:%M:%S')] + head
fm = open(m_file,'w')
fm.writelines( line+'\n' for line in output)
fm.close()
te = time.clock()
sorting_merge('ytre.txt','tataye.txt','merged.file.txt')
print time.clock()-te
Last codes, I hope.
Because I found a killer code.
First , I created two files "xxA.txt" and "yyB.txt" of 30 lines having 30000 lines as
430559 group_atlas.atlas084 12 181 4 04/03/10 01:38:02 02/03/11 22:05:42
430502 group_atlas.atlas084 12 181 4 23/01/10 21:45:05 02/03/11 22:05:42
430544 group_atlas.atlas084 12 181 4 17/06/11 12:58:10 02/03/11 22:05:42
430566 group_atlas.atlas084 12 181 4 25/03/10 23:55:22 02/03/11 22:05:42
with the following code:
create AB.py
from random import choice
n = tuple( str(x) for x in xrange(500,600))
days = ('01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16',
'17','18','19','20','21','22','23','24','25','26','27','28')
# not '29','30,'31' to avoid problems with strptime() on last days of february
months = days[0:12]
hours = days[0:23]
ms = ['00','01','02','03','04','05','06','07','09'] + [str(x) for x in xrange(10,60)]
repeat = 30000
with open('xxA.txt','w') as f:
# 430794 group_atlas.atlas084 12 181 4 02/03/11 22:02:37 02/03/11 22:05:42
ch = ('On 23/03/11 00:40:03\n'
'JobID Group.User Ctime Wtime Status QDate CDate\n'
'===================================================================================\n')
f.write(ch)
for i in xrange(repeat):
line = '430%s group_atlas.atlas084 12 181 4 \t%s/%s/%s %s:%s:%s\t02/03/11 22:05:42\n' %\
(choice(n),
choice(days),choice(months),choice(('10','11')),
choice(hours),choice(ms),choice(ms))
f.write(line)
with open('yyB.txt','w') as f:
# 430794 group_atlas.atlas084 12 181 4 02/03/11 22:02:37 02/03/11 22:05:42
ch = ('On 25/03/11 13:45:24\n'
'JobID Group.User Ctime Wtime Status QDate CDate\n'
'===================================================================================\n')
f.write(ch)
for i in xrange(repeat):
line = '430%s group_atlas.atlas084 12 181 4 \t%s/%s/%s %s:%s:%s\t02/03/11 22:05:42\n' %\
(choice(n),
choice(days),choice(months),choice(('10','11')),
choice(hours),choice(ms),choice(ms))
f.write(line)
with open('xxA.txt') as g:
print 'readlines of xxA.txt :',len(g.readlines())
g.seek(0,0)
print 'set of xxA.txt :',len(set(g))
with open('yyB.txt') as g:
print 'readlines of yyB.txt :',len(g.readlines())
g.seek(0,0)
print 'set of yyB.txt :',len(set(g))
Then I ran these 3 programs:
"merging regex.py"
#!/usr/bin/env python
from time import clock,mktime,strptime,strftime
from sets import Set
import re
infunc = []
def sorting_merge(o_file, c_file, m_file ):
infunc.append(clock()) #infunc[0]
pat = re.compile('([0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)')
output = Set()
def rmHead(filename, a_set):
f_n = open(filename, 'r')
f_n.readline()
head = []
for line in f_n:
head.append(line) # line of the header
if line.strip('= \r\n')=='': break
for line in f_n:
a_set.add(line.rstrip())
f_n.close()
return head
infunc.append(clock()) #infunc[1]
head = rmHead(o_file, output)
infunc.append(clock()) #infunc[2]
head = rmHead(c_file, output)
infunc.append(clock()) #infunc[3]
if '' in output: output.remove('')
infunc.append(clock()) #infunc[4]
output = [ (mktime(strptime(pat.search(line).group(),'%d/%m/%y %H:%M:%S')),line)
for line in output ]
infunc.append(clock()) #infunc[5]
output.sort()
infunc.append(clock()) #infunc[6]
fm = open(m_file,'w')
fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
for t,line in output:
fm.write(line + '\n')
fm.close()
infunc.append(clock()) #infunc[7]
c_f = "xxA.txt"
o_f = "yyB.txt"
t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergedr.txt')
t2 = clock()
print 'merging regex'
print 'total time of execution :',t2-t1
print ' launching :',infunc[1] - t1
print ' preparation :',infunc[1] - infunc[0]
print ' reading of 1st file :',infunc[2] - infunc[1]
print ' reading of 2nd file :',infunc[3] - infunc[2]
print ' output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print ' sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]
"merging split.py"
#!/usr/bin/env python
from time import clock,mktime,strptime,strftime
from sets import Set
infunc = []
def sorting_merge(o_file, c_file, m_file ):
infunc.append(clock()) #infunc[0]
output = Set()
def rmHead(filename, a_set):
f_n = open(filename, 'r')
f_n.readline()
head = []
for line in f_n:
head.append(line) # line of the header
if line.strip('= \r\n')=='': break
for line in f_n:
a_set.add(line.rstrip())
f_n.close()
return head
infunc.append(clock()) #infunc[1]
head = rmHead(o_file, output)
infunc.append(clock()) #infunc[2]
head = rmHead(c_file, output)
infunc.append(clock()) #infunc[3]
if '' in output: output.remove('')
infunc.append(clock()) #infunc[4]
output = [ (mktime(strptime(line.split('\t')[-2],'%d/%m/%y %H:%M:%S')),line)
for line in output ]
infunc.append(clock()) #infunc[5]
output.sort()
infunc.append(clock()) #infunc[6]
fm = open(m_file,'w')
fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
for t,line in output:
fm.write(line + '\n')
fm.close()
infunc.append(clock()) #infunc[7]
c_f = "xxA.txt"
o_f = "yyB.txt"
t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergeds.txt')
t2 = clock()
print 'merging split'
print 'total time of execution :',t2-t1
print ' launching :',infunc[1] - t1
print ' preparation :',infunc[1] - infunc[0]
print ' reading of 1st file :',infunc[2] - infunc[1]
print ' reading of 2nd file :',infunc[3] - infunc[2]
print ' output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print ' sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]
"merging killer"
#!/usr/bin/env python
from time import clock,strftime
from sets import Set
import re
infunc = []
def sorting_merge(o_file, c_file, m_file ):
infunc.append(clock()) #infunc[0]
patk = re.compile('([0123]\d)/([01]\d)/(\d{2}) ([012]\d:[0-6]\d:[0-6]\d)')
output = Set()
def rmHead(filename, a_set):
f_n = open(filename, 'r')
f_n.readline()
head = []
for line in f_n:
head.append(line) # line of the header
if line.strip('= \r\n')=='': break
for line in f_n:
a_set.add(line.rstrip())
f_n.close()
return head
infunc.append(clock()) #infunc[1]
head = rmHead(o_file, output)
infunc.append(clock()) #infunc[2]
head = rmHead(c_file, output)
infunc.append(clock()) #infunc[3]
if '' in output: output.remove('')
infunc.append(clock()) #infunc[4]
output = [ (patk.search(line).group(3,2,1,4),line)for line in output ]
infunc.append(clock()) #infunc[5]
output.sort()
infunc.append(clock()) #infunc[6]
fm = open(m_file,'w')
fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
for t,line in output:
fm.write(line + '\n')
fm.close()
infunc.append(clock()) #infunc[7]
c_f = "xxA.txt"
o_f = "yyB.txt"
t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergedk.txt')
t2 = clock()
print 'merging killer'
print 'total time of execution :',t2-t1
print ' launching :',infunc[1] - t1
print ' preparation :',infunc[1] - infunc[0]
print ' reading of 1st file :',infunc[2] - infunc[1]
print ' reading of 2nd file :',infunc[3] - infunc[2]
print ' output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print ' sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]
results
merging regex
total time of execution : 14.2816595405
launching : 0.00169211450059
preparation : 0.00168093989599
reading of 1st file : 0.163582242995
reading of 2nd file : 0.141301478261
output.remove('') : 2.37460347614e-05
creation of output : 13.4460212122
sorting of output : 0.216363532237
writing of merging file : 0.232923737514
closing of the function : 0.0797514767938
merging split
total time of execution : 13.7824474898
launching : 4.10666718815e-05
preparation : 2.70984161395e-05
reading of 1st file : 0.154349784679
reading of 2nd file : 0.136050810927
output.remove('') : 2.06730184981e-05
creation of output : 12.9691854691
sorting of output : 0.218704332534
writing of merging file : 0.225259076223
closing of the function : 0.0788362766776
merging killer
total time of execution : 2.14315311024
launching : 0.00206199391263
preparation : 0.00205026057781
reading of 1st file : 0.158711791582
reading of 2nd file : 0.138976601775
output.remove('') : 2.37460347614e-05
creation of output : 0.621466415424
sorting of output : 0.823161602941
writing of merging file : 0.227701565422
closing of the function : 0.171049393149
During killer program, sorting output takes 4 times longer , but time of creation of output as a list is divided by 21 !
Then globaly, the execution's time is reduced at least by 85 %.

Categories

Resources