Rename line by line a file - python

I input, i have the following lines in my file.
...
VOAUT0000001712_19774.JPG FRYW-1712-19774.jpg eab516afc1aaa10ad23edb5c15ae4ea4 1712 01
VOAUT0000001712_19775.JPG FRYW-1712-19775.jpg 2715ceba8fd5c69b4ca6952e942a1a8a 1712 02
VOAUT0000001712_19776.JPG FRYW-1712-19776.jpg b1a0c4ec6160da3511e23c617517ff6f 1712 03
VOAUT0000001713_19795.JPG FRYW-1713-19795.jpg 56cd173c6e9436b19d39de214669cc53 1713 01
VOAUT0000001713_19796.JPG FRYW-1713-19796.jpg 271aa1b9ef2ac39c502a270c82b31fa3 1713 02
VOAUT0000001713_19797.JPG FRYW-1713-19797.jpg 667732a85660bebec168bc46b884d9b7 1713 03
VOAUT0000001714_19763.JPG FRYW-1714-19763.jpg d37770d6cde5639ce5db4e6a436498a8 1714 01
VOAUT0000001714_19764.JPG FRYW-1714-19764.jpg ce891ca4d4ea59c3a312a468bb0d4408 1714 02
VOAUT0000001714_19765.JPG FRYW-1714-19765.jpg bd7fed521fe3997bf5c879d9d5ce942d 1714 03
...
I would like to modify my file line by line in order to have this
17124615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19774.jpg eab516afc1aaa10ad23edb5c15ae4ea4
17124615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19775.jpg 2715ceba8fd5c69b4ca6952e942a1a8a
17124615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19776.jpg b1a0c4ec6160da3511e23c617517ff6f
17134615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19795.jpg 56cd173c6e9436b19d39de214669cc53
17134615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19796.jpg 271aa1b9ef2ac39c502a270c82b31fa3
17134615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19797.jpg 667732a85660bebec168bc46b884d9b7
17144615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19763.jpg d37770d6cde5639ce5db4e6a436498a8
17144615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19764.jpg ce891ca4d4ea59c3a312a468bb0d4408
17144615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19765.jpg bd7fed521fe3997bf5c879d9d5ce942d
Here is the beginning of my code:
def renameLineByLine():
with open('/opt/data/photos.txt') as f:
for line in f:
newname, file, path, checksum = line.split()
if ..?? :
try:
rename(...???)
except OSError:
logger.error('Got a problem')
but I do not see how to rename each line with the new format ..?

You need to read in the file correctly: Skip empty lines and split for the correct values. There is no path where you expect one and you do not account for the two numbers at the end, which are crucial for the conversion.
Here I simply write all newly formatted lines into a new file, photos_new.txt. I hope this gets you started.
Note however, that your method name renameLineByLine as well as your try/except seem to hint that you also want to move/rename/do some work on your pictures. If that is the case, this answer will not be sufficient and you should try to elaborate a little bit more.
def renameLineByLine():
new_lines = []
path = '/opt/AutoPrivilege/client/photos/'
with open('/opt/data/photos.txt', 'r') as fin, \
open('/opt/data/photos_new.txt', 'w') as fout:
for line in fin:
if len(line) != 1:
newname, file, checksum, no1, no2 = line.split()
fout.write(" ".join([
"{}4615_{}_hd.jpg".format(no1, no2),
path + file, checksum, '\n'])
)
else:
fout.write('\n')
Input:
VOAUT0000001712_19774.JPG FRYW-1712-19774.jpg eab516afc1aaa10ad23edb5c15ae4ea4 1712 01
VOAUT0000001712_19775.JPG FRYW-1712-19775.jpg 2715ceba8fd5c69b4ca6952e942a1a8a 1712 02
VOAUT0000001712_19776.JPG FRYW-1712-19776.jpg b1a0c4ec6160da3511e23c617517ff6f 1712 03
VOAUT0000001713_19795.JPG FRYW-1713-19795.jpg 56cd173c6e9436b19d39de214669cc53 1713 01
VOAUT0000001713_19796.JPG FRYW-1713-19796.jpg 271aa1b9ef2ac39c502a270c82b31fa3 1713 02
VOAUT0000001713_19797.JPG FRYW-1713-19797.jpg 667732a85660bebec168bc46b884d9b7 1713 03
VOAUT0000001714_19763.JPG FRYW-1714-19763.jpg d37770d6cde5639ce5db4e6a436498a8 1714 01
VOAUT0000001714_19764.JPG FRYW-1714-19764.jpg ce891ca4d4ea59c3a312a468bb0d4408 1714 02
VOAUT0000001714_19765.JPG FRYW-1714-19765.jpg bd7fed521fe3997bf5c879d9d5ce942d 1714 03
Output:
17124615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19774.jpg eab516afc1aaa10ad23edb5c15ae4ea4
17124615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19775.jpg 2715ceba8fd5c69b4ca6952e942a1a8a
17124615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19776.jpg b1a0c4ec6160da3511e23c617517ff6f
17134615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19795.jpg 56cd173c6e9436b19d39de214669cc53
17134615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19796.jpg 271aa1b9ef2ac39c502a270c82b31fa3
17134615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19797.jpg 667732a85660bebec168bc46b884d9b7
17144615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19763.jpg d37770d6cde5639ce5db4e6a436498a8
17144615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19764.jpg ce891ca4d4ea59c3a312a468bb0d4408
17144615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19765.jpg bd7fed521fe3997bf5c879d9d5ce942d

All the information you need to generate the lines is already provided. Thanks to #SebastianHöffner for pointing out the obvious.
out = open('output.txt','w')
for line in open('data.txt'):
if len(line) != 1:
a, b, c, d, e = line.split()
l = d + '4615_'+ e + '_hd.jpg /opt/AutoPrivilege/client/photos/' + b + ' '+ c
out.write(l + '\n')
else:
out.write('\n')
out.close()

Something like this:
with open('1.txt','r') as inF:
with open('12.txt', 'w') as outF:
for line in inF:
if line not in ('\n','\r\n'):
t = []
s = line.split()
t.append(s[3]+'4615_'+s[4]+'_hd.'+s[0].split('.')[1].lower())
t.append('/opt/AutoPrivilege/client/photos/' + s[1])
t.append(s[2]+'\n')
outF.write(' '.join(t))
else:
outF.write(line)

Related

find a phrase/string and read the lines which corresponds to the phrase/string

Here is part of a file:-
### zones_list.txt file ########
------------------------
VSAN:1 FCID:0x6f01e0
------------------------
port-wwn (vendor) :20:32:00:02:ac:02:74:24
node-wwn :2f:f7:00:02:ac:02:74:24
class :3
node-ip-addr :0.0.0.0
ipa :ff ff ff ff ff ff ff ff
fc4-types:fc4_features :scsi-fcp:target
symbolic-port-name :4UW0002645 - 0:3:2 - LPE32004-32G
symbolic-node-name :HPE_3PAR A650 - 4UW0002645 - fw:4300
port-type :N
port-ip-addr :0.0.0.0
fabric-port-wwn :20:03:00:de:fb:ce:e9:40
hard-addr :0x000000
permanent-port-wwn (vendor) :20:32:00:02:ac:02:74:24
connected interface :fc1/3
switch name (IP address) :c3-sn6610c-02 (15.112.42.197)
------------------------
VSAN:1 FCID:0x6f0200
------------------------
port-wwn (vendor) :20:33:00:02:ac:07:e9:d5
node-wwn :2f:f7:00:02:ac:07:e9:d5
class :3
node-ip-addr :0.0.0.0
ipa :ff ff ff ff ff ff ff ff
fc4-types:fc4_features :scsi-fcp:target
symbolic-port-name :4UW0002955 - 0:3:3 - LPE32004-32G
symbolic-node-name :HPE_3PAR C630 - 4UW0002955 - fw:4210
port-type :N
port-ip-addr :0.0.0.0
fabric-port-wwn :20:0f:00:de:fb:ce:e9:40
hard-addr :0x000000
permanent-port-wwn (vendor) :20:33:00:02:ac:07:e9:d5
connected interface :fc1/15
switch name (IP address) :c3-sn6610c-02 (15.112.42.197)
------------------------
VSAN:1 FCID:0x8d0000
------------------------
port-wwn (vendor) :10:00:00:10:9b:8c:26:64 (Emulex)
node-wwn :20:00:00:10:9b:8c:26:64
class :3
node-ip-addr :0.0.0.0
ipa :ff ff ff ff ff ff ff ff
fc4-types:fc4_features :
symbolic-port-name :
symbolic-node-name :
port-type :N
port-ip-addr :0.0.0.0
fabric-port-wwn :20:07:00:3a:9c:53:9e:b0
hard-addr :0x000000
permanent-port-wwn (vendor) :00:00:00:00:00:00:00:00
connected interface :fc1/7
switch name (IP address) :c3-cs9148-44 (15.112.48.20)
------------------------
The file has 100s of entries in the above fashion. I want to find "port-wwn (vendor) :20:32:00:02:ac:02:74:24" and read out the "connected interface" and "switch name"..
So in my code i ask the user to enter the wwn number "xx:xx:...:xx" and I search for that entry and find the Index of line and add + 13 and + 14 to the Index and print the 13th and 14th line from that Index.
The below code works which gives me the line Index corresponding to the wwn "xx:xx:...:xx"
with open("zones_list.txt", 'r') as f:
#lines = f.readlines()
for (i, line) in enumerate(f):
if wwn in line:
print("index is : " + str(i))
#j = i + 13
#k = i + 14
#print(lines[j])
#print(lines[k])
break
f.close()
But when i want to print the lines 13th and 14th after the Index corresponding to the phase/string i desire, it is not happening any help?
with open("zones_list.txt", 'r') as f:
lines = f.readlines()
for (i, line) in enumerate(f):
if wwn in line:
print("index is : " + str(i))
j = i + 13
k = i + 14
print(lines[j])
print(lines[k])
break
f.close()
But the code is not working..
Any other way to write the code..?
Thank you!
This should work:
wwn = input()
with open("zones_list.txt", 'r') as f:
lines = f.readlines()
for (i, line) in enumerate(lines):
if wwn in line:
print("index is : " + str(i))
j = i + 13
k = i + 14
print(lines[j])
print(lines[k])
break
Input:
20:32:00:02:ac:02:74:24
Output:
index is : 3
connected interface :fc1/3
switch name (IP address) :c3-sn6610c-02 (15.112.42.197)

Python split() not working as expected for first line in file

I have a large text file of data mined opinions and each is classified as positive, negative, neutral, or mixed. Every line begins with "+ ", "- ", "= ", or "* " which correspond to these classifiers. Additionally, lines that begin with "!! " represent a comment to ignore.
Below is a simple Python script that is just supposed to count each of the classifiers and ignore the comment lines:
classes = [0, 0, 0, 0] # "+", "-", "=", "*"
f = open("All_Classified.txt")
for i, line in enumerate(f):
line = line.strip()
classifier = line.split(" ")[0]
if classifier == "+": classes[0] += 1
elif classifier == "-": classes[1] += 1
elif classifier == "=": classes[2] += 1
elif classifier == "*": classes[3] += 1
elif classifier == "!!": continue
else: print "Line "+str(i+1)+": \""+line+"\""
f.close()
print classes
Here is a sample of the first 5 lines of "All_Classified.txt":
!! GROUP 1, 1001 - 1512
= 1001//CD TITLETITLE//NNP How//WRB many//JJ conditioners/conditioner/NNS do//VBP you//PRP have//VBP ?//.
= 1002//CD I//PRP have//VBP two//CD different//JJ kinds/kind/NNS ,//, Garnier//NNP Fructis//NNP Triple//NNP Nutrition//NNP conditioner//NN ,//, and//CC Suave//NNP coconut//NN .//.
= 1003//CD But//CC I//PRP think//VBP I//PRP have//VBP about//IN 8//CD bottles/bottle/NNS of//IN the//DT Suave//NNP coconut//NN My//PRP$ mom//NN gave/give/VBD me//PRP a//DT bunch//NN for//IN Christmas//NNP because//IN she//PRP was/be/VBD getting/get/VBG tired/tire/VBN of//IN me//PRP saying/say/VBG I//PRP was/be/VBD out//IN
= 1004//CD TITLETITLE//NNP Need//VB a//DT gel//NN that//IN works/work/NNS ,//, 8//CD mo//NN ,//, post//NN ,//, ready//JJ to//TO relax//VB edges/edge/NNS ,//, HELP//NNP ,//,
For whatever reason my output is triggering the else statement during the first iteration as if it does not recognize the "!!", I am not sure why. I am getting this as output:
Line 1: "!! GROUP 1, 1001 - 1512"
[2986, 1034, 16278, 535]
Additionally, If I delete the first line from "All_Classified.txt" then it does not recognize the "=" of what would then be the first line. Not sure what has to be done for the first line to be recognized as expected.
Edit (again): As Peter asked, here is the output if I replace else: print "Line "+str(i+1)+": \""+line+"\"" with else: print "Classifier "+classifier+ " Line "+str(i+1)+": \""+line+"\"":
Classifier !! Line 1: "!! GROUP 1, 1001 - 1512"
[2986, 1034, 16278, 535]
Edit: First section using xxd All_Classified.txt:
0000000: efbb bf21 2120 4752 4f55 5020 312c 2031 ...!! GROUP 1, 1
0000010: 3030 3120 2d20 3135 3132 0d0a 3d20 3130 001 - 1512..= 10
0000020: 3031 2f2f 4344 2054 4954 4c45 5449 544c 01//CD TITLETITL
0000030: 452f 2f4e 4e50 2048 6f77 2f2f 5752 4220 E//NNP How//WRB
I suspect your input file isn't what it seems. For example, classifier could contain some control characters that are not shown when you print it (but which affect the comparison):
>>> classifier = '!\0!'
>>> print classifier
!!
>>> classifier == '!!'
False
edit There you have it:
0000000: efbb bf21 2120
^^^^ ^^
It's the UTF-8 BOM, which becomes part of classifier.
Try opening the file using codecs.open() with "utf-8-sig" as the encoding (see, for example, https://stackoverflow.com/a/13156715/367273).

compare two text file and save the output match

I have two text files and I want to compare them and save the matched columns to a new text
file.
file1:
114.74721
114.85107
2.96667
306.61756
file2:
115.06603 0.00294 5.90000
114.74721 0.00674 5.40000
114.85107 0.00453 6.20000
111.17744 0.00421 5.50000
192.77787 0.03080 3.20000
189.70226 0.01120 5.00000
0.46762 0.00883 3.70000
2.21539 0.01290 3.50000
2.96667 0.01000 3.60000
5.43310 0.00393 5.50000
0.28537 0.00497 5.10000
308.82348 0.00183 6.60000
306.61756 0.00359 5.20000
And I want the output to be:
114.74721 0.00674 5.40000
114.85107 0.00453 6.20000
2.96667 0.01000 3.60000
306.61756 0.00359 5.20000
I used a script but there is something wrong because the output file more rows than the file1 which it should be the same.Could you help me?
file1=open("file1.txt","r")
file2=open("file2.txt","r")
file3=open("output.txt","w")
for line1 in file1.readlines():
file2.seek(0)
for line2 in file2.readlines():
if line1.strip() in line2:
file3.writerow(line2)
Edit
From file1.txt
114.74721
114.85107
2.96667
306.61756
152.70581
150.04497
91.41869
91.41869
91.73398
92.35076
117.68963
117.69291
115.97827
168.14476
169.94404
73.00571
156.02833
156.02833
From file3.txt
114.74721 0.00674 5.40000
114.85107 0.00453 6.20000
2.96667 0.01000 3.60000
306.61756 0.00359 5.20000
152.70581 0.02780 2.70000
150.04497 0.00211 6.00000
91.41869 0.00500 3.70000
91.73398 0.00393 4.30000
92.35076 0.00176 5.80000
117.68963 0.15500 2.20000
117.69291 0.15100 2.50000
115.97827 0.00722 7.80000
168.14476 0.00383 5.50000
169.94404 0.00539 4.80000
73.00571 0.00876 3.80000
156.02833 0.00284 6.30000
156.64645 0.01290 3.50000
156.65070 0.02110 4.40000
If you see the line 7 and line 8 have the same value 91.41869 in file1.txt but in file3.txt it only mention line 7 but not 8. The same also in lines 17 and 18.
FILE1 = "file1.txt"
FILE2 = "file2.txt"
OUTPUT = "file3.txt"
with open(FILE1) as inf:
match = set(line.strip() for line in inf)
with open(FILE2) as inf, open(OUTPUT, "w") as outf:
for line in inf:
if line.split(' ', 1)[0] in match:
outf.write(line)
or, if they HAVE to be in the same order,
with open(FILE1) as inf:
items = [line.strip() for line in inf]
match = {val:i for i,val in enumerate(items)}
outp = ['\n'] * len(items)
with open(FILE2) as inf, open(OUTPUT, "w") as outf:
for line in inf:
val = line.split(' ', 1)[0]
try:
outp[match[val]] = line
except KeyError:
pass
outf.write(''.join(outp))
Note that the first version will write out as many matches as it finds - if two lines in FILE2 start with "114.74721" you will get both of them - while the second will only keep the last match found.

Python: compare column in two files

I'm just trying to solving this text-processing task using python, but I'm not able to compare column.
What I have tried :
#!/usr/bin/env python
import sys
def Main():
print "This is your input Files %s,%s" % ( file1,file2 )
f1 = open(file1, 'r')
f2 = open(file2, 'r')
for line in f1:
column1_f1 = line.split()[:1]
#print column1_f1
for check in f2:
column2_f2 = check.split()[:1]
print column1_f1,column2_f2
if column1_f1 == column2_f2:
print "Match",line
else:
print line,check
f1.close()
f2.close()
if __name__ == '__main__':
if len(sys.argv) != 3:
print >> sys.stderr, "This Script need exact 2 argument, aborting"
exit(1)
else:
ThisScript, file1, file2 = sys.argv
Main()
I'm new in Python, Please help me to learn and understand this..
I would resolve it in similar way in python3 that user46911 did with awk. Read second file and save its keys in a dictionary. Later check if exists for each line of first file:
import sys
codes = {}
with open(sys.argv[2], 'r') as f2:
for line in f2:
fields = line.split()
codes[fields[0]] = fields[1]
with open(sys.argv[1], 'r') as f1:
for line in f1:
fields = line.split(None, 1)
if fields[0] in codes:
print('{0:4s}{1:s}'.format(codes[fields[0]], line[4:]), end='')
else:
print(line, end='')
Run it like:
python3 script.py file1 file2
That yields:
060090 AKRABERG FYR DN 6138 -666 101
EKVG 060100 VAGA FLOGHAVN DN 6205 -728 88
060110 TORSHAVN DN 6201 -675 55
060120 KIRKJA DN 6231 -631 55
060130 KLAKSVIK HELIPORT DN 6221 -656 75
060160 HORNS REV A DN 5550 786 21
060170 HORNS REV B DN 5558 761 10
060190 SILSTRUP DN 5691 863 0
060210 HANSTHOLM DN 5711 858 0
EKGF 060220 TYRA OEST DN 5571 480 43
EKTS 060240 THISTED LUFTHAVN DN 5706 870 8
060290 GROENLANDSHAVNEN DN 5703 1005 0
EKYT 060300 FLYVESTATION AALBORG DN 5708 985 13
060310 TYLSTRUP DN 5718 995 0
060320 STENHOEJ DN 5736 1033 56
060330 HIRTSHALS DN 5758 995 0
EKSN 060340 SINDAL FLYVEPLADS DN 5750 1021 28

How to merge only the unique lines from file_a to file_b?

This question has been asked here in one form or another but not quite the thing I'm looking for. So, this is the situation I shall be having: I already have one file, named file_a and I'm creating another file - file_b. file_a is always bigger than file_b in size. There will be a number of duplicate lines in file_b (hence, in file_a as well) but both the files will have some unique lines. What I want to do is: to copy/merge only the unique lines from file_a to file_b and then sort the line order, so that the file_b becomes the most up-to-date one with all the unique entries. Either of the original files shouldn't be more than 10MB in size. What's the most efficient (and fastest) way I can do that?
I was thinking something like that, which does the merging alright.
#!/usr/bin/env python
import os, time, sys
# Convert Date/time to epoch
def toEpoch(dt):
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
# input files
o_file = "file_a"
c_file = "file_b"
n_file = [o_file,c_file]
m_file = "merged.file"
for x in range(len(n_file)):
P = open(n_file[x],"r")
output = P.readlines()
P.close()
# Sort the output, order by 2nd last field
#sp_lines = [ line.split('\t') for line in output ]
#sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
#for line in sp_lines:
for line in output:
if "group_" in line:
F.write(line)
F.close()
But, it's:
not with only the unique lines
not sorted (by next to last field)
and introduces the 3rd file i.e. m_file
Just a side note (long story short): I can't use sorted() here as I'm using v2.3, unfortunately. The input files look like this:
On 23/03/11 00:40:03
JobID Group.User Ctime Wtime Status QDate CDate
===================================================================================
430792 group_atlas.pltatl16 0 32 4 02/03/11 21:52:38 02/03/11 22:02:15
430793 group_atlas.atlas084 30 472 4 02/03/11 21:57:43 02/03/11 22:09:35
430794 group_atlas.atlas084 12 181 4 02/03/11 22:02:37 02/03/11 22:05:42
430796 group_atlas.atlas084 8 185 4 02/03/11 22:02:38 02/03/11 22:05:46
I tried to use cmp() to sort by the 2nd last field but, I think, it doesn't work just because of the first 3 lines of the input files.
Can anyone please help? Cheers!!!
Update 1:
For the future reference, as suggested by Jakob, here is the complete script. It worked just fine.
#!/usr/bin/env python
import os, time, sys
from sets import Set as set
def toEpoch(dt):
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
def yield_lines(fileobj):
#I want to discard the headers
for i in xrange(3):
fileobj.readline()
#
for line in fileobj:
yield line
def app(path1, path2):
file1 = set(yield_lines(open(path1)))
file2 = set(yield_lines(open(path2)))
return file1.union(file2)
# Input files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"
print time.strftime('%H:%M:%S', time.localtime())
# Sorting the output, order by 2nd last field
sp_lines = [ line.split('\t') for line in app(o_file, c_file) ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)
for line in sp_lines:
MF = '\t'.join(line)
F.write(MF)
F.close()
It took about 2m:47s to finish for 145244 lines.
[testac1#serv07 ~]$ ./uniq-merge.py
17:19:21
No. of lines: 145244
17:22:08
thanks!!
Update 2:
Hi eyquem, this is the Error message I get when I run your script(s).
From the first script:
[testac1#serv07 ~]$ ./uniq-merge_2.py
File "./uniq-merge_2.py", line 44
fm.writelines( '\n'.join(v)+'\n' for k,v in output )
^
SyntaxError: invalid syntax
From the second script:
[testac1#serv07 ~]$ ./uniq-merge_3.py
File "./uniq-merge_3.py", line 24
output = sett(line.rstrip() for line in fa)
^
SyntaxError: invalid syntax
Cheers!!
Update 3:
The previous one wasn't sorting the list at all. Thanks to eyquem to pointing that out. Well, it does now. This is a further modification to Jakob's version - I converted the set:app(path1, path2) to a list:myList() and then applied the sort( lambda ... ) to the myList to sort the merged file by the nest to last field. This is the final script.
#!/usr/bin/env python
import os, time, sys
from sets import Set as set
def toEpoch(dt):
# Convert date/time to epoch
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
def yield_lines(fileobj):
# Discard the headers (1st 3 lines)
for i in xrange(3):
fileobj.readline()
for line in fileobj:
yield line
def app(path1, path2):
# Remove duplicate lines
file1 = set(yield_lines(open(path1)))
file2 = set(yield_lines(open(path2)))
return file1.union(file2)
print time.strftime('%H:%M:%S', time.localtime())
# I/O files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"
# Convert set into to list
myList = list(app(o_file, c_file))
# Sort the list by the date
sp_lines = [ line.split('\t') for line in myList ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)
# Finally write to the outFile
for line in sp_lines:
MF = '\t'.join(line)
F.write(MF)
F.close()
There is no speed boost at all, it took 2m:50s to process the same 145244 lines. Is anyone see any scope of improvement, please let me know. Thanks to Jakob and eyquem for their time. Cheers!!
Update 4:
Just for future reference, this is a modified version of eyguem, which works much better and faster then the previous ones.
#!/usr/bin/env python
import os, sys, re
from sets import Set as sett
from time import mktime, strptime, strftime
def sorting_merge(o_file, c_file, m_file ):
# RegEx for Date/time filed
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d')
def kl(lines,pat = pat):
# match only the next to last field
line = lines.split('\t')
line = line[-2]
return mktime(strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
output = sett()
head = []
# Separate the header & remove the duplicates
def rmHead(f_n):
f_n.readline()
for line1 in f_n:
if pat.search(line1): break
else: head.append(line1) # line of the header
for line in f_n:
output.add(line.rstrip())
output.add(line1.rstrip())
f_n.close()
fa = open(o_file, 'r')
rmHead(fa)
fb = open(c_file, 'r')
rmHead(fb)
# Sorting date-wise
output = [ (kl(line),line.rstrip()) for line in output if line.rstrip() ]
output.sort()
fm = open(m_file,'w')
# Write to the file & add the header
fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head[0]+head[1])))
for t,line in output:
fm.write(line + '\n')
fm.close()
c_f = "03_a"
o_f = "03_b"
sorting_merge(o_f, c_f, 'outfile.txt')
This version is much faster - 6.99 sec. for 145244 lines compare to the 2m:47s - then the previous one using lambda a, b: cmp(). Thanks to eyquem for all his support. Cheers!!
EDIT 2
My previous codes have problems with output = sett(line.rstrip() for line in fa) and output.sort(key=kl)
Moreover, they have some complications.
So I examined the choice of reading the files directly with a set() function taken by Jakob Bowyer in his code.
Congratulations Jakob ! (and Michal Chruszcz by the way) : set() is unbeatable, it's faster than a reading one line at a time.
Then , I abandonned my idea to read the files line after line.
.
But I kept my idea to avoid a sorting with the help of cmp() function because, as it is described in the doc:
s.sort([cmpfunc=None])
The sort() method takes an optional
argument specifying a comparison
function of two arguments (list items)
(...) Note that this slows the sorting
process down considerably
http://docs.python.org/release/2.3/lib/typesseq-mutable.html
Then, I managed to obtain a list of tuples (t,line) in which the t is
time.mktime(time.strptime(( 1st date-and-hour in line ,'%d/%m/%y %H:%M:%S'))
by the instruction
output = [ (kl(line),line.rstrip()) for line in output]
.
I tested 2 codes. The following one in which 1st date-and-hour in line is computed thanks to a regex:
def kl(line,pat = pat):
return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
output = [ (kl(line),line.rstrip()) for line in output if line.rstrip()]
output.sort()
And a second code in which kl() is:
def kl(line,pat = pat):
return time.mktime(time.strptime(line.split('\t')[-2],'%d/%m/%y %H:%M:%S'))
.
The results are
Times of execution:
0.03598 seconds for the first code with regex
0.03580 seconds for the second code with split('\t')
that is to say the same
This algorithm is faster than a code using a function cmp() :
a code in which the set of lines output isn't transformed in a list of tuples by
output = [ (kl(line),line.rstrip()) for line in output]
but is only transformed in a list of the lines (without duplicates, then) and sorted with a function mycmp() (see the doc):
def mycmp(a,b):
return cmp(time.mktime(time.strptime(a.split('\t')[-2],'%d/%m/%y %H:%M:%S')),
time.mktime(time.strptime(b.split('\t')[-2],'%d/%m/%y %H:%M:%S')))
output = [ line.rstrip() for line in output] # not list(output) , to avoid the problem of newline of the last line of each file
output.sort(mycmp)
for line in output:
fm.write(line+'\n')
has an execution time of
0.11574 seconds
.
The code:
#!/usr/bin/env python
import os, time, sys, re
from sets import Set as sett
def sorting_merge(o_file , c_file, m_file ):
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'(?=[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)')
def kl(line,pat = pat):
return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
output = sett()
head = []
fa = open(o_file)
fa.readline() # first line is skipped
while True:
line1 = fa.readline()
mat1 = pat.search(line1)
if not mat1: head.append(line1) # line1 is here a line of the header
else: break # the loop ends on the first line1 not being a line of the heading
output = sett( fa )
fa.close()
fb = open(c_file)
while True:
line1 = fb.readline()
if pat.search(line1): break
output = output.union(sett( fb ))
fb.close()
output = [ (kl(line),line.rstrip()) for line in output]
output.sort()
fm = open(m_file,'w')
fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
for t,line in output:
fm.write(line + '\n')
fm.close()
te = time.clock()
sorting_merge('ytre.txt','tataye.txt','merged.file.txt')
print time.clock()-te
This time, I hope it will run correctly, and that the only thing to do is to wait the times of execution on real files much bigger than the ones on which I tested the codes
.
EDIT 3
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'(?=[ \t]+'
'[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'|'
'[ \t]+aborted/deleted)')
.
EDIT 4
#!/usr/bin/env python
import os, time, sys, re
from sets import Set
def sorting_merge(o_file , c_file, m_file ):
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'(?=[ \t]+'
'[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'|'
'[ \t]+aborted/deleted)')
def kl(line,pat = pat):
return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
head = []
output = Set()
fa = open(o_file)
fa.readline() # first line is skipped
for line1 in fa:
if pat.search(line1): break # first line after the heading
else: head.append(line1) # line of the header
for line in fa:
output.add(line.rstrip())
output.add(line1.rstrip())
fa.close()
fb = open(c_file)
for line1 in fb:
if pat.search(line1): break
for line in fb:
output.add(line.rstrip())
output.add(line1.rstrip())
fb.close()
if '' in output: output.remove('')
output = [ (kl(line),line) for line in output]
output.sort()
fm = open(m_file,'w')
fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
for t,line in output:
fm.write(line+'\n')
fm.close()
te = time.clock()
sorting_merge('A.txt','B.txt','C.txt')
print time.clock()-te
Maybe something along these lines?
from sets import Set as set
def yield_lines(fileobj):
#I want to discard the headers
for i in xrange(3):
fileobj.readline()
for line in fileobj:
yield line
def app(path1, path2):
file1 = set(yield_lines(open(path1)))
file2 = set(yield_lines(open(path2)))
return file1.union(file2)
EDIT: Forgot about with :$
I wrote this new code, with the ease of using a set. It is faster that my previous code. And, it seems, than your code
#!/usr/bin/env python
import os, time, sys, re
from sets import Set as sett
def sorting_merge(o_file , c_file, m_file ):
# Convert Date/time to epoch
def toEpoch(dt):
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
pat = re.compile('([0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)'
'[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d')
fa = open(o_file)
head = []
fa.readline()
while True:
line1 = fa.readline()
mat1 = pat.search(line1)
if not mat1:
head.append(('',line1.rstrip()))
else:
break
output = sett((toEpoch(pat.search(line).group(1)) , line.rstrip())
for line in fa)
output.add((toEpoch(mat1.group(1)) , line1.rstrip()))
fa.close()
fb = open(c_file)
while True:
line1 = fb.readline()
mat1 = pat.search(line1)
if mat1: break
for line in fb:
output.add((toEpoch(pat.search(line).group(1)) , line.rstrip()))
output.add((toEpoch(mat1.group(1)) , line1.rstrip()))
fb.close()
output = list(output)
output.sort()
output[0:0] = head
output[0:0] = [('',time.strftime('On %d/%m/%y %H:%M:%S'))]
fm = open(m_file,'w')
fm.writelines( line+'\n' for t,line in output)
fm.close()
te = time.clock()
sorting_merge('ytr.txt','tatay.txt','merged.file.txt')
print time.clock()-te
Note that this code put a heading in the merged file
.
EDIT
Aaaaaah... I got it... :-))
Execution's time divided by 3 !
#!/usr/bin/env python
import os, time, sys, re
from sets import Set as sett
def sorting_merge(o_file , c_file, m_file ):
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
'(?=[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)')
def kl(line,pat = pat):
return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
fa = open(o_file)
head = []
fa.readline()
while True:
line1 = fa.readline()
mat1 = pat.search(line1)
if not mat1:
head.append(line1.rstrip())
else:
break
output = sett(line.rstrip() for line in fa)
output.add(line1.rstrip())
fa.close()
fb = open(c_file)
while True:
line1 = fb.readline()
mat1 = pat.search(line1)
if mat1: break
for line in fb:
output.add(line.rstrip())
output.add(line1.rstrip())
fb.close()
output = list(output)
output.sort(key=kl)
output[0:0] = [time.strftime('On %d/%m/%y %H:%M:%S')] + head
fm = open(m_file,'w')
fm.writelines( line+'\n' for line in output)
fm.close()
te = time.clock()
sorting_merge('ytre.txt','tataye.txt','merged.file.txt')
print time.clock()-te
Last codes, I hope.
Because I found a killer code.
First , I created two files "xxA.txt" and "yyB.txt" of 30 lines having 30000 lines as
430559 group_atlas.atlas084 12 181 4 04/03/10 01:38:02 02/03/11 22:05:42
430502 group_atlas.atlas084 12 181 4 23/01/10 21:45:05 02/03/11 22:05:42
430544 group_atlas.atlas084 12 181 4 17/06/11 12:58:10 02/03/11 22:05:42
430566 group_atlas.atlas084 12 181 4 25/03/10 23:55:22 02/03/11 22:05:42
with the following code:
create AB.py
from random import choice
n = tuple( str(x) for x in xrange(500,600))
days = ('01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16',
'17','18','19','20','21','22','23','24','25','26','27','28')
# not '29','30,'31' to avoid problems with strptime() on last days of february
months = days[0:12]
hours = days[0:23]
ms = ['00','01','02','03','04','05','06','07','09'] + [str(x) for x in xrange(10,60)]
repeat = 30000
with open('xxA.txt','w') as f:
# 430794 group_atlas.atlas084 12 181 4 02/03/11 22:02:37 02/03/11 22:05:42
ch = ('On 23/03/11 00:40:03\n'
'JobID Group.User Ctime Wtime Status QDate CDate\n'
'===================================================================================\n')
f.write(ch)
for i in xrange(repeat):
line = '430%s group_atlas.atlas084 12 181 4 \t%s/%s/%s %s:%s:%s\t02/03/11 22:05:42\n' %\
(choice(n),
choice(days),choice(months),choice(('10','11')),
choice(hours),choice(ms),choice(ms))
f.write(line)
with open('yyB.txt','w') as f:
# 430794 group_atlas.atlas084 12 181 4 02/03/11 22:02:37 02/03/11 22:05:42
ch = ('On 25/03/11 13:45:24\n'
'JobID Group.User Ctime Wtime Status QDate CDate\n'
'===================================================================================\n')
f.write(ch)
for i in xrange(repeat):
line = '430%s group_atlas.atlas084 12 181 4 \t%s/%s/%s %s:%s:%s\t02/03/11 22:05:42\n' %\
(choice(n),
choice(days),choice(months),choice(('10','11')),
choice(hours),choice(ms),choice(ms))
f.write(line)
with open('xxA.txt') as g:
print 'readlines of xxA.txt :',len(g.readlines())
g.seek(0,0)
print 'set of xxA.txt :',len(set(g))
with open('yyB.txt') as g:
print 'readlines of yyB.txt :',len(g.readlines())
g.seek(0,0)
print 'set of yyB.txt :',len(set(g))
Then I ran these 3 programs:
"merging regex.py"
#!/usr/bin/env python
from time import clock,mktime,strptime,strftime
from sets import Set
import re
infunc = []
def sorting_merge(o_file, c_file, m_file ):
infunc.append(clock()) #infunc[0]
pat = re.compile('([0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)')
output = Set()
def rmHead(filename, a_set):
f_n = open(filename, 'r')
f_n.readline()
head = []
for line in f_n:
head.append(line) # line of the header
if line.strip('= \r\n')=='': break
for line in f_n:
a_set.add(line.rstrip())
f_n.close()
return head
infunc.append(clock()) #infunc[1]
head = rmHead(o_file, output)
infunc.append(clock()) #infunc[2]
head = rmHead(c_file, output)
infunc.append(clock()) #infunc[3]
if '' in output: output.remove('')
infunc.append(clock()) #infunc[4]
output = [ (mktime(strptime(pat.search(line).group(),'%d/%m/%y %H:%M:%S')),line)
for line in output ]
infunc.append(clock()) #infunc[5]
output.sort()
infunc.append(clock()) #infunc[6]
fm = open(m_file,'w')
fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
for t,line in output:
fm.write(line + '\n')
fm.close()
infunc.append(clock()) #infunc[7]
c_f = "xxA.txt"
o_f = "yyB.txt"
t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergedr.txt')
t2 = clock()
print 'merging regex'
print 'total time of execution :',t2-t1
print ' launching :',infunc[1] - t1
print ' preparation :',infunc[1] - infunc[0]
print ' reading of 1st file :',infunc[2] - infunc[1]
print ' reading of 2nd file :',infunc[3] - infunc[2]
print ' output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print ' sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]
"merging split.py"
#!/usr/bin/env python
from time import clock,mktime,strptime,strftime
from sets import Set
infunc = []
def sorting_merge(o_file, c_file, m_file ):
infunc.append(clock()) #infunc[0]
output = Set()
def rmHead(filename, a_set):
f_n = open(filename, 'r')
f_n.readline()
head = []
for line in f_n:
head.append(line) # line of the header
if line.strip('= \r\n')=='': break
for line in f_n:
a_set.add(line.rstrip())
f_n.close()
return head
infunc.append(clock()) #infunc[1]
head = rmHead(o_file, output)
infunc.append(clock()) #infunc[2]
head = rmHead(c_file, output)
infunc.append(clock()) #infunc[3]
if '' in output: output.remove('')
infunc.append(clock()) #infunc[4]
output = [ (mktime(strptime(line.split('\t')[-2],'%d/%m/%y %H:%M:%S')),line)
for line in output ]
infunc.append(clock()) #infunc[5]
output.sort()
infunc.append(clock()) #infunc[6]
fm = open(m_file,'w')
fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
for t,line in output:
fm.write(line + '\n')
fm.close()
infunc.append(clock()) #infunc[7]
c_f = "xxA.txt"
o_f = "yyB.txt"
t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergeds.txt')
t2 = clock()
print 'merging split'
print 'total time of execution :',t2-t1
print ' launching :',infunc[1] - t1
print ' preparation :',infunc[1] - infunc[0]
print ' reading of 1st file :',infunc[2] - infunc[1]
print ' reading of 2nd file :',infunc[3] - infunc[2]
print ' output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print ' sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]
"merging killer"
#!/usr/bin/env python
from time import clock,strftime
from sets import Set
import re
infunc = []
def sorting_merge(o_file, c_file, m_file ):
infunc.append(clock()) #infunc[0]
patk = re.compile('([0123]\d)/([01]\d)/(\d{2}) ([012]\d:[0-6]\d:[0-6]\d)')
output = Set()
def rmHead(filename, a_set):
f_n = open(filename, 'r')
f_n.readline()
head = []
for line in f_n:
head.append(line) # line of the header
if line.strip('= \r\n')=='': break
for line in f_n:
a_set.add(line.rstrip())
f_n.close()
return head
infunc.append(clock()) #infunc[1]
head = rmHead(o_file, output)
infunc.append(clock()) #infunc[2]
head = rmHead(c_file, output)
infunc.append(clock()) #infunc[3]
if '' in output: output.remove('')
infunc.append(clock()) #infunc[4]
output = [ (patk.search(line).group(3,2,1,4),line)for line in output ]
infunc.append(clock()) #infunc[5]
output.sort()
infunc.append(clock()) #infunc[6]
fm = open(m_file,'w')
fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
for t,line in output:
fm.write(line + '\n')
fm.close()
infunc.append(clock()) #infunc[7]
c_f = "xxA.txt"
o_f = "yyB.txt"
t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergedk.txt')
t2 = clock()
print 'merging killer'
print 'total time of execution :',t2-t1
print ' launching :',infunc[1] - t1
print ' preparation :',infunc[1] - infunc[0]
print ' reading of 1st file :',infunc[2] - infunc[1]
print ' reading of 2nd file :',infunc[3] - infunc[2]
print ' output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print ' sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]
results
merging regex
total time of execution : 14.2816595405
launching : 0.00169211450059
preparation : 0.00168093989599
reading of 1st file : 0.163582242995
reading of 2nd file : 0.141301478261
output.remove('') : 2.37460347614e-05
creation of output : 13.4460212122
sorting of output : 0.216363532237
writing of merging file : 0.232923737514
closing of the function : 0.0797514767938
merging split
total time of execution : 13.7824474898
launching : 4.10666718815e-05
preparation : 2.70984161395e-05
reading of 1st file : 0.154349784679
reading of 2nd file : 0.136050810927
output.remove('') : 2.06730184981e-05
creation of output : 12.9691854691
sorting of output : 0.218704332534
writing of merging file : 0.225259076223
closing of the function : 0.0788362766776
merging killer
total time of execution : 2.14315311024
launching : 0.00206199391263
preparation : 0.00205026057781
reading of 1st file : 0.158711791582
reading of 2nd file : 0.138976601775
output.remove('') : 2.37460347614e-05
creation of output : 0.621466415424
sorting of output : 0.823161602941
writing of merging file : 0.227701565422
closing of the function : 0.171049393149
During killer program, sorting output takes 4 times longer , but time of creation of output as a list is divided by 21 !
Then globaly, the execution's time is reduced at least by 85 %.

Categories

Resources