Python comparing two files partially

Python comparing two files partially - python

I have the two input file:
Input 1:
okay sentence
two runway
three runway
right runway
one pathway
four pathway
zero pathway
Input 2 :
okay sentence
two runway
three runway
right runway
zero pathway
one pathway
four pathway
I have used the following code:
def diff(a, b):
y = []
for x in a:
if x not in b:
y.append(x)
else:
b.remove(x)
return y
with open('output_ref.txt', 'r') as file1:
with open('output_ref1.txt', 'r') as file2:
same = diff(list(file1), list(file2))
print same
print "\n"
if '\n' in same:
same.remove('\n')
with open('some_output_file.txt', 'w') as FO:
for line in same:
FO.write(line)
And the expected output is :
one pathway
zero pathway
But the output I am getting an empty output for this. The problem is I don't know how to store the content from the files to the list partially ,then compare and finally read it back from there. Can someone help me in this regard ??

It seems that if you just want to have the common text lines in both files, sets would provide a good way. Something like this:
content1 = set(open("file1", "r"))
content2 = set(open("file2", "r"))
diff_items = content1.difference(content2)
UPDATE: But is it so that the question is about difference in the same sense as the diff utility? I.e. the order is important (looks like that with the examples).

Use sets
with open('output_ref.txt', 'r') as file1:
with open('output_ref1.txt', 'r') as file2:
f1 = [x.strip() for x in file1] # get all lines and strip whitespace
f2 = [x.strip() for x in file2]
five_f1 = f1[0:5] # first five lines
two_f1 = f1[5:] # rest of lines
five_f2 = f2[0:5]
two_f2 = f2[5:]
s1 = set(five_f1) # make sets to compare
s2 = set(two_f1)
s1 = s1.difference(five_f2) # in a but not b
s2 = s2.difference(two_f2)
same = s1.union(s2)
with open('some_output_file.txt', 'w') as FO:
for line in same:
FO.write(line+"\n") # add new line to write each word on separate line
Without sets using your own method:
with open('output_ref.txt', 'r') as file1:
with open('output_ref1.txt', 'r') as file2:
f1 = [x.strip() for x in file1]
f2 = [x.strip() for x in file2]
five_f1 = f1[0:5]
two_f1 = f1[5:]
five_f2 = f2[0:5]
two_f2 = f2[5:]
same = diff(five_f1,five_f2) + diff(two_f1,two_f2)
print same
['one pathway', 'zero pathway']

Related

Combine two wordlist in one file Python

I have two wordlists, as per the examples below:
wordlist1.txt
aa
bb
cc
wordlist2.txt
11
22
33
I want to take every line from wordlist2.txt and put it after each line in wordlist1.txt and combine them in wordlist3.txt like this:
aa
11
bb
22
cc
33
.
.
Can you please help me with how to do it? Thanks!

Try to always try to include what you have tried.
However, this is a great place to start.
def read_file_to_list(filename):
with open(filename) as file:
lines = file.readlines()
lines = [line.rstrip() for line in lines]
return lines
wordlist1= read_file_to_list("wordlist1.txt")
wordlist2= read_file_to_list("wordlist2.txt")
with open("wordlist3.txt",'w',encoding = 'utf-8') as f:
for x,y in zip(wordlist1,wordlist2):
f.write(x+"\n")
f.write(y+"\n")
Check the following question for more ideas and understanding: How to read a file line-by-line into a list?
Cheers

Open wordlist1.txt and wordlist2.txt for reading and wordlist3.txt for writing. Then it's as simple as:
with open('wordlist3.txt', 'w') as w3, open('wordlist1.txt') as w1, open('wordlist2.txt') as w2:
for l1, l2 in zip(map(str.rstrip, w1), map(str.rstrip, w2)):
print(f'{l1}\n{l2}', file=w3)

Instead of using .splitlines(), you can also iterate over the files directly. Here's the code:
wordlist1 = open("wordlist1.txt", "r")
wordlist2 = open("wordlist2.txt", "r")
wordlist3 = open("wordlist3.txt", "w")
for txt1,txt2 in zip(wordlist1, wordlist2):
if not txt1.endswith("\n"):
txt1+="\n"
wordlist3.write(txt1)
wordlist3.write(txt2)
wordlist1.close()
wordlist2.close()
wordlist3.close()
In the first block, we are opening the files. For the first two, we use "r", which stands for read, as we don't want to change anything to the files. We can omit this, as "r" is the default argument of the open function. For the second one, we use "w", which stands for write. If the file didn't exist yet, it will create a new file.
Next, we use the zip function in the for loop. It creates an iterator containing tuples from all iterables provided as arguments. In this loop, it will contain tuples containing each one line of wordlist1.txt and one of wordlist2.txt. These tuples are directly unpacked into the variables txt1 and txt2.
Next we use an if statement to check whether the line of wordlist1.txt ends with a newline. This might not be the case with the last line, so this needs to be checked. We don't check it with the second line, as it is no problem that the last line has no newline because it will also be at the end of the resulting file.
Next, we are writing the text to wordlist3.txt. This means that the text is appended to the end of the file. However, the text that was already in the file before the opening, is lost.
Finally, we close the files. This is very important to do, as otherwise some progress might not be saved and no other applications can use the file meanwhile.

Try this:
with open('wordlist1.txt', 'r') as f1:
f1_list = f1.read().splitlines()
with open('wordlist2.txt', 'r') as f2:
f2_list = f2.read().splitlines()
f3_list = [x for t in list(zip(f1, f2)) for x in t]
with open('wordlist3.txt', 'w') as f3:
f3.write("\n".join(f3_list))

with open('wordlist1.txt') as w1,\
open('wordlist2.txt') as w2,\
open('wordlist3.txt', 'w') as w3:
for wordlist1, wordlist2 in zip(w1.readlines(), w2.readlines()):
if wordlist1[-1] != '\n':
wordlist1 += '\n'
if wordlist2[-1] != '\n':
wordlist2 += '\n'
w3.write(wordlist1)
w3.write(wordlist2)

Here you go :)
with open('wordlist1.txt', 'r') as f:
file1 = f.readlines()
with open('wordlist2.txt', 'r') as f:
file2 = f.readlines()
with open('wordlist3.txt', 'w') as f:
for x in range(len(file1)):
if not file1[x].endswith('\n'):
file1[x] += '\n'
f.write(file1[x])
if not file2[x].endswith('\n'):
file2[x] += '\n'
f.write(file2[x])

Open wordlist 1 and 2 and make a line paring, separate each pair by a newline character then join all the pairs together and separated again by a newline.
# paths
wordlist1 = #
wordlist2 = #
wordlist3 = #
with open(wordlist1, 'r') as fd1, open(wordlist2, 'r') as fd2:
out = '\n'.join(f'{l1}\n{l2}' for l1, l2 in zip(fd1.read().split(), fd2.read().split()))
with open(wordlist3, 'w') as fd:
fd.write(out)

Show different lines in files

I have found a script which shows different lines in the file NEW.txt which do not exist in OLD.txt file. It works fine, but the problem is that script is messing the lines order when I get the output. This is the script:
with open(r'C:\Users\AMB\NEW.txt') as f, open(r'C:\Users\AMB\OLD.txt') as f2:
lines1 = set(map(str.rstrip, f))
s = str(lines1.difference(map(str.rstrip, f2)))
s = s.replace(',', '\n').replace("'", '').replace("{", '').replace("}", '')
print(s)
So let's suppose that this is the OLD.txt content:
aaaaaaaaaaaa
cccccccccccc
eeeeeeeeeeee
And this is the NEW.txt content:
aaaaaaaaaaaa
bbbbbbbbbbbb
cccccccccccc
dddddddddddd
eeeeeeeeeeee
hhhhhhhhhhhh
I would like to get this output:
bbbbbbbbbbbb
dddddddddddd
hhhhhhhhhhhh
But I am getting a random line order, for example:
dddddddddddd
bbbbbbbbbbbb
hhhhhhhhhhhh
(the output is random, and not always the same)
Is there a way to keep the order for output lines in NEW.txt file? Thanks in advance.

You can remove the set and do all the operations using list.
See this link for more help.
Solution:
s = ""
with open(r'C:\Users\AMB\NEW.txt') as f, open(r'C:\Users\AMB\OLD.txt') as f2:
lines1 = list(map(str.rstrip, f)) #list of words in f
lines2 = list(map(str.rstrip, f2)) #list of words in f2
#finds the difference between both lists
diff = [i for i in lines1 + lines2 if i not in lines1 or i not in lines2]
for words in diff:
s = s + words + '\n' #Appending all words to form a single string
s = s.rstrip() #remove last line whitespace
print(s)

How to Delete First Few Rows of .txt File in Python?

I have a .txt file which looks like:
# Explanatory text
# Explanatory text
# ID_1 ID_2
10310 34426
104510 4582343
1032410 5424233
12410 957422
In the file, the two IDs on the same row are separated with tabs and the tab character is encoded as '\t'
I'm trying to do some analysis using the numbers in the dataset so want to delete the first three rows. How can this be done in Python? I.e. I'd like to produce a new dataset that looks like:
10310 34426
104510 4582343
1032410 5424233
12410 957422
I've tried the following code but it didn't work:
f = open(filename,'r')
lines = f.readlines()[3:]
f.close()
It doesn't work because I get this format (a list, with \t and \n present), not the one I indicated I want above:
[10310\t34426\n', '104510\t4582343\n', '1032410\t5424233\n' ... ]

You Can Try Something Like this
with open(filename,'r') as fh
for curline in fh:
# check if the current line
# starts with "#"
if curline.startswith("#"):
...
...
else:
...
...

You can use Python's Pandas to do these kind of tasks easily:
import pandas as pd
pd.read_csv(filename, header=None, skiprows=[0, 1, 2], sep='\t')

Ok, here is the solution:
with open('file.txt') as f:
lines = f.readlines()
lines = lines[3:]
Remove Comments
This function remove all comment lines
def remove_comments(lines):
return [line for line in lines if line.startswith("#") == False]
Remove n number of top lines
def remove_n_lines_from_top(lines, n):
if n <= len(lines):
return lines[n:]
else:
return lines
Here is the complete source:
with open('file.txt') as f:
lines = f.readlines()
def remove_comments(lines):
return [line for line in lines if line.startswith("#") == False]
def remove_n_line(lines, n):
return lines[n if n<= len(lines) else 0:]
lines = remove_n_lines_from_top(lines, 3)
f = open("new_file.txt", "w+") # save on new_file
f.writelines(lines)
f.close()

Python linking 2 strings

I am working on python program where the goal is to create a tool that takes the first word word from a file and put it beside another line in a different file.
This is the code snippet:
lines = open("x.txt", "r").readlines()
lines2 = open("c.txt", "r").readlines()
for line in lines:
r = line.split()
line1 = str(r[0])
for line2 in lines2:
l2 = line2
rn = open("b.txt", "r").read()
os = open("b.txt", "w").write(rn + line1+ "\t" + l2)
but it doesn't work correctly.
My question is that I want to make this tool to take the first word from a file, and put it beside a line in from another file for all lines in the file.
For example:
File: 1.txt :
hello there
hi there
File: 2.txt :
michal smith
takawa sama
I want the result to be :
Output:
hello michal smith
hi takaua sama

By using the zip function, you can loop through both simultaneously. Then you can pull the first word from your greeting and add it to the name to write to the file.
greetings = open("x.txt", "r").readlines()
names = open("c.txt", "r").readlines()
with open("b.txt", "w") as output_file:
for greeting, name in zip(greetings, names):
greeting = greeting.split(" ")[0]
output = "{0} {1}\n".format(greeting, name)
output_file.write(output)

Yes , like Tigerhawk indicated you want to use zip function, which combines elements from different iterables at the same index to create a list of tuples (each ith tuple having elements from ith index from each list).
Example code -
lines = open("x.txt", "r").readlines()
lines2 = open("c.txt", "r").readlines()
newlines = ["{} {}".format(x.split()[0] , y) for x, y in zip(lines,lines2)]
with open("b.txt", "w") as opfile:
opfile.write(newlines)

from itertools import *
with open('x.txt', 'r') as lines:
with open('c.txt', 'r') as lines2:
with open('b.txt', 'w') as result:
words = imap(lambda x: str(x.split()[0]), lines)
results = izip(words, lines2)
for word, line in results:
result_line = '{0} {1}'.format(word, line)
result.write(result_line)
This code will work without loading files into memory.

parsing file in python

I'm trying to parse the 2 pipe/comma separated files and if the particular field matches in the file create the new entry in the 3rd file.
Code as follows:
#! /usr/bin/python
fo = open("c-1.txt" , "r" )
for line in fo:
#print line
fields = line.split('|')
src = fields[0]
f1 = open("Airport.txt", 'r')
f2 = open("b.txt", "a")
#with open('c.csv', 'r') as f1:
# line1 = f1.read()
for line1 in f1:
reader = line1.split(',')
hi = False
target = reader[0]
if target == src and fields[1] == 'ZHT':
print target
hi = True
f2.write(fields[0])
f2.write("|")
f2.write(fields[1])
f2.write("|")
f2.write(fields[2])
f2.write("|")
f2.write(fields[3])
f2.write("|")
f2.write(fields[4])
f2.write("|")
f2.write(fields[5])
f2.write("|")
f2.write(reader[2])
if hi == False:
f2.write(line)
f2.close()
f1.close()
fo.close()
The matching field gets printed 2 times in the new file. what could be the reason?

The problem seems to be that you reset hi to False in each iteration of the loop. Lets say the second line matches, but the third does not. You set hi to True in the second line, but then to False again in the third, and then print the original line.
Try like this:
hi = False
for line1 in f1:
reader = line1.split(',')
target = reader[0]
if target == src and fields[1] == 'ZHT':
hi = True
f2.write(stuff)
if hi == False:
f2.write(line)
Or, assuming that only one line will ever match, you could use for/else:
for line1 in f1:
reader = line1.split(',')
target = reader[0]
if target == src and fields[1] == 'ZHT':
f2.write(stuff)
break
else:
f2.write(line)
Also note that you could probably replace that series of f2.write statements by this one, joining the several parts with |:
f2.write('|'.join(fields[0:6] + [reader[2]])

As mentioned already, you reset the flag within the loop so are liable to printing multiple lines.
If there is definitely only one row that will match it might be worth breaking the loop once that row has been found.
and finally check your data to make sure there aren't identical matching rows.
Other than that I have a couple other suggestions to clean up your code and make it easier to debug:
1) Use the csv library.
2) If the files can be held in memory, it would be better to hold them in memory instead of constantly opening and closing them.
3) Use with to handle the files (I not you have already tried in your comments).
Something like the following should work.
#! /usr/bin/python
import csv
data_0 = {}
data_1 = {}
with open("c-1.txt" , "r" ) as fo, open("Airport.txt", "r") as f1:
fo_reader = csv.reader(fo, delimiter="|")
f1_reader = csv.reader(f1) # default delimiter is ','
for line in fo_reader:
if line[1] == 'ZHT':
try: # Add to a list here in case keys are duplicated.
data_0[line[0]].append(line)
except KeyError:
data_0[line[0]] = [line]
for line in f1_reader:
data_1[line[0]] = line[2] # We only need the third column of this row to append to the data.
with open("b.txt", "a") as f2:
writer = csv.writer(f2, delimiter="|") # I would be tempted to not make this a pipe, but probably too late already if you've got a pre-made file.
for key in data_0:
if key in data_1.keys():
for row in data_0[key]:
writer.writerow(row[:6]+data_1[key]) # index to the 6th column, and append the data from the other file.
else:
for row in data_0[key]:
writer.writerow(row)
That should avoid having the extra rows as well as there is no true/False flag to rely on.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python comparing two files partially - python

Related

Combine two wordlist in one file Python

Show different lines in files

How to Delete First Few Rows of .txt File in Python?

Python linking 2 strings

parsing file in python

Categories

Resources