Find differences between a file without checking line by line. Python - python

I'm trying to check the differences between two output files which contain a mixture of IP Addresses and Subnets. These are stripped from a file and are stored on output1.txt and output2.txt. I'm struggling when doing a comparison. These files don't always have the same number of lines so comparing line by line doesn't seem an option. For example, both files could have IP address 192.168.1.1 but in output1.txt it could be on line 1 and in output2.txt it could be on line 60. How do I compare properly identifying which strings are not in both files?
Code below
import difflib
with open('input1.txt','r') as f:
with open('output1.txt', 'w') as g:
for line in f:
ipaddress = line.split(None, 1)[0]
g.write(ipaddress + "\n")
with open('input2.txt', 'r') as f:
with open('output2.txt', 'w') as g:
for line in f:
ipaddress = line.split(None, 1)[0]
g.write(ipaddress + "\n")
with open('output1.txt', 'r') as output1, open('output2.txt', 'r') as output2:
output1_text = output1.read()
output2_text = output2.read()
d = difflib.Differ()
diff = d.compare(output1_text, output2_text)
print(''.join(diff))
I will eventually want the differences written to a file, but for now just printing out the result is fine.
Appreciate the help.
Thanks.

You probably want a set comparison:
with open('output1.txt') as fh1, open('output2.txt') as fh2:
# collect lines into sets
set1, set2 = set(fh1), set(fh2)
diff = set1.symmetric_difference(set2)
print(''.join(diff))
Where symmetric_difference will:
Return a new set with elements in either the set or other but not both.

Related

Combine two wordlist in one file Python

I have two wordlists, as per the examples below:
wordlist1.txt
aa
bb
cc
wordlist2.txt
11
22
33
I want to take every line from wordlist2.txt and put it after each line in wordlist1.txt and combine them in wordlist3.txt like this:
aa
11
bb
22
cc
33
.
.
Can you please help me with how to do it? Thanks!
Try to always try to include what you have tried.
However, this is a great place to start.
def read_file_to_list(filename):
with open(filename) as file:
lines = file.readlines()
lines = [line.rstrip() for line in lines]
return lines
wordlist1= read_file_to_list("wordlist1.txt")
wordlist2= read_file_to_list("wordlist2.txt")
with open("wordlist3.txt",'w',encoding = 'utf-8') as f:
for x,y in zip(wordlist1,wordlist2):
f.write(x+"\n")
f.write(y+"\n")
Check the following question for more ideas and understanding: How to read a file line-by-line into a list?
Cheers
Open wordlist1.txt and wordlist2.txt for reading and wordlist3.txt for writing. Then it's as simple as:
with open('wordlist3.txt', 'w') as w3, open('wordlist1.txt') as w1, open('wordlist2.txt') as w2:
for l1, l2 in zip(map(str.rstrip, w1), map(str.rstrip, w2)):
print(f'{l1}\n{l2}', file=w3)
Instead of using .splitlines(), you can also iterate over the files directly. Here's the code:
wordlist1 = open("wordlist1.txt", "r")
wordlist2 = open("wordlist2.txt", "r")
wordlist3 = open("wordlist3.txt", "w")
for txt1,txt2 in zip(wordlist1, wordlist2):
if not txt1.endswith("\n"):
txt1+="\n"
wordlist3.write(txt1)
wordlist3.write(txt2)
wordlist1.close()
wordlist2.close()
wordlist3.close()
In the first block, we are opening the files. For the first two, we use "r", which stands for read, as we don't want to change anything to the files. We can omit this, as "r" is the default argument of the open function. For the second one, we use "w", which stands for write. If the file didn't exist yet, it will create a new file.
Next, we use the zip function in the for loop. It creates an iterator containing tuples from all iterables provided as arguments. In this loop, it will contain tuples containing each one line of wordlist1.txt and one of wordlist2.txt. These tuples are directly unpacked into the variables txt1 and txt2.
Next we use an if statement to check whether the line of wordlist1.txt ends with a newline. This might not be the case with the last line, so this needs to be checked. We don't check it with the second line, as it is no problem that the last line has no newline because it will also be at the end of the resulting file.
Next, we are writing the text to wordlist3.txt. This means that the text is appended to the end of the file. However, the text that was already in the file before the opening, is lost.
Finally, we close the files. This is very important to do, as otherwise some progress might not be saved and no other applications can use the file meanwhile.
Try this:
with open('wordlist1.txt', 'r') as f1:
f1_list = f1.read().splitlines()
with open('wordlist2.txt', 'r') as f2:
f2_list = f2.read().splitlines()
f3_list = [x for t in list(zip(f1, f2)) for x in t]
with open('wordlist3.txt', 'w') as f3:
f3.write("\n".join(f3_list))
with open('wordlist1.txt') as w1,\
open('wordlist2.txt') as w2,\
open('wordlist3.txt', 'w') as w3:
for wordlist1, wordlist2 in zip(w1.readlines(), w2.readlines()):
if wordlist1[-1] != '\n':
wordlist1 += '\n'
if wordlist2[-1] != '\n':
wordlist2 += '\n'
w3.write(wordlist1)
w3.write(wordlist2)
Here you go :)
with open('wordlist1.txt', 'r') as f:
file1 = f.readlines()
with open('wordlist2.txt', 'r') as f:
file2 = f.readlines()
with open('wordlist3.txt', 'w') as f:
for x in range(len(file1)):
if not file1[x].endswith('\n'):
file1[x] += '\n'
f.write(file1[x])
if not file2[x].endswith('\n'):
file2[x] += '\n'
f.write(file2[x])
Open wordlist 1 and 2 and make a line paring, separate each pair by a newline character then join all the pairs together and separated again by a newline.
# paths
wordlist1 = #
wordlist2 = #
wordlist3 = #
with open(wordlist1, 'r') as fd1, open(wordlist2, 'r') as fd2:
out = '\n'.join(f'{l1}\n{l2}' for l1, l2 in zip(fd1.read().split(), fd2.read().split()))
with open(wordlist3, 'w') as fd:
fd.write(out)

Generate output file Python

As it can be seen in the code. I created two output files one for output after splitting
and second output as actual out after removing duplicate lines
How can i make only one output file. Sorry if i sound too stupid, I'm a beginner
import sys
txt = sys.argv[1]
lines_seen = set() # holds lines already seen
outfile = open("out.txt", "w")
actualout = open("output.txt", "w")
for line in open(txt, "r"):
line = line.split("?", 1)[0]
outfile.write(line+"\n")
outfile.close()
for line in open("out.txt", "r"):
if line not in lines_seen: # not a duplicate
actualout.write(line)
lines_seen.add(line)
actualout.close()
You can add the lines from the input file directly into the set. Since sets cannot have duplicates, you don't even need to check for those. Try this:
import sys
txt = sys.argv[1]
lines_seen = set() # holds lines already seen
actualout = open("output.txt", "w")
for line in open(txt, "r"):
line = line.split("?", 1)[0]
lines_seen.add(line + "\n")
for line in lines_seen:
actualout.write(line)
actualout.close()
In the first step you iterate through every line in the file, split the line on your decriminator and store it into a list. After that you iterate through the list and write it into your output file.
import sys
txt = sys.argv[1]
lines_seen = set() # holds lines already seen
actualout = open("output.txt", "w")
data = [line.split("?", 1[0] for line in open("path/to/file/here", "r")]
for line in data:
if line not in lines_seen: # not a duplicate
actualout.write(line)
lines_seen.add(line)
actualout.close()

Find coincidence and add column

I want to achieve this specific task, I have 2 files, the first one with emails and credentials:
xavier.desprez#william.com:Xavier
xavier.locqueneux#william.com:vocojydu
xaviere.chevry#pepe.com:voluzigy
Xavier.Therin#william.com:Pussycat5
xiomara.rivera#william.com:xrhj1971
xiomara.rivera#william-honduras.william.com:xrhj1971
and the second one, with emails and location:
xavier.desprez#william.com:BOSNIA
xaviere.chevry#pepe.com:ROMANIA
I want that, whenever the email from the first file is found on the second file, the row is substituted by EMAIL:CREDENTIAL:LOCATION , and when it is not found, it ends up being: EMAIL:CREDENTIAL:BLANK
so the final file must be like this:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.locqueneux#william.com:vocojydu:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
I have do several tries in python, but it is not even worth it to write it because I am not really close to the solution.
Regards !
EDIT:
This is what I tried:
import os
import sys
with open("test.txt", "r") as a_file:
for line_a in a_file:
stripped_email_a = line_a.strip().split(':')[0]
with open("location.txt", "r") as b_file:
for line_b in b_file:
stripped_email_b = line_b.strip().split(':')[0]
location = line_b.strip().split(':')[1]
if stripped_email_a == stripped_email_b:
a = line_a + ":" + location
print(a.replace("\n",""))
else:
b = line_a + ":BLANK"
print (b.replace("\n",""))
This is the result I get:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.desprez#william.com:Xavier:BLANK
xaviere.chevry#pepe.com:voluzigy:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
xavier.locqueneux#william.com:vocojydu:BLANK
xavier.locqueneux#william.com:vocojydu:BLANK
Xavier.Therin#william.com:Pussycat5:BLANK
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
I am very close but I get duplicates ;)
Regards
The duplication issue comes from the fact that you are reading two files in a nested way, once a line from the test.txt is read, you open the location.txt file for reading and process it. Then, you read the second line from test.txt, and re-open the location.txt and process it again.
Instead, get all the necessary data from the location.txt, say, into a dictionary, and then use it while reading the test.txt:
email_loc_dict = {}
with open("location.txt", "r") as b_file:
for line_b in b_file:
splits = line_b.strip().split(':')
email_loc_dict[splits[0]] = splits[1]
with open("test.txt", "r") as a_file:
for line_a in a_file:
line_a = line_a.strip()
stripped_email_a = line_a.split(':')[0]
if stripped_email_a in email_loc_dict:
a = line_a + ":" + email_loc_dict[stripped_email_a]
print(a)
else:
b = line_a + ":BLANK"
print(b)
Output:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.locqueneux#william.com:vocojydu:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK

Split a large text file to small ones based on location

Suppose I have a big file as file.txt and it has data of around 300,000. I want to split it based on certain key location. See file.txt below:
Line 1: U0001;POUNDS;**CAN**;1234
Line 2: U0001;POUNDS;**USA**;1234
Line 3: U0001;POUNDS;**CAN**;1234
Line 100000; U0001;POUNDS;**CAN**;1234
The locations are limited to 10-15 different nation. And I need to separate each record of a particular country in one particular file. How to do this task in Python
Thanks for help
This will run with very low memory overhead as it writes each line as it reads it.
Algorithm:
open input file
read a line from input file
get country from line
if new country then open file for country
write the line to country's file
loop if more lines
close files
Code:
with open('file.txt', 'r') as infile:
try:
outfiles = {}
for line in infile:
country = line.split(';')[2].strip('*')
if country not in outfiles:
outfiles[country] = open(country + '.txt', 'w')
outfiles[country].write(line)
finally:
for outfile in outfiles.values():
outfile.close()
with open("file.txt") as f:
content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
text = [x.strip() for x in content]
x = [i.split(";") for i in text]
x.sort(key=lambda x: x[2])
from itertools import groupby
from operator get itemgetter
y = groupby(x, itemgetter(2))
res = [(i[0],[j for j in i[1]]) for i in y]
for country in res:
with open(country[0]+".txt","w") as writeFile:
writeFile.writelines("%s\n" % ';'.join(l) for l in country[1])
will group by your item!
Hope it helps!
Looks like what you have is a csv file. csv stands for comma-separated values, but any file that uses a different delimiter (in this case a semicolon ;) can be treated like a csv file.
We'll use the python module csv to read the file in, and then write a file for each country
import csv
from collections import defaultdict
d = defaultdict(list)
with open('file.txt', 'rb') as f:
r = csv.reader(f, delimiter=';')
for line in r:
d[l[2]].append(l)
for country in d:
with open('{}.txt'.format(country), 'wb') as outfile:
w = csv.writer(outfile, delimiter=';')
for line in d[country]:
w.writerow(line)
# the formatting-function for the filename used for saving
outputFileName = "{}.txt".format
# alternative:
##import time
##outputFileName = lambda loc: "{}_{}.txt".format(loc, time.asciitime())
#make a dictionary indexed by location, the contained item is new content of the file for the location
sortedByLocation = {}
f = open("file.txt", "r")
#iterate each line and look at the column for the location
for l in f.readlines():
line = l.split(';')
#the third field (indices begin with 0) is the location-abbreviation
# make the string lower, cause on some filesystems the file with upper chars gets overwritten with only the elements with lower characters, while python differs between the upper and lower
location = line[2].lower().strip()
#get previous lines of the location and store it back
tmp = sortedByLocation.get(location, "")
sortedByLocation[location]=tmp+l.strip()+'\n'
f.close()
#save file for each location
for location, text in sortedByLocation.items():
with open(outputFileName(location) as f:
f.write(text)

How do I concatenate multiple CSV files row-wise using python?

I have a dataset of about 10 CSV files. I want to combine those files row-wise into a single CSV file.
What I tried:
import csv
fout = open("claaassA.csv","a")
# first file:
writer = csv.writer(fout)
for line in open("a01.ihr.60.ann.csv"):
print line
writer.writerow(line)
# now the rest:
for num in range(2, 10):
print num
f = open("a0"+str(num)+".ihr.60.ann.csv")
#f.next() # skip the header
for line in f:
print line
writer.writerow(line)
#f.close() # not really needed
fout.close()
Definitively need more details in the question (ideally examples of the inputs and expected output).
Given the little information provided, I will assume that you know that all files are valid CSV and they all have the same number or lines (rows). I'll also assume that memory is not a concern (i.e. they are "small" files that fit together in memory). Furthermore, I assume that line endings are new line (\n).
If all these assumptions are valid, then you can do something like this:
input_files = ['file1.csv', 'file2.csv', 'file3.csv']
output_file = 'output.csv'
output = None
for infile in input_files:
with open(infile, 'r') as fh:
if output:
for i, l in enumerate(fh.readlines()):
output[i] = "{},{}".format(output[i].rstrip('\n'), l)
else:
output = fh.readlines()
with open(output_file, 'w') as fh:
for line in output:
fh.write(line)
There are probably more efficient ways, but this is a quick and dirty way to achieve what I think you are asking for.
The previous answer implicitly assumes we need to do this in python. If bash is an option then you could use the paste command. For example:
paste -d, file1.csv file2.csv file3.csv > output.csv
I don't understand fully why you use the library csv. Actually, it's enough to fill the output file with the lines from given files (it they have the same columns' manes and orders).
input_path_list = [
"a01.ihr.60.ann.csv",
"a02.ihr.60.ann.csv",
"a03.ihr.60.ann.csv",
"a04.ihr.60.ann.csv",
"a05.ihr.60.ann.csv",
"a06.ihr.60.ann.csv",
"a07.ihr.60.ann.csv",
"a08.ihr.60.ann.csv",
"a09.ihr.60.ann.csv",
]
output_path = "claaassA.csv"
with open(output_path, "w") as fout:
header_written = False
for intput_path in input_path_list:
with open(intput_path) as fin:
header = fin.next()
# it adds the header at the beginning and skips other headers
if not header_written:
fout.write(header)
header_written = True
# it adds all rows
for line in fin:
fout.write(line)

Categories

Resources