Compare lines in two files efficiently in Python - python

I am trying to compare the two lines and capture the lines that match with each other. For example,
file1.txt contains
my
sure
file2.txt contains
my : 2
mine : 5
sure : 1
and I am trying to output
my : 2
sure : 1
I have the following code so far
inFile = "file1.txt"
dicts = "file2.txt"
with open(inFile) as f:
content = f.readlines()
content = [x.strip() for x in content]
with open(dicts) as fd:
inDict = fd.readlines()
inDict = [x.strip() for x in inDict]
ordered_dict = {}
for line in inDict:
key = line.split(":")[0].strip()
value = int(line.split(":")[1].strip())
ordered_dict[key] = value
for (key, val) in ordered_dict.items():
for entry in content:
if entry == content:
print(key, val)
else:
continue
However, this is very inefficient because it loops two times and iterates a lot. Therefore, this is not ideal when it comes to large files. How can I make this workable for large files?

You don't need nested loops. One loop to read in file2 and translate to a dict, and another loop to read file1 and look up the results.
inFile = "file1.txt"
dicts = "file2.txt"
ordered_dict = {}
with open(dicts) as fd:
for line in fd:
a,b = line.split(' : ')
ordered_dict[a] = b
with open(inFile) as f:
for line in f:
line = line.strip()
if line in ordered_dict:
print( line, ":", ordered_dict[line] )
The first loop can be done as a list comprehension.
with open(dicts) as fd:
ordered_dict = dict( line.strip().split(' : ') for line in fd )

Here is a solution with one for loop:
inFile = "file1.txt"
dicts = "file2.txt"
with open(inFile) as f:
content_list = list(map(str.split,f.readlines()))
with open(dicts) as fd:
in_dict_lines = fd.readlines()
for dline in in_dict_lines:
key,val=dline.split(" : ")
if key in content_list:
ordered_dict[key] = value

Related

How to compare individual words in two text files in Python

I am attempting to compare 2 files, A and B. The purpose is to find all the words A has but that are not in B. For example,
File A
my: 2
hello: 5
me: 1
File B
my
name
is
output
hello
me
The code I have so far is
inFile = "fila.txt"
lexicon = "fileb.xml"
with open(inFile) as f:
content = f.readlines()
content = [x.strip() for x in content]
with open(lexicon) as File:
lexicon_file = File.readlines()
lexicon_file = [x.strip() for x in lexicon_file]
ordered_dict = {}
for line in content:
key = line.split(":")[0].strip()
value = int(line.split(":")[1].strip())
ordered_dict[key] = value
for entry in lexicon_file:
for (key, val) in ordered_dict.items():
if entry == key:
continue
else:
print(key)
However this takes too long because it's in double loops, it's also printing duplicate words. How do I make this efficient?
Convert both lists into sets and just do a substraction:
content_wo_lexicon = list(set(content) - set(lexicon_content))

How to make a dictionary entry from each line in a text file?

So, i have this text file which contains this infos:
student_num1 student_name1 student_grade1
student_num2 student_name2 student_grade2
student_num3 student_name3 student_grade3
What i want to do is i want to take each line of this text file as a dictionary entry with this format:
students = { student_num1: [student_name1, student_grade1], student_num2: [student_name2, student_grade2], student_num3: [student_name3, student_grade3] }
Basically, the first string of the line should be the key and the 2 strings next to it would be the value. But i don't know how will i make python separate the strings in each line and assign them as the key and value for the dictionary.
EDIT:
So, i've tried some code: (I saw all your solutions, and i think they'll all definitely work, but i also want to learn to create my solution, so i will really appreciate if you could check mine!)
for line in fh:
line = line.split(";")
student_num = line[0]
student_name = line[1]
student_grade = line[2]
count =+ 1
direc[student_num] = [student_name,student_grade]
student_num = "student_num" + str(count)
student_grade = "student_grade" + str(count)
student_name = "student_name" + str(count)
print(direc)
The problem is i get an error of list index out of range on line 10 or this part "student_name = line[1]"
EDIT: THANK YOU EVERYONE! Every single one of your suggested solutions works! I've also fixed my own solution. This is the fixed one (as suggest by #norok2):
for line in fh:
line = line.split(" ")
student_num = line[0]
student_name = line[1]
student_grade = line[2]
count =+ 1
direc[student_num] = [student_name,student_grade]
student_num = "student_num" + str(count)
student_grade = "student_grade" + str(count)
student_name = "student_name" + str(count)
As a dict comprehension:
with open("data.txt", "r") as f:
students = {k:v for k, *v in map(str.split, f)}
Explanation:
The file object f is already an iterator (that yields each line), we want to split the lines, so we can use map(str.split, f) or (line.split() for line in f).
After that we know, that the first item is the key of the dictionary, and the remaining items are the values. We can use unpacking for that. An unpacking example:
>>> a, *b = [1,2,3]
>>> a
1
>>> b
[2, 3]
Then we use a comprehension to build the dict with the values we are capturing in the unpacking.
A dict comprehension is an expresion to build up dictionaries, for example:
>>> {x:x+1 for x in range(5)}
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5}
Example,
File data.txt:
student_num1 student_name1 student_grade1
student_num2 student_name2 student_grade2
student_num3 student_name3 student_grade3
Reading it
>>> with open("data.txt", "r") as f:
... students = {k:v for k, *v in map(str.split, f)}
...
>>> students
{'student_num1': ['student_name1', 'student_grade1'], 'student_num2': ['student_name2', 'student_grade2'], 'student_num3': ['student_name3', 'student_grade3']}
My current approach uses file handling to open a file in read mode, and then reading the lines present in the file. Then for each line, remove extra new line and whitespaces and split it at space, to create a list. Then used unpacking to store single value as key and a list of 2 values as value. Added values to the dictonary.
temp.txt
student_num1 student_name1 student_grade1
student_num2 student_name2 student_grade2
student_num3 student_name3 student_grade3
main.py
d = dict()
with open("temp.txt", "r") as f:
for line in f.readlines():
key, *values = line.strip().split(" ")
d[key] = values
print(d)
Output
{'student_num1': ['student_name1', 'student_grade1'], 'student_num2': ['student_name2', 'student_grade2'], 'student_num3': ['student_name3', 'student_grade3']}
with open('data.txt') as f:
lines = f.readlines()
d = {}
for line in lines:
tokens = line.split()
d[tokens[0]] = tokens[1:]
print(d)
I hope this is understandable. To split the lines into the different tokens, we use the split1 function.
The reason why your solution is giving you that error is that it seems your lines do not contain the character ;, yet you try to split by that character with line = line.split(";").
You should replace that with:
line = line.split(" ") to split by the space character
or
line = line.split(";") to split by any blank character
However, for a more elegant solution, see here.
Have you tried something as simple as this:
d = {}
with open('students.txt') as f:
for line in f:
key, *rest = line.split()
d[key] = rest
print(d)
# {'student_num1': ['student_name1', 'student_grade1'], 'student_num2': ['student_name2', 'student_grade2'], 'student_num3': ['student_name3', 'student_grade3']}
file.txt:
student_num1 student_name1 student_grade1
student_num2 student_name2 student_grade2
student_num3 student_name3 student_grade3
Main.py:
def main():
file = open('file.txt', 'r')
students = {}
for line in file:
fields = line.split(" ")
fields[2] = fields[2].replace("\n", "")
students[fields[1]] = [fields[0], fields[2]]
print(students)
main()
Output:
{'student_name1': ['student_num1', 'student_grade1'], 'student_name2': ['student_num2', 'student_grade2'], 'student_name3': ['student_num3', 'student_grade3']}

inverting the order of lines in a list

I'm having some difficulty with writing a program in Python. I would like the program to read lines between a set of characters, reverse the order of the lines and then write them into a new file. The input is:
AN10 G17 G21 G90
N20 '2014_12_08_Banding_Test_4
N30 M3 S1B
N40G00X0.000Y0.000Z17.000
N50 G00X0.001Y0.001Z17.000
N60 G01Z0.000F3900.0
N70 G01X0.251
N80 G01X149.999
N90 G01Y0.251
N100 G01X149.749
N110 G01X149.499Z-8.169
N120 G01X148.249Z-8.173
N130 G01X146.999Z-8.183
N140 G01X145.499Z-8.201
...
N3140 G01Y0.501
So far my code is:
with open('Source.nc') as infile, open('Output.nc', 'w') as outfile:
copy = False
strings_A = ("G01Y", ".251")
strings_B = ("G01Y", ".501")
content = infile.readlines()
for lines in content:
lines.splitlines(1)
if all(x in lines for x in strings_A):
copy = True
elif all(x in lines for x in strings_B):
copy = False
elif copy:
outfile.writelines(reversed(lines))
I think I am failing to understand something about the difference between lines and a multi-multiline string. I would really appreciate some help here!
Thanks in advance, Arthur
A string has multiple lines if it contains newline characters \n.
You can think of a file as either one long string that contains newline characters:
s = infile.read()
Or you can treat it like a list of lines:
lines = infile.readlines()
If you have a multiline string you can split it into a list of lines:
lines = s.splitlines(False)
# which is basically a special form of:
lines = s.split('\n')
If you want to process a file line by line all of the following methods are equivalent (in effect if not in efficiency) :
with open(filename, 'r') as f:
s = f.read()
lines = s.splitlines()
for line in lines:
# do something
pass
with open(filename, 'r') as f:
lines = f.readlines()
for line in lines:
# do something
pass
# this last option is the most pythonic one,
# it uses the fact that any file object can be treated as a list of lines
with open(filename, 'r') as f
for line in f:
# do something
pass
EDIT Now the solution of your problem:
with open('Source.nc') as infile, open('Output.nc', 'w') as outfile:
copy = False
strings_A = ("G01Y", ".251")
strings_B = ("G01Y", ".501")
target_lines = []
for line in infile:
if copy and all(x in line for x in strings_B):
outfile.writelines(reversed(target_lines))
break
if copy:
target_lines.append(line)
if all(x in line for x in strings_A):
copy = True
This will copy all lines between a line that matches all(x in line for x in strings_A) and a line that matches all(x in line for x in strings_B) into the outfile in reversed order. The identifying lines are NOT included in the output (I hope that was the intent).
The order of the if clauses is deliberate to achieve that.
Also be aware that the identification tests (all(x in line for x in strings_A)) you use, work as a substring search not a word match, again I don't know if that was your intent.
EDIT2 In response to comment:
with open('Source.nc') as infile, open('Output.nc', 'w') as outfile:
strings_A = ("G01Y", ".251")
strings_B = ("G01Y", ".501")
do_reverse = False
lines_to_reverse = []
for line in infile:
if all(x in line for x in strings_B):
do_reverse = False
outfile.writelines(reversed(lines_to_reverse))
outfile.writeline(line)
continue
if do_reverse:
lines_to_reverse.append(line)
continue
else:
outfile.writeline(line)
if all(x in line for x in strings_A):
do_reverse = True
lines_to_reverse = []

Python: Too many values to unpack (dictionary)

I'm trying to add key-value pairs to a dictionary by pairing two and two lines from a text file. Why does this not work?
newdata = {}
os.chdir("//GOLLUM//tbg2//tbg2//forritGB")
f = open(filename)
for line1, line2 in f.readlines():
newdata[line1] = line2
edit: The error I get is
ValueError: too many values to unpack
You are reading all lines, and assigning the first line (a sequence) to two variables. This only works if the first line consists of 2 characters. Use the file as an iterator instead:
newdata = {}
os.chdir("//GOLLUM//tbg2//tbg2//forritGB")
with open(filename) as f:
for line1 in f:
newdata[line1.strip()] = next(f, '').strip()
Here next() reads the next line from the file.
The alternative would be to use a pair-wise recipe:
from itertools import izip_longest
def pairwise(iterable):
return izip_longest(*([iter(iterable)] * 2), '')
newdata = {}
os.chdir("//GOLLUM//tbg2//tbg2//forritGB")
with open(filename) as f:
for line1, line2 in pairwise(f):
newdata[line1.strip()] = line2.strip()
Note the str.strip() calls, to remove any extra whitespace (including the newline at the end of each line).
newdata = {}
os.chdir("//GOLLUM//tbg2//tbg2//forritGB")
with open(filename) as f:
for line1, line2 in zip(*[iter(f)]*2):
newdata[line1] = line2
or
os.chdir("//GOLLUM//tbg2//tbg2//forritGB")
with open(filename) as f:
newdata = dict(zip(*[iter(f)]*2))

how to create a dictionary from a file?

I'm trying to write a Python code that will allow me to take in text, and read it line by line. In each line, the words just go into the dictionary as a key and the numbers should be the assigned values, as a list.
the file 'topics.txt' will be composed of hundreds of lines that have the same format as this:
1~cocoa
2~
3~
4~
5~grain~wheat~corn~barley~oat~sorghum
6~veg-oil~linseed~lin-oil~soy-oil~sun-oil~soybean~oilseed~corn~sunseed~grain~sorghum~wheat
7~
8~
9~earn
10~acq
and so on..
i need to create dictionaries for each word
for ex:
Ideally, the name "grain" would be a key in the dictionary, and the values would be dict[grain]: [5,6,..].
similarly,
"cocoa" would be another key and values would be
dict[cocoa]:[1,..]
Not much,but so far..
with open("topics.txt", "r") as fi: # Data read from a text file is a string
d = {}
for i in fi.readlines():
temp = i.split()
#i am lost here
num = temp[0]
d[name] = [map(int, num)]
http://docs.python.org/3/library/collections.html#collections.defaultdict
import collections
with open('topics.txt') as f:
d = collections.defaultdict(list)
for line in f:
value, *keys = line.strip().split('~')
for key in filter(None, keys):
d[key].append(value)
value, *keys = ... is Extended Iterable Unpacking which is only available in Python 3.x.
with open("topics.txt", "r") as file: # Data read from a text file is a string
dict = {}
for fullLine in file:
splitLine = fullLine.split("~")
num = splitLine[0]
for name in splitLine[1:]:
if name in dict:
dict[name] = dict[name] + (num,)
else
dict[name] = (num,)

Categories

Resources