compare two file and find matching words in python

compare two file and find matching words in python - python

I have a two file: the first one includes terms and their frequency:
table 2
apple 4
pencil 89
The second file is a dictionary:
abroad
apple
bread
...
I want to check whether the first file contains any words from the second file. For example both the first file and the second file contains "apple".
I am new to python.
I try something but it does not work. Could you help me ? Thank you
for line in dictionary:
words = line.split()
print words[0]
for line2 in test:
words2 = line2.split()
print words2[0]

Something like this:
with open("file1") as f1,open("file2") as f2:
words=set(line.strip() for line in f1) #create a set of words from dictionary file
#why sets? sets provide an O(1) lookup, so overall complexity is O(N)
#now loop over each line of other file (word, freq file)
for line in f2:
word,freq=line.split() #fetch word,freq
if word in words: #if word is found in words set then print it
print word
output:
apple

It may help you :
file1 = set(line.strip() for line in open('file1.txt'))
file2 = set(line.strip() for line in open('file2.txt'))
for line in file1 & file2:
if line:
print line

Here's what you should do:
First, you need to put all the dictionary words in some place where you can easily look them up. If you don't do that, you'd have to read the whole dictionary file every time you want to check one single word in the other file.
Second, you need to check if each word in the file is in the words you extracted from the dictionary file.
For the first part, you need to use either a list or a set. The difference between these two is that list keeps the order you put the items in it. A set is unordered, so it doesn't matter which word you read first from the dictionary file. Also, a set is faster when you look up an item, because that's what it is for.
To see if an item is in a set, you can do: item in my_set which is either True or False.

I have your first double list in try.txt and the single list in try_match.txt
f = open('try.txt', 'r')
f_match = open('try_match.txt', 'r')
print f
dictionary = []
for line in f:
a, b = line.split()
dictionary.append(a)
for line in f_match:
if line.split()[0] in dictionary:
print line.split()[0]

Related

Reading a text file and replacing it to value in dictionary

I have a dictionary made in python. I also have a text file where each line is a different word. I want to check each line of the text file against the keys of the dictionary and if the line in the text file matches the key I want to write that key's value to an output file. Is there an easy way to do this. Is this even possible?
for example I am reading my file in like this:
test = open("~/Documents/testfile.txt").read()
tokenising it and for each word token I want to look it up a dictionary, my dictionary is setup like this:
dic = {"a": ["ah0", "ey1"], "a's": ["ey1 z"], "a.": ["ey1"], "a.'s": ["ey1 z"]}
If I come across the letter 'a' in my file, I want it to output ["ah0", "ey1"].

you can try:
for line in all_lines:
for val in dic:
if line.count(val) > 0:
print(dic[val])
this will look through all lines in the file and if the line contains a letter from dic, then it will print the items associated with that letter in the dictionary (you will have to do something like all_lines = test.readlines() to get all the lines in a list) the dic[val] gives the list assined to the value ["ah0", "ey1"] so you do not just have to print it but you can use it in other places

you can give this a try:
#dictionary to match keys againts words in text filee
dict = {"a": ["ah0", "ey1"], "a's": ["ey1 z"], "a.": ["ey1"], "a.'s": ["ey1 z"]}
# Read from text filee
open_file = open('sampletext.txt', 'r')
lines = open_file.readlines()
open_file.close()
#search the word extracted from textfile, if found in dictionary then print list into the file
for word in lines:
if word in dict:
write_to_file = open('outputfile.txt', 'w')
write_to_file.writelines(str(dict[word]))
write_to_file.close()
Note: you may need to strip the newline "\n" if the textfile you read from have multiple lines

Replace words of a long document in Python

I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work

A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)

Text to dictionary doesn't work

I have the following text file in the same folder as my Python Code.
78459581
Black Ballpoint Pen
12345670
Football
49585922
Perfume
83799715
Shampoo
I have written this Python code.
file = open("ProductDatabaseEdit.txt", "r")
d = {}
for line in file:
x = line.split("\n")
a=x[0]
b=x[1]
d[a]=b
print(d)
This is the result I receive.
b=x[1] # IndexError: list index out of range
My dictionary should appear as follows:
{"78459581" : "Black Ballpoint Pen"
"12345670" : "Football"
"49585922" : "Perfume"
"83799715" : "Shampoo"}
What am I doing wrong?

A line is terminated by a linebreak, thus line.split("\n") will never give you more than one line.
You could cheat and do:
for first_line in file:
second_line = next(file)

You can simplify your solution by using a dictionary generator, this is probably the most pythonic solution I can think of:
>>> with open("in.txt") as f:
... my_dict = dict((line.strip(), next(f).strip()) for line in f)
...
>>> my_dict
{'12345670': 'Football', '49585922': 'Perfume', '78459581': 'Black Ballpoint Pen', '83799715': 'Shampoo'}
Where in.txt contains the data as described in the problem. It is necessary to strip() each line otherwise you would be left with a trailing \n character for your keys and values.

You need to strip the \n, not split
file = open("products.txt", "r")
d = {}
for line in file:
a = line.strip()
b = file.next().strip()
# next(file).strip() # if using python 3.x
d[a]=b
print(d)
{'12345670': 'Football', '49585922': 'Perfume', '78459581': 'Black Ballpoint Pen', '83799715': 'Shampoo'}

What's going on
When you open a file you get an iterator, which will give you one line at a time when you use it in a for loop.
Your code is iterating over the file, splitting every line in a list with \n as the delimiter, but that gives you a list with only one item: the same line you already had. Then you try to access the second item in the list, which doesn't exist. That's why you get the IndexError: list index out of range.
How to fix it
What you need is this:
file = open('products.txt','r')
d = {}
for line in file:
d[line.strip()] = next(file).strip()
In every loop you add a new key to the dictionary (by assigning a value to a key that didn't exist yet) and assign the next line as the value. The next() function is just telling to the file iterator "please move on to the next line". So, to drive the point home: in the first loop you set first line as a key and assign the second line as the value; in the second loop iteration, you set the third line as a key and assign the fourth line as the value; and so on.
The reason you need to use the .strip() method every time, is because your example file had a space at the end of every line, so that method will remove it.
Or...
You can also get the same result using a dictionary comprehension:
file = open('products.txt','r')
d = {line.strip():next(file).strip() for line in file}
Basically, is a shorter version of the same code above. It's shorter, but less readable: not necessarily something you want (a matter of taste).

In my solution i tried to not use any loops. Therefore, I first load the txt data with pandas:
import pandas as pd
file = pd.read_csv("test.txt", header = None)
Then I seperate keys and values for the dict such as:
keys, values = file[0::2].values, file[1::2].values
Then, we can directly zip these two as lists and create a dict:
result = dict(zip(list(keys.flatten()), list(values.flatten())))
To create this solution I used the information as provided in [question]: How to remove every other element of an array in python? (The inverse of np.repeat()?) and in [question]: Map two lists into a dictionary in Python

You can loop over a list two items at a time:
file = open("ProductDatabaseEdit.txt", "r")
data = file.readlines()
d = {}
for line in range(0,len(data),2):
d[data[i]] = data[i+1]

Try this code (where the data is in /tmp/tmp5.txt):
#!/usr/bin/env python3
d = dict()
iskey = True
with open("/tmp/tmp5.txt") as infile:
for line in infile:
if iskey:
_key = line.strip()
else:
_value = line.strip()
d[_key] = _value
iskey = not iskey
print(d)
Which gives you:
{'12345670': 'Football', '49585922': 'Perfume', '78459581': 'Black Ballpoint Pen', '83799715': 'Shampoo'}

Reading text file into dic file results in incomplete dic file

The following file 2016_01_22_Reps.txt is a list of expanded contractions that I want to put into a python dic file;
“can't":"cannot","could've":"could have","could've":"could have","didn't":"did not","doesn't":"does not", “don't":"do not"," hadn't":"had not", "hasn't":"has not","haven't":"have not","I'll":"I will","I'm":"I am","I've":"I have","isn't":"is not","I'll":"I
Note that the contents are a single line, not multiple lines.
My code is as follows;
reps = open('2016_01_22_Reps.txt', 'r')
Reps1dic={}
for line in reps:
x=line.split(",")
a=x[0]
b=x[1]
c=len(b)-1
b=b[0:c]
Reps1dic[a]=b
print (Reps1dic)
The output to Reps1dic stops after first two pairs of contractions. Contents are as follows;
{‘2016_01_22Reps = {“can\’t”:”cannot”‘ : ‘”could\’ve”:”could have’}
Instructions and explanation of why the complete file contents are not written to the dic file will be most appreciated.

The problem is that your values are all on one line, so your for line in reps only goes through the one iteration. Do something like this:
with open('2016_01_22_Reps.txt', 'r') as reps:
Reps1dic={}
contents = reps.read()
pairs = contents.split(',')
for pair in pairs:
parts = pair.split(':')
a = parts[0].replace('"', '').strip()
b = parts[1].replace('"', '').strip()
Reps1dic[a] = b
print(Reps1dic)
where you split the line and then iterate over that list instead of the lines in the file. I also used the with keyword to open your file - it's much better practice.

Store words of file in dictionary

I want to store the words of a text file in a dictionary.
My code is
word=0
char=0
i=0
a=0
d={}
with open("m.txt","r") as f:
for line in f:
w=line.split()
d[i]=w[a]
i=i+1
a=a+1
word=word+len(w)
char=char+len(line)
print(word,char)
print(d)
my text file is
jdfjdnv dj g gjv,kjvbm
but the problem is that the dictionary is storing only the first word of the text file .how to store the rest of the words.please help

How many lines does your text file have? If it only has one line your loop executes only once, splits whole line into separate words, then saves one word in Python dict. If you want to save all words from this text file with one line you need to add another loop. Like this:
for word in line.split():
d[i] = word
i += 1

You only store the first word because you only have one line in the file, and your only for loop is over the lines.
Generally, if you are going to key the dictionary by index, you can just use the list you are already making:
w = []
char = 0
with open("m.txt", "r") as f:
for line in f:
char += len(line)
w.extend(line.split())
word = sum(map(len, w))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

compare two file and find matching words in python - python

It may help you : file1 = set(line.strip() for line in open('file1.txt')) file2 = set(line.strip() for line in open('file2.txt')) for line in file1 & file2: if line: print line

I have your first double list in try.txt and the single list in try_match.txt f = open('try.txt', 'r') f_match = open('try_match.txt', 'r') print f dictionary = [] for line in f: a, b = line.split() dictionary.append(a) for line in f_match: if line.split()[0] in dictionary: print line.split()[0]

Related

Reading a text file and replacing it to value in dictionary

Replace words of a long document in Python

Text to dictionary doesn't work

Reading text file into dic file results in incomplete dic file

Store words of file in dictionary

Categories

Resources