How do I find all unique words(no duplicates)? - python

I would like to find all the unique words that are in both files. I am able to list all the words from each file but it gives me duplicates. I also would like to sort them by alphabetical order. How do I go about doing this?
#!/usr/bin/python3
#First file
file = raw_input("Please enter the name of the first file: ")
store = open(file)
new = store.read()
#Second file
file2 = raw_input("Please enter the name of the second file: ")
store2 = open(file2)
new2 = store2.read()
for line in new.split():
if line in new2:
print line

Here is a snippet which might help you:
new = 'this is a bunch of words'
new2 = 'this is another bunch of words'
unique_words = set(new.split())
unique_words.update(new2.split())
sorted_unique_words = sorted(list(unique_words))
print('\n'.join(sorted_unique_words))
Update:
If you're only interested in words that are common to both files, do this instead:
unique_words = set(new.split())
unique_words2 = set(new2.split())
common_words = set.intersection(unique_words, unique_words2)
print('\n'.join(sorted(common_words)))

Related

how to sort text file alphabetically with python?

I have a text file that contains English words translated into Arabic that I want to arrange alphabetically,
text file :
entrenched = ترسخ
hypotenuse =وتر
conquered = التغلب
tempted = يغري
intentional = متعمد
ps: some words contains more the one like this :
indistinguishable = لا يمكن تمييزه
spot the subtle = بقعة الخفية
I want to check just the first char of every lines and make the sort , I tried but it didn't work :
def sorting(filename):
infile = open(filename,encoding="utf8")
words = []
for line in infile:
temp = line.split()
for i in temp:
words.append(i)
infile.close()
words.sort()
outfile = open("result.txt", "w",encoding="utf8")
for i in words:
outfile.writelines(i)
outfile.writelines("\n")
outfile.close()
sorting("words.txt")
Had to figure out your code indenting :-(
Your code was splitting each line and adding all the words to words - I thought you wanted to sort by the first word of each line, so my code below sorts the lines using the key= method of str.sort() using a function getfirstword to work out the first word of the line. Often you'd use a lambda function for this but it's much easier to debug a non-lambda function.
I put the strings into a variable and split the variable, just so it's a [mre] in one file.
datatext = """entrenched = ترسخ
hypotenuse =وتر
conquered = التغلب
tempted = يغري
intentional = متعمد
"""
def getfirstword(s):
'''
given a string, split off the first word
'''
words = s.split(maxsplit=1)
if len(words)>0:
return words[0]
return ""
# takes a complete text - splits it into lines
def sorting(text):
lines = text.splitlines()
sortedlines = sorted(lines,key=getfirstword)
for line in sortedlines:
print( line )
# could use:
# datatext = open( "words.csv").read()
sorting(datatext)
Result:
conquered = التغلب
entrenched = ترسخ
hypotenuse =وتر
intentional = متعمد
tempted = يغري

How do I count unique names?

I am trying to count up unique names that start with "From:" from a file name. However, I keep getting a long list of numbers. What is my code actually reading and how do I fix this?
count = 0
name = []
fname = input("What is your file name? Enter it here: ")
try:
fname = open(fname)
name = set(f.readlines())
except:
print ("That file does not exist.")
for name in fname:
if name.startswith("From:"):
count = len(name)
print (count)
We can make use of set to hold all required names and find its length to get the count:
file_name = input("What is your file name? Enter it here: ")
s = set()
with open(file_name) as f:
for name in f:
if name.startswith('From:'):
s.add(name)
print(len(s))
Try this:
words = []
count = 0
with open ("unique.txt","r") as f:
# Get a list of lines in the file and covert it into a set
words = set(f.readlines())
FromWords=[]
for word in words:
if word.startswith("From:"):
FromWords.append(word)
print(len(FromWords))
First, we filter out all duplicate words and then look for the words which start with From: and this may aid in faster processing if you're dealing with the big amount of data.
let me know if you need any help in this regard.

Issue with reading a text file to a dictionary in python

Hey everyone just have an issue with a text file and putting it into a dictionary.
So my code first starts off by gathering data from a website and writes it to a text file. From there I reopen the file and make it into a dictionary to transfer the data from the text to the dictionary. In the while loop, I am getting the error of
key,value = line.split()
ValueError: too many values to unpack (expected 2)
Which I'm not sure why if I'm using the wrong method to write the text file data to the new place in the program of "countryName"
Then once that compiles I want to be able to ask the user to input a country name and it will give the income capita of that country as shown in the print line.
def main():
import requests
webFile = "https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2004.txt"
data = requests.get(webFile) #connects to the file and gest a response object
with open("capital.txt",'wb') as f:
f.write(data.content) #write the data out to a file – wb used since thecontent from the response object is returned as abinary object.
f.close()
infile = open('capital.txt', 'r')
line = infile.readline()
countryName = {}
while line != "":
key,value = line.split()
countryName[key] = value
line = infile.readline()
infile.close()
userInput = input("Enter a country name: ")
for i in countryName:
while(userInput != 'stop'):
print("The per capita income in",countryName[key], "is",countryName[value])
userInput = input("Enter a country name: ")
main()
each line also has a number in the beginning of it, and some country names have spaces in them, causing split to return longer lists. If you use regex to add in semicolons as delimiters, and trim leading and trailing whitespace, the splitting works properly. This code would go inside the first while loop
line = re.sub(r"(\$)", r";\1", line) # add semicolon before $ sign
line = re.sub(r'^([0-9]+)',r'\1;', line) # add semicolon after first group of numbers
num, key, value = re.split(r';', line) # split with semicolons as delimiters
countryName[key.strip()] = value.strip() # assign key and values after stripping whitespace
Split returns list, not dictionary.
a = 'a b c'
list = a.split() #-> ['a','b','c']
Are you trying to do something like:
import requests
webFile = "https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2004.txt"
data = requests.get(webFile).text #connects to the file and gest a response object
print(data)
while(1):
name = input('Enter a country name: ')
for a in data.splitlines():
if name.lower() in a.lower():
print(a.split()[-1])

How to save data from text file to python dictionary and select designation data

Example of data in txt file:
apple
orange
banana
lemon
pears
Code of filtering words with 5 letters without dictionary:
def numberofletters(n):
file = open("words.txt","r")
lines = file.readlines()
file.close()
for line in lines:
if len(line) == 6:
print(line)
return
print("===================================================================")
print("This program can use for identify and print out all words in 5
letters from words.txt")
n = input("Please Press enter to start filtering words")
print("===================================================================")
numberofletters(n)
My question is how create a dictionary whose keys are integers and values the English words with that many letters and Use the dictionary to identify and print out all the 5 letter words?
Imaging with a huge list of words
Sounds like a job for a defaultdict.
>>> from collections import defaultdict
>>> length2words = defaultdict(set)
>>>
>>> with open('file.txt') as f:
... for word in f: # one word per line
... word = word.strip()
... length2words[len(word)].add(word)
...
>>> length2words[5]
set(['lemon', 'apple', 'pears'])
If you care about duplicates and insertion order, use a defaultdict(list) and append instead of add.
you can make your for loop like this:
for line in lines:
line_len = len(line)
if line_len not in dicword.keys():
dicword.update({line_len: [line]})
else:
dicword[line_len].append(line)
Then you can get it by just doing dicword[5]
If I understood, you need to write filter your document and result into a file. For that you can write a CSV file with DictWriter (https://docs.python.org/2/library/csv.html).
DictWriter: Create an object which operates like a regular writer but maps dictionaries onto output rows.
BTW, you will be able to store and structure your document
def numberofletters(n):
file = open("words.txt","r")
lines = file.readlines()
file.close()
dicword = {}
writer = csv.DictWriter(filename, fieldnames=fieldnames)
writer.writeheader()
for line in lines:
if len(line) == 6:
writer.writerow({'param_label': line, [...]})
return
I hope that help you.

In Python: Search for two words in multiple texts using raw_input

I want to open (raw_input) multiple text files in a directory, give them a name (Document 1, Document 2...), search (raw_input) for two words using "OR" and then put the search word (without characters ".,/"" only the words in lower case) and names of the files containing the word/words in a list or new text file:
I've tried to put the files into a dictionary, but I don't really know if that is a stupid idea?
I don't know how to let the user search for one or two words (via raw_input) in all the files at the same time. Can you help me or give me a hint?
I want it to print something like:
SearchWord "found in " Document 1, python.txt
SearchWord "found in " Document 3, foobar.txt
import re, os
path = raw_input("insert path to directory :")
ex_library = os.listdir(path)
search_words = open("sword.txt", "w") # File or maybe list to put in the results
thelist = []
for texts in ex_library:
file = os.path.join(path, texts)
text = open(file, "r")
textname = os.path.basename(texts)
print textname
for names in textname.split():
thelist.append(names)
text.close()
print thelist
print "These texts are now open"
print "###########################"
count = 0
for y in ex_library:
count = count + 1
print count
print "texts total"
d ={}
for x in range(count):
d["Document {0}".format(x)] = None # I would like the values to be the names of the texts (text1.txt)

Categories

Resources