text file to dictionary as keys - python

How can I convert file's words to dictionary as it's keys, directly, without using the list? Here is my code:
def abc(f):
with open(f, "r")as f:
dict = {}
lst = []
for line in f:
lst += line.strip().split()
for item in lst:
dict[item] = ""
return dict
print(abc("file.txt"))
Example of input "file.txt":
abc def
ghi jkl mno
pqr
Output:
{"abc":"", "def":"", "ghi":"", "jkl":"", "mno":"", "pqr":""}
The output of split() is a list. So, by use of that I need to read the data from file, store it on a list and then define it as dictionary's keys. My question is how can we ignore the list, and after reading data from file, put them directly to the dictionary?

if you really want a dictionary, build it in one go using a dict comprehension on the splitted words:
def abc(f):
return {word:"" for line in f for word in line.split()}
but you probably want a set instead since there are no values to your dict:
def abc(f):
return {word for line in f for word in line.split()}
I suspect you want to count the words, in that case:
def abc(f):
return collections.Counter(word for line in f for word in line.split())
note that split doesn't split on punctuation, so if the text contains some, you'll have duplicate words unless you replace
for word in line.split()
by
for word in re.split("\W",line) if word
(using re package, which has the slight drawback to generate empty fields at the start / end which is easily fixed by filtering word)

I don't know what you're trying to do nor what the expected output really is, but the following is functionnaly equivalent to your code snippet (and a bit cleaner too)
def abc(f):
dic = {}
with open(f, "r")as f:
for line in f:
# using split() will already remove
# trainling whitespaces so you don't
# need strip() here
for item in line.split():
dic[item] = ""
return dic

Related

File Names Chain in python

I CANNOT USE ANY IMPORTED LIBRARY. I have this task where I have some directories containing some files; every file contains, besides some words, the name of the next file to be opened, in its first line. Once every word of every files contained in a directory is opened, they have to be treated in a way that should return a single string; such string contains in its first position, the most frequent first letter of every word seen before, in its second position the most frequent second letter, and so on. I have managed to do this with a directory containing 3 files, but it's not using any type of chain-like mechanism, rather a passing of local variables. Some of my college colleagues suggested I had to use slicing of lists, but I can't figure out how. I CANNOT USE ANY IMPORTED LIBRARY.
This is what I got:
'''
The objective of the homework assignment is to design and implement a function
that reads some strings contained in a series of files and generates a new
string from all the strings read.
The strings to be read are contained in several files, linked together to
form a closed chain. The first string in each file is the name of another
file that belongs to the chain: starting from any file and following the
chain, you always return to the starting file.
Example: the first line of file "A.txt" is "B.txt," the first line of file
"B.txt" is "C.txt," and the first line of "C.txt" is "A.txt," forming the
chain "A.txt"-"B.txt"-"C.txt".
In addition to the string with the name of the next file, each file also
contains other strings separated by spaces, tabs, or carriage return
characters. The function must read all the strings in the files in the chain
and construct the string obtained by concatenating the characters with the
highest frequency in each position. That is, in the string to be constructed,
at position p, there will be the character with the highest frequency at
position p of each string read from the files. In the case where there are
multiple characters with the same frequency, consider the alphabetical order.
The generated string has a length equal to the maximum length of the strings
read from the files.
Therefore, you must write a function that takes as input a string "filename"
representing the name of a file and returns a string.
The function must construct the string according to the directions outlined
above and return the constructed string.
Example: if the contents of the three files A.txt, B.txt, and C.txt in the
directory test01 are as follows
test01/A.txt test01/B.txt test01/C.txt
-------------------------------------------------------------------------------
test01/B.txt test01/C.txt test01/A.txt
house home kite
garden park hello
kitchen affair portrait
balloon angel
surfing
the function most_frequent_chars ("test01/A.txt") will return "hareennt".
'''
def file_names_list(filename):
intermezzo = []
lista_file = []
a_file = open(filename)
lines = a_file.readlines()
for line in lines:
intermezzo.extend(line.split())
del intermezzo[1:]
lista_file.append(intermezzo[0])
intermezzo.pop(0)
return lista_file
def words_list(filename):
lista_file = []
a_file = open(filename)
lines = a_file.readlines()[1:]
for line in lines:
lista_file.extend(line.split())
return lista_file
def stuff_list(filename):
file_list = file_names_list(filename)
the_rest = words_list(filename)
second_file_name = file_names_list(file_list[0])
the_lists = words_list(file_list[0]) and
words_list(second_file_name[0])
the_rest += the_lists[0:]
return the_rest
def most_frequent_chars(filename):
huge_words_list = stuff_list(filename)
maxOccurs = ""
list_of_chars = []
for i in range(len(max(huge_words_list, key=len))):
for item in huge_words_list:
try:
list_of_chars.append(item[i])
except IndexError:
pass
maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
list_of_chars.clear()
return maxOccurs
print(most_frequent_chars("test01/A.txt"))
This assignment is relatively easy, if the code has a good structure. Here is a full implementation:
def read_file(fname):
with open(fname, 'r') as f:
return list(filter(None, [y.rstrip(' \n').lstrip(' ') for x in f for y in x.split()]))
def read_chain(fname):
seen = set()
new = fname
result = []
while not new in seen:
A = read_file(new)
seen.add(new)
new, words = A[0], A[1:]
result.extend(words)
return result
def most_frequent_chars (fname):
all_words = read_chain(fname)
result = []
for i in range(max(map(len,all_words))):
chars = [word[i] for word in all_words if i<len(word)]
result.append(max(sorted(set(chars)), key = chars.count))
return ''.join(result)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"
In the code above, we define 3 functions:
read_file: simple function to read the contents of a file and return a list of strings. The command x.split() takes care of any spaces or tabs used to separate words. The final command list(filter(None, arr)) makes sure that empty strings are erased from the solution.
read_chain: Simple routine to iterate through the chain of files, and return all the words contained in them.
most_frequent_chars: Easy routine, where the most frequent characters are counted carefully.
PS. This line of code you had is very interesting:
maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
I edited my code to include it.
Space complexity optimization
The space complexity of the previous code can be improved by orders of magnitude, if the files are scanned without storing all the words:
def scan_file(fname, database):
with open(fname, 'r') as f:
next_file = None
for x in f:
for y in x.split():
if next_file is None:
next_file = y
else:
for i,c in enumerate(y):
while len(database) <= i:
database.append({})
if c in database[i]:
database[i][c] += 1
else:
database[i][c] = 1
return next_file
def most_frequent_chars (fname):
database = []
seen = set()
new = fname
while not new in seen:
seen.add(new)
new = scan_file(new, database)
return ''.join(max(sorted(d.keys()),key=d.get) for d in database)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"
Now we scan the files tracking the frequency of the characters in database, without storing intermediate arrays.
Ok, here's my solution:
def parsi_file(filename):
visited_files = set()
words_list = []
# Getting words from all files
while filename not in visited_files:
visited_files.add(filename)
with open(filename) as f:
filename = f.readline().strip()
words_list += [line.strip() for line in f.readlines()]
# Creating dictionaries of letters:count for each index
letters_dicts = []
for word in words_list:
for i in range(len(word)):
if i > len(letters_dicts)-1:
letters_dicts.append({})
letter = word[i]
if letters_dicts[i].get(letter):
letters_dicts[i][letter] += 1
else:
letters_dicts[i][letter] = 1
# Sorting dicts and getting the "best" letter
code = ""
for dic in letters_dicts:
sorted_letters = sorted(dic, key = lambda letter: (-dic[letter],letter))
code += sorted_letters[0]
return code
We first get the words_list from all files.
Then, for each index, we create a dictionary of the letters in all words at that index, with their count.
Now we sort the dictionary keys by descending count (-count) then by alphabetical order.
Finally we get the first letter (thus the one with the max count) and add it to the "code" word for this test battery.
Edit: in terms of efficiency, parsing through all words for each index will get worse as the number of words grows, so it would be better to tweak the code to simultaneously create the dictionaries for each index and parse through the list of words only once. Done.

Index error iterating over list in python

So I have this file that contains 2 words each line. It looks like this.
[/lang:F </lang:foreign>
[lipmack] [lipsmack]
[Fang:foreign] <lang:foreign>
the first word is incorrectly formatted and the second one is correctly formatted. I am trying to put them in a dictionary. Below is my code.
textFile = open("hinditxt1.txt", "r+")
textFile = textFile.readlines()
flist = []
for word in textFile:
flist.append(word.split())
fdict = dict()
for num in range(len(flist)):
fdict[flist[num][0]] = flist[num][1]
First I split it then I try to put them in a dictionary. But for some reason I get "IndexError: list index out of range" when trying to put them in a dictionary. What can i do to fix it? Thanks!
It is better in python to iterate over the items of a list rather than a new range of indicies. My guess is that the IndexError is coming from a line in the input file that is blank or does not contain any spaces.
with open("input.txt", 'r') as f:
flist = [line.split() for line in f]
fdict = {}
for k, v in flist:
fdict[k] = v
print(fdict)
The code above avoids needing to access elements of the list using an index by simply iterating over the items of the list itself. We can further simplify this by using a dict comprehension:
with open("input.txt", 'r') as f:
flist = [line.split() for line in f]
fdict = {k: v for k, v in flist}
print(fdict)
With dictionaries it is typical to use the .update() method to add new key-value pairs. It would look more like:
for num in range(len(flist)):
fdict.update({flist[num][0] : flist[num][1]})
A full example without file reading would look like:
in_words = ["[/lang:F </lang:foreign>",
"[lipmack] [lipsmack]",
"[Fang:foreign] <lang:foreign>"]
flist = []
for word in in_words:
flist.append(word.split())
fdict = dict()
for num in range(len(flist)):
fdict.update({flist[num][0]: flist[num][1]})
print(fdict)
Yielding:
{'[lipmack]': '[lipsmack]', '[Fang:foreign]': '<lang:foreign>', '[/lang:F': '</lang:foreign>'}
Although your output may vary, since dictionaries do not maintain order.
As #Alex points out, the IndexError is likely from your data having improperly formatted data (i.e. a line with only 1 or 0 items on it). I suspect the most likely cause of this would be a \n at the end of your file that is causing the last line(s) to be blank.

removing duplicates from a list of strings

I am trying to read a file, make a list of words and then make a new list of words removing the duplicates.
I am not able to append the words to the new list. it says none type object has no attribute'append'
Here is the bit of code:
fh = open("gdgf.txt")
lst = list()
file = fh.read()
for line in fh:
line = line.rstrip()
file = file.split()
for word in file:
if word in lst:
continue
lst = lst.append(word)
print lst
python append will return None.So set will help here to remove duplicates.
In [102]: mylist = ["aa","bb","cc","aa"]
In [103]: list(set(mylist))
Out[103]: ['aa', 'cc', 'bb']
Hope this helps
In your case
file = fh.read()
After this fh will be an empty generator.So you cannot use it since it is already used.You have to do operations with variable file
You are replacing your list with the return value of the append function, which is not a list. Simply do this instead:
lst.append(word)
append modifies the list it was called on, and returns None. I.e., you should replace the line:
lst=lst.append(word)
with simply
lst.append(word)
fh=open("gdgf.txt")
file=fh.read()
for line in fh:
line=line.rstrip()
lst = []
file=file.split()
for word in file:
lst.append(word)
print (set(lst))
append appends an item in-place which means it does not return any value. You should get rid of lst= when appending word:
if word in lst:
continue
lst.append(word)
list.append() is inplace append, it returns None (as it does not return anything). so you do not need to set the return value of list.append() back to the list. Just change the line - lst=lst.append(word) to -
lst.append(word)
Another issue, you are first calling .read() on the file and then iterating over its lines, you do not need to do that. Just remove the iteration part.
Also, an easy way to remove duplicates, if you are not interested in the order of the elements is to use set.
Example -
>>> lst = [1,2,3,4,1,1,2,3]
>>> set(lst)
{1, 2, 3, 4}
So, in your case you can initialize lst as - lst=set() . And then use lst.add() element, you would not even need to do a check whether the element already exists or not. At the end, if you really want the result as a list, do - list(lst) , to convert it to list. (Though when doing this, you want to consider renaming the variable to something better that makes it easy to understand that its a set not a list )
append() does not return anything, so don't assign it. lst.append() is
enough.
Modified Code:
fh = open("gdgf.txt")
lst = []
file=fh.read()
for line in fh:
line = line.rstrip()
file=file.split()
for word in file:
if word in lst:
continue
lst.append(word)
print lst
I suggest you use set() , because it is used for unordered collections of unique elements.
fh = open("gdgf.txt")
lst = []
file = fh.read()
for line in fh:
line = line.rstrip()
file = file.split()
lst = list( set(lst) )
print lst
You can simplify your code by reading and adding the words directly to a set. Sets do not allow duplicates, so you'll be left with just the unique words:
words = set()
with open('gdgf.txt') as f:
for line in f:
for word in line.strip():
words.add(word.strip())
print(words)
The problem with the logic above, is that words that end in punctuation will be counted as separate words:
>>> s = "Hello? Hello should only be twice in the list"
>>> set(s.split())
set(['be', 'twice', 'list', 'should', 'Hello?', 'only', 'in', 'the', 'Hello'])
You can see you have Hello? and Hello.
You can enhance the code above by using a regular expression to extract words, which will take care of the punctuation:
>>> set(re.findall(r"(\w[\w']*\w|\w)", s))
set(['be', 'list', 'should', 'twice', 'only', 'in', 'the', 'Hello'])
Now your code is:
import re
with open('gdgf.txt') as f:
words = set(re.findall(r"(\w[\w']*\w|\w)", f.read(), re.M))
print(words)
Even with the above, you'll have duplicates as Word and word will be counted twice. You can enhance it further if you want to store a single version of each word.
I think the solution to this problem can be more succinct:
import string
with open("gdgf.txt") as fh:
word_set = set()
for line in fh:
line = line.split()
for word in line:
# For each character in string.punctuation, iterate and remove
# from the word by replacing with '', an empty string
for char in string.punctuation:
word = word.replace(char, '')
# Add the word to the set
word_set.add(word)
word_list = list(word_set)
# Sort the set to be fastidious.
word_list.sort()
print(word_list)
One thing about counting words by "split" is that you are splitting on whitespace, so this will make "words" out of things like "Hello!" and "Really?" The words will include punctuation, which may probably not be what you want.
Your variable names could be a bit more descriptive, and your indentation seems a bit off, but I think it may be the matter of cutting/pasting into the posting. I have tried to name the variables I used based on whatever the logical structure it is I am interacting with (file, line, word, char, and so on).
To see the contents of 'string.punctuation' you can launch iPython, import string, then simply enter string.punctuation to see what is the what.
It is also unclear if you need to have a list, or if you just need a data structure that contains a unique list of words. A set or a list that has been properly created to avoid duplicates should do the trick. Following on with the question, I used a set to uniquely store elements, then converted that set to a list trivially, and later sorted this alphabetically.
Hope this helps!

How do I see if a value matches another value in a text file in Python?

Here's what I have so far.
from itertools import permutations
original = str(input('What word would you like to unscramble?: '))
for bob in permutations(original):
print(''.join(bob))
inputFile = open(dic.txt, 'r')
compare = inputFile.read()
inputFile.close()
Basically, what I'm trying to do is create a word unscrambler by having Python find all possible rearrangements of a string and then only print the rearrangements that are actual words, which can be found out by running each rearrangement through a dictionary file (in this case dic.txt) to see if there is a match. I am running Python 3.3, if that matters. What do I need to add in order to compare the rearrangements with the dictionary file?
You could store the permutations in a list, add the dictionary in another list and select those being in both lists…
For example this way:
from itertools import permutations
original = str(input('What word would you like to unscramble?: '))
perms = []
for bob in permutations(original):
perms.append(''.join(bob))
inputFile = open(dic.txt, 'r')
dict_entries = inputFile.read().split('\n')
inputFile.close()
for word in [perm for perm in perms if perm in dict_entries]:
print word
(Assuming the dictionary contains one word per line…)
Read the dictionary file into a list line by line, iterate through each of the rearrangements and check if it's in the dictionary like so:
if word in dict_list:
...
Although this puts a little more up-front effort into processing the input file, once you've built the word_dict it's much more efficient to look up the sorted form of the word rather than build and check for all permutations:
def get_word_dict(filename):
words = {}
with open(filename) as word_dict:
for line in word_dict:
word = line.strip()
key = sorted(word)
if key not in words:
words[key] = []
words[key].append(word)
return words
word_dict = get_word_dict('dic.txt')
original = input("What word would you like to unscramble: ")
key = sorted(original)
if key in word_dict:
for word in word_dict[key]:
print(word)
else:
print("Not in the dictionary.")
This will be particularly beneficial if you want to look up more than one word - you can process the file once, then repeatedly refer to the word_dict.

compare two file and find matching words in python

I have a two file: the first one includes terms and their frequency:
table 2
apple 4
pencil 89
The second file is a dictionary:
abroad
apple
bread
...
I want to check whether the first file contains any words from the second file. For example both the first file and the second file contains "apple".
I am new to python.
I try something but it does not work. Could you help me ? Thank you
for line in dictionary:
words = line.split()
print words[0]
for line2 in test:
words2 = line2.split()
print words2[0]
Something like this:
with open("file1") as f1,open("file2") as f2:
words=set(line.strip() for line in f1) #create a set of words from dictionary file
#why sets? sets provide an O(1) lookup, so overall complexity is O(N)
#now loop over each line of other file (word, freq file)
for line in f2:
word,freq=line.split() #fetch word,freq
if word in words: #if word is found in words set then print it
print word
output:
apple
It may help you :
file1 = set(line.strip() for line in open('file1.txt'))
file2 = set(line.strip() for line in open('file2.txt'))
for line in file1 & file2:
if line:
print line
Here's what you should do:
First, you need to put all the dictionary words in some place where you can easily look them up. If you don't do that, you'd have to read the whole dictionary file every time you want to check one single word in the other file.
Second, you need to check if each word in the file is in the words you extracted from the dictionary file.
For the first part, you need to use either a list or a set. The difference between these two is that list keeps the order you put the items in it. A set is unordered, so it doesn't matter which word you read first from the dictionary file. Also, a set is faster when you look up an item, because that's what it is for.
To see if an item is in a set, you can do: item in my_set which is either True or False.
I have your first double list in try.txt and the single list in try_match.txt
f = open('try.txt', 'r')
f_match = open('try_match.txt', 'r')
print f
dictionary = []
for line in f:
a, b = line.split()
dictionary.append(a)
for line in f_match:
if line.split()[0] in dictionary:
print line.split()[0]

Categories

Resources