I want to find all the "phrases" in a list in remove them from the list, so that I have only words (without spaces) left. I'm making a hangman type game and want the computer to choose a random word. I'm new to Python and coding, so I'm happy to hear other suggestions for my code as well.
import random
fhand = open('common_words.txt')
words = []
for line in fhand:
line = line.strip()
words.append(line)
for word in words:
if ' ' in word:
words.remove(word)
print(words)
Sets are more efficient than lists. When lazily constructed like here, you can gain significant performance boost.
# Load all words
words = {}
with open('common_words.txt') as file:
for line in file.readlines():
line = line.strip()
if " " not in line:
words.add(line)
# Can be converted to one-liner using magic of Python
words = set(filter(lambda x: " " in x, map(str.strip, open('common_words.txt').readlines())))
# Get random word
import random
print(random.choice(words))
Use str.split(). It separates by both spaces and newlines by default.
>>> 'some words\nsome more'.split()
['some', 'words', 'some', 'more']
>>> 'this is a sentence.'.split()
['this', 'is', 'a', 'sentence.']
>>> 'dfsonf 43 SDFd fe#2'.split()
['dfsonf', '43', 'SDFd', 'fe#2']
Read the file normally and make a list this way:
words = []
with open('filename.txt','r') as file:
words = file.read().split()
That should be good.
with open( 'common_words.txt', 'r' ) as f:
words = [ word for word in filter( lambda x: len( x ) > 0 and ' ' not in x, map( lambda x: x.strip(), f.readlines() ) ) ]
with is used because file objects are content managers. The strange list-like syntax is a list comprehension, so it builds a list from the statements inside of the brackets. map is a function with takes in an iterable, applying a provided function to each item in the iterable, placing each transformed result into a new list*. filter is function which takes in an iterable, testing each item against the provided predicate, placing each item which evaluated to True into a new list*. lambda is used to define a function (with a specific signature) in-line.
*: The actual return types are generators, which function like iterators so they can still be used with for loops.
I am not sure if I understand you correctly, but I guess the split() method is something for you, like:
with open('common_words.txt') as f:
words = [line.split() for line in f]
words = [word for words in words_nested for word in words] # flatten nested list
As mentioned, the
.split()
Method could be a solution.
Also, the NLTK module might be useful for future language processing tasks.
Hope this helps!
Related
I'm working on a file text, but, as it has spaces at the beginning too, when I try to delete my \n using the strip mode and list comprehension, I get a list with empty elements (" ") and I don't know how to delete them.
I have a text and my code is:
with open(filename) as f:
testo= f.readlines()
[e.strip() for e in testo]
but I get a list like this:
[' ', ' ', 'word1', 'word2', 'word3', ' ']
I wanted to know if I can work it out with the strip method, otherwise with another method.
You can use a generator to read all the lines and strip() the unwanted newlines.
From the generator you only use those elements that are "Truthy" - empty strings are considered False.
Advantage: you create only one list and get rid of empty strings:
Write file:
filename = "t.txt"
with open(filename,"w") as f:
f.write("""
c
oo
l
te
xt
""")
Process file:
with open(filename) as f:
testo = [x for x in (line.strip() for line in f) if x] # f.readlines() not needed. f is
# an iterable in its own right
print(testo) # ['c', 'oo', 'l', 'te', 'xt']
You could do the similarly:
testo = [line.strip() for line in f if line.strip()]
but that would execute strip() twice and would be slightly less efficient.
Output:
['c', 'oo', 'l', 'te', 'xt']
Doku:
strip()
truth value testing
A suggested alternative from Eli Korvigo is:
testo = list(filter(bool, map(str.strip, f)))
with is essentially the same - replacing the explicit list comp using a generator comp with a map of str.strip on f (resulting in a generator) and applying a filter to that to feed it into a list.
See built in function for the docu of filter,map,bool.
I like mine better though ;o)
You are getting those empty string because few of lines were just empty line breaks. Here's the code for weeding out these empty strings.
with open(filename) as f:
testo = [e.strip() for e in f.readlines()]
final_list = list(filter(lambda x: x != '', testo))
print(final_list)
Without lambda and using map:
with open(filename) as f:
final_list = list(filter(bool, map(str.strip, f)))
print(final_list)
Another solution is:
with open(filename) as f:
testo = [x for x in f.read().splitlines() if x]
print(testo)
For second solution is source is:
https://stackoverflow.com/a/15233379/2988776
For performance upgrades refer to #Patrick 's answer
From the data you showed us, it looks like there is a line with just a space in it. With that in mind, you have to decide whether this is something you want or not.
In case you want it, then your code should look something like this:
with open(filename) as f:
testo=f.readlines()
list(filter(None, (l.rstrip('\n') for l in testo)))
In case you don't want lines with just whitespace characters, you can do something like:
with open(filename) as f:
testo=f.readlines()
[e.rstrip('\n') for e in testo if e.strip()]
In this case, we avoid stripping the: " a word with leading and trailing spaces " to "a word with leading and trailing spaces", since in some cases it might change the semantics of the line:)
I am relatively new to python and have a question:
I am trying to write a script that will read a .txt file and check if words are in a list that I've provided and then return a count as to how many words were in that list.
So far,
import string
#this is just an example of a list
list = ['hi', 'how', 'are', 'you']
filename="hi.txt"
infile=open(filename, "r")
lines = infile.readlines()
for line in lines:
words = line.split()
for word in words:
word = word.strip(string.punctuation)
I've tried to split the file into lines and then the lines into words without punctuation.
I am not sure where to go after this. I would like ultimately for the output to be something like this:
"your file has x words that are in the list".
Thank you!
You can split your file to words using the following command :
words=reduce(lambda x,y:x+y,[line.split() for line in f])
Then count the number of words in your word list with loop over it and using count function :
w_list = ['hi', 'how', 'are', 'you']
with open("hi.txt", "r") as f :
words=reduce(lambda x,y:x+y,[line.split() for line in f])
for w in w_list:
print "your file has {} {}".format(words.count(w),w)
# words to search for;
# (stored as a set so `word in search_for` is O(1))
search_for = set(["hi", "how", "are", "you"])
# get search text
# (no need to split into lines)
with open("hi.txt") as inf:
text = inf.read().lower()
# create translation table
# - converts non-word chars to spaces (this maintains appropriate word-breaks)
# - keeps apostrophe (for words like "don't" or "couldn't")
trans = str.maketrans(
"0123456789abcdefghijklmnopqrstuvwxyz'!#$%&()*+,-./:;<=>?#[\\]^_`{|}~\"\\",
" abcdefghijklmnopqrstuvwxyz' "
)
# apply translation table and split into words
words = text.translate(trans).split()
# count desired words
word_count = sum(word in search_for for word in words)
# show result
print("your file has {} words that are in the list".format(word_count))
Read file content by with statement and open() method.
Remove punctuation from the file content by string module.
Split file content by split() method and iterate every word by for loop.
Check if word is present in the input list or not and increment count value according yo that.
input file: hi.txt
hi, how are you?
hi, how are you?
code:
import string
input_list = ['hi', 'how', 'are', 'you']
filename="hi.txt"
count = 0
with open(filename, "rb") as fp:
data = fp.read()
data = data.translate(string.maketrans("",""), string.punctuation)
for word in data.split():
if word in input_list:
count += 1
print "Total number of word present in file from the list are %d"%(count)
Output:
vivek#vivek:~/Desktop/stackoverflow$ python 18.py
Total number of word present in file from the list are 8
vivek#vivek:~/Desktop/stackoverflow$
Do not use variable names which already define by python interpreter
e.g. list in your code.
>>> list
<type 'list'>
>>> a = list([1,2,3])
>>> a
[1, 2, 3]
>>> list = ["hi", "how"]
>>> b = list([1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'list' object is not callable
>>>
use the len()
for example for a list;
enter code here
myList = ["c","b","a"]
len(myList)
it will return 3 meaning there are three items on your list.
I am learning python from an introductory Python textbook and I am stuck on the following problem:
You will implement function index() that takes as input the name of a text file and a list of words. For every word in the list, your function will find the lines in the text file where the word occurs and print the corresponding line numbers.
Ex:
>>>> index('raven.txt', ['raven', 'mortal', 'dying', 'ghost', 'ghastly', 'evil', 'demon'])
ghost 9
dying 9
demon 122
evil 99, 106
ghastly 82
mortal 30
raven 44, 53, 55, 64, 78, 97, 104, 111, 118, 120
Here is my attempt at the problem:
def index(filename, lst):
infile = open(filename, 'r')
lines = infile.readlines()
lst = []
dic = {}
for line in lines:
words = line.split()
lst. append(words)
for i in range(len(lst)):
for j in range(len(lst[i])):
if lst[i][j] in lst:
dic[lst[i][j]] = i
return dic
When I run the function, I get back an empty dictionary. I do not understand why I am getting an empty dictionary. So what is wrong with my function? Thanks.
You are overwriting the value of lst. You use it as both a parameter to a function (in which case it is a list of strings) and as the list of words in the file (in which case it's a list of list of strings). When you do:
if lst[i][j] in lst
The comparison always returns False because lst[i][j] is a str, but lst contains only lists of strings, not strings themselves. This means that the assignment to the dic is never executed and you get an empty dict as result.
To avoid this you should use a different name for the list in which you store the words, for example:
In [4]: !echo 'a b c\nd e f' > test.txt
In [5]: def index(filename, lst):
...: infile = open(filename, 'r')
...: lines = infile.readlines()
...: words = []
...: dic = {}
...: for line in lines:
...: line_words = line.split()
...: words.append(line_words)
...: for i in range(len(words)):
...: for j in range(len(words[i])):
...: if words[i][j] in lst:
...: dic[words[i][j]] = i
...: return dic
...:
In [6]: index('test.txt', ['a', 'b', 'c'])
Out[6]: {'a': 0, 'c': 0, 'b': 0}
There are also a lot of things you can change.
When you want to iterate a list you don't have to explicitly use indexes. If you need the index you can use enumerate:
for i, line_words in enumerate(words):
for word in line_words:
if word in lst: dict[word] = i
You can also iterate directly on a file (refer to Reading and Writing Files section of the python tutorial for a bit more information):
# use the with statement to make sure that the file gets closed
with open('test.txt') as infile:
for i, line in enumerate(infile):
print('Line {}: {}'.format(i, line))
In fact I don't see why would you first build that words list of list. Just itertate on the file directly while building the dictionary:
def index(filename, lst):
with open(filename, 'r') as infile:
dic = {}
for i, line in enumerate(infile):
for word in line.split():
if word in lst:
dic[word] = i
return dic
Your dic values should be lists, since more than one line can contain the same word. As it stands your dic would only store the last line where a word is found:
from collections import defaultdict
def index(filename, words):
# make faster the in check afterwards
words = frozenset(words)
with open(filename) as infile:
dic = defaultdict(list)
for i, line in enumerate(infile):
for word in line.split():
if word in words:
dic[word].append(i)
return dic
If you don't want to use the collections.defaultdict you can replace dic = defaultdict(list) with dic = {} and then change the:
dic[word].append(i)
With:
if word in dic:
dic[word] = [i]
else:
dic[word].append(i)
Or, alternatively, you can use dict.setdefault:
dic.setdefault(word, []).append(i)
although this last way is a bit slower than the original code.
Note that all these solutions have the property that if a word isn't found in the file it will not appear in the result at all. However you may want it in the result, with an emty list as value. In such a case it's simpler the dict with empty lists before starting to loop, such as in:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
for word in line.split():
if word in words:
dic[word].append(i)
Refer to the documentation about List Comprehensions and Dictionaries to understand the first line.
You can also iterate over words instead of the line, like this:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
for word in words:
if word in line.split():
dic[word].append(i)
Note however that this is going to be slower because:
line.split() returns a list, so word in line.split() will have to scan all the list.
You are repeating the computation of line.split().
You can try to solve these two problems doing:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
line_words = frozenset(line.split())
for word in words:
if word in line_words:
dic[word].append(i)
Note that here we are iterating once over line.split() to build the set and also over words. Depending on the sizes of the two sets this may be slower or faster than the original version (iteratinv over line.split()).
However at this point it's probably faster to intersect the sets:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
line_words = frozenset(line.split())
for word in words & line_words: # & stands for set intersection
dic[word].append(i)
Try this,
def index(filename, lst):
dic = {w:[] for w in lst}
for n,line in enumerate( open(filename,'r') ):
for word in lst:
if word in line.split(' '):
dic[word].append(n+1)
return dic
There are some features of the language introduced here that you should be aware of because they will make life a lot easier in the long run.
The first is a dictionary comprehension. It basically initializes a dictionary using the words in lst as keys and an empty list [] as the value for each key.
Next the enumerate command. This allows us to iterate over the items in a sequence but also gives us the index of those items. In this case, because we passed a file object to enumerate it will loop over the lines. For each iteration, n will be the 0-based index of the line and line will be the line itself. Next we iterate over the words in lst.
Notice that we don't need any indices here. Python encourages looping over objects in sequences rather than looping over indices and then accessing the objects in a sequence based on index (for example discourages doing for i in range(len(lst)): do something with lst[i]).
Finally, the in operator is a very straightforward way to test membership for many types of objects and the syntax is very intuitive. In this case, we are asking is the current word from lst in the current line.
Note that we use line.split(' ') to get a list of the words in the line. If we don't do this, 'the' in 'there was a ghost' would return True as the is a substring of one of the words.
On the other hand 'the' in ['there', 'was', 'a', 'ghost'] would return False. If the conditional returns True, we append it to the list associated to the key in our dictionary.
That might be a lot to chew on, but these concepts make problems like this more straight forward.
First, your function param with the words is named lst and also the list where you put all the words in the file is also named lst, so you are not saving the words passed to your functions, because on line 4 you're redeclaring the list.
Second, You are iterating over each line in the file (the first for), and getting the words in that line. After that lst has all the words in the entire file. So in the for i ... you are iterating over all the words readed from the file, there's no need to use the third for j where you are iterating over each character in every word.
In resume, in that if you are saying "If this single character is in the lists of words ..." wich is not, so the dict will be never filled up.
for i in range(len(lst)):
if words[i] in lst:
dic[words[i]] = dic[words[i]] + i # To count repetitions
You need to rethink the problem, even my answer will fail because the word in the dict will not exist giving an error, but you get the point. Good luck!
Say I have a text file formatted like this:
100 20 the birds are flying
and I wanted to read the int(s) into their own lists and the string into its own list...how would I go about this in python. I tried
data.append(map(int, line.split()))
that didn't work...any help?
Essentially, I'm reading the file line by line, and splitting them. I first check to see if I can turn them into an integer, and if I fail, treat them as strings.
def separate(filename):
all_integers = []
all_strings = []
with open(filename) as myfile:
for line in myfile:
for item in line.split(' '):
try:
# Try converting the item to an integer
value = int(item, 10)
all_integers.append(value)
except ValueError:
# if it fails, it's a string.
all_strings.append(item)
return all_integers, all_strings
Then, given the file ('mytext.txt')
100 20 the birds are flying
200 3 banana
hello 4
...doing the following on the command line returns...
>>> myints, mystrings = separate(r'myfile.txt')
>>> print myints
[100, 20, 200, 3, 4]
>>> print mystrings
['the', 'birds', 'are', 'flying', 'banana', 'hello']
If i understand your question correctly:
import re
def splitList(list):
ints = []
words = []
for item in list:
if re.match('^\d+$', item):
ints.append(int(item))
else:
words.append(item)
return ints, words
intList, wordList = splitList(line.split())
Will give you two lists: [100, 20] and ['the', 'birds', 'are', 'flying']
Here's a simple solution. Note it might not be as efficient as others for very large files, because it iterates over word two times for each line.
words = line.split()
intList = [int(x) for x in words if x.isdigit()]
strList = [x for x in words if not x.isdigit()]
pop removes the element from the list and returns it:
words = line.split()
first = int(words.pop(0))
second = int(words.pop(0))
This is of course assuming your format is always int int word word word ....
And then join the rest of the string:
words = ' '.join(words)
And in Python 3 you can even do this:
first, second, *words = line.split()
Which is pretty neat. Although you would still have to convert first and second to int's.
I'm running the following code on a list of strings to return a list of its words:
words = [re.split('\\s+', line) for line in lines]
However, I end up getting something like:
[['import', 're', ''], ['', ''], ['def', 'word_count(filename):', ''], ...]
As opposed to the desired:
['import', 're', '', '', '', 'def', 'word_count(filename):', '', ...]
How can I unpack the lists re.split('\\s+', line) produces in the above list comprehension? Naïvely, I tried using * but that doesn't work.
(I'm looking for a simple and Pythonic way of doing; I was tempted to write a function but I'm sure the language accommodates for this issue.)
>>> import re
>>> from itertools import chain
>>> lines = ["hello world", "second line", "third line"]
>>> words = chain(*[re.split(r'\s+', line) for line in lines])
This will give you an iterator that can be used for looping through all words:
>>> for word in words:
... print(word)
...
hello
world
second
line
third
line
Creating a list instead of an iterator is just a matter of wrapping the iterator in a list call:
>>> words = list(chain(*[re.split(r'\s+', line) for line in lines]))
The reason why you get a list of lists is because re.split() returns a list which then in 'appended' to the list comprehension output.
It's unclear why you are using that (or probably just a bad example) but if you can get the full content (all lines) as a string you can just do
words = re.split(r'\s+', lines)
if lines is the product of:
open('filename').readlines()
use
open('filename').read()
instead.
You can always do this:
words = []
for line in lines:
words.extend(re.split('\\s+',line))
It's not nearly as elegant as a one-liner list comprehension, but it gets the job done.
Just stumbled across this old question, and I think I have a better solution. Normally if you want to nest a list comprehension ("append" each list), you think backwards (un-for-loop-like). This is not what you want:
>>> import re
>>> lines = ["hello world", "second line", "third line"]
>>> [[word for word in re.split(r'\s+', line)] for line in lines]
[['hello', 'world'], ['second', 'line'], ['third', 'line']]
However if you want to "extend" instead of "append" the lists you're generating, just leave out the extra set of square brackets and reverse your for-loops (putting them back in the "right" order).
>>> [word for line in lines for word in re.split(r'\s+', line)]
['hello', 'world', 'second', 'line', 'third', 'line']
This seems like a more Pythonic solution to me since it is based in list-processing logic rather than some random-ass built-in function. Every programmer should know how to do this (especially ones trying to learn Lisp!)