find common elements in the strings python

find common elements in the strings python - python

I'm trying to find common elements in the strings reading from a file. And this is what I wrote:
file = open ("words.txt", 'r')
while 1:
line = file.readlines()
if len(line) == 0:
break
print line
file.close
def com_Letters(*strings):
return set.intersection(*map(set,strings))
and the result turns out: ['out\n', 'dog\n', 'pingo\n', 'coconut']
I put com_Letters(line), but the result is empty.

There are two problems, but neither one is with com_Letters.
First, this code guarantees that line will always be an empty list:
while 1:
line = file.readlines()
if len(line) == 0:
break
print line
The first time through the loop, you call readlines(), which will
Read until EOF using readline() and return a list containing the lines thus read.
If the file is empty, that's an empty list, so you'll break.
Otherwise, you'll print out the list, and go back into the loop. At which point readlines() is going to have nothing left to read, since you already read until EOF, so it's guaranteed to be an empty list. Which means you'll break.
Either way, list ends up empty.
It's not clear what you're trying to do with that loop. There's never any good reason to call readlines() repeatedly on the same file. But, even if there were, you'd probably want to accumulate all of the results, rather than just keeping the last (guaranteed-empty) result. Something like this:
while 1:
new_line = file.readlines()
if len(new_line) == 0:
break
print new_line
line += new_line
Anyway, if you fix that problem (e.g., by scrapping the whole loop and just using line = file.readlines()), you're calling com_Letters with a single list of strings. That's not particularly useful; it's just a very convoluted way of calling set. If it's not clear why:
Since there's only one argument (a list of strings), *strings ends up as a one-element tuple of that argument.
map(set, strings) on a single-element tuple just calls set on that element and returns a single-element list.
*map(set, strings) explodes that into one argument, the set.
set.intersection(s) is the same thing as s.intersection(), which just returns s itself.
All of this would be easier to see if you broke up some of those complex expressions and printed the intermediate values. Then you'd know exactly where it first goes wrong, instead of just knowing it's somewhere in a long chain of events.
A few side notes:
You forgot the () on the file.close, which means you're not actually closing the file. One of the many reasons that with is better is that it means you can't make that mistake.
Use plural names for collections. line sounds like a variable that should have a single line in it, not a variable that should have all of your lines.
The readlines function with no sizehint argument is basically useless. If you're just going to iterate over the lines, you can do that to the file itself. If you really need the lines in a list instead of reading them lazily, list(file) makes your intention clearer—and doesn't mislead you into thinking it might be useful to do repeatedly.
The Pythonic way to check for an empty collection is just if not line:, rather than if len(line) == 0:.
while True is clearer than while 1.

I suggest modifying the function as follows:
def com_Letters(strings):
return set.intersection(*map(set,strings))
I think the function is treating the argument strings as a list of a list of strings (only one argument passed in this case a single list) and therefore not finding the intersection.

Related

How to avoid Nonetype when combining lists in Python

I am very new to Python and am looking for assistance to where I am going wrong with an assignment. I have attempted different ways to approach the problem but keep getting stuck at the same point(s):
Problem 1: When I am trying to create a list of words from a file, I keep making a list for the words per line rather than the entire file
Problem 2: When I try and combine the lists I keep receiving "None" for my result or Nonetype errors [which I think means I have added the None's together(?)].
The assignment is:
#8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.You can download the sample data at http://www.py4e.com/code3/romeo.txt
My current code which is giving me a Nonetype error is:
poem = input("enter file:")
play = open(poem)
lst= list()
for line in play:
line=line.rstrip()
word=line.split()
if not word in lst:
lst= lst.append(word)
print(lst.sort())
If someone could just talk me through where I am going wrong that will be greatly appreciated!

your problem was lst= lst.append(word) this returns None
with open(poem) as f:
lines = f.read().split('\n') #you can also you readlines()
lst = []
for line in lines:
words = line.split()
for word in words:
if word:
lst.append(word)

Problem 1: When I am trying to create a list of words from a file, I keep making a list for the words per line rather than the entire file
You are doing play = open(poem) then for line in play: which is method for processing file line-by-line, if you want to process whole content at once then do:
play = open(poem)
content = play.read()
words = content.split()
Please always remember to close file after you used it i.e. do
play.close()
unless you use context manager way (i.e. like with open(poem) as f:)

Just to help you get into Python a little more:
You can:
1. Read whole file at once (if it is big it is better to grab it into RAM if you have enough of it, if not grab as much as you can for the chunk to be reasonable, then grab another one and so on)
2. Split data you read into words and
3. Use set() or dict() to remove duplicates
Along the way, you shouldn't forget to pay attention to upper and lower cases,
if you need same words, not just different not repeating strings
This will work in Py2 and Py3 as long as you do something about input() function in Py2 or use quotes when entering the path, so:
path = input("Filename: ")
f = open(filename)
c = f.read()
f.close()
words = set(x.lower() for x in c.split()) # This will make a set of lower case words, with no repetitions
# This equals to:
#words = set()
#for x in c.split():
# words.add(x.lower())
# set() is an unordered datatype ignoring duplicate items
# and it mimics mathematical sets and their properties (unions, intersections, ...)
# it is also fast as hell.
# Checking a list() each time for existance of the word is slow as hell
#----
# OK, you need a sorted list, so:
words = sorted(words)
# Or step-by-step:
#words = list(words)
#words.sort()
# Now words is your list
As for your errors, do not worry, they are common at the beginning in almost any objective oriented language.
Other explained them well in their comments. But not to make the answer lacking...:
Always pay attention on functions or methods which operate on the datatype (in place sort - list.sort(), list.append(), list.insert(), set.add()...) and which ones return a new version of the datatype (sorted(), str.lower()...).
If you ran into a similar situation again, use help() in interactive shell to see what exactly a function you used does.
>>> help(list.append)
>>> help(list.sort)
>>> help(str.lower)
>>> # Or any short documentation you need
Python, especially Python 3.x is sensitive to trying operations between types, but some might have a different connotation and can actually work while doing unexpected stuff.
E.g. you can do:
print(40*"x")
It will print out 40 'x' characters, because it will create a string of 40 characters.
But:
print([1, 2, 3]+None)
will, logically not work, which is what is happening somewhere in the rest of your code.
In some languages like javascript (terrible stuff) this will work perfectly well:
v = "abc "+123+" def";
Inserting the 123 seamlessly into the string. Which is usefull, but a programming nightmare and nonsense from another viewing angle.
Also, in Py3 a reasonable assumption from Py2 that you can mix unicode and byte strings and that automatic cast will be performed is not holding.
I.e. this is a TypeError:
print(b"abc"+"def")
because b"abc" is bytes() and "def" (or u"def") is str() in Py3 - what is unicode() in Py2)
Enjoy Python, it is the best!

Replace words in list that later will be used in variable

I have a file which currently stores a string eeb39d3e-dd4f-11e8-acf7-a6389e8e7978
which I am trying to pass into as a variable to my subprocess command.
My current code looks like this
with open(logfilnavn, 'r') as t:
test = t.readlines()
print(test)
But this prints ['eeb39d3e-dd4f-11e8-acf7-a6389e8e7978\n'] and I don't want the part with ['\n'] to be passed into my command, so i'm trying to remove them by using replace.
with open(logfilnavn, 'r') as t:
test = t.readlines()
removestrings = test.replace('[', '').replace('[', '').replace('\\', '').replace("'", '').replace('n', '')
print(removestrings)
I get an exception value saying this so how can I replace these with nothing and store them as a string for my subprocess command?
'list' object has no attribute 'replace'
so how can I replace these with nothing and store them as a string for my subprocess command?

readline() returns a list. Try print(test[0].strip())

You can read the whole file and split lines using str.splitlines:
test = t.read().splitlines()

Your test variable is a list, because readlines() returns a list of all lines read.
Since you said the file only contains this one line, you probably wish to perform the replace on only the first line that you read:
removestrings = test[0].replace('[', '').replace('[', '').replace('\\', '').replace("'", '').replace('n', '')

Where you went wrong...
file.readlines() in python returns an array (collection or grouping of the same variable type) of the lines in the file -- arrays in python are called lists. you, here are treating the list as a string. you must first target the string inside it, then apply that string-only function.
In this case however, this would not work as you are trying to change the way the python interpretter has displayed it for one to understand.
Further information...
In code it would not be a string - we just can't easily understand the stack, heap and memory addresses easily. The example below would work for any number of lines (but it will only print the first element) you will need to change that and
this may be useful...
you could perhaps make the variables globally available (so that other parts of the program can read them
more useless stuff
before they go out of scope - the word used to mean the points at which the interpreter (what runs the program) believes the variable is useful - so that it can remove it from memory, or in much larger programs only worry about the locality of variables e.g. when using for loops i is used a lot without scope there would need to be a different name for each variable in the whole project. scopes however get specialised (meaning that if a scope contains the re-declaration of a variable this would fail as it is already seen as being one. an easy way to understand this might be to think of them being branches and the connections between the tips of branches. they don't touch along with their variables.
solution?
e.g:
with open(logfilenavn, 'r') as file:
lines = file.readlines() # creates a list
# an in-line for loop that goes through each item and takes off the last character: \n - the newline character
#this will work with any number of lines
strippedLines = [line[:-1] for line in lines]
#or
strippedLines = [line.replace('\n', '') for line in lines]
#you can now print the string stored within the list
print(strippedLines[0]) # this prints the first element in the list
I hope this helped!

You get the error because readlines returns a list object. Since you mentioned in the comment that there is just one line in the file, its better to use readline() instead,
line = "" # so you can use it as a variable outside `with` scope,
with open("logfilnavn", 'r') as t:
line = t.readline()
print(line)
# output,
eeb39d3e-dd4f-11e8-acf7-a6389e8e7978

readlines will return a list of lines, and you can't use replace with a list.
If you really want to use readlines, you should know that it doesn't remove the newline character from the end, you'll have to do it yourself.
lines = [line.rstrip('\n') for line in t.readlines()]
But still, after removing the newline character yourself from the end of each line, you'll have a list of lines. And from the question, it looks like, you only have one line, you can just access first line lines[0].
Or you can just leave out readlines, and just use read, it'll read all of the contents from the file. And then just do rstrip.
contents = t.read().rstrip('\n')

using readline() in a function to read through a log file will not iterate

In the code below readline() will not increment. I've tried using a value, no value and variable in readline(). When not using a value I don't close the file so that it will iterate but that and the other attempts have not worked.
What happens is just the first byte is displayed over and over again.
If I don't use a function and just place the code in the while loop (without 'line' variable in readline()) it works as expected. It will go through the log file and print out the different hex numbers.
i=0
x=1
def mFinder(line):
rgps=open('c:/code/gps.log', 'r')
varr=rgps.readline(line)
varr=varr[12:14].rstrip()
rgps.close()
return varr
while x<900:
val=mFinder(i)
i+=1
x+=1
print val
print 'this should change'

It appears you have misunderstood what file.readline() does. Passing in an argument does not tell the method to read a specific numbered line.
The documentation tells you what happens instead:
file.readline([size])
Read one entire line from the file. A trailing newline character is kept in the string (but may be absent when a file ends with an incomplete line). If the size argument is present and non-negative, it is a maximum byte count (including the trailing newline) and an incomplete line may be returned.
Bold emphasis mine, you are passing in a maximum byte count and rgps.readline(1) reads a single byte, not the first line.
You need to keep a reference to the file object around until you are done with it, and repeatedly call readline() on it to get successive lines. You can pass the file object to a function call:
def finder(fileobj):
line = fileobj.readline()
return line[12:14].rstrip()
with open('c:/code/gps.log') as rgps:
x = 0
while x < 900:
section = finder(rgps)
print section
# do stuff
x += 1
You can also loop over files directly, because they are iterators:
for line in openfilobject:
or use the next() function to get a next line, as long as you don't mix .readline() calls and iteration (including next()). If you combine this witha generator function, you can leave the file object entirely to a separate function that will read lines and produce sections until you are done:
def read_sections():
with open('c:/code/gps.log') as rgps:
for line in rgps:
yield line[12:14].rstrip()
for section in read_sections():
# do something with `section`.

How does python read lines from file

Consider the following simple python code:
f=open('raw1', 'r')
i=1
for line in f:
line1=line.split()
for word in line1:
print word,
print '\n'
In the first for loop i.e "for line in f:", how does python know that I want to read a line and not a word or a character?
The second loop is clearer as line1 is a list. So the second loop will iterate over the list elemnts.

Python has a notation of what are called "iterables". They're things that know how to let you traverse some data they hold. Some common iterators are lists, sets, dicts, pretty much every data structure. Files are no exception to this.
The way things become iterable is by defining a method to return an object with a next method. This next method is meant to be called repeatedly and return the next piece of data each time. The for foo in bar loops actually are just calling the next method repeatedly behind the scenes.
For files, the next method returns lines, that's it. It doesn't "know" that you want lines, it's just always going to return lines. The reason for this is that ~50% of cases involving file traversal are by line, and if you want words,
for word in (word for line in f for word in line.split(' ')):
...
works just fine.

In python the for..in syntax is used over iterables (elements tht can be iterated upon). For a file object, the iterator is the file itself.
Please refer here to the documentation of next() method - excerpt pasted below:
A file object is its own iterator, for example iter(f) returns f
(unless f is closed). When a file is used as an iterator, typically in
a for loop (for example, for line in f: print line), the next() method
is called repeatedly. This method returns the next input line, or
raises StopIteration when EOF is hit when the file is open for reading
(behavior is undefined when the file is open for writing). In order to
make a for loop the most efficient way of looping over the lines of a
file (a very common operation), the next() method uses a hidden
read-ahead buffer. As a consequence of using a read-ahead buffer,
combining next() with other file methods (like readline()) does not
work right. However, using seek() to reposition the file to an
absolute position will flush the read-ahead buffer. New in version
2.3.

Use readlines() with indices or parse lines on the fly?

I'm making a simple test function that asserts that the output from an interpreter I'm developing is correct, by reading from a file the expression to evaluate and the expected result, much like python's doctest. This is for scheme, so an example of an input file would be
> 42
42
> (+ 1 2 3)
6
My first attempt for a function that can parse such a file looks like the following, and it seems to work as expected:
def run_test(filename):
interp = Interpreter()
response_next = False
num_tests = 0
with open(filename) as f:
for line in f:
if response_next:
assert response == line.rstrip('\n')
response_next = False
elif line.startswith('> '):
num_tests += 1
response = interp.eval(line[2:])
response = str(response) if response else ''
response_next = True
print "{:20} Ran {} tests successfully".format(os.path.basename(filename),
num_tests)
I wanted to improve it slightly by removing the response_next flag, as I am not a fan of such flags, and instead read in the next line within the elif block with next(f). I had a small unrelated question regarding that which I asked about in IRC at freenode. I got the help I wanted but I was also given the suggestion to use f.readlines() instead, and then use indexing on the resulting list. (I was also told that I could use groupby() in itertools for the pairwise lines, but I'll investigate that approach later.)
Now to the question, I was very curious why that approach would be better, but my Internet connection was a flaky one on a train and I was unable to ask, so I'll ask it here instead. Why would it be better to read everything with readlines() instead of parsing every line as they are read on the fly?
I'm really wondering as my feeling is the opposite, I think it seems cleaner to parse the lines one at a time so that everything is finished in one go. I usually avoid using indices in arrays in Python and prefer to work with iterators and generators. Maybe it is impossible to answer and guess what the person was thinking in case it was a subjective opinion, but if there is some general recommendation I'd be happy to hear about it.

It's certainly more Pythonic to process input iteratively rather than reading the whole input at once; for example, this will work if the input is a console.
An argument in favour of reading a whole array and indexing is that using next(f) could be unclear when combined with a for loop; the options there would be either to replace the for loop with a while True or to fully document that you are calling next on f within the loop:
try:
while True:
test = next(f)
response = next(f)
except StopIteration:
pass
As Jonas suggests you could accomplish this (if you're sure that the input will always consist of lines test/response/test/response etc.) by zipping the input with itself:
for test, response in zip(f, f): # Python 3
for test, response in itertools.izip(f, f): # Python 2

Reading everything into an array gives you the equivalent of random access: You use an array index to move down the array, and at any time you can check what's next and back up if necessary.
If you can carry out your task without backing up, you don't need the random access and it would be cleaner to do without it. In your examples, it seems that your syntax is always a single-line (?) expression followed by the expected response. So, I'd have written a top-level loop that iterates once per expression-value pair, reading lines as necessary.
If you want to support multi-line expressions and results, you can write separate functions to read each one: One that reads a complete expression, one that reads a result (up to the next blank line). The important thing is they should be able consume as much input as they need, and leave the input pointer in a reasonable state for the next input.

from itertools import ifilter,imap
def run_test(filename):
interp = Interpreter()
num_tests, num_passed, last_result = 0, 0, None
with open(filename) as f:
# iterate over non-blank lines
for line in ifilter(None, imap(str.strip, f)):
if line.startswith('> '):
last_result = interp.eval(line[2:])
else:
num_tests += 1
try:
assert line == repr(last_test_result)
except AssertionError, e:
print e.message
else:
num_passed += 1
print("Ran {} tests, {} passed".format(num_tests, num_passed))
... this simply assumes that any result-line refers to the preceding test.
I would avoid .readlines() unless you get get some specific benefit from having the whole file available at once.
I also changed the comparison to look at the representation of the result, so it can distinguish between output types, ie
'6' + '2'
> '62'
60 + 2
> 62

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.