My code below is extracting some portion from a file and displaying the result in separate lists.
I want to form a list of all these lists which were filtered out. I tried to form it in my code but when I am trying to print it out, I am getting an empty list.
import re
hand = open('mbox.txt')
for line in hand:
my_list = list()
line = line.rstrip()
#Extracting out the data from file
x=re.findall('^From .* ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])', line)
#checking the length and checking if the data is not present to the list
if len(x) != 0 and x not in my_list:
my_list.append(x[0])
print my_list
Filtered list is:
['15:46:24']
['15:03:18']
['14:50:18']
['11:37:30']
['11:35:08']
['11:12:37']
and so on.
A couple of things to note. If you are repeatedly doing regex matching, I suggest you compile the pattern first and then do the matching. Also, you don't need to check length of a container manually to get its bool value - just do if container:. Use builtin filter to remove empty items. Or you can use a set that avoids duplicates automatically. I am also not sure why you are stripping the space characters before doing the regex match. Is that necessary?
import re
match = r"^From .* ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])"
with open("mbox.txt") as f:
for line in f.readlines():
match = filter(None,re.findall(match, line))
data.append(list(match))
print(data)
This is all you need to get that list of lists. The use of list comprehension and filter made the code more compact.
just move my_list=list() to out of the for loop.
Related
Let's say I have a ton of HTML with no newlines. I want to get each element into a list.
input = "<head><title>Example Title</title></head>"
a_list = ["<head>", "<title>Example Title</title>", "</head>"]
Something like such. Splitting between each ><.
But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.
How can this be done?
Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.
# initial input
a = "<head><title>Example Title</title></head>"
# split list
b = a.split('><')
# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]
# initialize new list
a_list = []
# fill new list with formatted elements
for i in range(len(b)):
a_list.append('<{}>'.format(b[i]))
This will output the given list in python 2.7.2, but it should work in python 3 as well.
You can try this:
import re
a = "<head><title>Example Title</title></head>"
data = re.split("><", a)
new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]
Output:
['<head>', '<title>Example Title</title>', '</head>']
The shortest approach using re.findall() function on extended example:
# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)
The output:
['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']
Based on the answers by other people, I made this.
It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.
Here, I got rid of one extra argument by combining the two characters into a string. Anyways,
def split_between(string, chars):
if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")
result_list = [chars[1] + line + chars[0] for line in string.split(chars)]
result_list[0] = result_list[0][1:]
result_list[-1] = result_list[-1][:-1]
return result_list
Credit goes to #cforemanand #Ajax1234.
Or even simpler, this:
input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])
I'm having some troubles with removing items from a list. I'm looking for a more elegant solution. Preferably a solution in one for-loop or filter.
The objective of the piece of code: remove all empty entries and all entries starting with a '#' from the config handle.
At the moment i'm using:
# Read the config file and put every line in a seperate entry in a list
configHandle = [item.rstrip('\n') for item in open('config.conf')]
# Strip comment items from the configHandle
for item in configHandle:
if item.startswith('#'):
configHandle.remove(item)
# remove all empty items in handle
configHandle = filter(lambda a: a != '', configHandle)
print configHandle
This works but I think it is a bit of a nasty solution.
When I try:
# Read the config file and put every line in a seperate entry in a list
configHandle = [item.rstrip('\n') for item in open('config.conf')]
# Strip comment items and empty items from the configHandle
for item in configHandle:
if item.startswith('#'):
configHandle.remove(item)
elif len(item) == 0:
configHandle.remove(item)
This, however, fails. I cannot figure out why.
Can someone push me in the right direction?
Because You're changing the list while iterating over it. You can use a list comprehension to get ride of this problem:
configHandle = [i for i in configHandle if i and not i.startswith('#')]
Also for opening a file you better to use a with statement that close the file at the end of the block automatically1:
with open('config.conf') as infile :
configHandle = infile.splitlines()
configHandle = [line for line in configHandle if line and not line.startswith('#')]
1. Because there is no guarantee for external links to be collected by garbage-collector. And you need to close them explicitly, which can be done by calling the close() method of a file object, or as mentioned as a more pythonic way use a with statement.
Don't remove items while you iterating, it's a common pitfall
You aren't allowed to modify an item that you're iterating over.
Instead you should use things like filter or list comprehensions.
configHandle = filter(lambda a: (a != '') and not a.startswith('#'), configHandle)
Your filter expression is good; just include the additional condition you're looking for:
configHandle = filter(lambda a: a != '' and not a.startswith('#'), configHandle)
There are other options if you don't want to use filter, but, as has been stated in other answers, it is a very bad idea to attempt to modify a list while you are iterating through it. The answers to this stackoverflow question provides alternatives to using filter to remove from a list based on a condition.
I have a running python script that reads in a file of phone numbers. Some of these phone numbers are invalid.
import re
def IsValidNumber(number, pattern):
isMatch = re.search(pattern, number)
if isMatch is not None:
return number
numbers = [line.strip() for line in open('..\\phoneNumbers.txt', 'r')]
Then I use another list comprehension to filter out the bad numbers:
phonePattern = '^\d{10}$'
validPhoneNumbers = [IsValidNumber(x, phonePattern) for x in phoneNumbers
if IsValidNumber(x, phonePattern) is not None]
for x in validPhoneNumbers:
print x
Due to formatting, the second list comprehension spans two lines.
The problem is that although the IsValidNumber should only return the number if the match is valid, it also returns 'None' on invalid matches. So I had to modify the second list comprehension to include:
if IsValidNumber(x, phonePattern) is not None
While this works, the problem is that for each iteration in the list, the function is executed twice. Is there a cleaner approach to doing this?
Your isValidFunction should return True/False (as its name suggests). That way your list comprehension becomes:
valid = [num for num in phoneNumbers if isValidNumber(num, pattern)]
While you're at it, modify numbers to be a generator expression instead of a list comprehension (since you're interested in efficiency):
numbers = (line.strip() for line in open("..\\phoneNumbers.txt"))
Try this:
validPhoneNumbers = [x for x in phoneNumbers if isValidNumber(x, phonepattern)]
Since isValidNumber returns the same number that's passed in, without modification, you don't actually need that number. You just need to know that a number is returned at all (meaning the number is valid).
You may be able to combine the whole thing as well, with:
validPhoneNumbers = [x.strip() for x in open('..\\phonenumbers.txt', 'r') if isValidNumber(x.strip(), phonePattern)]
I would change your validity check method to simply return whether the number matches or not, but not return the number itself.
def is_valid_number(number):
return re.search(r'^\d{10}$', number)
Then you can filter out the invalid numbers in the first list comprehension:
numbers = [line.strip() for line in open('..\\phoneNumbers.txt', 'r')
if is_valid_number(line.strip())]
There are many options to work with here, including filter(None, map(isValidNumber, lines)). Most efficient is probably to let the regular expression do all the work:
import re
numpat = re.compile(r'^\s*(\d{10})\s*$', re.MULTILINE)
filecontents = open('phonenumbers.txt', 'r').read()
validPhoneNumbers = numpat.findall(filecontents)
This way there is no need for a Python loop, and you get precisely the validated numbers.
I am having trouble splitting an '&' in a list of URL's. I know it is because I cannot split a list directly but I cannot figure out how to get around this error. I am open for any suggestions.
def nestForLoop():
lines = open("URL_leftof_qm.txt", 'r').readlines()
for l in lines:
toke1 = l.split("?")
toke2 = toke1.split("&")
for t in toke2:
with open("ampersand_right_split.txt".format(), 'a') as f:
f.write
lines.close()
nestForLoop()
NO. STOP.
qs = urlparse.urlparse(url).query
qsl = urlparse.parse_qsl(qs)
As Ignacio points out, you should not be doing this in the first place. But I'll explain where you're going wrong, and how to fix it:
toke2 is a list of two strings: the main URL before the ?, and the query string after the &. You don't want to split that list, or everything in that list; you just want to split the query string. So:
mainurl, query = l.split("?")
queryvars = query.split("&")
What if you did want to split everything in the first list? There are two different things that could mean, which are of course done differently. But both require a loop (explicit, or inside a list comprehension) over the first list. Either this:
tokens = [toke2.split("&") for toke2 in l.split("?")]
or
tokens = [token for toke2 in l.split("?")
for token in toke2.split("&")]
Try them both out to see the different outputs, and hopefully you'll understand what they're doing.
I am running a server with cherrypy and python script. Currently, there is a web page containing data of a list of people, which i need to get. The format of the web page is as follow:
www.url1.com, firstName_1, lastName_1
www.url2.com, firstName_2, lastName_2
www.url3.com, firstName_3, lastName_3
I wish to display the list of names on my own webpage, with each name hyperlinked to their corresponding website.
I have read the webpage into a list with the following method:
#cherrypy.expose
def receiveData(self):
""" Get a list, one per line, of currently known online addresses,
separated by commas.
"""
method = "whoonline"
fptr = urllib2.urlopen("%s/%s" % (masterServer, method))
data = fptr.readlines()
fptr.close()
return data
But I don't know how to break the list into a list of lists at where the comma are. The result should give each smaller list three elements; URL, First Name, and Last Name. So I was wondering if anyone could help.
Thank you in advance!
You can iterate over fptr, no need to call readlines()
data = [line.split(', ') for line in fptr]
You need the split(',') method on each string:
data = [ line.split(',') for line in fptr.readlines() ]
lists = []
for line in data:
lists.append([x.strip() for x in line.split(',')])
If you data is a big 'ole string (potentially with leading or trailing spaces), do it this way:
lines=""" www.url1.com, firstName_1, lastName_1
www.url2.com, firstName_2 , lastName_2
www.url3.com, firstName_3, lastName_3 """
data=[]
for line in lines.split('\n'):
t=[e.strip() for e in line.split(',')]
data.append(t)
print data
Out:
[['www.url1.com', 'firstName_1', 'lastName_1'], ['www.url2.com', 'firstName_2',
'lastName_2'], ['www.url3.com', 'firstName_3', 'lastName_3']]
Notice the leading and trailing spaces are removed.