I have a running python script that reads in a file of phone numbers. Some of these phone numbers are invalid.
import re
def IsValidNumber(number, pattern):
isMatch = re.search(pattern, number)
if isMatch is not None:
return number
numbers = [line.strip() for line in open('..\\phoneNumbers.txt', 'r')]
Then I use another list comprehension to filter out the bad numbers:
phonePattern = '^\d{10}$'
validPhoneNumbers = [IsValidNumber(x, phonePattern) for x in phoneNumbers
if IsValidNumber(x, phonePattern) is not None]
for x in validPhoneNumbers:
print x
Due to formatting, the second list comprehension spans two lines.
The problem is that although the IsValidNumber should only return the number if the match is valid, it also returns 'None' on invalid matches. So I had to modify the second list comprehension to include:
if IsValidNumber(x, phonePattern) is not None
While this works, the problem is that for each iteration in the list, the function is executed twice. Is there a cleaner approach to doing this?
Your isValidFunction should return True/False (as its name suggests). That way your list comprehension becomes:
valid = [num for num in phoneNumbers if isValidNumber(num, pattern)]
While you're at it, modify numbers to be a generator expression instead of a list comprehension (since you're interested in efficiency):
numbers = (line.strip() for line in open("..\\phoneNumbers.txt"))
Try this:
validPhoneNumbers = [x for x in phoneNumbers if isValidNumber(x, phonepattern)]
Since isValidNumber returns the same number that's passed in, without modification, you don't actually need that number. You just need to know that a number is returned at all (meaning the number is valid).
You may be able to combine the whole thing as well, with:
validPhoneNumbers = [x.strip() for x in open('..\\phonenumbers.txt', 'r') if isValidNumber(x.strip(), phonePattern)]
I would change your validity check method to simply return whether the number matches or not, but not return the number itself.
def is_valid_number(number):
return re.search(r'^\d{10}$', number)
Then you can filter out the invalid numbers in the first list comprehension:
numbers = [line.strip() for line in open('..\\phoneNumbers.txt', 'r')
if is_valid_number(line.strip())]
There are many options to work with here, including filter(None, map(isValidNumber, lines)). Most efficient is probably to let the regular expression do all the work:
import re
numpat = re.compile(r'^\s*(\d{10})\s*$', re.MULTILINE)
filecontents = open('phonenumbers.txt', 'r').read()
validPhoneNumbers = numpat.findall(filecontents)
This way there is no need for a Python loop, and you get precisely the validated numbers.
Related
I have a list of lists of sequences, and a corresponding list of lists of names.
testSequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
testNames = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
I also have a list of all the identifying parts of the names:
taxonNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
I am trying to produce a new list, where each item in the list will correspond to one of the "identifying parts of the names", and the string will be made up of all the sequences for that name.
If a name and sequence does not appear in one of the lists in the lists (i.e. no redFish or blueFish in the first list of testNames) I want to add in a string of hyphens the same length as the sequences in that list. This would give me this output:
['aaaa--AAAAAA', 'cccc--CCCCCC', '----ttTTTTTT', '----ggGGGG']
I have this piece of code to do this.
complete = [''] * len(taxonNames)
for i in range(len(testSequences)):
for j in range(len(taxonNames)):
sequenceLength = len(testSequences[i][0])
for k in range(len(testSequences[i])):
if taxonNames[j] in testNames[i][k]:
complete[j].join(testSequences[i][k])
if taxonNames[j] not in testNames[i][k]:
hyphenString = "-" * sequenceLength
complete[j].join(hyphenString)
print complete
"complete" should give my final output as explained above, but it comes out looking like this:
['', '', '', '']
How can I fix my code to give me the correct answer?
The main issue with your code, which makes it very hard to understand, is you're not really leveraging the language elements that make Python so strong.
Here's a solution to your problem that works:
test_sequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
test_names = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
taxon_names = ['oneFish', 'twoFish', 'redFish', 'blueFish']
def get_seqs(taxon_name, sequences_list, names_list):
for seqs, names in zip(sequences_list, names_list):
found_seq = None
for seq, name in zip(seqs, names):
if taxon_name in name:
found_seq = seq
break
yield found_seq if found_seq else '-' * len(seqs[0])
result = [''.join(get_seqs(taxon_name, test_sequences, test_names))
for taxon_name in taxon_names]
print(result)
The generator get_seqs pairs up lists from test_sequences and test_names and for each pair, tries to find the sequence (seq) for the name (name) that matches and yields it, or yields a string of the right number of hyphens for that list of sequences.
The generator (a function that yields multiple values) has code that quite literally follows the explanation above.
The result is then simply a matter of, for each taxon_name, getting all the resulting sequences that match in order and joining them together into a string, which is the result = ... line.
You could make it work with list indexing loops and string concatenation, but this is not a PHP question, now is it? :)
Note: for brevity, you could just access the global test_sequences and test_names instead of passing them in as parameters, but I think that would come back to haunt you if you were to actually use this code. Also, I think it makes semantic sense to change the order of names and sequences in the entire example, but I didn't to avoid further deviating from your example.
Here is a solution that may do what you want. It begins, not with your data structures from this post, but with the three example files from your previous post (which you used to build this post's data structures).
The only thing I couldn't figure out was how many hyphens to use for a missing sequence from a file.
differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
files = ['f1.txt', 'f2.txt', 'f3.txt']
data = [[] for _ in range(len(differentNames))]
final = []
for file in files:
d = dict()
with open(file, 'r') as fin:
for line in fin:
line = line.rstrip()
if line.startswith('>'): # for ex., >xx_oneFish |xxx
underscore = line.index('_')
space = line.index(' ')
key = line[underscore+1:space]
else:
d[key] = line
for i, key in enumerate(differentNames):
data[i].append(d.get(key, '-' * 4))
for array in data:
final.append(''.join(array))
print(final)
Prints:
['AAAAAAAaaaa----', 'CCCCCCcccc----', 'TTTTTT----tt', 'GGGGGG----gg']
Let's say I have a ton of HTML with no newlines. I want to get each element into a list.
input = "<head><title>Example Title</title></head>"
a_list = ["<head>", "<title>Example Title</title>", "</head>"]
Something like such. Splitting between each ><.
But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.
How can this be done?
Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.
# initial input
a = "<head><title>Example Title</title></head>"
# split list
b = a.split('><')
# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]
# initialize new list
a_list = []
# fill new list with formatted elements
for i in range(len(b)):
a_list.append('<{}>'.format(b[i]))
This will output the given list in python 2.7.2, but it should work in python 3 as well.
You can try this:
import re
a = "<head><title>Example Title</title></head>"
data = re.split("><", a)
new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]
Output:
['<head>', '<title>Example Title</title>', '</head>']
The shortest approach using re.findall() function on extended example:
# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)
The output:
['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']
Based on the answers by other people, I made this.
It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.
Here, I got rid of one extra argument by combining the two characters into a string. Anyways,
def split_between(string, chars):
if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")
result_list = [chars[1] + line + chars[0] for line in string.split(chars)]
result_list[0] = result_list[0][1:]
result_list[-1] = result_list[-1][:-1]
return result_list
Credit goes to #cforemanand #Ajax1234.
Or even simpler, this:
input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])
I have a file named ping.txt which has the values that shows the time taken to ping an ip for n number of times.
I have my ping.txt contains:
time=35.9
time=32.4
I have written a python code to extract this floating number alone and add it using regular expression. But I feel that the below code is the indirect way of completing my task. The findall regex I am using here outputs a list which is them converted, join and then added.
import re
add,tmp=0,0
with open("ping.txt","r+") as pingfile:
for i in pingfile.readlines():
tmp=re.findall(r'\d+\.\d+',i)
add=add+float("".join(tmp))
print("The sum of the times is :",add)
My question is how to solve this problem without using regex or any other way to reduce the number of lines in my code to make it more efficient?
In other words, can I use different regex or some other method to do this operation?
~
You can use the following:
with open('ping.txt', 'r') as f:
s = sum(float(line.split('=')[1]) for line in f)
Output:
>>> with open('ping.txt', 'r') as f:
... s = sum(float(line.split('=')[1]) for line in f)
...
>>> s
68.3
Note: I assume each line of your file contains time=some_float_number
You could do it like this:
import re
total = sum(float(s) for s in re.findall(r'\d+(\.\d+)?', open("ping.txt","r+").read()))
If you have the string:
>>> s='time=35.9'
Then to get the value, you just need:
>>> float(s.split('=')[1]))
35.9
You don't need regular expressions for something with a simple delimiter.
You can use the string split to split each line at '=' and append them to a list. At the end, you can simply call the sum function to print the sum of elements in the list
temp = []
with open("test.txt","r+") as pingfile:
for i in pingfile.readlines():
temp.append(float(str.split(i,'=')[1]))
print("The sum of the times is :",sum(temp))
Use This in RE
tmp = re.findall("[0-9]+.[0-9]+",i)
After that run a loop
sum = 0
for each in tmp:
sum = sum + float(each)
My code below is extracting some portion from a file and displaying the result in separate lists.
I want to form a list of all these lists which were filtered out. I tried to form it in my code but when I am trying to print it out, I am getting an empty list.
import re
hand = open('mbox.txt')
for line in hand:
my_list = list()
line = line.rstrip()
#Extracting out the data from file
x=re.findall('^From .* ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])', line)
#checking the length and checking if the data is not present to the list
if len(x) != 0 and x not in my_list:
my_list.append(x[0])
print my_list
Filtered list is:
['15:46:24']
['15:03:18']
['14:50:18']
['11:37:30']
['11:35:08']
['11:12:37']
and so on.
A couple of things to note. If you are repeatedly doing regex matching, I suggest you compile the pattern first and then do the matching. Also, you don't need to check length of a container manually to get its bool value - just do if container:. Use builtin filter to remove empty items. Or you can use a set that avoids duplicates automatically. I am also not sure why you are stripping the space characters before doing the regex match. Is that necessary?
import re
match = r"^From .* ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])"
with open("mbox.txt") as f:
for line in f.readlines():
match = filter(None,re.findall(match, line))
data.append(list(match))
print(data)
This is all you need to get that list of lists. The use of list comprehension and filter made the code more compact.
just move my_list=list() to out of the for loop.
I have a log file and at the end of each line in the file there is this string:
Line:# where # is the line number.
I am trying to get the # and compare it to the previous line's number. what would be the best way to do that in python?
I would probably use str.split because it seems easy:
with open('logfile.log') as fin:
numbers = [ int(line.split(':')[-1]) for line in fin ]
Now you can use zip to compare one number with the next one:
for num1,num2 in zip(numbers,numbers[1:]):
compare(num1,num2) #do comparison here.
Of course, this isn't lazy (you store every line number in the file at once when you really only need 2 at a time), so it might take up a lot of memory if your files are HUGE. It wouldn't be hard to make it lazy though:
def elem_with_next(iterable):
ii = iter(iterable)
prev = next(ii)
for here in ii:
yield prev,here
prev = here
with open('logfile.log') as fin:
numbers = ( int(line.split(':')[-1]) for line in fin )
for num1,num2 in elem_with_next(numbers):
compare(num1,num2)
I'm assuming that you don't have something convenient to split a string on, meaning a regular expression might make more sense. That is, if the lines in your log file are structured like:
date: 1-15-2013, error: mildly_annoying, line: 121
date: 1-16-2013, error: err_something_bad, line: 123
Then you won't be able to use line.split('#') as mgilson as suggested, although if there is always a colon, line.split(':') might work. In any case, a regular expression solution would look like:
import re
numbers = []
for line in log:
digit_match = re.search("(\d+)$", line)
if digit_match is not None:
numbers.append(int(digit_match.group(1)))
Here the expression "(\d+)$" is matching some number of digits and then the end of the line. We extract the digits with the group(1) method on the returned match object and then add them to our list of line numbers.
If you're not confident that the "Line: #" will always come at the end of the log, you could replace the regular expression used above with something akin to "Line:\s*(\d+)" which checks for the string "Line:" then some (or no) whitespace, and then any number of digits.