Matching in Python lists when there are extra characters

Matching in Python lists when there are extra characters - python

I am trying to write a python code to match things from two lists in python.
One tab-delimited file looks like this:
COPB2
KLMND7
BLCA8
while the other file2 has a long list of similar looking "names", if you will. There should be some identical matches in the file, which I have succeeded in identifying and writing out to a new file. The problem is when there are additional characters at the end of one of the "names". For example, COPB2 from above should match COPB2A in file2, but it does not. Similarly KLMND7 should match KLMND79. Should I use regular expressions? Make them into strings? Any ideas are helpful, thank you!
What I have worked on so far, after the first response seen below:
with open(in_file1, "r") as names:
for line in names:
file1_list = [i.strip() for i in line.split()]
file1_str = str(file1_list)
with open(in_file2, "r") as symbols:
for line in symbols:
items = line.split("\t")
items = str(items)
matches = items.startswith(file1_str)
print matches
This code returns False when I know there should be some matches.

string.startswith() No need for regex, if it's only trailing characters
>>> g = "COPB2A"
>>> f = "COPB2"
>>> g.startswith(f)
True
Here is a working piece of code:
file1_list = []
with open(in_file1, "r") as names:
for line in names:
line_items = line.split()
for item in line_items:
file1_list.append(item)
matches = []
with open(in_file2, "r") as symbols:
for line in symbols:
file2_items = line.split()
for file2_item in file2_items:
for file1_item in file1_list:
if file2_item.startswith(file1_item):
matches.append(file2_item)
print file2_item
print matches
It may be quite slow for large files. If it's unacceptable, I could try to think about how to optimize it.

You might take a look at difflib if you need a more generic solution. Keep in mind it is a big import with lots of overhead so only use it if you really need to. Here is another question that is somewhat similar.
https://stackoverflow.com/questions/1209800/difference-between-two-strings-in-python-php

Assuming you loaded the files into lists X, Y.
## match if a or b is equal to or substring of one another in a case-sensitive way
def Match( a, b):
return a.find(b[0:min(len(a),len(b))-1])
common_words = {};
for a in X:
common_words[a]=[];
for b in Y:
if ( Match( a, b ) ):
common_words[a].append(b);
If you want to use regular expressions to do the matching, you want to use "beginning of word match" operator "^".
import re
def MatchRe( a, b ):
# make sure longer string is in 'a'.
if ( len(a) < len(b) ):
a, b = b, a;
exp = "^"+b;
q = re.match(exp,a);
if ( not q ):
return False; #no match
return True; #access q.group(0) for matches

Related

Comparing multiple file items using re

Currently I have a script that finds all the lines across multiple input files that have something in the format of
Matches: 500 (54.3 %) and prints out the top 10 highest matches in percentage.
I want to be able to have it also output the top 10 lines for score ex: Score: 4000
import re
def get_values_from_file(filename):
f = open(filename)
winpat = re.compile("([\d\.]+)\%")
xinpat = re.compile("[\d]") #ISSUE, is this the right regex for it? Score: 500****
values = []
scores = []
for line in f.readlines():
if line.find("Matches") >=0:
percn = float(winpat.findall(line)[0])
values.append(percn)
elif line.find("Score") >=0:
hey = float(xinpat.findall(line)[0])
scores.append(hey)
return (scores,values)
all_values = []
all_scores = []
for filename in ["out0.txt", "out1.txt"]:#and so on
values = get_values_from_file(filename)
all_values += values
all_scores += scores
all_values.sort()
all_values.reverse()
all_scores.sort() #also for scores
all_scores.reverse()
print(all_values[0:10])
print(all_scores[0:10])
Is my regex for the score format correct? I believe that's where I am having the issue, as it doesn't output both correctly.
Any thoughts? Should I split it into two functions?
Thank you.

Is my regex for the score format correct?
No, it should be r"\d+".
You don't need []. Those brackets establish a character class representing all of the characters inside the brackets. Since you only have one character type inside the bracket, they do nothing.
You only match a single character. You need a * or a + to match a sequence of characters.
You have an unescaped backslash in your string. Use the r prefix to allow the regular expression engine to see the backslash.
Commentary:
If it were me, I'd let the regular expression do all of the work, and skip line.find() altogether:
#UNTESTED
def get_values_from_file(filename):
winpat = re.compile(r"Matches:\s*\d+\s*\(([\d\.]+)\%\)")
xinpat = re.compile(r"Score:\s*([\d]+)")
values = []
scores = []
# Note: "with open() as f" automatically closes f
with open(filename) as f:
# Note: "for line in f" more memory efficient
# than "for line in f.readlines()"
for line in f:
win = winpat.match(line)
xin = xinpat.match(line)
if win: values.append(float(win.group(0)))
if xin: scores.append(float(xin.group(0)))
return (scores,values)
Just for fun, here is a version of the routine which calls re.findall exactly once per file:
# TESTED
# Compile this only once to save time
pat = re.compile(r'''
(?mx) # multi-line, verbose
(?:Matches:\s*\d+\s*\(([\d\.]+)\s*%\)) # "Matches: 300 (43.2%)"
|
(?:Score:\s*(\d+)) # "Score: 4000"
''')
def get_values_from_file(filename):
with open(filename) as f:
values, scores = zip(*pat.findall(f.read()))
values = [float(value) for value in values if value]
scores = [float(score) for score in scores if score]
return scores, values

No. xinpat will only match single digits, so findall() will return a list of single digits, which is a bit messy. Change it to
xinpat = re.compile("[\d]+")
Actually, you don't need the square brackets here, so you could simplify it to
xinpat = re.compile("\d+")
BTW, the names winpat and xinpat are a bit opaque. The pat bit is ok, but win & xin? And hey isn't great either. But I guess xin and hey are just temporary names you made up when you decidd to expand the program.
Another thing I just noticed, you don't need to do
all_values.sort()
all_values.reverse()
You can (and should) do that in one hit:
all_values.sort(reverse=True)

Something wrong in interpreting my solution to python challenge?

I'm trying to find a special character in view-source:http://www.pythonchallenge.com/pc/def/ocr.html
Here is my code:
f = open('file.txt')
lines = f.read()
k = ''.join(lines)
stat = ''
for i in k:
if i in '#&#$!*^{}_()*+[]%':
stat=stat+''
else:
stat=stat+i
print(stat)
I'm getting the answer as "equality" but the words are very far apart. why is it so? since I' not adding any space for other characters.

You are not skipping the newlines in the file:
for i in k:
if i not in '#&#$!*^{}_()*+[]%\n':
stat=stat+i
Note that there is little point in appending an empty string for anything in your special characters string. Only append when the character is not in that string.
You already found the solution anyway, but the challenge could have been better met with actually finding the rare characters:
from collections import Counter
import requests # external library but much more convenient than urllib2
r = requests.get('http://www.pythonchallenge.com/pc/def/ocr.html')
text = r.text.rsplit('<!--', 1)[-1].rsplit('-->', 1)[0] # extract comment
counts = Counter(text)
rare = {c for c in counts if counts[c] < 5}
print ''.join([c for c in text if c in rare])
where rare turns out to be only once, really.

You're not skipping linebreaks, which is why you get a sort of "stretched" output.
Here's my attempt at solving it, incorporating the comments to your question:
f = open('file.txt')
data = f.read()
stat = ''
for i in data:
if i in '#&#$!*^{}_()+[]%]\n':
h=6
else:
stat=stat+i
print stat

Splitting lines in a file into string and hex and do operations on the hex values

I have a large file with several lines as given below.I want to read in only those lines which have the _INIT pattern in them and then strip off the _INIT from the name and only save the OSD_MODE_15_H part in a variable. Then I need to read the corresponding hex value, 8'h00 in this case, ans strip off the 8'h from it and replace it with a 0x and save in a variable.
I have been trying strip the off the _INIT,the spaces and the = and the code is becoming really messy.
localparam OSD_MODE_15_H_ADDR = 16'h038d;
localparam OSD_MODE_15_H_INIT = 8'h00
Can you suggest a lean and clean method to do this?
Thanks!

The following solution uses a regular expression (compiled to speed searching up) to match the relevant lines and extract the needed information. The expression uses named groups "id" and "hexValue" to identify the data we want to extract from the matching line.
import re
expression = "(?P<id>\w+?)_INIT\s*?=.*?'h(?P<hexValue>[0-9a-fA-F]*)"
regex = re.compile(expression)
def getIdAndValueFromInitLine(line):
mm = regex.search(line)
if mm == None:
return None # Not the ..._INIT parameter or line was empty or other mismatch happened
else:
return (mm.groupdict()["id"], "0x" + mm.groupdict()["hexValue"])
EDIT: If I understood the next task correctly, you need to find the hexvalues of those INIT and ADDR lines whose IDs match and make a dictionary of the INIT hexvalue to the ADDR hexvalue.
regex = "(?P<init_id>\w+?)_INIT\s*?=.*?'h(?P<initValue>[0-9a-fA-F]*)"
init_dict = {}
for x in re.findall(regex, lines):
init_dict[x.groupdict()["init_id"]] = "0x" + x.groupdict()["initValue"]
regex = "(?P<addr_id>\w+?)_ADDR\s*?=.*?'h(?P<addrValue>[0-9a-fA-F]*)"
addr_dict = {}
for y in re.findall(regex, lines):
addr_dict[y.groupdict()["addr_id"]] = "0x" + y.groupdict()["addrValue"]
init_to_addr_hexvalue_dict = {init_dict[x] : addr_dict[x] for x in init_dict.keys() if x in addr_dict}
Even if this is not what you actually need, having init and addr dictionaries might help to achieve your goal easier. If there are several _INIT (or _ADDR) lines with the same ID and different hexvalues then the above dict approach will not work in a straight forward way.

try something like this- not sure what all your requirements are but this should get you close:
with open(someFile, 'r') as infile:
for line in infile:
if '_INIT' in line:
apostropheIndex = line.find("'h")
clean_hex = '0x' + line[apostropheIndex + 2:]
In the case of "16'h038d;", clean_hex would be "0x038d;" (need to remove the ";" somehow) and in the case of "8'h00", clean_hex would be "0x00"
Edit: if you want to guard against characters like ";" you could do this and test if a character is alphanumeric:
clean_hex = '0x' + ''.join([s for s in line[apostropheIndex + 2:] if s.isalnum()])

You can use a regular expression and the re.findall() function. For example, to generate a list of tuples with the data you want just try:
import re
lines = open("your_file").read()
regex = "([\w]+?)_INIT\s*=\s*\d+'h([\da-fA-F]*)"
res = [(x[0], "0x"+x[1]) for x in re.findall(regex, lines)]
print res
The regular expression is very specific for your input example. If the other lines in the file are slightly different you may need to change it a bit.

Python - How to match and replace words from a given string?

I have a array list with large collection, and i have one input string. Large collecion if found in the input string, it will replace by given option.
I tried following but its returning wrong:
#!/bin/python
arr=['www.', 'http://', '.com', 'many many many....']
def str_replace(arr, replaceby, original):
temp = ''
for n,i in enumerate(arr):
temp = original.replace(i, replaceby)
return temp
main ='www.google.com'
main1='www.a.b.c.company.google.co.uk.com'
print str_replace(arr,'',main);
Output:
www.google
Expected:
google

You are deriving temp from the original every time, so only the last element of arr will be replaced in the temp that is returned. Try this instead:
def str_replace(arr, replaceby, original):
temp = original
for n,i in enumerate(arr):
temp = temp.replace(i, replaceby)
return temp

You don't even need temp (assuming the above code is the whole function):
def str_replace(search, replace, subject):
for s in search:
subject = subject.replace(s, replace)
return subject
Another (probably more efficient) option is to use regular expressions:
import re
def str_replace(search, replace, subject):
search = '|'.join(map(re.escape, search))
return re.sub(search, replace, subject)
Do note that these functions may produce different results if replace contains substrings from search.

temp = original.replace(i, replaceby)
It should be
temp = temp.replace(i, replaceby)
You're throwing away the previous substitutions.

Simple way :)
arr=['www.', 'http://', '.com', 'many many many....']
main ='http://www.google.com'
for item in arr:
main = main.replace(item,'')
print main

How to refactor this python code block to be more efficient

This code block works - it loops through a file that has a repeating number of sets of data
and extracts out each of the 5 pieces of information for each set.
But I I know that the current factoring is not as efficient as it can be since it is looping
through each key for each line found.
Wondering if some python gurus can offer better way to do this more efficiently.
def parse_params(num_of_params,lines):
for line in lines:
for p in range(1,num_of_params + 1,1):
nam = "model.paramName "+str(p)+" "
par = "model.paramValue "+str(p)+" "
opt = "model.optimizeParam "+str(p)+" "
low = "model.paramLowerBound "+str(p)+" "
upp = "model.paramUpperBound "+str(p)+" "
keys = [nam,par,opt,low,upp]
for key in keys:
if key in line:
a,val = line.split(key)
if key == nam: names.append(val.rstrip())
if key == par: params.append(val.rstrip())
if key == opt: optimize.append(val.rstrip())
if key == upp: upper.append(val.rstrip())
if key == low: lower.append(val.rstrip())
print "Names = ",names
print "Params = ",params
print "Optimize = ",optimize
print "Upper = ",upper
print "Lower = ",lower

Though this doesn't answer your question (other answers are getting at that) something that has helped me a lot in doing things similar to what you're doing are List Comprehensions. They allow you to build lists in a concise and (I think) easy to read way.
For instance, the below code builds a 2-dimenstional array with the values you're trying to get at. some_funct here would be a little regex, if I were doing it, that uses the index of the last space in the key as the parameter, and looks ahead to collect the value you're trying to get in the line (the value which corresponds to the key currently being looked at) and appends it to the correct index in the seen_keys 2D array.
Wordy, yes, but if you get list-comprehension and you're able to construct the regex to do that, you've got a nice, concise solution.
keys = ["model.paramName ","model.paramValue ","model.optimizeParam ""model.paramLowerBound ","model.paramUpperBound "]
for line in lines:
seen_keys = [[],[],[],[],[]]
[seen_keys[keys.index(k)].some_funct(line.index(k) for k in keys if k in line]

It's not totally easy to see the expected format. From what I can see, the format is like:
lines = [
"model.paramName 1 foo",
"model.paramValue 2 bar",
"model.optimizeParam 3 bat",
"model.paramLowerBound 4 zip",
"model.paramUpperBound 5 ech",
"model.paramName 1 foo2",
"model.paramValue 2 bar2",
"model.optimizeParam 3 bat2",
"model.paramLowerBound 4 zip2",
"model.paramUpperBound 5 ech2",
]
I don't see the above code working if there is more than one value in each line. Which means the digit is not really significant unless I'm missing something. In that case this works very easily:
import re
def parse_params(num_of_params,lines):
key_to_collection = {
"model.paramName":names,
"model.paramValue":params,
"model.optimizeParam":optimize,
"model.paramLowerBound":upper,
"model.paramUpperBound":lower,
}
reg = re.compile(r'(.+?) (\d) (.+)')
for line in lines:
m = reg.match(line)
key, digit, value = m.group(1, 2, 3)
key_to_collection[key].append(value)

It's not entirely obvious from your code, but it looks like each line can have one "hit" at most; if that's indeed the case, then something like:
import re
def parse_params(num_of_params, lines):
sn = 'Names Params Optimize Upper Lower'.split()
ks = '''paramName paramValue optimizeParam
paramLowerBound paramUpperBound'''.split()
vals = dict((k, []) for k in ks)
are = re.compile(r'model\.(%s) (\d+) (.*)' % '|'.join(ks))
for line in lines:
mo = are.search(line)
if not mo: continue
p = int(mo.group(2))
if p < 1 or p > num_of_params: continue
vals[mo.group(1)].append(mo.group(3).rstrip())
for k, s in zip(ks, sn):
print '%-8s =' % s,
print vals[k]
might work -- I exercised it with a little code as follows:
if __name__ == '__main__':
lines = '''model.paramUpperBound 1 ZAP
model.paramLowerBound 1 zap
model.paramUpperBound 5 nope'''.splitlines()
parse_params(2, lines)
and it emits
Names = []
Params = []
Optimize = []
Upper = ['zap']
Lower = ['ZAP']
which I think is what you want (if some details must differ, please indicate exactly what they are and let's see if we can fix it).
The two key ideas are: use a dict instead of lots of ifs; use a re to match "any of the following possibilities" with parenthesized groups in the re's pattern to catch the bits of interest (the keyword after model., the integer number after that, and the "value" which is the rest of the line) instead of lots of if x in y checks and string manipulation.

There is a lot of duplication there, and if you ever add another key or param, you're going to have to add it in many places, which leaves you ripe for errors. What you want to do is pare down all of the places you have repeated things and use some sort of data model, such as a dict.
Some others have provided some excellent examples, so I'll just leave my answer here to give you something to think about.

Are you sure that parse_params is the bottle-neck? Have you profiled your app?
import re
from collections import defaultdict
names = ("paramName paramValue optimizeParam "
"paramLowerBound paramUpperBound".split())
stmt_regex = re.compile(r'model\.(%s)\s+(\d+)\s+(.*)' % '|'.join(names))
def parse_params(num_of_params, lines):
stmts = defaultdict(list)
for m in (stmt_regex.match(s) for s in lines):
if m and 1 <= int(m.group(2)) <= num_of_params:
stmts[m.group(1)].append(m.group(3).rstrip())
for k, v in stmts.iteritems():
print "%s = %s" % (k, ' '.join(v))

The code given in the OP does multiple tests per line to try to match against the expected set of values, each of which is being constructed on the fly. Rather than construct paramValue1, paramValue2, etc. for each line, we can use a regular expression to try to do the matching in a cheaper (and more robust) manner.
Here's my code snippet, drawing from some ideas that have already been posted. This lets you add a new keyword to the key_to_collection dictionary and not have to change anything else.
import re
def parse_params(num_of_params, lines):
pattern = re.compile(r"""
model\.
(.+) # keyword
(\d+) # index to keyword
[ ]+ # whitespace
(.+) # value
""", re.VERBOSE)
key_to_collection = {
"paramName": names,
"paramValue": params,
"optimizeParam": optimize,
"paramLowerBound": upper,
"paramUpperBound": lower,
}
for line in lines:
match = pattern.match(line)
if not match:
print "Invalid line: " + line
elif match[1] not in key_to_collection:
print "Invalid key: " + line
# Not sure if you really care about enforcing this
elif match[2] > num_of_params:
print "Invalid param: " + line
else:
key_to_collection[match[1]].append(match[3])
Full disclosure: I have not compiled/tested this.

It can certainly be made more efficient. But, to be honest, unless this function is called hundreds of times a second, or works on thousands of lines, is it necessary?
I would be more concerned about making it clear what is happening... currently, I'm far from clear on that aspect.
Just eyeballing it, the input seems to look like this:
model.paramName 1 A model.paramValue 1 B model.optimizeParam 1 C model.paramLowerBound 1 D model.paramUpperBound 1 E model.paramName 2 F model.paramValue 2 G model.optimizeParam 2 H model.paramLowerBound 2 I model.paramUpperBound 2 J
And your desired output seems to be something like:
Names = AF
Params = BG
etc...
Now, since my input certainly doesn't match yours, the output is likely off too, but I think I have the gist.
There are a few points. First, does it matter how many parameters are passed to the function? For example, if the input has two sets of parameters, do I just want to read both, or is it necessary to allow the function to only read one? For example, your code allows me to call parse_params(1,1) and have it only read parameters ending in a 1 from the same input. If that's not actually a requirement, you can skip a large chunk of the code.
Second, is it important to ONLY read the given parameters? If I, for example, have a parameter called 'paramFoo', is it bad if I read it? You can also simplify the procedure by just grabbing all parameters regardless of their name, and extracting their value.
def parse_params(input):
parameter_list = {}
param = re.compile(r"model\.([^ ]+) [0-9]+ ([^ ]+)")
each_parameter = param.finditer(input)
for match in each_parameter:
key = match[0]
value = match[1]
if not key in paramter_list:
parameter_list[key] = []
parameter_list[key].append(value)
return parameter_list
The output, in this instance, will be something like this:
{'paramName':[A, F], 'paramValue':[B, G], 'optimizeParam':[C, H], etc...}
Notes: I don't know Python well, I'm a Ruby guy, so my syntax may be off. Apologies.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching in Python lists when there are extra characters - python

Related

Comparing multiple file items using re

Something wrong in interpreting my solution to python challenge?

Splitting lines in a file into string and hex and do operations on the hex values

Python - How to match and replace words from a given string?

How to refactor this python code block to be more efficient

Categories

Resources