how to match python list using regular expression

how to match python list using regular expression - python

I have following lists in python ["john","doe","1","90"] and ["prince","2","95"]. the first number column is field: id and second number field is score. I would like to use re in python to parse out the field and print. So far, I only know how to do split of field comma. Any one can help?

You better use a dictionary than a regex (which I don't see how you use here):
{'name': 'John Doe', 'id': '1', 'score': '90'}
Or better yet, use numbers:
{'name': 'John Doe', 'id': 1, 'score': 90}

You don't really need regular expression here. You can just use isinstance() and slicing.
This should do what you want :
a_list = ['john','doe','1','90']
for i, elem in enumerate(a_list):
try:
elem = int(elem)
except ValueError, e:
pass
if isinstance(elem, int):
names_part = a_list[:i-1]
id_and_score = a_list[i-1:]
print 'name(s): {0}, '.format(' '.join(names_part)), 'id: {id}, score: {score}'.format(id=id_and_score[0], score=id_and_score[1])
Though, this solution could be improve if we were know the source of your data or if there is a way to pridict the field position you can just turn your list into a dict as suggested. If you extract your data you may consider building a dict instead of a list which prevent you from having to do what above.

Related

Sort values for both str and int by ranking appearance in a string

I have to sort keywords and values in a string.
This is my attempt:
import re
phrase='$1000 is the price of the car, it is 10 years old. And this sandwish cost me 10.34£'
list1 = (re.findall('\d*\.?\d+', phrase)) #this is to make a list that find all the ints in my phrase and sort them (1000, 10, 10.34)
list2= ['car', 'year', 'sandwish'] #this is to make a list of all the keywords in the phrase I need to find.
joinedlist = list1 + list2 #This is the combination of the 2 lists int and str that are in my sentence (the key elements)
filter1 = (sorted(joinedlist, key=phrase.find)) #This is to find all the key elements in my phrase and sort them by order of appearance.
print(filter1)
Unfortunately, in some cases, because the "sorted" function works by lexical sorting, integrals would be printed in the wrong order. This means that in some cases like this one, the output will be:
['1000', '10', 'car', 'year', 'sandwich', '10.34']
instead of:
['1000', 'car', '10', 'year', 'sandwich', '10.34']
as the car appears before 10 in the initial phrase.

Lexical sorting has nothing to do with it, because your sorting key is the position in the original phrase; all the sorting is done by numeric values (the indices returned by find). The reason that the '10' is appearing "out of order" is that phrase.find returns the first occurrence of it, which is inside the 1000 part of the string!
Rather than breaking the sentence apart into two lists and then trying to reassemble them with a sort, why not just use a single regex that selects the different kinds of things you want to keep? That way you don't need to re-sort them at all:
>>> re.findall('\d*\.?\d+|car|year|sandwish', phrase)
['1000', 'car', '10', 'year', 'sandwish', '10.34']

The issue is that 10 and 1000 each have the same value from Python's default string lookup. Both are found at the start of the string since 10 is a substring of 1000.
You can implement a regex lookup into phrase to implement the method you are attempting by using \b word boundaries so that 10 only matches 10 in your string:
def finder(s):
if m:=re.search(rf'\b{s}\b', phrase):
return m.span()[0]
elif m:=re.search(rf'\b{s}', phrase):
return m.span()[0]
return -1
Test it:
>>> sorted(joinedlist, key=finder)
['1000', 'car', '10', 'year', 'sandwish', '10.34']
It is easier if you turn phrase into a look up list of your keywords however. You will need some treatment for year as a keyword vs years in phrase; you can just use the regex r'\d+\.\d+|\w+' as a regex to find the words and then str.startswith() to test if it is close enough:
pl=re.findall(r'\d+\.\d+|\w+', phrase)
def finder2(s):
try: # first try an exact match
return pl.index(s)
except ValueError:
pass # not found; now try .startswith()
try:
return next(i for i,w in enumerate(pl) if w.startswith(s))
except StopIteration:
return -1
>>> sorted(joinedlist, key=finder2)
['1000', 'car', '10', 'year', 'sandwish', '10.34']

Arrange list of strings that are divided into 4 parts by the different parts?

I have a list comprised of strings that all follow the same format 'Name%Department%Age'
I would like to order the list by age, then name, then department.
alist = ['John%Maths%30', 'Sarah%English%50', 'John%English%30', 'John%English%31', 'George%Maths%30']
after sorting would output:
['Sarah%English%50, 'John%English%31', 'George%Maths%30', 'John%English%30, 'John%Maths%30']
The closest I have found to what I want is the following (found here: How to sort a list by Number then Letter in python?)
import re
def sorter(s):
match = re.search('([a-zA-Z]*)(\d+)', s)
return int(match.group(2)), match.group(1)
sorted(alist, key=sorter)
Out[13]: ['1', 'A1', '2', '3', '12', 'A12', 'B12', '17', 'A17', '25', '29', '122']
This however only sorted my layout of input by straight alphabetical.
Any help appreciated,
Thanks.

You are on the right track.
Personally, I:
would first use string.split() to chop the string up into its constituent parts;
would then make the sort key produce a tuple that reflects the desired sort order.
For example:
def key(name_dept_age):
name, dept, age = name_dept_age.split('%')
return -int(age), name, dept
alist = ['John%Maths%30', 'Sarah%English%50', 'John%English%30', 'John%English%31', 'George%Maths%30']
print(sorted(alist, key=key))

Use name, department, age = item.split('%') on each item.
Make a dict out of them {'name': name, 'department': department, 'age': age}
Then sort them using this code
https://stackoverflow.com/a/1144405/277267
sorted_items = multikeysort(items, ['-age', 'name', 'department'])
Experiment once with that multikeysort function, you will see that it will come in handy in a couple of situations in your programming career.

Python LOB to List

Using:
cur.execute(SQL)
response= cur.fetchall() //response is a LOB object
names = response[0][0].read()
i have following SQL response as String names:
'Mike':'Mike'
'John':'John'
'Mike/B':'Mike/B'
As you can see it comes formatted. It is actualy formatted like:\\'Mike\\':\\'Mike\\'\n\\'John\\'... and so on
in order to check if for example Mike is inside list at least one time (i don't care how many times but at least one time)
I would like to have something like that:
l = ['Mike', 'Mike', 'John', 'John', 'Mike/B', 'Mike/B'],
so i could simply iterate over the list and ask
for name in l:
'Mike' == name:
do something
Any Ideas how i could do that?
Many thanks
Edit:
When i do:
list = names.split()
I receive the list which is nearly how i want it, but the elements inside look still like this!!!:
list = ['\\'Mike\\':\\'Mike\\", ...]

names = ['\\'Mike\\':\\'Mike\\", ...]
for name in names:
if "Mike" in name:
print "Mike is here"
The \\' business is caused by mysql escaping the '
if you have a list of names try this:
my_names = ["Tom", "Dick", "Harry"]
names = ['\\'Mike\\':\\'Mike\\", ...]
for name in names:
for my_name in my_names:
if myname in name:
print myname, " is here"

import re
pattern = re.compile(r"[\n\\:']+")
list_of_names = pattern.split(names)
# ['', 'Mike', 'Mike', 'John', 'John', 'Mike/B', '']
# Quick-tip: Try not to name a list with "list" as "list" is a built-in
You can keep your results this way or do a final cleanup to remove empty strings
clean_list = list(filter(lambda x: x!='', list_of_names))

Removing characters from a tuple

Im using
Users = win32net.NetGroupGetUsers(IP,'none',0),
to get all the local users on a system. The output is a tuple,
(([{'name': u'Administrator'}, {'name': u'Guest'}, {'name': u'Tom'}], 3, 0),)
I want to clean this up so it just prints out "Administrator, Guest, Tom". I tried using strip and replace but you cant use those on tuples. Is there a way to convert this into a string so i can manipulate it or is there an even simpler way to go about it?

This should not end with a comma:
Users = win32net.NetGroupGetUsers(IP,'none',0),
The trailing comma turns the result into a single item tuple containing the result, which is itself a tuple.
The data you want is in Users[0].
>>> print Users[0]
[{'name': u'Administrator'}, {'name': u'Guest'}, {'name': u'Tom'}]
To unpack this list of dictionaries we use a generator expression:
Users = win32net.NetGroupGetUsers(IP,'none',0)
print ', '.join(d['name'] for d in Users[0])

', '.join(user['name'] for user in Users[0][0])

input = (([{'name': u'Administrator'}, {'name': u'Guest'}, {'name': u'Tom'}], 3, 0),)
in_list = input[0][0]
names = [x['name'] for x in in_list]
print names
[u'Administrator', u'Guest', u'Tom']

Simple way to convert a string to a dictionary

What is the simplest way to convert a string of keyword=values to a dictionary, for example the following string:
name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"
to the following python dictionary:
{'name':'John Smith', 'age':34, 'height':173.2, 'location':'US', 'avatar':':,=)'}
The 'avatar' key is just to show that the strings can contain = and , so a simple 'split' won't do. Any ideas? Thanks!

This works for me:
# get all the items
matches = re.findall(r'\w+=".+?"', s) + re.findall(r'\w+=[\d.]+',s)
# partition each match at '='
matches = [m.group().split('=', 1) for m in matches]
# use results to make a dict
d = dict(matches)

I would suggest a lazy way of doing this.
test_string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
eval("dict({})".format(test_string))
{'age': 34, 'location': 'US', 'avatar': ':,=)', 'name': 'John Smith', 'height': 173.2}
Hope this helps someone !

Edit: since the csv module doesn't deal as desired with quotes inside fields, it takes a bit more work to implement this functionality:
import re
quoted = re.compile(r'"[^"]*"')
class QuoteSaver(object):
def __init__(self):
self.saver = dict()
self.reverser = dict()
def preserve(self, mo):
s = mo.group()
if s not in self.saver:
self.saver[s] = '"%d"' % len(self.saver)
self.reverser[self.saver[s]] = s
return self.saver[s]
def expand(self, mo):
return self.reverser[mo.group()]
x = 'name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
qs = QuoteSaver()
y = quoted.sub(qs.preserve, x)
kvs_strings = y.split(',')
kvs_pairs = [kv.split('=') for kv in kvs_strings]
kvs_restored = [(k, quoted.sub(qs.expand, v)) for k, v in kvs_pairs]
def converter(v):
if v.startswith('"'): return v.strip('"')
try: return int(v)
except ValueError: return float(v)
thedict = dict((k.strip(), converter(v)) for k, v in kvs_restored)
for k in thedict:
print "%-8s %s" % (k, thedict[k])
print thedict
I'm emitting thedict twice to show exactly how and why it differs from the required result; the output is:
age 34
location US
name John Smith
avatar :,=)
height 173.2
{'age': 34, 'location': 'US', 'name': 'John Smith', 'avatar': ':,=)',
'height': 173.19999999999999}
As you see, the output for the floating point value is as requested when directly emitted with print, but it isn't and cannot be (since there IS no floating point value that would display 173.2 in such a case!-) when the print is applied to the whole dict (because that inevitably uses repr on the keys and values -- and the repr of 173.2 has that form, given the usual issues about how floating point values are stored in binary, not in decimal, etc, etc). You might define a dict subclass which overrides __str__ to specialcase floating-point values, I guess, if that's indeed a requirement.
But, I hope this distraction doesn't interfere with the core idea -- as long as the doublequotes are properly balanced (and there are no doublequotes-inside-doublequotes), this code does perform the required task of preserving "special characters" (commas and equal signs, in this case) from being taken in their normal sense when they're inside double quotes, even if the double quotes start inside a "field" rather than at the beginning of the field (csv only deals with the latter condition). Insert a few intermediate prints if the way the code works is not obvious -- first it changes all "double quoted fields" into a specially simple form ("0", "1" and so on), while separately recording what the actual contents corresponding to those simple forms are; at the end, the simple forms are changed back into the original contents. Double-quote stripping (for strings) and transformation of the unquoted strings into integers or floats is finally handled by the simple converter function.

Here is a more verbose approach to the problem using pyparsing. Note the parse actions
which do the automatic conversion of types from strings to ints or floats. Also, the
QuotedString class implicitly strips the quotation marks from the quoted value. Finally,
the Dict class takes each 'key = val' group in the comma-delimited list, and assigns
results names using the key and value tokens.
from pyparsing import *
key = Word(alphas)
EQ = Suppress('=')
real = Regex(r'[+-]?\d+\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
qs = QuotedString('"')
value = real | integer | qs
dictstring = Dict(delimitedList(Group(key + EQ + value)))
Now to parse your original text string, storing the results in dd. Pyparsing returns an
object of type ParseResults, but this class has many dict-like features (support for keys(),
items(), in, etc.), or can emit a true Python dict by calling asDict(). Calling dump()
shows all of the tokens in the original parsed list, plus all of the named items. The last
two examples show how to access named items within a ParseResults as if they were attributes of
a Python object.
text = 'name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
dd = dictstring.parseString(text)
print dd.keys()
print dd.items()
print dd.dump()
print dd.asDict()
print dd.name
print dd.avatar
Prints:
['age', 'location', 'name', 'avatar', 'height']
[('age', 34), ('location', 'US'), ('name', 'John Smith'), ('avatar', ':,=)'), ('height', 173.19999999999999)]
[['name', 'John Smith'], ['age', 34], ['height', 173.19999999999999], ['location', 'US'], ['avatar', ':,=)']]
- age: 34
- avatar: :,=)
- height: 173.2
- location: US
- name: John Smith
{'age': 34, 'height': 173.19999999999999, 'location': 'US', 'avatar': ':,=)', 'name': 'John Smith'}
John Smith
:,=)

The following code produces the correct behavior, but is just a bit long! I've added a space in the avatar to show that it deals well with commas and spaces and equal signs inside the string. Any suggestions to shorten it?
import hashlib
string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":, =)"'
strings = {}
def simplify(value):
try:
return int(value)
except:
return float(value)
while True:
try:
p1 = string.index('"')
p2 = string.index('"',p1+1)
substring = string[p1+1:p2]
key = hashlib.md5(substring).hexdigest()
strings[key] = substring
string = string[:p1] + key + string[p2+1:]
except:
break
d = {}
for pair in string.split(', '):
key, value = pair.split('=')
if value in strings:
d[key] = strings[value]
else:
d[key] = simplify(value)
print d

Here is a approach with eval, I considered it is as unreliable though, but its works for your example.
>>> import re
>>>
>>> s='name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
>>>
>>> eval("{"+re.sub('(\w+)=("[^"]+"|[\d.]+)','"\\1":\\2',s)+"}")
{'age': 34, 'location': 'US', 'name': 'John Smith', 'avatar': ':,=)', 'height': 173.19999999999999}
>>>
Update:
Better use the one pointed by Chris Lutz in the comment, I believe Its more reliable, because even there is (single/double) quotes in dict values, it might works.

Here's a somewhat more robust version of the regexp solution:
import re
keyval_re = re.compile(r'''
\s* # Leading whitespace is ok.
(?P<key>\w+)\s*=\s*( # Search for a key followed by..
(?P<str>"[^"]*"|\'[^\']*\')| # a quoted string; or
(?P<float>\d+\.\d+)| # a float; or
(?P<int>\d+) # an int.
)\s*,?\s* # Handle comma & trailing whitespace.
|(?P<garbage>.+) # Complain if we get anything else!
''', re.VERBOSE)
def handle_keyval(match):
if match.group('garbage'):
raise ValueError("Parse error: unable to parse: %r" %
match.group('garbage'))
key = match.group('key')
if match.group('str') is not None:
return (key, match.group('str')[1:-1]) # strip quotes
elif match.group('float') is not None:
return (key, float(match.group('float')))
elif match.group('int') is not None:
return (key, int(match.group('int')))
It automatically converts floats & ints to the right type; handles single and double quotes; handles extraneous whitespace in various locations; and complains if a badly formatted string is supplied
>>> s='name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"'
>>> print dict(handle_keyval(m) for m in keyval_re.finditer(s))
{'age': 34, 'location': 'US', 'name': 'John Smith', 'avatar': ':,=)', 'height': 173.19999999999999}

do it step by step
d={}
mystring='name="John Smith", age=34, height=173.2, location="US", avatar=":,=)"';
s = mystring.split(", ")
for item in s:
i=item.split("=",1)
d[i[0]]=i[-1]
print d

I think you just need to set maxsplit=1, for instance the following should work.
string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":, =)"'
newDict = dict(map( lambda(z): z.split("=",1), string.split(", ") ))
Edit (see comment):
I didn't notice that ", " was a value under avatar, the best approach would be to escape ", " wherever you are generating data. Even better would be something like JSON ;). However, as an alternative to regexp, you could try using shlex, which I think produces cleaner looking code.
import shlex
string = 'name="John Smith", age=34, height=173.2, location="US", avatar=":, =)"'
lex = shlex.shlex ( string )
lex.whitespace += "," # Default whitespace doesn't include commas
lex.wordchars += "." # Word char should include . to catch decimal
words = [ x for x in iter( lex.get_token, '' ) ]
newDict = dict ( zip( words[0::3], words[2::3]) )

Always comma separated? Use the CSV module to split the line into parts (not checked):
import csv
import cStringIO
parts=csv.reader(cStringIO.StringIO(<string to parse>)).next()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to match python list using regular expression - python

I have following lists in python ["john","doe","1","90"] and ["prince","2","95"]. the first number column is field: id and second number field is score. I would like to use re in python to parse out the field and print. So far, I only know how to do split of field comma. Any one can help?

You better use a dictionary than a regex (which I don't see how you use here): {'name': 'John Doe', 'id': '1', 'score': '90'} Or better yet, use numbers: {'name': 'John Doe', 'id': 1, 'score': 90}

Related

Sort values for both str and int by ranking appearance in a string

Arrange list of strings that are divided into 4 parts by the different parts?

Python LOB to List

Removing characters from a tuple

Simple way to convert a string to a dictionary

Categories

Resources