python: replacing regex with BNF or pyparsing - python

I am parsing a relatively simple text, where each line describes a game unit. I have little knowledge of parsing techniques, so I used the following ad hoc solution:
class Unit:
# rules is an ordered dictionary of tagged regex that is intended to be applied in the given order
# the group named V would correspond to the value (if any) for that particular tag
rules = (
('Level', r'Lv. (?P<V>\d+)'),
('DPS', r'DPS: (?P<V>\d+)'),
('Type', r'(?P<V>Tank|Infantry|Artillery'),
#the XXX will be expanded into a list of valid traits
#note: (XXX| )* wouldn't work; it will match the first space it finds,
#and stop at that if it's in front of something other than a trait
('Traits', r'(?P<V>(XXX)(XXX| )*)'),
# flavor text, if any, ends with a dot
('FlavorText', r'(?P<V>.*\."?$)'),
)
rules = collections.OrderedDict(rules)
traits = '|'.join('All-Terrain', 'Armored', 'Anti-Aircraft', 'Motorized')
rules['Traits'] = re.sub('XXX', effects, rules['Traits'])
for x in rules:
rules[x] = re.sub('<V>', '<'+x+'>', rules[x])
rules[x] = re.compile(rules[x])
def __init__(self, data)
# data looks like this:
# Lv. 5 Tank DPS: 55 Motorized Armored
for field, regex in Item.rules.items():
data = regex.sub(self.parse, data, 1)
if data:
raise ParserError('Could not parse part of the input: ' + data)
def parse(self, m):
if len(m.groupdict()) != 1:
Exception('Expected a single named group')
field, value = m.groupdict().popitem()
setattr(self, field, value)
return ''
It works fine, but I feel I reached the limit of regex power. Specifically, in the case of Traits, the value ends up being a string that I need to split and convert into a list at a later point: e.g., obj.Traits would be set to 'Motorized Armored' in this code, but in a later function changed to ('Motorized', 'Armored').
I'm thinking of converting this code to use either EBNF or pyparsing grammar or something like that. My goals are:
make this code neater and less error-prone
avoid the ugly treatment of the case with a list of values (where I need do replacement inside the regex first, and later post-process the result to convert a string into a list)
What would be your suggestions about what to use, and how to rewrite the code?
P.S. I skipped some parts of the code to avoid clutter; if I introduced any errors in the process, sorry - the original code does work :)

I started to write up a coaching guide for pyparsing, but looking at your rules, they translate pretty easily into pyparsing elements themselves, without dealing with EBNF, so I just cooked up a quick sample:
from pyparsing import Word, nums, oneOf, Group, OneOrMore, Regex, Optional
integer = Word(nums)
level = "Lv." + integer("Level")
dps = "DPS:" + integer("DPS")
type_ = oneOf("Tank Infantry Artillery")("Type")
traits = Group(OneOrMore(oneOf("All-Terrain Armored Anti-Aircraft Motorized")))("Traits")
flavortext = Regex(r".*\.$")("FlavorText")
rule = (Optional(level) & Optional(dps) & Optional(type_) &
Optional(traits) & Optional(flavortext))
I included the Regex example so you could see how a regular expression could be dropped in to an existing pyparsing grammar. The composition of rule using '&' operators means that the individual items could be found in any order (so the grammar takes care of the iterating over all the rules, instead of you doing it in your own code). Pyparsing uses operator overloading to build up complex parsers from simple ones: '+' for sequence, '|' and '^' for alternatives (first-match or longest-match), and so on.
Here is how the parsed results would look - note that I added results names, just as you used named groups in your regexen:
data = "Lv. 5 Tank DPS: 55 Motorized Armored"
parsed_data = rule.parseString(data)
print parsed_data.dump()
print parsed_data.DPS
print parsed_data.Type
print ' '.join(parsed_data.Traits)
prints:
['Lv.', '5', 'Tank', 'DPS:', '55', ['Motorized', 'Armored']]
- DPS: 55
- Level: 5
- Traits: ['Motorized', 'Armored']
- Type: Tank
55
Tank
Motorized Armored
Please stop by the wiki and see the other examples. You can easy_install to install pyparsing, but if you download the source distribution from SourceForge, there is a lot of additional documentation.

Related

How to replace all T with U in an input string of DNA?

So, the task is quite simple. I just need to replace all "T"s with "U"s in an input string of DNA. I have written the following code:
def transcribe_dna_to_rna(s):
base_change = {"t":"U", "T":"U"}
replace = "".join([base_change(n,n) for n in s])
return replace.upper()
and for some reason, I get the following error code:
'dict' object is not callable
Why is it that my dictionary is not callable? What should I change in my code?
Thanks for any tips in advance!
To correctly convert DNA to RNA nucleotides in string s, use a combination of str.maketrans and str.translate, which replaces thymine to uracil while preserving the case. For example:
s = 'ACTGactgACTG'
s = s.translate(str.maketrans("tT", "uU"))
print(s)
# ACUGacugACUG
Note that in bioinformatics, case (lower or upper) is often important and should be preserved, so keeping both t -> u and T -> U is important. See, for example:
Uppercase vs lowercase letters in reference genome
SEE ALSO:
Character Translation using Python (like the tr command)
Note that there are specialized bioinformatics tools specifically for handling biological sequences.
For example, BioPython offers transcribe:
from Bio.Seq import Seq
my_seq = Seq('ACTGactgACTG')
my_seq = my_seq.transcribe()
print(my_seq)
# ACUGacugACUG
To install BioPython, use conda install biopython or conda create --name biopython biopython.
The syntax error tells you that base_change(n,n) looks like you are trying to use base_change as the name of a function, when in fact it is a dictionary.
I guess what you wanted to say was
def transcribe_dna_to_rna(s):
base_change = {"t":"U", "T":"U"}
replace = "".join([base_change.get(n, n) for n in s])
return replace.upper()
where the function is the .get(x, y) method of the dictionary, which returns the value for the key in x if it is present, and otherwise y (so in this case, return the original n if it's not in the dictionary).
But this is overcomplicating things; Python very easily lets you replace characters in strings.
def transcribe_dna_to_rna(s):
return s.upper().replace("T", "U")
(Stole the reordering to put the .upper() first from #norie's answer; thanks!)
If your real dictionary was much larger, your original attempt might make more sense, as long chains of .replace().replace().replace()... are unattractive and eventually inefficient when you have a lot of them.
In python 3, use str.translate:
dna = "ACTG"
rna = dna.translate(str.maketrans("T", "U")) # "ACUG"
Change s to upper and then do the replacement.
def transcribe_dna_to_rna(s):
return s.upper().replace("T", "U")

Fuzzy Searching a Column in Pandas

Is there a way to search for a value in a dataframe column using FuzzyWuzzy or similar library?
I'm trying to find a value in one column that corresponds to the value in another while taking fuzzy matching into account. So
So for example, if I have State Names in one column and State Codes in another, how would I find the state code for Florida, which is FL while catering for abbreviations like "Flor"?
So in other words, I want to find a match for a State Name corresponding to "Flor" and get the corresponding State Code "FL".
Any help is greatly appreciated.
If the abbreviations are all prefixes, you can use the .startswith() string method against either the short or long version of the state.
>>> test_value = "Flor"
>>> test_value.upper().startswith("FL")
True
>>> "Florida".lower().startswith(test_value.lower())
True
However, if you have more complex abbreviations, difflib.get_close_matches will probably do what you want!
>>> import pandas as pd
>>> import difflib
>>> df = pd.DataFrame({"states": ("Florida", "Texas"), "st": ("FL", "TX")})
>>> df
states st
0 Florida FL
1 Texas TX
>>> difflib.get_close_matches("Flor", df["states"].to_list())
['Florida']
>>> difflib.get_close_matches("x", df["states"].to_list(), cutoff=0.2)
['Texas']
>>> df["st"][df.index[df["states"]=="Texas"]].iloc[0]
'TX'
You will probably want to try/except IndexError around reading the first member of the returned list from difflib and possibly tweak the cutoff to get less false matches with close states (perhaps offer all the states as possibilities to some user or require more letters for close states).
You may also see the best results combining the two; testing prefixes first before trying the fuzzy match.
Putting it all together
def state_from_partial(test_text, df, col_fullnames, col_shortnames):
if len(test_text) < 2:
raise ValueError("must have at least 2 characters")
# if there's exactly two characters, try to directly match short name
if len(test_text) == 2 and test_text.upper() in df[col_shortnames]:
return test_text.upper()
states = df[col_fullnames].to_list()
match = None
# this will definitely fail at least for states starting with M or New
#for state in states:
# if state.lower().startswith(test_text.lower())
# match = state
# break # leave loop and prepare to find the prefix
if not match:
try: # see if there's a fuzzy match
match = difflib.get_close_matches(test_text, states)[0] # cutoff=0.6
except IndexError:
pass # consider matching against a list of problematic states with different cutoff
if match:
return df[col_shortnames][df.index[df[col_fullnames]==match]].iloc[0]
raise ValueError("couldn't find a state matching partial: {}".format(test_text))
Beware of states which start with 'New' or 'M' (and probably others), which are all pretty close and will probably want special handling. Testing will do wonders here.

Trying to generate all sentences of a simple formal grammar

I am new to python and trying to generate all sentences possible in the grammar.
Here is the grammar:
#set of non terminals
N = ('<subject>', '<predicate>', '<noun phrase>', '<noun>', '<article>', '<verb>', '<direct object>')
#set of teminals
T = ('the', 'boy', 'dog', 'bit')
#productions
P = [ ('Sigma', ['<subject>', '<predicate>']), \
('<subject>', ['<noun phrase>']), \
('<predicate>', ['<verb>']), \
('<predicate>', ['<verb>','<direct object>']), \
('<noun phrase>', ['<article>','<noun>']), \
('<direct object>', ['<noun phrase>']), \
('<noun>', ['boy']), \
('<noun>', ['dog']), \
('<article>', ['the']), \
('<verb>', ['bit']) ]
Here is my attempt, I am using a queue class to implement it methodically,
# language defined by the previous grammar.
Q = Queue()
Q.enqueue(['Sigma'])
found = 0
while 0 < len(Q):
print "One while loop done"
# Get the next sentential form
sf = Q.dequeue()
sf1 = [y for y in sf]
for production in P:
for i in range(len(sf1)):
if production[0] == sf1[i]:
sf[i:i+1] = [x for x in production[1]]
Q.enqueue(sf)
Q.printQ()
I am getting in infinite loop, and also I am facing some issue with shallow-deep copy, if I change one copy of sf, everything in queue changes too. Any help is appreciated, any directions, tips would be great
Here is the expected output:
The dog bit the boy
The boy bit the dog
The boy bit the boy
The dog bit the dog
The dog bit
The boy bit
I am facing some issue with shallow-deep copy, if I change one copy of sf, everything in queue changes too
Yes. In Python, a list is an object with its own identity. So:
Q.enqueue(['Sigma'])
creates a (one-element) list and enqueues a reference to it.
sf = Q.dequeue()
pops that reference from Q and assigns it to variable 'sf'.
sf[i:i+1] = ...
makes a change to that list (the one that 'sf' refers to).
Q.enqueue(sf)
enqueues a reference to that same list.
So there's only one list object involved, and Q just contains multiple references to it.
Instead, you presumably want each entry in Q to be a reference to a separate list (sentential form), so you have to create a new list for each call to Q.enqueue.
Depending on how you fix that, there might or might not be other problems in the code. Consider:
(1) Each sentence has multiple derivations, and you only need to 'find' one (e.g., the leftmost derivation).
(2) In general, though not in your example grammar, a production's RHS might have more than one occurrence of a non-terminal (e.g. if COND then STMT else STMT), and those occurrences need not derive the same sub-forms.
(3) In general, a grammar can generate an infinite set of sentences.
By the way, to copy a list in Python, instead of saying
copy = [x for x in original]
it's simpler to say:
copy = original[:]
I created a simple grammar that allows to specify different sentences in terms of alternatives and options. Sentences that are described with that grammar can be parsed. The attributed grammar is described using Coco/R for which there is a python version (http://www.ssw.uni-linz.ac.at/Coco/#Others). I am more familiar with C# so I created a C# project here that can work as an example for you: https://github.com/abeham/Sentence-Generator.
For instance, parsing "(This | That) is a [nice] sentence" with the parser of that simple grammar creates four sentences:
* This is a sentence
* This is a nice sentence
* That is a sentence
* That is a nice sentence
Only finite sentences can be created with that grammar since there is no symbol for repetition.
I know that there already exists an accepted answer, but I hope that this answer will also be of value to those, like me, that arrived here looking for a generic solution. At least I didn't find anything like that on the web, which is why I created the github project.

How can I match partial strings / is there a better way?

I am pulling hotel names through the Expedia API and cross referencing results with another travel service provider.
The problem I am encountering is that many of the hotel names appear differently on the Expedia API than they do with the other provider and I cannot figure out a good way to match them.
I am storing the results of both in separate dicts with room rates. So, for example, the results from Expedia on a search for Vilnius in Lithuania might look like this:
expediadict = {'Ramada Hotel & Suites Vilnius': 120, 'Hotel Rinno': 100,
'Vilnius Comfort Hotel': 110}
But the results from the other provider might look like this:
altproviderdict = {'Ramada Vilnius': 120, 'Rinno Hotel': 100,
'Comfort Hotel LT': 110}
The only thing I can think of doing is stripping out all instances of 'Hotel', 'Vilnius', 'LT' and 'Lithuania' and then testing whether part of the expediadict key matches part of an altprovderdict key. This seems messy and not very Pythonic, so I wondered if any of you had any cleaner ideas?
>>> def simple_clean(word):
... return word.lower().replace(" ","").replace("hotel","")
...
>>> a = "Ramada Hotel & Suites Vilnius"
>>> b = "Hotel Ramada Suites Vilnous"
>>> a = simple_clean(a)
>>> b = simple_clean(b)
>>> a
'ramada&suitesvilnius'
>>> b
'ramadasuitesvilnous'
>>> import difflib
>>> difflib.SequenceMatcher(None,a,b).ratio()
0.9230769230769231
Do cleaning and normalization of the words : eg. remove words like Hotel,The,Resort etc
, and convert to lower case without spaces etc
Then use a fuzzy string matching algorithm like leveinstein, eg from difflib module.
This method is pretty raw and just an example, you can enhance it to suit your needs for optimal results.
If you only want to match names when the words appear in the same order, you might want to use some longest common sub sequence algorithm like it's used in diff tools. But with words instead of characters or lines.
If order is not important, it's simpler: Put all the words of the name into a set like this:
set(name.split())
and in order to match two names, test the size of the intersection of these two sets. Or test if the symmetric_difference only contains unimportant words.

How to make unique short URL with Python?

How can I make unique URL in Python a la http://imgur.com/gM19g or http://tumblr.com/xzh3bi25y
When using uuid from python I get a very large one. I want something shorter for URLs.
Edit: Here, I wrote a module for you. Use it. http://code.activestate.com/recipes/576918/
Counting up from 1 will guarantee short, unique URLS. /1, /2, /3 ... etc.
Adding uppercase and lowercase letters to your alphabet will give URLs like those in your question. And you're just counting in base-62 instead of base-10.
Now the only problem is that the URLs come consecutively. To fix that, read my answer to this question here:
Map incrementing integer range to six-digit base 26 max, but unpredictably
Basically the approach is to simply swap bits around in the incrementing value to give the appearance of randomness while maintaining determinism and guaranteeing that you don't have any collisions.
I'm not sure most URL shorteners use a random string. My impression is they write the URL to a database, then use the integer ID of the new record as the short URL, encoded base 36 or 62 (letters+digits).
Python code to convert an int to a string in arbitrary bases is here.
Python's short_url is awesome.
Here is an example:
import short_url
id = 20 # your object id
domain = 'mytiny.domain'
shortened_url = "http://{}/{}".format(
domain,
short_url.encode_url(id)
)
And to decode the code:
decoded_id = short_url.decode_url(param)
That's it :)
Hope this will help.
Hashids is an awesome tool for this.
Edit:
Here's how to use Hashids to generate a unique short URL with Python:
from hashids import Hashids
pk = 123 # Your object's id
domain = 'imgur.com' # Your domain
hashids = Hashids(salt='this is my salt', min_length=6)
link_id = hashids.encode(pk)
url = 'http://{domain}/{link_id}'.format(domain=domain, link_id=link_id)
This module will do what you want, guaranteeing that the string is globally unique (it is a UUID):
http://pypi.python.org/pypi/shortuuid/0.1
If you need something shorter, you should be able to truncate it to the desired length and still get something that will reasonably probably avoid clashes.
This answer comes pretty late but I stumbled upon this question when I was planning to create an URL shortener project. Now that I have implemented a fully functional URL shortener(source code at amitt001/pygmy) I am adding an answer here for others.
The basic principle behind any URL shortener is to get an int from long URL then use base62(base32, etc) encoding to convert this int to a more readable short URL.
How is this int generated?
Most of the URL shortener uses some auto-incrementing datastore to add URL to datastore and use the autoincrement id to get base62 encoding of int.
The sample base62 encoding from string program:
# Base-62 hash
import string
import time
_BASE = 62
class HashDigest:
"""Base base 62 hash library."""
def __init__(self):
self.base = string.ascii_letters + string.digits
self.short_str = ''
def encode(self, j):
"""Returns the repeated div mod of the number.
:param j: int
:return: list
"""
if j == 0:
return [j]
r = []
dividend = j
while dividend > 0:
dividend, remainder = divmod(dividend, _BASE)
r.append(remainder)
r = list(reversed(r))
return r
def shorten(self, i):
"""
:param i:
:return: str
"""
self.short_str = ""
encoded_list = self.encode(i)
for val in encoded_list:
self.short_str += self.base[val]
return self.short_str
This is just the partial code showing base62 encoding. Check out the complete base62 encoding/decoding code at core/hashdigest.py
All the link in this answer are shortened from the project I created
The reason UUIDs are long is because they contain lots of information so that they can be guaranteed to be globally unique.
If you want something shorter, then you'll need to do something like generate a random string, checking whether it is in the universe of already generated strings, and repeating until you get an unused string. You'll also need to watch out for concurrency here (what if the same string gets generated by a separate process before you inserted into the set of strings?).
If you need some help generating random strings in Python, this other question might help.
It doesn't really matter that this is Python, but you just need a hash function that maps to the length you want. For example, maybe use MD5 and then take just the first n characters. You'll have to watch out for collisions in that case, though, so you might want to pick something a little more robust in terms of collision detection (like using primes to cycle through the space of hash strings).
I don't know if you can use this, but we generate content objects in Zope that get unique numeric ids based on current time strings, in millis (eg, 1254298969501)
Maybe you can guess the rest. Using the recipe described here:
How to convert an integer to the shortest url-safe string in Python?, we encode and decode the real id on the fly, with no need for storage. A 13-digit integer is reduced to 7 alphanumeric chars in base 62, for example.
To complete the implementation, we registered a short (xxx.yy) domain name, that decodes and does a 301 redirect for "not found" URLs,
If I was starting over, I would subtract the "starting-over" time (in millis) from the numeric id prior to encoding, then re-add it when decoding. Or else when generating the objects. Whatever. That would be way shorter..
You can generate a N random string:
import string
import random
def short_random_string(N:int) -> str:
return ''.join(random.SystemRandom().choice(
string.ascii_letters + \
string.digits) for _ in range(N)
)
so,
print (short_random_string(10) )
#'G1ZRbouk2U'
all lowercase
print (short_random_string(10).lower() )
#'pljh6kp328'
Try this http://code.google.com/p/tiny4py/ ... It's still under development, but very useful!!
My Goal: Generate a unique identifier of a specified fixed length consisting of the characters 0-9 and a-z. For example:
zcgst5od
9x2zgn0l
qa44sp0z
61vv1nl5
umpprkbt
ylg4lmcy
dec0lu1t
38mhd8i5
rx00yf0e
kc2qdc07
Here's my solution. (Adapted from this answer by kmkaplan.)
import random
class IDGenerator(object):
ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyz"
def __init__(self, length=8):
self._alphabet_length = len(self.ALPHABET)
self._id_length = length
def _encode_int(self, n):
# Adapted from:
# Source: https://stackoverflow.com/a/561809/1497596
# Author: https://stackoverflow.com/users/50902/kmkaplan
encoded = ''
while n > 0:
n, r = divmod(n, self._alphabet_length)
encoded = self.ALPHABET[r] + encoded
return encoded
def generate_id(self):
"""Generate an ID without leading zeros.
For example, for an ID that is eight characters in length, the
returned values will range from '10000000' to 'zzzzzzzz'.
"""
start = self._alphabet_length**(self._id_length - 1)
end = self._alphabet_length**self._id_length - 1
return self._encode_int(random.randint(start, end))
if __name__ == "__main__":
# Sample usage: Generate ten IDs each eight characters in length.
idgen = IDGenerator(8)
for i in range(10):
print idgen.generate_id()

Categories

Resources