I'm new to Python and find myself in the following situation. I work with equations stored as strings, such as:
>>> my_eqn = "A + 3.1B - 4.7D"
I'm looking to parse the string and store the numeric and alphabetic parts separately in two lists, or some other container. A (very) rough sketch of what I'm trying to put together would look like:
>>> foo = parse_and_separate(my_eqn);
>>> foo.numbers
[1.0, 3.1, -4.7]
>>> foo.letters
['A', 'B', 'D']
Any resources/references/pointers would be much appreciated.
Thanks!
Update
Here's one solution I came up with that's probably overly-complicated but seems to work. Thanks again to all those responded!
import re my_eqn = "A + 3.1B - 4.7D"
# add a "1.0" in front of single letters
my_eqn = re.sub(r"(\b[A-Z]\b)","1"+ r"\1", my_eqn, re.I)
# store the coefficients and variable names separately via regex
variables = re.findall("[a-z]", my_eqn, re.I)
coeffs = re.findall("[-+]?\s?\d*\.\d+|\d+", my_eqn)
# strip out '+' characters and white space
coeffs = [s.strip('+') for s in coeffs]
coeffs = [s.replace(' ', '') for s in coeffs]
# coefficients should be floats
coeffs = list(map(float, coeffs))
# confirm answers
print(variables)
print(coeffs)
This works for your simple scenario if you don't want to include any non standard python libraries.
class Foo:
def __init__(self):
self.numbers = []
self.letters = []
def split_symbol(symbol, operator):
name = ''
multiplier = ''
for letter in symbol:
if letter.isalpha():
name += letter
else:
multiplier += letter
if not multiplier:
multiplier = '1.0'
if operator == '-':
multiplier = operator + multiplier
return name, float(multiplier)
def parse_and_separate(my_eqn):
foo = Foo()
equation = my_eqn.split()
operator = ''
for symbol in equation:
if symbol in ['-', '+']:
operator = symbol
else:
letter, number = split_symbol(symbol, operator)
foo.numbers.append(number)
foo.letters.append(letter)
return foo
foo = parse_and_separate("A + 3.1B - 4.7D + 45alpha")
print(foo.numbers)
print(foo.letters)
Related
I am working on an NLP project of text processing using python in which I need to do a data cleaning before feature extractions.
I am doing the cleaning of special characters and number separations with chars using regex operation but I am doing all these in many operations separately which is making it slow. I want to make it in as few as possible operations or in a faster way.
my code is as follows
def remove_special_char(x):
if type(x) is str:
x = x.replace('-', ' ').replace('(', ',').replace(')', ',')
x = re.compile(r"\s+").sub(" ", x).strip()
x = re.sub(r'[^A-Z a-z 0-9-,.x]+', '', x).lower()
x = re.sub(r"([0-9]+(\.[0-9]+)?)",r" \1 ", x).strip()
x = x.replace(",,",",")
return x
else:
return x
Can anyone help me?
In addition to preparing the compiled patterns outside the function, you can gain some performance by using translate for all the one-to-one or one-to-none conversions:
import string
mappings = {'-':' ', '(':',', ')':','} # add more mappings as needed
mappings.update({ c:' ' for c in string.whitespace }) # white spaces become spaces
mappings.update({c:c.lower() for c in string.ascii_uppercase}) # set to lowercase
specialChars = str.maketrans(mappings)
def remove_special_char(x):
x = x.translate(specialChars)
...
return x
You have different replacement strings for the various operations, so you can't really merge them.
You can pre-compile all of the regexps beforehand though, but I suspect it won't make much of a difference:
paren_re = re.compile(r"[()]")
whitespace_re = re.compile(r"\s+")
ident_re = re.compile(r"[^A-Za-z0-9-,.x]+")
number_re = re.compile(r"([0-9]+(\.[0-9]+)?)")
def remove_special_char(x):
if isinstance(x, str):
x = x.replace("-", " ")
x = paren_re.sub(",", x)
x = whitespace_re.sub(" ", x)
x = ident_re.sub("", x).lower()
x = number_re.sub(r" \1 ", x).strip()
x = x.replace(",,", ",")
return x
Have you profiled your program to see that this is the bottleneck?
Background
I have some large text files used in an automation script for audio tuning. Each line in the text file looks roughly like:
A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]] BANANA # BANANA
The text gets fed to an old command-line program which searches for keywords, and swaps them out. Sample output would be:
A[0] + B[100] - C[0x1000] [[0]] 0 # 0
A[2] + B[200] - C[0x100A] [[2]] 0 # 0
Problem
Sometimes, text files have keywords that are meant to be left untouched (i.e. cases where we don't want "BANANA" substituted). I'd like to modify the text files to use some kind of keyword/delimiter that is unlikely to pop up in normal circumstances, i.e:
A[#1] + B[#2] - C[#3] [[#1]] #1 # #1
Question
Does python's text file parser have any special indexing/escape sequences I could use instead of simple keywords?
use a regular expression replacement function with a dictionary.
Match everything between brackets (non-greedy, avoiding the brackets themselves) and replace by the value of the dict, put original value if not found:
import re
d = {"BANANA":"12", "PINEAPPLE":"20","CHERRY":"100","BANANA":"400"}
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
print(re.sub("\[([^\[\]]*)\]",lambda m : "[{}]".format(d.get(m.group(1),m.group(1))),s))
prints:
A[400] + B[20] - C[100] [[400]]
You can use re.sub to perform the substitution. This answer creates a list of randomized values to demonstrate, however, the list can be replaces with the data you are using:
import re
import random
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
new_s = re.sub('(?<=\[)[a-zA-Z0-9]+(?=\])', '{}', s)
random_data = [[random.randint(1, 2000) for i in range(4)] for _ in range(10)]
final_results = [new_s.format(*i) for i in random_data]
for command in final_results:
print(command)
Output:
A[51] + B[134] - C[864] [[1344]]
A[468] + B[1761] - C[1132] [[1927]]
A[1236] + B[34] - C[494] [[1009]]
A[1330] + B[1002] - C[1751] [[1813]]
A[936] + B[567] - C[393] [[560]]
A[1926] + B[936] - C[906] [[1596]]
A[1532] + B[1881] - C[871] [[1766]]
A[506] + B[1505] - C[1096] [[491]]
A[290] + B[1841] - C[664] [[38]]
A[1552] + B[501] - C[500] [[373]]
Just use
\[([^][]+)\]
And replace this with the desired result, e.g. 123.
Broken down, this says
\[ # opening bracket
([^][]+) # capture anything not brackets, 1+ times
\] # closing bracket
See a demo on regex101.com.
For your changed requirements, you could use an OrderedDict:
import re
from collections import OrderedDict
rx = re.compile(r'\[([^][]+)\]')
d = OrderedDict()
def replacer(match):
item = match.group(1)
d[item] = 1
return '[#{}]'.format(list(d.keys()).index(item) + 1)
string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
string = rx.sub(replacer, string)
print(string)
Which yields
A[#1] + B[#2] - C[#3] [[#1]]
The idea here is to put every (potentially) new item in the dict, then search for the index. OrderedDicts remember the order entry.
For the sake of academic completeness, you could do it all on your own as well:
import re
class Replacer:
rx = re.compile(r'\[([^][]+)\]')
keywords = []
def do_replace(self, match):
idx = self.lookup(match.group(1))
return '[#{}]'.format(idx + 1)
def replace(self, string):
return self.rx.sub(self.do_replace, string)
def lookup(self, item):
for idx, key in enumerate(self.keywords):
if key == item:
return idx
self.keywords.append(item)
return len(self.keywords)-1
string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
rpl = Replacer()
string = rpl.replace(string)
print(string)
Can also be done using pyparsing.
This parser essentially defines noun to be the uppercase things within square brackets, then defines a sequence of them to be one line of input, as complete.
To replace items identified with other things define a class derived from dict in a suitable way, so that anything not in the class is left unchanged.
>>> import pyparsing as pp
>>> noun = pp.Word(pp.alphas.upper())
>>> between = pp.CharsNotIn('[]')
>>> leftbrackets = pp.OneOrMore('[')
>>> rightbrackets = pp.OneOrMore(']')
>>> stmt = 'A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]'
>>> one = between + leftbrackets + noun + rightbrackets
>>> complete = pp.OneOrMore(one)
>>> complete.parseString(stmt)
(['A', '[', 'BANANA', ']', ' + B', '[', 'PINEAPPLE', ']', ' - C', '[', 'CHERRY', ']', ' ', '[', '[', 'BANANA', ']', ']'], {})
>>> class Replace(dict):
... def __missing__(self, key):
... return key
...
>>> replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})
>>> new = []
>>> for item in complete.parseString(stmt).asList():
... new.append(replace[item])
...
>>> ''.join(new)
'A[1] + B[2] - C[CHERRY] [[1]]'
I think it's easier — and clearer — using plex. The snag is that it appears to be available only for Py2. It took me an hour or two to make sufficient conversion work to Py3 to get this.
Just three types of tokens to watch for, then a similar number of branches within a while statement.
from plex import *
from io import StringIO
stmt = StringIO('A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]')
lexicon = Lexicon([
(Rep1(AnyBut('[]')), 'not_brackets'),
(Str('['), 'left_bracket'),
(Str(']'), 'right_bracket'),
])
class Replace(dict):
def __missing__(self, key):
return key
replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})
scanner = Scanner(lexicon, stmt)
new_statement = []
while True:
token = scanner.read()
if token[0] is None:
break
elif token[0]=='no_brackets':
new_statement.append(replace[token[1]])
else:
new_statement.append(token[1])
print (''.join(new_statement))
Result:
A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]
I have created a function that takes in four strings. The first two strings will be long strings that can be anything. The last two strings will be referred to as boundaries. I want to take everything in string1 between the defined boundaries and replace everything in string2 between the defined boundaries. The part of the string taken away from string 1 will be removed and the part replaced in string 2 will be removed. An example of this function is below:
def bound('DOGYOMAMA','ROOGMEMAD', 'OG' 'MA') --> RETURNS('DMA','ROOGYOMAD',
'OG', 'MA')
This is the function I have created to do what I wrote above
def bound(st,sz,a,b):
s1=''.join(st)
s2=''.join(sz)
if a in s1 and b in s1 and a in s2 and b in s2:
f1=s1.find(a)
l1=s1.find(b)
f2=s2.find(a)
l2=s2.find(b)
blen1 = len(b)
blen2 = len(b)
s1_n = s1[:f1] +s1[l1+blen1:]
s2_n = s2[:f2] + s1[f1:l1 + blen1] +s2[l2+blen2]
return s1_n, s2_n, a, b
print(bound('DOGYOMAMA','ROOGMEMAD', 'OG','MA'))
My problem is that I also need to make it so this will work in reverse so if I have ('DOGYOMAMA','ROOGMEMAD', 'OG' 'MA') it should also look for ('AMAMOYGOD','DAMEMGOOR', 'GO' 'AM'). Another thing would be if the string can be spliced both ways it will take only the sequence that is spliced at the lowest index.
Try this :
and if you have to return many items then don't return the output instead of store the output in a list and return that list at last , that i did there :
def bound(st,sz,a,b):
result=[]
string_s = [''.join(st), ''.join(sz), ''.join(st)[::-1], ''.join(sz)[::-1]]
boundaries = [a, b, a[::-1], b[::-1]]
for chunk in range(0, len(string_s), 2):
word = string_s[chunk:chunk + 2]
bound = boundaries[chunk:chunk + 2]
if bound[0] in word[0] and bound[1] in word[0] and bound[0] in word[1] and bound[1] in word[1]:
f1 = word[0].find(bound[0])
l1 = word[0].find(bound[1])
f2 = word[1].find(bound[0])
l2 = word[1].find(bound[1])
blen1 = len(bound[1])
blen2 = len(bound[1])
s1_n = word[0][:f1] + word[0][l1 + blen1:]
s2_n = word[1][:f2] + word[0][f1:l1 + blen1] + word[1][l2 + blen2]
result.append([s1_n, s2_n, bound[0], bound[1]])
return result
print(bound('DOGYOMAMA','ROOGMEMAD', 'OG','MA'))
output:
[['DMA', 'ROOGYOMAD', 'OG', 'MA'], ['AMAMOYAMOYGOD', 'DAMEME', 'GO', 'AM']]
I am working with Ruby every day, but i have an issue in Python. I was found this languages very similar... But I have some problems with migration from Ruby :)
Please, help me to convert this action in python:
string = "qwerty2012"
(var, some_var, another_var) = string.unpack("a1a4a*")
this should return three variables with unpacked values from string:
var = "q" # a1
some_var = "wert" # a4
another_var = "y2012" # a*
Help me to represent it in Python
Thank you!
s = "qwerty2012"
(a, b, c) = s[:1], s[1:5], s[5:]
Python does have a similar module named struct. It lacks the ability to grab the rest of the string in the same way that Ruby and PHP lifted from Perl. You can almost get there though:
>>> import struct
>>> s = 'qwerty2012'
>>> struct.unpack_from('1s4s', s)
('q', 'wert')
>>> def my_unpack(format, packed_string):
... result = []
... result.extend(struct.unpack_from(format, packed_string))
... chars_gobbled = struct.calcsize(format)
... rest = packed_string[chars_gobbled:]
... if rest:
... result.append(rest)
... return result
...
>>> my_unpack('1s4s', 'qwerty2012')
['q', 'wert', 'y2012']
>>> my_unpack('1s4s', 'qwert')
['q', 'wert']
>>> [hex(x) for x in my_unpack('<I', '\xDE\xAD\xBE\xEF')]
['0xefbeadde']
I wish that the struct module implemented the rest of Perl's unpack and pack since they were incredibly useful functions for ripping apart binary packets but alas.
s = "qwerty2012"
var, some_var, another_var = s[:1], s[1:5], s[5:]
will do the assignment and yield respectively:
q
wert
y2012
The above assignment makes use of slice notation as described by the Python Docs. This SO post Good Primer for Python Slice Notation gives a good explanation too.
Here's a preliminary recreation of unpack:
import re
import StringIO
def unpack(s, fmt):
fs = StringIO.StringIO(s)
res = []
for do,num in unpack.pattern.findall(fmt):
if num == '*':
num = len(s)
elif num == '':
num = 1
else:
num = int(num)
this = unpack.types[do](num, fs)
if this is not None:
res.append(this)
return res
unpack.types = {
'#': lambda n,s: s.seek(n), # skip to offset
'a': lambda n,s: s.read(n), # string
'A': lambda n,s: s.read(n).rstrip(), # string, right-trimmed
'b': lambda n,s: bin(reduce(lambda x,y:256*x+ord(y), s.read(n), 0))[2:].zfill(8*n)[::-1], # binary, LSB first
'B': lambda n,s: bin(reduce(lambda x,y:256*x+ord(y), s.read(n), 0))[2:].zfill(8*n) # binary, MSB first
}
unpack.pattern = re.compile(r'([a-zA-Z#](?:_|!|<|>|!<|!>|0|))(\d+|\*|)')
It works for your given example,
unpack("qwerty2012", "a1a4a*") # -> ['q', 'wert', 'y2012']
but has a long list of datatypes not yet implemented (see the documentation).
This might ease your migration from Ruby:
import re
import struct
def unpack(format, a_string):
pattern = r'''a(\*|\d+)\s*'''
widths = [int(w) if w is not '*' else 0 for w in re.findall(pattern, format)]
if not widths[-1]: widths[-1] = len(a_string) - sum(widths)
fmt = ''.join('%ds' % f for f in widths)
return struct.unpack_from(fmt, a_string)
(var, some_var, another_var) = unpack('a1a4a*', 'qwerty2012') # also 'a1 a4 a*' OK
print (var, some_var, another_var)
Output:
('q', 'wert', 'y2012')
Quick question. I'm trying to find or write an encoder in Python to shorten a string of numbers by using upper and lower case letters. The numeric strings look something like this:
20120425161608678259146181504021022591461815040210220120425161608667
The length is always the same.
My initial thought was to write some simple encoder to utilize upper and lower case letters and numbers to shorten this string into something that looks more like this:
a26Dkd38JK
That was completely arbitrary, just trying to be as clear as possible.
I'm certain that there is a really slick way to do this, probably already built in. Maybe this is an embarrassing question to even be asking.
Also, I need to be able to take the shortened string and convert it back to the longer numeric value.
Should I write something and post the code, or is this a one line built in function of Python that I should already know about?
Thanks!
This is a pretty good compression:
import base64
def num_to_alpha(num):
num = hex(num)[2:].rstrip("L")
if len(num) % 2:
num = "0" + num
return base64.b64encode(num.decode('hex'))
It first turns the integer into a bytestring and then base64 encodes it. Here's the decoder:
def alpha_to_num(alpha):
num_bytes = base64.b64decode(alpha)
return int(num_bytes.encode('hex'), 16)
Example:
>>> num_to_alpha(20120425161608678259146181504021022591461815040210220120425161608667)
'vw4LUVm4Ea3fMnoTkHzNOlP6Z7eUAkHNdZjN2w=='
>>> alpha_to_num('vw4LUVm4Ea3fMnoTkHzNOlP6Z7eUAkHNdZjN2w==')
20120425161608678259146181504021022591461815040210220120425161608667
There are two functions that are custom (not based on base64), but produce shorter output:
chrs = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = len(chrs)
def int_to_cust(i):
result = ''
while i:
result = chrs[i % l] + result
i = i // l
if not result:
result = chrs[0]
return result
def cust_to_int(s):
result = 0
for char in s:
result = result * l + chrs.find(char)
return result
And the results are:
>>> int_to_cust(20120425161608678259146181504021022591461815040210220120425161608667)
'9F9mFGkji7k6QFRACqLwuonnoj9SqPrs3G3fRx'
>>> cust_to_int('9F9mFGkji7k6QFRACqLwuonnoj9SqPrs3G3fRx')
20120425161608678259146181504021022591461815040210220120425161608667L
You can also shorten the generated string, if you add other characters to the chrs variable.
Do it with 'class':
VALID_CHRS = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
BASE = len(VALID_CHRS)
MAP_CHRS = {k: v
for k, v in zip(VALID_CHRS, range(BASE + 1))}
class TinyNum:
"""Compact number representation in alphanumeric characters."""
def __init__(self, n):
result = ''
while n:
result = VALID_CHRS[n % BASE] + result
n //= BASE
if not result:
result = VALID_CHRS[0]
self.num = result
def to_int(self):
"""Return the number as an int."""
result = 0
for char in self.num:
result = result * BASE + MAP_CHRS[char]
return result
Sample usage:
>> n = 4590823745
>> tn = TinyNum(a)
>> print(n)
4590823745
>> print(tn.num)
50GCYh
print(tn.to_int())
4590823745
(Based on Tadeck's answer.)
>>> s="20120425161608678259146181504021022591461815040210220120425161608667"
>>> import base64, zlib
>>> base64.b64encode(zlib.compress(s))
'eJxly8ENACAMA7GVclGblv0X4434WrKFVW5CtJl1HyosrZKRf3hL5gLVZA2b'
>>> zlib.decompress(base64.b64decode(_))
'20120425161608678259146181504021022591461815040210220120425161608667'
so zlib isn't real smart at compressing strings of digits :(