I am working on an NLP project of text processing using python in which I need to do a data cleaning before feature extractions.
I am doing the cleaning of special characters and number separations with chars using regex operation but I am doing all these in many operations separately which is making it slow. I want to make it in as few as possible operations or in a faster way.
my code is as follows
def remove_special_char(x):
if type(x) is str:
x = x.replace('-', ' ').replace('(', ',').replace(')', ',')
x = re.compile(r"\s+").sub(" ", x).strip()
x = re.sub(r'[^A-Z a-z 0-9-,.x]+', '', x).lower()
x = re.sub(r"([0-9]+(\.[0-9]+)?)",r" \1 ", x).strip()
x = x.replace(",,",",")
return x
else:
return x
Can anyone help me?
In addition to preparing the compiled patterns outside the function, you can gain some performance by using translate for all the one-to-one or one-to-none conversions:
import string
mappings = {'-':' ', '(':',', ')':','} # add more mappings as needed
mappings.update({ c:' ' for c in string.whitespace }) # white spaces become spaces
mappings.update({c:c.lower() for c in string.ascii_uppercase}) # set to lowercase
specialChars = str.maketrans(mappings)
def remove_special_char(x):
x = x.translate(specialChars)
...
return x
You have different replacement strings for the various operations, so you can't really merge them.
You can pre-compile all of the regexps beforehand though, but I suspect it won't make much of a difference:
paren_re = re.compile(r"[()]")
whitespace_re = re.compile(r"\s+")
ident_re = re.compile(r"[^A-Za-z0-9-,.x]+")
number_re = re.compile(r"([0-9]+(\.[0-9]+)?)")
def remove_special_char(x):
if isinstance(x, str):
x = x.replace("-", " ")
x = paren_re.sub(",", x)
x = whitespace_re.sub(" ", x)
x = ident_re.sub("", x).lower()
x = number_re.sub(r" \1 ", x).strip()
x = x.replace(",,", ",")
return x
Have you profiled your program to see that this is the bottleneck?
Related
I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?
re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.
I suggest a different approach which uses groupby to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe
r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on #Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())
I'm trying to format any number by inserting ',' every 3 numbers from the end by not using format()
123456789 becomes 123,456,789
1000000 becomes 1,000,000
What I have so far only seems to go from the start, I've tried different ideas to get it to reverse but they seem to not work as I hoped.
def format_number(number):
s = [x for x in str(number)]
for a in s[::3]:
if s.index(a) is not 0:
s.insert(s.index(a), ',')
return ''.join(s)
print(format_number(1123456789))
>> 112,345,678,9
But obviously what I want is 1,123,456,789
I tried reversing the range [:-1:3] but I get 112,345,6789
Clarification: I don't want to use format to structure the number, I'd prefer to understand how to do it myself just for self-study's sake.
Here is a solution for you, without using built-in functions:
def format_number(number):
s = list(str(number))[::-1]
o = ''
for a in range(len(s)):
if a and a % 3 == 0:
o += ','
o += s[a]
return o[::-1]
print(format_number(1123456789))
And here is the same solution using built-in functions:
def format_number(number):
return '{:,}'.format(number)
print(format_number(1123456789))
I hope this helps. :D
One way to do it without built-in functions at all...
def format_number(number):
i = 0
r = ""
while True:
r = "0123456789"[number % 10] + r
number //= 10
if number == 0:
return r
i += 1
if i % 3 == 0:
r = "," + r
Here's a version that's almost free of built-in functions or methods (it does still have to use str)
def format_number(number):
i = 0
r = ""
for character in str(number)[::-1]:
if i > 0 and i % 3 == 0:
r = "," + r
r = character + r
i += 1
return r
Another way to do it without format but with other built-ins is to reverse the number, split it into chunks of 3, join them with a comma, and reverse it again.
def format_number(number):
backward = str(number)[::-1]
r = ",".join(backward[i:i+3] for i in range(0, len(backward), 3))
return r[::-1]
Your current approach has following drawbacks
checking for equality/inequality in most cases (especially for int) should be made using ==/!= operators, not is/is not ones,
using list.index returns first occurence from the left end (so s.index('1') will be always 0 in your example), we can iterate over range if indices instead (using range built-in).
we can have something like
def format_number(number):
s = [x for x in str(number)]
for index in range(len(s) - 3, 0, -3):
s.insert(index, ',')
return ''.join(s)
Test
>>> format_number(1123456789)
'1,123,456,789'
>>> format_number(6789)
'6,789'
>>> format_number(135)
'135'
If range, list.insert and str.join are not allowed
We can replace
range with while loop,
list.insert using slicing and concatenation,
str.join with concatenation,
like
def format_number(number):
s = [x for x in str(number)]
index = len(s) - 3
while index > 0:
s = s[:index] + [','] + s[index:]
index -= 3
result = ''
for character in s:
result += character
return result
Using str.format
Finally, following docs
The ',' option signals the use of a comma for a thousands separator. For a locale aware separator, use the 'n' integer presentation type instead.
your function can be simplified to
def format_number(number):
return '{:,}'.format(number)
and it will even work for floats.
I'm new to Python and find myself in the following situation. I work with equations stored as strings, such as:
>>> my_eqn = "A + 3.1B - 4.7D"
I'm looking to parse the string and store the numeric and alphabetic parts separately in two lists, or some other container. A (very) rough sketch of what I'm trying to put together would look like:
>>> foo = parse_and_separate(my_eqn);
>>> foo.numbers
[1.0, 3.1, -4.7]
>>> foo.letters
['A', 'B', 'D']
Any resources/references/pointers would be much appreciated.
Thanks!
Update
Here's one solution I came up with that's probably overly-complicated but seems to work. Thanks again to all those responded!
import re my_eqn = "A + 3.1B - 4.7D"
# add a "1.0" in front of single letters
my_eqn = re.sub(r"(\b[A-Z]\b)","1"+ r"\1", my_eqn, re.I)
# store the coefficients and variable names separately via regex
variables = re.findall("[a-z]", my_eqn, re.I)
coeffs = re.findall("[-+]?\s?\d*\.\d+|\d+", my_eqn)
# strip out '+' characters and white space
coeffs = [s.strip('+') for s in coeffs]
coeffs = [s.replace(' ', '') for s in coeffs]
# coefficients should be floats
coeffs = list(map(float, coeffs))
# confirm answers
print(variables)
print(coeffs)
This works for your simple scenario if you don't want to include any non standard python libraries.
class Foo:
def __init__(self):
self.numbers = []
self.letters = []
def split_symbol(symbol, operator):
name = ''
multiplier = ''
for letter in symbol:
if letter.isalpha():
name += letter
else:
multiplier += letter
if not multiplier:
multiplier = '1.0'
if operator == '-':
multiplier = operator + multiplier
return name, float(multiplier)
def parse_and_separate(my_eqn):
foo = Foo()
equation = my_eqn.split()
operator = ''
for symbol in equation:
if symbol in ['-', '+']:
operator = symbol
else:
letter, number = split_symbol(symbol, operator)
foo.numbers.append(number)
foo.letters.append(letter)
return foo
foo = parse_and_separate("A + 3.1B - 4.7D + 45alpha")
print(foo.numbers)
print(foo.letters)
I have the program below, which is passed on to another function which simply prints out the original and encrypted messages. I want to know how I can simplify this program, specifically the "match = zip" and "change = (reduce(lambda" lines. If possible to do this without using lambda, how can I?
from itertools import cycle
alphabet = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
def vigenereencrypt(message,keyword):
output = ""
match = zip(message.lower(),cycle(keyword.lower()))
for i in match:
change = (reduce(lambda x, y: alphabet.index(x) + alphabet.index(y), i)) % 26
output = output + alphabet[change]
return output.lower()
Two things:
You dont need to have a local variable match, just loop zip
Your can split up your two indices x and y in your for loop definition rather than using reduce; reduce is normally used for larger iterables and since you only have 2 items in i, it's adding unnecessary complexity.
ie, you can change your for loop definition to:
for x, y in zip(...):
and your definition of change to:
change = (alphabet.index(x) + alphabet.index(y)) % 26
Starting with what R Nar said:
def vigenereencrypt(message,keyword):
output = ""
for x, y in zip(message.lower(), cycle(keyword.lower())):
change = (alphabet.index(x) + alphabet.index(y)) % 26
output = output + alphabet[change]
return output.lower()
We can be more efficient by using a list and then joining it, instead of adding to a string, and also by noticing that the output is already lowercase:
def vigenereencrypt(message,keyword):
output = []
for x, y in zip(message.lower(), cycle(keyword.lower())):
change = (alphabet.index(x) + alphabet.index(y)) % 26
output.append(alphabet[change])
return "".join(output)
Then we can reduce the body of the loop to one line..
def vigenereencrypt(message,keyword):
output = []
for x, y in zip(message.lower(), cycle(keyword.lower())):
output.append(alphabet[(alphabet.index(x) + alphabet.index(y)) % 26])
return "".join(output)
... so we can turn it into a list comprehension:
def vigenereencrypt(message,keyword):
output = (
alphabet[(alphabet.index(x) + alphabet.index(y)) % 26]
for x, y in zip(message.lower(), cycle(keyword.lower()))
)
return "".join(output)
I feel like there's something we could do with map(alphabet.index, ...) but I can't think of a way that's any better than the list comprehension.
you could do it with a bunch of indexing instead of zip...
alphabet = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
alphaSort = {k:n for n,k in enumerate(alphabet)}
alphaDex = {n:k for n,k in enumerate(alphabet)}
def vigenereencrypt(message,keyword):
output = ""
#match = zip(message.lower(),cycle(keyword.lower())) # zip(a,cycle(b)) Creates [(a[n],b[n%len(b)]) for k in range(len(a)) ]
op = "" # So lets start with for k in range(len(a))
for k in range(len(message)):
op += alphaDex[(alphaSort[message.lower()[k]]+alphaSort[keyword.lower()[k%len(keyword)]])%len(alphabet)]
return(op)
Is there a Python-way to split a string after the nth occurrence of a given delimiter?
Given a string:
'20_231_myString_234'
It should be split into (with the delimiter being '_', after its second occurrence):
['20_231', 'myString_234']
Or is the only way to accomplish this to count, split and join?
>>> n = 2
>>> groups = text.split('_')
>>> '_'.join(groups[:n]), '_'.join(groups[n:])
('20_231', 'myString_234')
Seems like this is the most readable way, the alternative is regex)
Using re to get a regex of the form ^((?:[^_]*_){n-1}[^_]*)_(.*) where n is a variable:
n=2
s='20_231_myString_234'
m=re.match(r'^((?:[^_]*_){%d}[^_]*)_(.*)' % (n-1), s)
if m: print m.groups()
or have a nice function:
import re
def nthofchar(s, c, n):
regex=r'^((?:[^%c]*%c){%d}[^%c]*)%c(.*)' % (c,c,n-1,c,c)
l = ()
m = re.match(regex, s)
if m: l = m.groups()
return l
s='20_231_myString_234'
print nthofchar(s, '_', 2)
Or without regexes, using iterative find:
def nth_split(s, delim, n):
p, c = -1, 0
while c < n:
p = s.index(delim, p + 1)
c += 1
return s[:p], s[p + 1:]
s1, s2 = nth_split('20_231_myString_234', '_', 2)
print s1, ":", s2
I like this solution because it works without any actuall regex and can easiely be adapted to another "nth" or delimiter.
import re
string = "20_231_myString_234"
occur = 2 # on which occourence you want to split
indices = [x.start() for x in re.finditer("_", string)]
part1 = string[0:indices[occur-1]]
part2 = string[indices[occur-1]+1:]
print (part1, ' ', part2)
I thought I would contribute my two cents. The second parameter to split() allows you to limit the split after a certain number of strings:
def split_at(s, delim, n):
r = s.split(delim, n)[n]
return s[:-len(r)-len(delim)], r
On my machine, the two good answers by #perreal, iterative find and regular expressions, actually measure 1.4 and 1.6 times slower (respectively) than this method.
It's worth noting that it can become even quicker if you don't need the initial bit. Then the code becomes:
def remove_head_parts(s, delim, n):
return s.split(delim, n)[n]
Not so sure about the naming, I admit, but it does the job. Somewhat surprisingly, it is 2 times faster than iterative find and 3 times faster than regular expressions.
I put up my testing script online. You are welcome to review and comment.
>>>import re
>>>str= '20_231_myString_234'
>>> occerence = [m.start() for m in re.finditer('_',str)] # this will give you a list of '_' position
>>>occerence
[2, 6, 15]
>>>result = [str[:occerence[1]],str[occerence[1]+1:]] # [str[:6],str[7:]]
>>>result
['20_231', 'myString_234']
It depends what is your pattern for this split. Because if first two elements are always numbers for example, you may build regular expression and use re module. It is able to split your string as well.
I had a larger string to split ever nth character, ended up with the following code:
# Split every 6 spaces
n = 6
sep = ' '
n_split_groups = []
groups = err_str.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
print n_split_groups
Thanks #perreal!
In function form of #AllBlackt's solution
def split_nth(s, sep, n):
n_split_groups = []
groups = s.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
return n_split_groups
s = "aaaaa bbbbb ccccc ddddd eeeeeee ffffffff"
print (split_nth(s, " ", 2))
['aaaaa bbbbb', 'ccccc ddddd', 'eeeeeee ffffffff']
As #Yuval has noted in his answer, and #jamylak commented in his answer, the split and rsplit methods accept a second (optional) parameter maxsplit to avoid making splits beyond what is necessary. Thus, I find the better solution (both for readability and performance) is this:
s = '20_231_myString_234'
first_part = text.rsplit('_', 2)[0] # Gives '20_231'
second_part = text.split('_', 2)[2] # Gives 'myString_234'
This is not only simple, but also avoids performance hits of regex solutions and other solutions using join to undo unnecessary splits.