Using parentheses as delimiter in re or str.split() python

Using parentheses as delimiter in re or str.split() python - python

I am trying to split a string such as: add(ten)sub(one) into add(ten) sub(one).
I can't figure out how to match the close parentheses. I have used re.sub(r'\\)', '\\) ') and every variation of escaping the parentheses,I can think of. It is hard to tell in this font but I am trying to add a space between these commands so I can split it into a list later.

There's no need to escape ) in the replacement string, ) has a special a special meaning only in the regex pattern so it needs to be escaped there in order to match it in the string, but in normal string it can be used as is.
>>> strs = "add(ten)sub(one)"
>>> re.sub(r'\)(?=\S)',r') ', strs)
'add(ten) sub(one)'
As #StevenRumbalski pointed out in comments the above operation can be simply done using str.replace and str.rstrip:
>>> strs.replace(')',') ').strip()
'add(ten) sub(one)'

d = ')'
my_str = 'add(ten)sub(one)'
result = [t+d for t in my_str.split(d) if len(t) > 0]
result = ['add(ten)','sub(one)']

Create a list of all substrings
import re
a = 'add(ten)sub(one)'
print [ b for b in re.findall('(.+?\(.+?\))', a) ]
Output:
['add(ten)', 'sub(one)']

Related

Python regex how to remove all zeo from beginning?

I have lot of string somethings like this "01568460144" ,"0005855048560"
I want to remove all zero from beginning. I tried this which only removing one zeo from beginning but I also have others string those have multiple zeo at the beginning.
re.sub(r'0','',number)
so my expected result will be for "0005855048560" this type of string "5855048560"

If the goal is to remove all leading zeroes from a string, skip the regex, and just call .lstrip('0') on the string. The *strip family of functions are a little weird when the argument isn't a single character, but for the purposes of stripping leading/trailing copies of a single character, they're perfect:
>>> s = '000123'
>>> s = s.lstrip('0')
>>> s
'123'

>>> v = '0001111110'
>>>
>>> str(int(v))
'1111110'
>>>
>>> str(int('0005855048560'))
'5855048560'

If the string should contain only digits, you can use either isnumeric() or use re.sub and match only digits:
import re
strings = [
"01568460144",
"0005855048560",
"00test",
"00000",
"0"
]
for s1 in strings:
if s1.isnumeric():
print(f"'{s1.lstrip('0')}'")
else:
print(f"'{s1}'")
print("----------------------------")
for s2 in strings:
res = re.sub(r"^0+(\d*)$", r"\1", s2)
print(f"'{res}'")
Output
'1568460144'
'5855048560'
'00test'
''
''
----------------------------
'1568460144'
'5855048560'
'00test'
''
''

How to parse values appear after the same string in python?

I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy

Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)

Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters

For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.

You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)

Understanding strip()

I am trying the strip() method on this string but it doesn't give the desired output.
s = 'www.yahoo.com'
s = s.rstrip('.com')
print s # The desired output is 'www.yahoo' but this is showing 'www.yah'
Along with the solution please provide the reason for current output.

str.strip('.com') removes specified characters ., c, o, m, not .com at the beginning and at the end of the string.
To remove .com, use str.replace.
>>> s = 'www.yahoo.com'
>>> s.replace('.com', '') # Replace `.com` with empty string.
'www.yahoo'
UPDATE
As Marcin Fabrykowski, David Zwicker pointed, above solution will turn www.company.com into wwwpany.
To address that, you can use Marcin Fabrykowski's solution. Or using regular expression:
>>> import re
>>> re.sub(r'\.com$', '', 'www.company.com')
'www.company'
>>> re.sub(r'\.com$', '', 'www.company.com.com')
'www.company.com'
>>> re.sub(r'(\.com)+$', '', 'www.company.com.com') # To remove multiple trailings.
'www.company'
\.com$ matches .com at the end of the string ($). . is escaped becasue . has a special meaning in the regular expression (match any character).
NOTE I used r'raw string literal'; r'\.com' == '\\.com'

you can try:
if x.endswith('.com'): print(x[:-4])
becouse:
x = "www.computers.com"
print(x.replace('.com',''))
wwwputers

Python, can't replace generator object

I need to change replace a string's punctuation marks with space.
The problem is that I need to do it in one line.
for example: there's a string: 'H,+-=/e^##%ll-!!..o'
the result should be : 'H-----e----ll-----o'
where '-' symbolizes ' ' (space)
when I do
replace((c for c in string.punctuation),' ')
I get the error:
TypeError: Can't convert 'generator' object to str implicitly
I tried to put it in a list, in a set even in a dict.
but this error keeps on coming back.
how can I surpass this?

str.replace() doesn't take a list or generator, it'd only take a string, and even then won't do what you want. The method replaces one whole sequence of characters with another, so even x.replace(string.puntuation, '-') would only replace whole occurrences of the string.punctuation string in x with one dash.
Use string.maketrans() and str.translate() instead:
import string
translationmap = string.maketrans(string.punctuation, '-' * len(string.punctuation))
x = x.translate(translationmap)
Demo:
>>> import string
>>> x = 'H,+-=/e^##%ll-!!..o'
>>> import string
>>> translationmap = string.maketrans(string.punctuation, '-' * len(string.punctuation))
>>> x.translate(translationmap)
'H-----e----ll-----o'
str.translate() is hands-down the fastest method to map characters to other characters, or delete characters from a string.
On Python 3, str.translate() (or in Python 2, unicode.translate()) takes a mapping instead:
translationmap = {ord(c): '-' for c in string.punctuation}
x.translate(translationmap)

Try following
import string
''.join(map(lambda x : '-' if x in string.punctuation else x,
'H,+-=/e^##%ll-!!..o'))

You could also use re.sub for this:
>>> from re import sub
>>> sub("\W", "-", "H,+-=/e^##%ll-!!..o")
'H-----e----ll-----o'
>>>
\W captures all non-word characters.
Note that the above code will keep underscores. If you don't want them, replace \W with [\W_].

Remove all special characters, punctuation and spaces from string

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.

This can be done without regex:
>>> string = "Special $#! characters spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'
You can use str.isalnum:
S.isalnum() -> bool
Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.
If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.

Here is a regex to match a string of characters that are not a letters or numbers:
[^A-Za-z0-9]+
Here is the Python command to do a regex substitution:
re.sub('[^A-Za-z0-9]+', '', mystring)

Shorter way :
import re
cleanString = re.sub('\W+','', string )
If you want spaces between words and numbers substitute '' with ' '

TLDR
I timed the provided answers.
import re
re.sub('\W+','', string)
is typically 3x faster than the next fastest provided top answer.
Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.
After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:
string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'
Example 1
'.join(e for e in string if e.isalnum())
string1 - Result: 10.7061979771
string2 - Result: 7.78372597694
Example 2
import re
re.sub('[^A-Za-z0-9]+', '', string)
string1 - Result: 7.10785102844
string2 - Result: 4.12814903259
Example 3
import re
re.sub('\W+','', string)
string1 - Result: 3.11899876595
string2 - Result: 2.78014397621
The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)
Example 3 can be 3x faster than Example 1.

Python 2.*
I think just filter(str.isalnum, string) works
In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'
Python 3.*
In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:
''.join(filter(str.isalnum, string))
or to pass list in join use (not sure but can be fast a bit)
''.join([*filter(str.isalnum, string)])
note: unpacking in [*args] valid from Python >= 3.5

#!/usr/bin/python
import re
strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr
you can add more special character and that will be replaced by '' means nothing i.e they will be removed.

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.
For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:
import re
s = re.sub(r"[^a-zA-Z0-9]","",s)
This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".
In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.
Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.
import re
s = re.sub(r"[^a-z0-9]","",s.lower())

string.punctuation contains following characters:
'!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~'
You can use translate and maketrans functions to map punctuations to empty values (replace)
import string
'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))
Output:
'This is A test'

s = re.sub(r"[-()\"#/#;:<>{}`+=~|.!?,]", "", s)

Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:
>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:
import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien
PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))
def filter_non_printable(s):
result = []
ws_last = False
for c in s:
c = unicodedata.category(c) in PRINTABLE and c or u'#'
result.append(c)
return u''.join(result).replace(u'#', u' ')
Look at the given URL above for all related categories. You also can of course filter
by the punctuation categories.

For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:
Example for German:
re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)

This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.
import re
sample_str = "Hel&&lo %% Wo$#rl#d"
# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))
# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)
special_char_list = ["$", "#", "#", "&", "%"]
# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)
# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)

Use translate:
import string
def clean(instr):
return instr.translate(None, string.punctuation + ' ')
Caveat: Only works on ascii strings.

This will remove all non-alphanumeric characters except spaces.
string = "Special $#! characters spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))
Special characters spaces 888323

import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the
same as double quotes."""
# if we need to count the word python that ends with or without ',' or '.' at end
count = 0
for i in text:
if i.endswith("."):
text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
count += 1
print("The count of Python : ", text.count("python"))

After 10 Years, below I wrote there is the best solution.
You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.
from clean_text import clean
string = 'Special $#! characters spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'

function regexFuntion(st) {
const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
st = st.replace(/\s\s+/g, ' '); // remove multiple space
return st;
}
console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)
and you shall see your result as
'askhnlaskdjalsdk

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using parentheses as delimiter in re or str.split() python - python

d = ')' my_str = 'add(ten)sub(one)' result = [t+d for t in my_str.split(d) if len(t) > 0] result = ['add(ten)','sub(one)']

Create a list of all substrings import re a = 'add(ten)sub(one)' print [ b for b in re.findall('(.+?\(.+?\))', a) ] Output: ['add(ten)', 'sub(one)']

Related

Python regex how to remove all zeo from beginning?

How to parse values appear after the same string in python?

Understanding strip()

Python, can't replace generator object

Remove all special characters, punctuation and spaces from string

Categories

Resources