how to replace +xx in pandas str replace

how to replace +xx in pandas str replace - python

I'm using Python 2.7.12 and pandas 0.20.3, I have a data frame like below, I want to replace column called number, this column dtype is object, when I try to replace +91 in that column I'm getting error like below,
number
0 +9185600XXXXX
1 +9199651XXXXX
2 99211XXXXX
3 99341XXXXX
4 +9199651XXXXX
sre_constants.error: nothing to repeat
full trace,
Traceback (most recent call last):
File "encoder.py", line 21, in
df['number']=df['number'].str.replace('+91','')
File "/home/hduser/.local/lib/python2.7/site-packages/pandas/core/strings.py", line 1574, in replace
flags=flags)
File "/home/hduser/.local/lib/python2.7/site-packages/pandas/core/strings.py", line 424, in str_replace
regex = re.compile(pat, flags=flags)
File "/usr/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
But when I replace 91 it works as I expected, It's not working when I put + in prefix,
Please help me to solve this problem.
Error Occurs at,
df['number']=df['number'].str.replace('+91','')

You can escape special regex value + (one or more repetitions) like:
df['number'] = df['number'].str.replace('\+91','')
Or use parameter regex=False:
df['number'] = df['number'].str.replace('+91','', regex=False)

import pandas as pd
data={'number':['+9185600XXXXX','+9199651XXXXX']}
f=pd.DataFrame(data)
f['number']=f['number'].str.replace('\+91','')
print(f)

Related

Columnwise combining of three strings throws 'TypeError: '<' not supported between instances of 'str' and 'int'

I am puzzled and have no idea what is happening. My script contains the following line. It is used to combine contents of three columns of a dataframe into one of them (only for rows that fulfill the specified condition):
share_data_sm[yr]['MMR']= np.where((share_data_sm[yr]['MC']!='MA') & (share_data_sm[yr]['MC']!=' ') & (share_data_sm[yr]['MY']!=' '), share_data_sm[yr]['MC'].astype(str) + share_data_sm[yr]['N'].astype(str) + share_data_sm[yr]['MFR'].astype(str), share_data_sm[yr]['MFR'])
'share_data_sm' is a dictionary of dataframes - one table per 'yr'. What puzzles me the most, is that the error is thrown only for one particular value of 'yr' (command is a part of a loop that goes over several values of 'yr' and except for this one particular value (2021) the script runs smoothly). I though maybe there are some peculiarities in the data contents of the 2021 dataframe, but nothing exceptional there everything is exactly as the others. The following is traceback from console:
Traceback (most recent call last):
File "…ipykernel_1380/3858926177.py", line 1, in <module>
runfile('…work_folder/Groups/Structure/shareholding.py', wdir='…_work_folder/Groups/Structure')
File "…pydevd\_pydev_bundle\pydev_umd.py", line 167, in runfile
execfile(filename, namespace)
File "…pydevd\_pydev_imps\_pydev_execfile.py", line 25, in execfile
exec(compile(contents + "\n", file, 'exec'), glob, loc)
File "…work_folder/Groups/Structure/shareholding.py", line 281, in <module>
share_data_sm[yr]['MMR']= np.where((share_data_sm[yr]['MC']!='MA') & (share_data_sm[yr]['MC']!=' ') & (share_data_sm[yr]['MY']!=' '), share_data_sm[yr]['MC'].astype(str) + share_data_sm[yr]['N'].astype(str) + share_data_sm[yr]['MFR'].astype(str), share_data_sm[yr]['MFR'])
File "…pandas\core\ops\common.py", line 69, in new_method
return method(self, other)
File "…pandas\core\arraylike.py", line 96, in __radd__
return self._arith_method(other, roperator.radd)
File "…pandas\core\frame.py", line 6864, in _arith_method
self, other = ops.align_method_FRAME(self, other, axis, flex=True, level=None)
File "…pandas\core\ops\__init__.py", line 306, in align_method_FRAME
left, right = left.align(
File "…pandas\core\frame.py", line 4677, in align
return super().align(
File "…pandas\core\generic.py", line 8591, in align
return self._align_series(
File "…pandas\core\generic.py", line 8708, in _align_series
join_index, lidx, ridx = join_index.join(
File "…pandas\core\indexes\base.py", line 207, in join
join_index, lidx, ridx = meth(self, other, how=how, level=level, sort=sort)
File "…pandas\core\indexes\base.py", line 3987, in join
return this.join(other, how=how, return_indexers=True)
File "…pandas\core\indexes\base.py", line 207, in join
join_index, lidx, ridx = meth(self, other, how=how, level=level, sort=sort)
File "…pandas\core\indexes\base.py", line 3995, in join
return self._join_monotonic(other, how=how)
File "…pandas\core\indexes\base.py", line 4327, in _join_monotonic
join_array, lidx, ridx = self._outer_indexer(other)
File "…pandas\core\indexes\base.py", line 345, in _outer_indexer
joined_ndarray, lidx, ridx = libjoin.outer_join_indexer(sv, ov)
File "…pandas\_libs\join.pyx", line 562, in pandas._libs.join.outer_join_indexer
TypeError: '<' not supported between instances of 'str' and 'int'
I'll appreciate any help - how can I overcome the problem?

I think I see it.
The code can be reformatted as:
condition = \
(share_data_sm[yr]['MC']!='MA') & \
(share_data_sm[yr]['MC']!=' ') & \
(share_data_sm[yr]['MY']!=' ')
val_if_true = share_data_sm[yr]['MC'].astype(str) + share_data_sm[yr]['N'].astype(str) + share_data_sm[yr]['MFR'].astype(str)
val_if_false = share_data_sm[yr]['MFR']
share_data_sm[yr]['MMR'] = np.where(condition, val_if_true, val_if_false)
Now you can see that the value types of val_if_true and val_if_false are different - in the first case, you add together 3 str values. In the second, you're keeping the datatype of share_data_sm[yr]['MFR'].
I bet it complains when you try to add both types into the same array.

The traceback says the error is in the complicated
np.where((share_data_sm[yr]['MC']!='MA') & (share_data_sm[yr]['MC']!=' ') & (share_data_sm[yr]['MY']!=' '), share_data_sm[yr]['MC'].astype(str) + share_data_sm[yr]['N'].astype(str) + share_data_sm[yr]['MFR'].astype(str), share_data_sm[yr]['MFR'])
But keep in mind that it has to evaluate each of the 3 arguments before passing them to where.
(share_data_sm[yr]['MC']!='MA') & (share_data_sm[yr]['MC']!=' ') & (share_data_sm[yr]['MY']!=' ')
share_data_sm[yr]['MC'].astype(str) + share_data_sm[yr]['N'].astype(str) + share_data_sm[yr]['MFR'].astype(str)
share_data_sm[yr]['MFR']
It's a little hard to read the traceback, but the str+ error suggests it in the middle argument. But you are adding string values.
But I am seeing this.join and indices which suggests that it's trying to line up the indices of the series. So frame indices may be mostly strings, with an oddball numeric index. But this is just a guess; I'm not a pandas expert.
I'd suggest evaluating those 3 arguments before hand, before using them in the where to better isolate the problem. Expressions that extend over many lines are hard to debug.

how do I get words with meta expressions from string regardless of spacing

What I want to do is getting start point of some words from string.
For example,
context = "abcd e f g ( $ 150 )"
answer = "g($150)"
I want to get the start index of answer from context which should be "9".
I tried something like this,
answer = ' ?'.join()
try:
answer = re.sub('[$]', '\$', answer)
answer = re.sub('[(]', '\(', answer)
answer = re.sub('[)]', '\)', answer)
except:
pass
start_point = re.search(answer, context).span()[0]
Because there are answers with meta expressions and answers without meta expressions I used try, except.
And I used this kinds of code,
answer = re.sub('[(]', '\(', answer)
because if I don't use it, I found that re.search(answer, context) can't find my answer from context.
then I get this error.
Traceback (most recent call last):
File "mc_answer_v2.py", line 42, in <module>
match = re.search(spaced_answer_text, mc_paragraph_text)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/re.py", line 182, in search
return _compile(pattern, flags).search(string)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_parse.py", line 619, in _parse
source.tell() - here + len(this))
sre_constants.error: multiple repeat at position 3
How do I fix it and is there any other good way to get the start index?

It seems possible to do it by sticking \s* (variable number of white space characters) after each escaped character of answer string.
import re
def findPosition(context, answer):
regex=r"\s*"
regAnswer=regex.join([re.escape(w) for w in answer]) + regex
# print(regAnswer)
return re.search(regAnswer, context).start()
context = "abcd e f g ( $ 150 )"
answer = "g($150)"
print(findPosition(context, answer))

Use map to escape each character
Regex replace the original string with the target string
The string find method looks for the target string. If the target string does not exist, it will not return -1 abnormally.
>>> import re
>>> context = 'abcd e f g ( $ 150 )'
>>> answer = 'g($150)'
>>> findSpacing = lambda target, src :re.sub("\s*".join(map(re.escape, target)), target, src).find(target)
>>> findSpacing(answer, context)
9
>>> findSpacing("FLAG", context)
-1
>>>

sre_constants.error: missing ),unterminated subpattern

i am gaurav and i am learning programming. i was reading regular expressions in dive into python 3,so i thought to try myself something so i wrote this code in eclipse but i got a lot of errors.can anyone pls help me
import re
def add_shtner(add):
return re.sub(r"\bROAD\b","RD",add)
print(add_shtner("100,BROAD ROAD"))
# a code to check valid roman no.
ptn=r"^(M{0,3})(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}$)"
def romancheck(num):
num=num.upper()
if re.search(ptn,num):
return "VALID"
else:
return "INVALID"
print(romancheck("MMMLXXVIII"))
print(romancheck("MMMLxvviii"))
mul_line_str='''adding third argument
re.VERBOSE in re.search()
will ignore whitespace
and comments'''
print(re.search("re.search()will",mul_line_str,re.VERBOSE))
print(re.search("re.search() will",mul_line_str,re.VERBOSE))
print(re.search("ignore",mul_line_str,re.VERBOSE))
ptn='''
^ #beginning of the string
M{0,3} #thousands-0 to 3 M's
(CM|CD|D?C{0,3} #hundreds
(XC|XL|L?XXX) #tens
(IX|IV|V?III) #ones
$ #end of the string
'''
print(re.search(ptn,"MMMCDLXXIX",re.VERBOSE))
def romanCheck(num):
num=num.upper()
if re.search(ptn,num,re.VERBOSE):
return "VALID"
else:
return "INVALID"
print(romanCheck("mmCLXXXIV"))
print(romanCheck("MMMCCLXXXiv"))
i wrote this code and i ran it but i got this-
100,BROAD RD
VALID
INVALID
None
None
<_sre.SRE_Match object; span=(120, 126), match='ignore'>
Traceback (most recent call last):
File "G:\pydev\xyz\rts\regular_expressions.py", line 46, in <module>
print(re.search(ptn,"MMMCDLXXIX",re.VERBOSE))
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\re.py", line 182, in search
return _compile(pattern, flags).search(string)
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\sre_parse.py", line 856, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\sre_parse.py", line 768, in _parse
source.tell() - start)
sre_constants.error: missing ), unterminated subpattern at position 113 (line 4, column 6)
what are these errors can anyone help me.
i have understood all the output but i am not able to understand this errors

The error means that you have passed a malformed regular expression to the search() function in line 46.
Although you have defined a valid RegEx in this line:
ptn=r"^(M{0,3})(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}$)"
you overwrite this pattern (ptn) later with what seems to be some help/docstring:
ptn='''
^ #beginning of the string
M{0,3} #thousands-0 to 3 M's
(CM|CD|D?C{0,3} #hundreds
(XC|XL|L?XXX) #tens
(IX|IV|V?III) #ones
$ #end of the string
'''
This is not a valid RegEx pattern, it is missing a closing bracket after (CM|CD|D?C{0,3}.
You pass this new string as regular expression in the nex line print(re.search(ptn,"MMMCDLXXIX",re.VERBOSE)) and re.compile() throws that error.
If you use another name for the variable to hold your help/docstring in line 27 (based on your sample code or line 38 based on your stacktrace) everything looks fine:
100,BROAD RD
VALID
INVALID
None
None
<_sre.SRE_Match object; span=(85, 91), match='ignore'>
<_sre.SRE_Match object; span=(0, 10), match='MMMCDLXXIX'>
VALID
VALID

I have had this issue when using re.VERBOSE. I suggest not using it in that format. Create the pattern in a single line rather than over multiple lines and don't pass the verbose parameter.

convert date from numpyarray into datetime type -> getting mystic error

I load a file f with the numpy.loadtxt function and wanted to extract some dates.
The date has a format like this: 15.08. - 21.08.2011
numpyarr = loadtxt(f, dtype=str, delimiter=";", skiprows = 1)
alldate = numpyarr[:,0][0]
dat = datetime.datetime.strptime(alldate,"%d.%m. - %d.%m.%Y")
And here is the whole error:
Traceback (most recent call last):
File "C:\PYTHON\Test DATETIME_2.py", line 52, in <module>
dat = datetime.datetime.strptime(alldate,"%d.%m. - %d.%m.%Y")
File "C:\Python27\lib\_strptime.py", line 308, in _strptime
format_regex = _TimeRE_cache.compile(format)
File "C:\Python27\lib\_strptime.py", line 265, in compile
return re_compile(self.pattern(format), IGNORECASE)
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: redefinition of group name 'd' as group 3; was group 1
Does somebody have an idea was going on?

A datetime holds a single date & time, while your field contains two dates and trys to extract them into a single variable. Specifically, the error you're getting is because you've used %d and %m twice.
You can try something along the lines of:
a = datetime.datetime.strptime(alldate.split('-')[0],"%d.%m. ")
b = datetime.datetime.strptime(alldate.split('-')[1]," %d.%m.%Y")
a = datetime.datetime(b.year, a.month, a.day)
(it's not the best code, but it demonstrates the fact that there are two dates in two different formats hiding in your string).

Splitting up lines in a regular expression

I'm trying to break up a long regex into smaller chunks. Is it possible/good practice to change A to B?
A:
line = re.sub(r'\$\{([0-9]+)\}|\$([0-9]+)|\$\{(\w+?\=\w?+)\}|[^\\]\$(\w[^-]+)|[^\\]\$\{(\w[^-]+)\}',replace,line)
B:
line = re.sub(r'\$\{([0-9]+)\}|'
r'\$([0-9]+)|'
r'\$\{(\w+?\=\w?+)\}|'
r'[^\\]\$(\w[^-]+)|'
r'[^\\]\$\{(\w[^-]+)\}',replace,line)
Edit:
I receive the following error when running this in Python 2:
def main():
while(1):
line = raw_input("(%s)$ " % ncmd)
line = re.sub(r'''
\$\{([0-9]+)\}|
\$([0-9]+)|
\$\{(\w+?\=\w?+)\}|
[^\\]\$(\w[^-]+)|
[^\\]\$\{(\w[^-]+)\}
''',replace,line,re.VERBOSE)
print '>> ' + line
Error:
(1)$ abc
Traceback (most recent call last):
File "Test.py", line 4, in <module>
main()
File "Test.py", line 2, in main
[^\\]\$\{(\w[^-]+)\}''',replace,line,re.VERBOSE)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat

You can use a triple-quoted (multi-line) string and set the re.VERBOSE flag, which allows you to break a Regex pattern over multiple lines:
line = re.sub(r'''
\$\{([0-9]+)\}|
\$([0-9]+)|
\$\{(\w+?\=\w?+)\}|
[^\\]\$(\w[^-]+)|
[^\\]\$\{(\w[^-]+)\}
''', replace, line, re.VERBOSE)
You can even include comments directly inside the string:
line = re.sub(r'''
\$\{([0-9]+)\}| # Pattern 1
\$([0-9]+)| # Pattern 2
\$\{(\w+?\=\w?+)\}| # Pattern 3
[^\\]\$(\w[^-]+)| # Pattern 4
[^\\]\$\{(\w[^-]+)\} # Pattern 5
''', replace, line, re.VERBOSE)
Lastly, it should be noted that you can likewise activate the verbose flag by using re.X or by placing (?x) at the start of your Regex pattern.

You can also separate your expression over multiple lines using double quotes, like the following:
line = re.sub(r"\$\{([0-9]+)\}|\$([0-9]+)|"
r"\$\{(.+-.+)\}|"
r"\$\{(\w+?\=\w+?)\}|"
r"\$(\w[^-]+)|\$\{(\w[^-]+)\}",replace,line)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to replace +xx in pandas str replace - python

You can escape special regex value + (one or more repetitions) like: df['number'] = df['number'].str.replace('\+91','') Or use parameter regex=False: df['number'] = df['number'].str.replace('+91','', regex=False)

import pandas as pd data={'number':['+9185600XXXXX','+9199651XXXXX']} f=pd.DataFrame(data) f['number']=f['number'].str.replace('\+91','') print(f)

Related

Columnwise combining of three strings throws 'TypeError: '<' not supported between instances of 'str' and 'int'

how do I get words with meta expressions from string regardless of spacing

sre_constants.error: missing ),unterminated subpattern

convert date from numpyarray into datetime type -> getting mystic error

Splitting up lines in a regular expression

Categories

Resources