I am extracting a substring from an Excel cell and the entire string says this:
The bolts are 5" long each and 3" apart
I want to extract the length of the bolt which is 5". And I use the following code to get that
df['Bolt_Length'] = df['Description'].str.extract(r'(\s[0-9]")',expand=False)
But if the string says the following:
The bolts are 10" long each and 3" apart
and I try to use to the following code:
df['Bolt_Length'] = df['Description'].str.extract(r'(\s(\d{1,2})")',expand=False)
I get the following error message:
ValueError: Columns must be same length as key
I think Python doesn't know which number to acquire. The 10" or 3"
How can I fix this? How do I tell Python to only go for the first "?
On another note what if I want to get both the bolt length and distance from another bolt? How do I extract the two at the same time?
Your problem is that you have two capture groups in your second regular expression (\s(\d{1,2})"), not one. So basically, you're telling Python to get the number with the ", and the same number without the ":
>>> df['Description'].str.extract(r'(\s(\d{1,2})")', expand=False)
0 1
0 5" 5
1 10" 10
You can add ?: right after the opening parenthesis of a group to make it so that it does not capture anything, though it still functions as a group. The following makes it so that the inner group, which excludes the ", does not capture:
# notice vv
>>> df['Description'].str.extract(r'(\s(?:\d{1,2})")', expand=False)
0 5"
1 10"
Name: Description, dtype: object
The error occurs because your regex contains two capturing groups, that extract two column values, BUT you assign those to a single column, df['Bolt_Length'].
You need to use as many capturing groups in the regex pattern as there are columns you assign the values to:
df['Bolt_Length'] = df['Description'].str.extract(r'\s(\d{1,2})"',expand=False)
The \s(\d{1,2})" regex only contains one pair of unescaped parentheses that form a capturing group, so this works fine since this single value is assigned to a single Bolt_Length column.
Related
I want to write code that can parse American phone numbers (ie. "(664)298-4397") . Below are the constraints:
allow leading and trailing white spaces
allow white spaces that appear between area code and local numbers
no white spaces in area code or the seven digit number XXX-XXXX
Ultimately I want to print a tuple of strings (area_code, first_three_digits_local, last_four_digits_local)
I have two sets of questions.
Question 1:
Below are inputs my code should accept and print the tuple for:
'(664) 298-4397', '(664)298-4397', ' (664) 298-4397'
Below is the code I tried:
regex_parse1 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664) 298-4397')
print (f' groups are: {regex_parse1.groups()} \n')
regex_parse2 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664)298-4397')
print (f' groups are: {regex_parse2.groups()} \n')
regex_parse3 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' (664) 298-4397')
print (f' groups are: {regex_parse3.groups()}')
The string input for all three are valid and should return the tuple:
('664', '298', '4397')
But instead I'm getting the output below for all three:
groups are: ('', '', '4397')
What am I doing wrong?
Question 2:
The following two chunks of code should output an 'NoneType' object has no attribute 'group' error because the input phone number string violates the constraints. But instead, I get outputs for all three.
regex_parse4 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(404)555 -1212')
print (f' groups are: {regex_parse4.groups()}')
regex_parse5 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' ( 404)121-2121')
print (f' groups are: {regex_parse5.groups()}')
Expected output: should be an error but I get this instead for all three:
groups are: ('', '', '2121')
What is wrong with my regex code?
In general, your regex overuse the asterisk *. Details as follows:
You have 3 capturing groups:
([\s]*[(]*[0-9]*[)]*[\s]*)
([\s]*[0-9]*)
([0-9]*[\s]*)
You use asterisk on every single item, including the open and close parenthesis. Actually, almost everything in your regex is quoted with asterisk. Thus, the capturing groups match also null strings. That's why your first and second capturing groups return the null strings. The only item you don't use asterisk is the hyphen sign - just before the third capturing group. This is also the reason why your regex can capture the third capturing group as in the 4397 and 2121
To solve your problem, you have to use asterisk only when needed.
In fact, your regex still has plenty of rooms for improvement. For example, it now matches numeric digits of any length (instead of 3 or 4 digits chunks). It also allows the area code not enclosed in parenthesis (because of your use of asterisk around parenthesis symbols.
For this kind of common regex, I suggest you don't need to reinvent the wheel. You can refer to some already made regex easily found from the Internet. For example, you can refer to this post Although the post is using javascript instead of Python, the regex is just similar.
Try:
regex_parse4 = re.match(r'([(]*[0-9]{3}[)])\s*([0-9]{3}).([0-9]{4})', number)
Assumes 3 digit area code in parentheses, proceeded by XXX-XXXX.
Python returns 'NoneType' when there are no matches.
If above does not work, here is a helpful regex tool:
https://regex101.com
Edit:
Another suggestion is to clean data prior to applying a new regex. This helps with instances of abnormal spacing, gets rid of parentheses, and '-'.
clean_number = re.sub("[^0-9]", "", original_number)
regex_parse = re.match(r'([0-9]{3})([0-9]{3})([0-9]{4})', clean_number)
print(f'groups are: {regex_parse}.groups()}')
>>> ('xxx', 'xxx', 'xxxx')
I've used BeautifulSoup and pandas to create a csv with columns that contain error codes and corresponding error messages.
Before formatting, the columns look something like this
-132456ErrorMessage
-3254Some other Error
-45466You've now used 3 different examples. 2 more to go.
-10240 This time there was a space.
-1232113That was a long number.
I've successfully isolated the text of the codes like this:
dfDSError['text'] = dfDSError['text'].map(lambda x: x.lstrip('-0123456789'))
This returns just what I want.
But I've been struggling to come up with a solution for the codes.
I tried this:
dfDSError['codes'] = dfDSError['codes'].replace(regex=True,to_replace= r'\D',value=r'')
But that will append numbers from the error message to the end of the code number. So for the third example above instead of 45466 I would get 4546632. Also I would like to keep the leading minus sign.
I thought maybe that I could somehow combine rstrip() with a regex to find where there was a nondigit or a space next to a space and remove everything else, but I've been unsuccessful.
for_removal = re.compile(r'\d\D*')
dfDSError['codes'] = dfDSError['codes'].map(lambda x: x.rstrip(re.findall(for_removal,x)))
TypeError: rstrip arg must be None, unicode or str
Any suggestions? Thanks!
You can use extract:
dfDSError[['code','text']] = dfDSError.text.str.extract('([-0-9]+)(.*)', expand=True)
print (dfDSError)
text code
0 ErrorMessage -132456
1 Some other Error -3254
2 You've now used 3 different examples. 2 more t... -45466
3 This time there was a space. -10240
4 That was a long number. -1232113
How to replace only particular dots in the text string?:
string_expample = '123|4.3|123.54|sdflk|hfghjkkf.ffg..t.s..9.7..tg..3..654..2.fd'
I need to get only dots that are 1 and between 2 digits( 4.3 from |4.3|; 3.5 from 123.54, etc.)
be replaced by commas in the original string, is it possible?
If so, how?
So, the result string must be:
string_final = '123|4,3|123,54|sdflk|hfghjkkf.ffg..t.s..9,7..tg..3...654..2.fd'
Thanks in advance.
import re
string_example = '123|4.3|123.54|sdflk|hfghjkkf.ffg..t.s..4..tg..3...654..2.fd'
string_final = re.sub(r'(\d)\.(\d)', r'\1,\2', string_example)
print(string_final)
123|4,3|123,54|sdflk|hfghjkkf.ffg..t.s..4..tg..3...654..2.fd
We use a regular expression to find "digit . digit" (the digits are captured into groups with parentheses) and replace them with "group 1 , group 2" (the groups are the corresponding digits).
i have string like this:
string = "The stock item "28031 (111111: Test product)" was added successfully."
I need store from string the first 5 numbers ( for example "28031" ) and save them to another string.
It's because i am selenium tester and every time i am create new stock item he has different first 5 numbers.
Thank you for your help
Filip
m = re.search("\d+", string)
print m.group(0)
prints 28031
It just selects the first group of digits, regardless of the length (2803 would be selected, too)
Firstly I am assuming all these strings have exactly the same format. If so the simplest way to get your stock item number is:
stocknumber = string.split()[3][1:]
After sehe answer I leave mine edited just to show how to match 5 digits
import re
re.search('\d{5}', string).group(0)
EDIT : neurino solution is the smartest!! use it
EDIT : sehe solution is smart and perfect you can add this line to get only the first 5 numbers:
print m.group(0)[0:5]
using [0:5] means to take string elements from 0 to 5 (first 5 elements)
use the str.isdigit built-in function
string = "The stock item 28031 "
Digitstring=''
for i in string:
if i.isdigit():
Digitstring+=i
print Digitstring
Output:
28031
you can count to first x numbers you need and then stop.
I'm new to regex, and I'm starting to sort of get the hang of things. I have a string that looks like this:
This is a generated number #123 which is an integer.
The text that I've shown here around the 123 will always stay exactly the same, but it may have further text on either side. But the number may be 123, 597392, really one or more digits. I believe I can match the number and the folowing text using using \d+(?= which is an integer.), but how do I write the look-behind part?
When I try (?<=This is a generated number #)\d+(?= which is an integer.), it does not match using regexpal.com as a tester.
Also, how would I use python to get this into a variable (stored as an int)?
NOTE: I only want to find the numbers that are sandwiched in between the text I've shown. The string might be much longer with many more numbers.
You don't really need a fancy regex. Just use a group on what you want.
re.search(r'#(\d+)', 'This is a generated number #123 which is an integer.').group(1)
if you want to match a number in the middle of some known text, follow the same rule:
r'some text you know (\d+) other text you also know'
res = re.search('#(\d+)', 'This is a generated number #123 which is an integer.')
if res is not None:
integer = int(res.group(1))
You can just use the findall() in the re module.
string="This is a string that contains #134534 and other things"
match=re.findall(r'#\d+ .+',string);
print match
Output would be '#1234534 and other things'
This will match any length number #123 or #123235345 then a space then the rest of the line till it hits a newline char.
if you want to get the numbers only if the numbers are following text "This is a generated number #" AND followed by " which is an integer.", you don't have to do look-behind and lookahead. You can simply match the whole string, like:
"This is a generated number #(\d+) which is an integer."
I am not sure if I understood what you really want though. :)
updated
In [16]: a='This is a generated number #123 which is an integer.'
In [17]: b='This should be a generated number #123 which could be an integer.'
In [18]: exp="This is a generated number #(\d+) which is an integer."
In [19]: result =re.search(exp, a)
In [20]: int(result.group(1))
Out[20]: 123
In [21]: result = re.search(exp,b)
In [22]: result == None
Out[22]: True