Split string from digits/number according to sentence length - python

I have cases that I need to seperate chars/words from digits/numbers which are written consecutively, but I need to do this only when char/word length more than 3.
For example,
input
ferrari03
output must be:
ferrari 03
However, it shouldn't do any action for the followings:
fe03, 03fe, 03ferrari etc.
Can you help me on this one ? I'm trying to do this without coding any logic, but re lib in python.

Using re.sub() we can try:
inp = ["ferrari03", "fe03", "03ferrari", "03fe"]
output = [re.sub(r'^([A-Za-z]{3,})([0-9]+)$', r'\1 \2', i) for i in inp]
print(output) # ['ferrari 03', 'fe03', '03ferrari', '03fe']
Given an input word, the above regex will match should that word begin with 3 or more letters and end in 1 or more digits. In that case, we capture the letters and numbers in the \1 and \2 capture groups, respectively. We replace by inserting a separating space.

Related

Parsing based on pattern not at the beginning

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

Python regex - group

I scraped some text from pdfs and accents/umlaut on characters get scraped after their letter, e.g.: `"Jos´e" and "Mu¨ller". Because there are just a few of these characters, I would like to fix them to e.g. "José" and "Müller".
I am trying to adapt the pattern here Regex to match words with hyphens and/or apostrophes.
pattern="(?=\S*[´])([a-zA-Z´]+)"
ms = re.finditer(pattern, "Jos´e Vald´ez")
for m in ms:
m.group() #returns "Jos´e" and "Vald´ez"
m.start() #returns 0 and 6, but I want 3 and 10
In the example above, what pattern can I use to get the position of the '´' character? Then I can check the subsequent letter and replace the text accordingly.
My texts are scraped from from scientific papers and could contain those characters elsewhere, for example in code. That is the reason why I am using regex instead of .replace or text normalization with e.g. unicodedata, because I want to make sure I am replacing "words" (more precisely the authors' first and last names).
EDIT: I can relax these conditions and simply replace those characters everywhere because, if they appear in non-words such as "F=m⋅x¨", I will discard non-words anyway. Therefore, I can use a simple replace approach
I suggest using
import re
d = {'´e': 'é', 'u¨' : 'ü'}
pattern = "|".join([x for x in d])
print( re.sub(pattern, lambda m: d[m.group()], "Jos´e Vald´ez") )
# => José Valdéz
See the Python demo.
If you need to make sure there are word boundaries, you may consider using
pattern = r"\b´e|u¨\b"
See this Python demo. \b before ´ and after u will make sure there are other word chars before/after them.
A quick fix on the pattern returns the indexes which you are looking for. Instead of matching the whole word, the group will catch the apostrophe characters only.
import re
pattern = "(?=\S*[´])[a-zA-Z]+([´]+)[a-zA-Z]+"
ms = re.finditer(pattern, "Jos´e Vald´ez")
for m in ms:
print(m.group()) # returns "Jos´e" and "Vald´ez"
print(m.start(1)) # returns 3 and 10

How to match patterns in one sentence using regex in python?

Here are 2 examples,
1. I need to take this apple. I just finished the first one.
2. I need to get some sleep. apple is not working.
I want to match the text with need and apple in the same sentence.
By using need.*apple it will match both examples. But I want it works only for the first one. How do I change the code, or do we have other string methods in Python?
The comment posted by #ctwheels concerning splitting on . and then testing to see if if it contains apple and need is a good one not requiring the use of regular expressions. I would first, however, split again on white space and then test these words against the resulting list to ensure you do not match against applesauce. But here is a regex solution:
import re
text = """I need to take this apple. I just finished the first one.
I need to get some sleep. apple is not working."""
regex = re.compile(r"""
[^.]* # match 0 or more non-period characters
(
\bneed\b # match 'need' on a word boundary
[^.]* # match 0 or more non-period characters
\bapple\b # match 'apple' on a word boundary
| # or
\bapple\b # match 'apple' on a word boundary
[^.]* # match 0 or more non-period characters
\bneed\b # match 'need' on a word boundary
)
[^.]* # match 0 or more non-period characters
\. # match a period
""", flags=re.VERBOSE)
for m in regex.finditer(text):
print(m.group(0))
Prints:
I need to take this apple.
The problem with both of these solutions is if the sentence contains a period whose usage is for purposes other than ending a sentence, such as I need to take John Q. Public's apple. In this case you need a more powerful mechanism for dividing the text up into sentences. Then the regex that operates against these sentences, of course, becomes simpler but splitting on white space still seems to make the most sense.

insert space between regex match

I want to un-join typos in my string by locating them using regex and insert a space character between the matched expression.
I tried the solution to a similar question ... but it did not work for me -(Insert space between characters regex); solution- to use the replace string as '\1 \2' in re.sub .
import re
corpus = '''
This is my corpus1a.I am looking to convert it into a 2corpus 2b.
'''
clean = re.compile('\.[^(\d,\s)]')
corpus = re.sub(clean,' ', corpus)
clean2 = re.compile('\d+[^(\d,\s,\.)]')
corpus = re.sub(clean2,'\1 \2', corpus)
EXPECTED OUTPUT:
This is my corpus 1 a. I am looking to convert it into a 2 corpus 2 b.
You need to put the capture group parentheses around the patterns that match each string that you want to copy to the result.
There's also no need to use + after \d. You only need to match the last digit of the number.
clean = re.compile(r'(\d)([^\d,\s])')
corpus = re.sub(clean, r'\1 \2', corpus)
DEMO
I'm not sure about other possible inputs, we might be able to add spaces using an expression similar to:
(\d+)([a-z]+)\b
after that we would replace any two spaces with a single space and it might work, not sure though:
import re
print(re.sub(r"\s{2,}", " ", re.sub(r"(\d+)([a-z]+)\b", " \\1 \\2", "This is my corpus1a.I am looking to convert it into a 2corpus 2b")))
The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Capture groups, marked by parenthesis ( and ), should be around the patterns you want to match.
So this should work for you
clean = re.compile(r'(\d+)([^\d,\s])')
corpus = re.sub(clean,'\1 \2', corpus)
The regex (\d+)([^\d,\s]) reads: match 1 or more digits (\d+) as group 1 (first set of parenthesis), match non-digit and non-whitespace as group 2.
The reason why your's doesn't work was that you did not have parenthesis surrounding the patterns you want to reuse.

Regex sub phone number format multiple times on same string

I'm trying to use reg expressions to modify the format of phone numbers in a list.
Here is a sample list:
["(123)456-7890 (321)-654-0987",
"(111) 111-1111",
"222-222-2222",
"(333)333.3333",
"(444).444.4444",
"blah blah blah (555) 555.5555",
"666.666.6666 random text"]
Every valid number has either a space OR start of string character leading, AND either a space OR end of string character trailing. This means that there can be random text in the strings, or multiple numbers on one line. My question is: How can I modify the format of ALL the phone numbers with my match pattern below?
I've written the following pattern to match all valid formats:
p = re.compile(r"""
(((?<=\ )|(?<=^)) #space or start of string
((\([0-9]{3}\))|[0-9]{3}) #Area code
(((-|\ )?[0-9]{3}-[0-9]{4}) #based on '-'
| #or
((\.|\ )?[0-9]{3}\.[0-9]{4})) #based on '.'
(?=(\ |$))) #space or end of string
""", re.X)
I want to modify the numbers so they adhere to the format:
\(\d{3}\)d{3}-\d{4} #Ex: (123)456-7890
I tried using re.findall, and re.sub but had no luck. I'm confused on how to deal with the circumstance of there being multiple matches on a line.
EDIT: Desired output:
["(123)456-7890 (321)654-0987",
"(111)111-1111",
"(222)222-2222",
"(333)333-3333",
"(444)444-4444",
"blah blah blah (555)555-5555",
"(666)666-6666 random text"]
Here's a more simple solution that works for all of those cases, though is a little naïve (and doesn't care about matching brackets).
\(?(\d{3})\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\1)\2-\3
Try it online
Explanation:
Works by first checking for 3 digits, and optionally surrounding brackets on either side, with \(?(\d{3})\)?. Notice that the 3 digits are in a capturing group.
Next, it checks for an optional separator character, and then another 3 digits, also stored in a capturing group: [ -.]?(\d{3}).
And lastly, it does the previous step again - but with 4 digits instead of 3: [ -.]?(\d{4})
Python:
To use it in Python, you should just be able to iterate over each element in the list and do:
p.sub('(\\1)\\2-\\3', myString) # Note the double backslashes, or...
p.sub(r'(\1)\2-\3', myString) # Raw strings work too
Example Python code
EDIT
This solution is a bit more complex, and ensures that if there is a close bracket, there must be a start bracket.
(\()?((?(1)\d{3}(?=\))|\d{3}(?!\))))\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\2)\3-\4
Try it online

Categories

Resources