remove n before a string - python

I want to remove unrequired r and n at beginning of each upper-case word and number in this string. I tried regex. Not sure if regex or some other method would be helpful here.
This is the code I am trying to use:
text = "nFamily n49 new nTom"
regex_pattern = re.compile(r'.*n[A-Z][a-z]*|[0-9]*\s')
matches = regex_pattern.findall(text)
for match in matches:
text = text.replace(match," ")
print(text)
Expected output:
Family 49 new Tom

You can use
text = re.sub(r'\bn(?=[A-Z0-9])', '', text)
See the regex demo.
Details:
\b - here, start of a word
n - a n letter
(?=[A-Z0-9]) - a positive lookahead that requires an uppercase ASCII letter or a digit to be present immediately to the right of the current location.
See the Python demo:
import re
rx = r"\bn(?=[A-Z0-9])"
text = "nFamily n49 new nTom"
print( re.sub(rx, '', text) )
# => Family 49 new Tom

Related

Regex pattern to find n non-space characters of x length after a certain substring

I am using this regex pattern pattern = r'cig[\s:.]*(\w{10})' to extract the 10 characters after the '''cig''' contained in each line of my dataframe. With this pattern I am accounting for all cases, except for the ones where that substring contains some spaces inside it.
For example, I am trying to extract Z9F27D2198 from the string
/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031
In the previous string, it seems like Stack overflow formatted it, but there should be 17 whitespaces between F and 2, after CIG.
Could you help me to edit the regex pattern in order to account for the white spaces in that 10-characters substring? I am also using flags=re.I to ignore the case of the strings in my re.findall calls.
To give an example string for which this pattern works:
CIG7826328A2B FORNITURA ENERGIA ELETTRICA U TENZE COMUNALI CONVENZIONE CONSIP E
and it outputs what I want: 7826328A2B.
Thanks in advance.
You can use
r'(?i)cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
See the regex demo. Details:
cig - a cig string
[\s:.]* - zero or more whitespaces, : or .
(\S(?:\s*\S){9}) - Group 1: a non-whitespace char and then nine occurrences of zero or more whitespaces followed with a non-whitespace char
(?!\S) - immediately to the right, there must be a whitespace or end of string.
In Python, you can use
import re
text = "/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031"
pattern = r'cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
matches = re.finditer(pattern, text, re.I)
for match in matches:
print(re.sub(r'\s+', '', match.group(1)), ' found at ', match.span(1))
# => Z9F27D2198 found at (32, 57)
See the Python demo.
What about:
# removes all white spaces with replace()
x = 'CIG7826328A2B FORNITURA ENERGIA ELETTRICA U'.replace(' ', '')
x = x.split("CIG")[1][:10]
# x = '7826328A2B'
x = '/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031'.replace(' ', '')
x.split("CIG")[1][:10]
# x = '7826328A2B'
Works fine if there is only one "CIG" in the string

Python: using regex and while loop to break out paragraphs in a PDF

I have a pdf file that has 82 paragraphs, my goal is to use python to break each paragraph into its own block of text. I have already extracted the text using PyPDF2.
The paragraphs all begin with a number and a period (1. 42. 76. etc.). It is working for most of the paragraphs with the code below, but it doesn't always take the period into account. For example, number 18's match output is: "18 (06/". This shouldn't have gotten picked up though because there isn't a period after it. Any suggestions?
Code for finding positions:
i = 1
all_positions = []
found = "found"
while found == "found":
matches = []
matches_positions =[]
standard_length = 0
substring = str(i) + "."
matches = re.finditer(substring, text, re.IGNORECASE)
matches_positions = [match.start() for match in matches]
standard_length = len(matches_positions)
if standard_length > 0:
all_positions.append(matches_positions[0])
i += 1
else:
found = "not found"
Code for printing output:
for i in range(0,len(all_positions)):
print('---')
print(text[all_positions[i]:all_positions[i+1]])
You can use the below regex to achieve your result:
^\d+\. ?(.*)
Explanation of the above regex:
^ - Represents the start of the given test String.
\d+ - Matches digits[0-9] one or more times.
\. - Matches dot literally.
? - Represents zero or one space character.
(.*) - Represents capturing group capturing the paragraph text greedily.
You can find the regex demo here.
IMPLEMENTATION IN PYTHON:
import re
pattern = re.compile(r"^\d+\. ?(.*)", re.MULTILINE)
match = pattern.findall("1. Hellow World\n"
"23. This is loremIpsum text\n"
"9001. Some random textbcjsbcskcbksck sbcksbcksckscsk\n"
"90 (89. Some other")
print (match)
# Output - ['Hellow World', 'This is loremIpsum text', 'Some random textbcjsbcskcbksck sbcksbcksckscsk']
You can find the implementation of the above code here.

How to insert space between alphabet characters and numeric character using regex?

I'm trying to insert space between numeric characters and alphabet character so I can convert numeric character to words like :
Input :
subject101
street45
Output :
subject 101
street 45
I tried this one
re.sub('[a-z][\d]|[\d][a-z]',' ','subject101')
but the output was like this :
subjec 01
How can I do it using python?
Try this Regex:
(?i)(?<=\d)(?=[a-z])|(?<=[a-z])(?=\d)
Click for Demo
Replace each match with a space
Explanation:
(?i) - modifier to make the matches case-insensitive
(?<=\d)(?=[a-z]) - finds the position just preceded by a digit and followed by a letter
| - OR
(?<=[a-z])(?=\d) - finds the position just preceded by a letter and followed by a digit
Code output
import re
regex = r"(?i)(?<=\d)(?=[a-z])|(?<=[a-z])(?=\d)"
test_str = ("subject101\n"
" street45")
subst = " "
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
You can use if statement (?(#group)) in regex to check if char is digit or a letter.
Regex: (?<=([a-z])|\d)(?=(?(1)\d|[a-z]))
Python code:
def addSpace(text):
return re.sub(r'(?<=([a-z])|\d)(?=(?(1)\d|[a-z]))', ' ', text)
Output:
addSpace('subject101')
>>> subject 101
addSpace('101subject')
>>> 101 subject
A way to do this would be to pass a callable to re.sub. This allows you to reuse the matched substring to generate the replacement value.
subject = '101subject101'
s = re.sub(r'[a-zA-Z]\d|\d[a-zA-Z]', lambda m: ' '.join(m.group()), subject )
# s: '101 subject 101'

Delete the repetition of a specific word in a row

For example I have a string:
my_str = 'my example example string contains example some text'
What I want to do - delete all duplicates of specific word (only if they goes in a row). Result:
my example string contains example some text
I tried next code:
import re
my_str = re.sub(' example +', ' example ', my_str)
or
my_str = re.sub('\[ example ]+', ' example ', my_str)
But it doesn't work.
I know there are a lot of questions about re, but I still can't implement them to my case correctly.
You need to create a group and quantify it:
import re
my_str = 'my example example string contains example some text'
my_str = re.sub(r'\b(example)(?:\s+\1)+\b', r'\1', my_str)
print(my_str) # => my example string contains example some text
# To build the pattern dynamically, if your word is not static
word = "example"
my_str = re.sub(r'(?<!\w)({})(?:\s+\1)+(?!\w)'.format(re.escape(word)), r'\1', my_str)
See the Python demo
I added word boundaries as - judging by the spaces in the original code - whole word matches are expected.
See the regex demo here:
\b - word boundary (replaced with (?<!\w) - no word char before the current position is allowed - in the dynamic approach since re.escape might also support "words" like .word. and then \b might stop the regex from matching)
(example) - Group 1 (referred to with \1 from the replacement pattern):
the example word
(?:\s+\1)+ - 1 or more occurrences of
\s+ - 1+ whitespaces
\1 - a backreference to the Group 1 value, that is, an example word
\b - word boundary (replaced with (?!\w) - no word char after the current position is allowed).
Remember that in Python 2.x, you need to use re.U if you need to make \b word boundary Unicode-aware.
Regex: \b(\w+)(?:\s+\1)+\b or \b(example)(?:\s+\1)+\b Substitution: \1
Details:
\b Assert position at a word boundary
\w Matches any word character (equal to [a-zA-Z0-9_])
\s Matches any whitespace character
+ Matches between one and unlimited times
\1 Group 1.
Python code:
text = 'my example example string contains example some text'
text = re.sub(r'\b(\w+)(?:\s+\1)+\b', r'\1', text)
Output:
my example string contains example some text
Code demo
You could also do this in pure Python (without a regex), by creating a list of words and then generating a new string - applying your rules.
>>> words = my_str.split()
>>> ' '.join(w for i, w in enumerate(words) if w != words[i-1] or i == 0)
'my example string contains example some text'
Why not use the .replace function:
my_str = 'my example example string contains example some text'
print my_str.replace("example example", "example")

Python regular expression to find letters and numbers

Entering a string
I used 'findall' to find words that are only letters and numbers (The number of words to be found is not specified).
I created:
words = re.findall ("\ w * \ s", x) # x is the input string
If i entered "asdf1234 cdef11dfe a = 1 b = 2"
these sentences seperated asdf1234, cdef11dfe, a =, 1, b =, 2
I would like to pick out only asdf1234, cdef11dfe
How do you write a regular expression?
Try /[a-zA-z0-9]{2,}/.
This looks for any alphanumeric character ([a-zA-Z0-9]) at least 2 times in a row ({2,}). That would be the only way to filter out the one letter words of the string.
The problem with \w is that it includes underscores.
This one should work : (?<![\"=\w])(?:[^\W_]+)(?![\"=\w])
Explanation
(?:[^\W_])+ Anything but a non-word character or an underscore at least one time (non capturing group)
(?<![\"=\w]) not precedeed by " or a word character
(?![\"=\w]) not followed by " or a word character
RegEx Demo
Sample code Run online
import re
regex = r"(?<![\"=\w])(?:[^\W_]+)(?![\"=\w])"
test_str = "a01a b02 c03 e dfdfd abcdef=2 b=3 e=4 c=\"a b\" aaa=2f f=\"asdf 12af\""
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
print (match.group())

Categories

Resources