I need to convert a text like
photo_id 102297_skdjksd223238 text black dog in a water
to
photo_id 102297 text black dog in a water
by removeing the substring after underscore
inputFile = open("text.txt", "r")
exportFile = open("result", "w")
sub_str = "_"
for line in inputFile:
new_line = line[:line.index(sub_str) + len(sub_str)]
exportFile.writelines(new_line)
but couldn't access the second underscore as it removed all text after photo_id ..
Note: The question was tagged regex when I wrote this:
_[^\s]*
_ - a literal _
[^\s]* - (or \S* if supported) any character but whitespaces - zero or more times
Substitute with a blank string.
Demo
inp = 'photo_id 102297_skdjksd223238 text black dog in a water foo_baz bar'
res = re.sub(r'_[^\s]*', '', inp)
print(res)
Output
photo 102297 text black dog in a water foo bar
You could split the first underscore from the right:
s= "photo_id 102297_skdjksd223238 text black dog in a water"
prefix, suffix = s.rsplit('_', 1)
print(f"{prefix} {suffix.split(' ', 1)[-1]}")
Out:
photo_id 102297 text black dog in a water
You might use a pattern to capture the leading digits to make it a bit more specific, and then match the underscore followed by optional non whitespace characters.
In the replacement use the first capture group.
\b(\d+)_\S*
Explanation
\b A word boundary to prevent a partial word match
(\d+) Capture group 1, match 1+ digits
_\S* Match an underscore and optional non whitespace characters
See a regex101 demo.
import re
pattern = r"\b(\d+)_\S*"
s = "photo_id 102297_skdjksd223238 text black dog in a water"
result = re.sub(pattern, r"\1", s)
if result:
print (result)
Output
photo_id 102297 text black dog in a water
Another option including photo_id and matching until the first underscore:
\b(photo_id\s+[^_\s]+)_\S*
See another regex101 demo.
Related
I have a text file and the content is,
Submitted By,Assigned,Closed
Name1,10,5
Name2,20,10
Name3,30,15
I have written a Regex Pattern, to extract the value between first , and second ,
^\w+,(\w+),.*$
My Python code is
import re
f=r'sample.txt'
rePat = re.compile('^\w+,(\w+),.*$', re.MULTILINE)
text = open(f, 'r').read()
output = re.findall(rePat, text)
print (f)
print (output)
Expected Output:
Assigned
10
20
30
But I am getting
10
20
30
Why it is missing the first line?
The problem is due to the fact that \w+ matches one or more word chars (basically, letters, digits, underscores and also some diacritics). You have a space in between the second and third commas, so I suggest matching any chars between commas with [^,\n]+ (the \n here is to make sure we stay within the same line).
You can use
rePat = re.compile(r'^[^,\n]+,([^,\n]+),.*$', re.MULTILINE)
Or, a bit simplified if you do not need to extract anything else:
rePat = re.compile(r'^[^,\n]+,([^,\n]+)', re.MULTILINE)
See this regex demo. Details:
^ - start of a line
[^,\n]+ - one or more chars other than , and LF
, - a comma
([^,\n]+) - Group 1: one or more chars other than , and LF.
See a Python demo:
import re
text = r"""Submitted By,Assigned,Closed
Name1,10,5
Name2,20,10
Name3,30,15"""
rePat = re.compile('^[^,\n]+,([^,\n]+),.*$', re.MULTILINE)
output = re.findall(rePat, text)
print (output)
# => ['Assigned', '10', '20', '30']
You could add matching optional spaces and word characters after the first \w+ to match till the first comma.
^\w+(?: \w+)*,(\w+),.*$
^ Start of string
\w+ Match 1+ word chars
(?: \w+)* Optionally repeat matching a space and 1+ word chars
,(\w+), Match a comma and capture 1+ word chars in group 1
.*$ ( You could omit this part)
Regex demo
import re
f = r'sample.txt'
rePat = re.compile('^\w+(?: \w+)*,(\w+),.*$', re.MULTILINE)
text = open(f, 'r').read()
output = re.findall(rePat, text)
print(output)
Output
['Assigned', '10', '20', '30']
In Python, I'm attempting to clean (and, later compare) artists names and want to remove:
non alpha characters, or
white spaces, or
the word "and"
INPUT STRING: Bootsy Collins and The Rubber Band
DESIRED OUTPUT: BootsyCollinsTheRubberBand
import re
s = 'Bootsy Collins and The Rubber Band'
res1 = re.sub(r'[^\w]|\s|\s+(and)\s', "", s)
res2 = re.sub(r'[^\w]|\s|\sand\s', "", s)
res3 = re.sub(r'[^\w]|\s|(and)', "", s)
print("\b", s, "\n"
, "1st: ", res1, "\n"
, "2nd: ", res2, "\n"
, "3rd: ", res3)
Output:
Bootsy Collins and The Rubber Band
1st: BootsyCollinsandTheRubberBand
2nd: BootsyCollinsandTheRubberBand
3rd: BootsyCollinsTheRubberB
To support the rules that you set out, instead of just on the sample text quoted, you need a more general regex with the correct flags setting for re.sub call:
re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
Explanation
The flag re.IGNORECASE is set so that you can also remove "And" (and other uppercase/lowercase combination variations) in the sentence. In case you want to remove only "and" but not any variations of it, you can remove this flag setting.
\band\b the word "and" enclosed with word boundary token \b on both sides. This is to match for the 3 characters sequence "and" as an independent word rather than being a substring of another word. Using \b to isolate the word instead of enclosing the word within white spaces like \s+and\s has the advantage that the \b option can also detect also word boundary in strings like and, while \s+and\s can't do. This is because comma is not a white space.
As white space \s is also a kind of non-word \W (since word \w is equivalent to [a-zA-Z0-9_]), you don't need separate regex tokens for both. \W already includes \s. So, you can simplify the regex without separately using \s.
Demo
Test case #1:
s = 'Bootsy Collins and The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)
Output:
'BootsyCollinsTheRubberBand'
Test case #2 ('And' got removed) :
s = 'Bootsy Collins And The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)
Output:
'BootsyCollinsTheRubberBand'
Test case #3 ('and,' [with comma after 'and'] got removed)
s = 'Bootsy Collins and, The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)
Output:
'BootsyCollinsTheRubberBand'
Counter Test case: (regex using white space \s+ or \s instead of \b for word boundary)
s = 'Bootsy Collins and, The Rubber Band'
res = re.sub(r'\s+(and)\s|\W', '',s)
print(res)
Output: 'and' is NOT removed
'BootsyCollinsandTheRubberBand'
Your first two regular expressions don't match the " and " because when arriving at that position in the string, the \s part of the regex will match the space before "and" instead of the \s+(and)\s part of your regex.
You simply need to change the order, so that the latter is tried first. Also, \s is part of [^\w], so you don't need to match \s separately. And finally, \W is the shorter form of [^\w]. So use:
\s+(and)\s|\W
I am trying to extract the name and profession as a list of tuples from the below string using regex.
Input string
text = "Mr John,Carpenter,Mrs Liza,amazing painter"
As you can see the first word is the name followed by the profession which repeats in a comma seperated fashion. The problem is that, I want to get rid of the adjectives that comes along with the profession. For e.g "amazing" in the below example.
Expected output
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
I stripped out the adjective from the text using "replace" and used the below code using "regex" to get the output. But I am looking for a single regex function to avoid running the string replace. I figured that this has something to do with look ahead in regex but couldn't make it work. Any help would be appreciated.
text.replace("amazing ", "")
txt_new = re.findall("([\w\s]+),([\w\s]+)",text)
If you only want to use word and whitespace characters, this could be another option:
(\w+(?:\s+\w+)*)\s*,\s*(?:\w+\s+)*(\w+)
Explanation
( Capture group 1
\w+(?:\s+\w+)* Match 1+ word chars and optionally repeat 1+ whitespace chars and 1+ word chars
) Close group 1
\s*,\s* Match a comma between optional whitespace chars
(?:\w+\s+)* Optionally repeat 1+ word and 1+ whitespace chars
(\w+) Capture group 2, match 1+ word chars
Regex demo | Python demo
import re
regex = r"(\w+(?:\s+\w+)*)\s*,\s*(?:\w+\s+)*(\w+)"
s = ("Mr John,Carpenter,Mrs Liza,amazing painter")
print(re.findall(regex, s))
Output
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
Here is one regex approach using re.findall:
text = "Mr John,Carpenter,Mrs Liza,amazing painter"
matches = re.findall(r'\s*([^,]+?)\s*,\s*.*?(\S+)\s*(?![^,])', text)
print(matches)
This prints:
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
Here is an explanation of the regex pattern:
\s* match optional whitespace
([^,]+?) match the name
\s* optional whitespace
, first comma
\s* optional whitespace
.*? consume all content up until
(\S+) the last profession word
\s* optional whitespace
(?![^,]) assert that what follows is either comma or the end of the input
the co[njuring](media_title)
I want a regex to detect if a pattern like above exist.
Currently I have a regex that turns
line = Can I please eat at[ warunk upnormal](restaurant_name)
line = re.sub('\[\s*(.*?)\s*\]', r'[\1]', line)
line = re.sub(r'(\w)\[', r'\1 [', line)
Can I please eat at [warunk upnormal](restaurant_name)
Notice how there aren't any spaces which is good, and it creates a space char and brace ex. x[ to x [
What I want, is to change the above to regexes to not perform the change if there is a sentences like this
the co[njuring](media_title)
the co[njuring](media_title) and che[ese dog]s(food)
Notice how there is a brace in there. Basically, I want to know how can I improve these regexes to take this into account.
line = re.sub('\[\s*(.*?)\s*\]', r'[\1]', line)
line = re.sub(r'(\w)\[', r'\1 [', line)
For the 2 patterns that you use, you could also use a single pattern with 2 capturing groups.
(\w)\[\s*(.*?)\s*\]
Regex demo and a Python demo
In the replacement use the 2 capturing groups \1 [\2]
Example code
line = re.sub('(\w)\[\s*(.*?)\s*\]', r'\1 [\2]', line)
The different in the given format that I see is that there is an underscore present (instead of a brace) between the parenthesis (restaurant_name) and (media_title) vs (food)
If that is the case, you can use a third capturing group, matching the value in parenthesis with at least a single underscore present, not at the start and not at the end.
(\w)\[\s*(.*?)\s*\](\([^_\s()]+(?:_[^_\s()]+)+\))
Explanation
(\w) Capture group 1, match a word char
\[\s* Match [ and 0+ whitespace chars
(.*?) Capture group 2, match any char except a newline non greedy
\s*\] Match 0+ whitespace chars and ]
( Capture group 3
\( Match (
[^_\s()]+ Match 1+ times any char except an underscore, whitespace char or parenthesis
(?:_[^_\s()]+)+ Repeat 1+ times the previous pattern with an underscore prepended
\) Match )
) Close group
In the replacement use the 3 capturing groups \1 [\2]\3
Regex demo and a Python demo
Example code
import re
regex = r"(\w)\[\s*(.*?)\s*\](\([^_\s()]+(?:_[^_\s()]+)+\))"
test_str = ("Can I please eat at[ warunk upnormal](restaurant_name)\n"
"Can I please eat at[ warunk upnormal ](restaurant_name)\n"
"the co[njuring](media_title)\n"
"the co[njuring](media_title) and che[ese dog]s(food)")
result = re.sub(regex, r"\1 [\2]\3", test_str)
if result:
print (result)
Output
Can I please eat at [warunk upnormal](restaurant_name)
Can I please eat at [warunk upnormal](restaurant_name)
the co [njuring](media_title)
the co [njuring](media_title) and che[ese dog]s(food)
InputString = r'On <ENAMEX TYPE="DATE">August 17</ENAMEX> , <ENAMEX TYPE="GPE">Tai wan</ENAMEX> is investigation department.'
p1 = r'<ENAMEX TYPE="(\S+)">(.+?)</ENAMEX>'
p2 = '_'.join(r'\2'.split(' '))
plain_text = re.sub(p1,p2,InputString)
Expect Output:
On August_17 , Tai_wan is investigation department.
Unfortunately, I get the result:
On August 17 , Tai wan is investigation department.
How to split the captured group '\2'?
It seems you want to just replace the matches with the second group (text between ENAMEX tags) and replace all spaces with _.
You may use
import re
InputString = r'On <ENAMEX TYPE="DATE">August 17</ENAMEX> , <ENAMEX TYPE="GPE">Tai wan</ENAMEX> is investigation department.'
p1 = r'<ENAMEX TYPE="[^"]+">(.*?)</ENAMEX>'
plain_text = re.sub(p1,lambda p2: p2.group(1).replace(' ', '_'),InputString)
print(plain_text)
# => On August_17 , Tai_wan is investigation department.
See the Python demo.
Here, <ENAMEX TYPE="[^"]+">(.*?)</ENAMEX> matches <ENAMEX TYPE=", any 1+ chars other than " up to and including a ", then matches a > and then captures any 0+ chars other than line break chars into Group 1. Then, </ENAMEX> substring is matched. The lambda expression only pastes back the contents of Group 1 with literal spaces replaced with underscores. Note you may use re.sub(r'\s', '_', p2.group(1)) in case you want to replace any whitespace char with an underscore.