Extract a portion of string from another string using regex

Extract a portion of string from another string using regex - python

Lets assume I have a string as follows:
s = '23092020_indent.xlsx'
I want to extract only indent from the above string. Now there are many approaches:
#Via re.split() operation
s_f = re.split('_ |. ',s) <---This is returning 's' ONLY. Not the desired output
#Via re.findall() operation
s_f = re.findall(r'[^A-Za-z]',s,re.I)
s_f
['i','n','d','e','n','t','x','l','s','x']
s_f = ''.join(s_f) <----This is returning 'indentxlsx'. Not the desired output
Am I missing out anything? Or do I need to use regex at all?
P.S. In the whole part of s only '.'delimiter would be constant. Rests all delimiter can be changed.

Use os.path.splitext and then str.split:
import os
name, ext = os.path.splitext(s)
name.split("_")[1] # If the position is always fixed
Output:
"indent"

I LOVE regex's, so that's definitely the way I'd go.
The exactly right answer requires more information as to all possible input strings and what the right thing to extract is for each of them. Here's a solution that assumes:
one or more digits, then
a single underscore, then
a group of chars not containing a '.', then
a '.', then
anything besides a '.', but at least one char
The #3 part is captured.
import re
s = '23092020_indent.xlsx'
exp = re.compile(r"^\d+_(.*?)\.[^.]+$")
m = exp.match(s)
if m:
print(m.group(1))
Result:
indent

Related

How I can use regex to remove repeated characters from string

I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)

The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.

If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result

Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.

Remove Characters From A String Until A Specific Format is Reached

So I have the following strings and I have been trying to figure out how to manipulate them in such a way that I get a specific format.
string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo
I want to be able to get rid of any of the last string so I am just left with the month and year, like below:
string1-itd_jan2021
string2itd_mar2021
string3itd_feb2021
string4-itd_mar2021
string5itd_jun2021
string6-itd_feb2021
I thought about using string.split on the - but then realized that for some strings this wouldn't work. I also thought about getting rid of a set amount of characters by putting it into a list and slicing but the end is varying characters length?
Is there anything I can do it with regex or any other python module?

Use str.rsplit with the appropriate maxsplit parameter:
s = s.rsplit("-", 1)[0]
You could also use str.split (even though this is clearly the worse choice):
s = "-".join(s.split("-")[:-1])
Or using regular expressions:
s = re.sub(r'-[^-]*$', '', s)
# "-[^-]*" a "-" followed by any number of non-"-"

With a regex:
import re
re.sub(r'([0-9]{4}).*$', r'\1', s)

Use re.sub like so:
import re
lines = '''string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo'''
for old in lines.split('\n'):
new = re.sub(r'[-][^-]+$', '', old)
print('\t'.join([old, new]))
Prints:
string1-itd_jan2021-internal string1-itd_jan2021
string2itd_mar2021-space string2itd_mar2021
string3itd_feb2021-internal string3itd_feb2021
string4-itd_mar2021-moon string4-itd_mar2021
string5itd_jun2021-internal string5itd_jun2021
string6-itd_feb2021-apollo string6-itd_feb2021
Explanation:
r'[-][^-]+$' : Literal dash (-), followed by any character other than a dash ([^-]) repeated 1 or more times, followed by the end of the string ($).

Regex in python: combining 2 regex expressions into one

Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?

This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')

If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d

You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.

Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation

Splitting based on particular pattern and editing string

I am trying to split a string based on a particular pattern in an effort to rejoin it later after adding a few characters.
Here's a sample of my string: "123\babc\b:123" which I need to convert to "123\babc\\"b\":123". I need to do it several times in a long string. I have tried variations of the following:
regex = r"(\\b[a-zA-Z]+)\\b:"
test_str = "123\\babc\\b:123"
x = re.split(regex, test_str)
but it doesn't split at the right positions for me to join. Is there another way of doing this/another way of splitting and joining?

You're right, you can do it with re.split as suggested. You can split by \b and then rebuild your output with a specific separator (and keep the \b when you want too).
Here an example:
# Import module
import re
string = "123\\babc\\b:123"
# Split by "\n"
list_sliced = re.split(r'\\b', "123\\babc\\b:123")
print(list_sliced)
# ['123', 'abc', ':123']
# Define your custom separator
custom_sep = '\\\\"b\\"'
# Build your new output
output = list_sliced[0]
# Iterate over each word
for i, word in enumerate(list_sliced[1:]):
# Chose the separator according the parity (since we don't want to change the first "\b")
sep = "\\\\b"
if i % 2 == 1:
sep = custom_sep
# Update output
output += sep + word
print(output)
# 123\\babc\\"b\":123

Maybe, the following expression,
^([\\]*)([^\\]+)([\\]*)([^\\]+)([\\]*)([^:]+):(.*)$
and a replacement of,
\1\2\3\4\5\\"\6\\":\7
with a re.sub might return our desired output.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.

Updating a string using regular expressions in Python

I'm pretty sure that my question is very straightforward but I cannot find the answer to it. Let's say we have an input string like:
input = "This is an example"
Now, I want to simply replace every word --generally speaking, every substring using a regular expression, "word" here is just an example-- in the input with another string which includes the original string too. For instance, I want to add an # to the left and right of every word in input. And, the output would be:
output = "#This# #is# #an# #example#"
What is the solution? I know how to use re.sub or replace, but I do not know how I can use them in a way that I can update the original matched strings and not completely replace them with something else.

You can use capture groups for that.
import re
input = "This is an example"
output = re.sub("(\w+)", "#\\1#", input)
A capture group is something that you can later reference, for example in the substitution string. In this case, I'm matching a word, putting it into a capture group and then replacing it with the same word, but with # added as a prefix and a suffix.
You can read about regexps in python more in the docs.

Here is an option using re.sub with lookarounds:
input = "This is an example"
output = re.sub(r'(?<!\w)(?=\w)|(?<=\w)(?!\w)', '#', input)
print(output)
#This# #is# #an# #example#

This is without re library
a = "This is an example"
l=[]
for i in a.split(" "):
l.append('#'+i+'#')
print(" ".join(l))

You can match only word boundaries with \b:
import re
input = "This is an example"
output = re.sub(r'\b', '#', input)
print(output)
#This# #is# #an# #example#

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract a portion of string from another string using regex - python

Use os.path.splitext and then str.split: import os name, ext = os.path.splitext(s) name.split("_")[1] # If the position is always fixed Output: "indent"

Related

How I can use regex to remove repeated characters from string

Remove Characters From A String Until A Specific Format is Reached

Regex in python: combining 2 regex expressions into one

Splitting based on particular pattern and editing string

Updating a string using regular expressions in Python

Categories

Resources