retrieve subset of string with regex - python [duplicate] - python

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
p = "\home\gef\Documents\abc_this_word_dfg.gz.tar"
I'm looking for a way to retrieve this_word.
base = os.path.basename(p)
base1 = base.replace("abc_","")
base1.replace("_dfg.gz.tar","")
this works, but it's not ideal because I would need to know in advance what strings I want to remove. Maybe a regex would be appropriate here?

You don't give much information, but from what is shown can't you just use string slicing?
Maybe like this:
>>> p = os.path.join('home', 'gef', 'Documents', 'abc_this_word_dfg.gz.tar')
>>> p
'home/gef/Documents/abc_this_word_dfg.gz.tar'
>>> os.path.dirname(p)
'home/gef/Documents'
>>> os.path.basename(p)
'abc_this_word_dfg.gz.tar'
>>> os.path.basename(p)[4:-11]
'this_word'

You don't give much information, but from what is shown can't you just split on _ chars?
Maybe like this:
>>> p = os.path.join('home', 'gef', 'Documents', 'abc_this_word_dfg.gz.tar')
>>> p
'home/gef/Documents/abc_this_word_dfg.gz.tar'
>>> os.path.dirname(p)
'home/gef/Documents'
>>> os.path.basename(p)
'abc_this_word_dfg.gz.tar'
>>> '_'.join(
... os.path.basename(p).split('_')[1:-1])
'this_word'
It splits by underscores, then discards the first and last part, finally joining the other parts together with underscore (if this_word had no underscores, then there will be only one part left and no joining will be done).

Related

how to check if both query strings (with wildcards) are present using regex in python [duplicate]

This question already has answers here:
Regex to match string containing two names in any order
(9 answers)
Closed 1 year ago.
I have two query strings, both of which contains wildcards. We can check if any of the two query strings are present like below:
import re
def wildcards2regex (pattern):
return ( pattern.replace('?', '[a-zA-Z]{1}').replace('*', '.*') )
string = 'christopher susan'
s1='chris*'
r1 = wildcards2regex(s1)
s2 = 'sus??'
r2 = wildcards2regex(s2)
q = r1+'|'+r2
bool(re.search(q, string))
Now I wonder what to do if I want to check if both query string are present? obviously replacing '|' with '&' does not work. Do anyone know how to achieve that?
You may consider this code:
>>> import re
>>> def wildcards2regex (pattern):
... return ( pattern.replace('?', '[a-zA-Z]').replace('*', '.*') )
...
>>> string = 'christopher susan'
>>> s1='chris*'
>>> r1 = wildcards2regex(s1)
>>> s2 = 'sus??'
>>> r2 = wildcards2regex(s2)
>>> q = re.compile(r'(?=.*{})(?=.*{})'.format(r1, r2))
>>> bool(re.search(q, string))
True
Take note of this regex building:
q = re.compile(r'(?=.*{})(?=.*{})'.format(r1, r2))
Here we are building a regex with 2 conditions defined using positive lookahead assertions that asserts both of your query strings.
You can combine multiple independently positioned search terms into one regex using positive lookahead, since that doesn't consume characters or advance the conceptual cursor.
^(?=.*term1)(?=.*term2)
Demonstration: https://pythex.org/?regex=%5E(%3F%3D.*term1)(%3F%3D.*term2)&test_string=Here%20are%20term1%20and%20term2.

Python regex string matching with varying search string

Is there anyway in python to be able to perform:
"DDx" should match "01x", "10x", "11x, "00x"
in an elegant way in Python?
The easiest way I see to do this is by using regex, which in this case would be:
re.search('\d\dx',line)
Is there anyway to dynamically update this regex?
In case the input is:
"D0x" then regex: \d0x
Please help.
Using Python 2.7
EDIT
In simpler terms, my question is:
>>> str = "DDx"
>>> str.replace('\d','D')
>>> re.search(<use str here>,line)
Or any alternate approach
I think I found the answer:
>>> s = "DDx"
>>> s = s.replace('D','\d')
>>> p = "01x"
>>> c = re.search(s,p)
>>> print c.group(0)
>>> "01x"

Ignoring or removing line breaks python [duplicate]

This question already has answers here:
Remove all newlines from inside a string
(8 answers)
Closed 1 year ago.
I'm sorry for the noobish question, but none of the answers I've looked at seem to fix this. I'd like to take a multi-line string like this:
myString = """a
b
c
d
e"""
And get a result that looks like or that is at least interpreted as this:
myString = "abcde"
myString.rstrip(), myString.rstrip(\n), and myString.rstrip(\r) don't seem to change anything when I print this little "abcde" test string. Some of the other solutions I've read involve entering the string like this:
myString = ("a"
"b"
"c")
But this solution is impractical because I'm working with very large sets of data. I need to be able to copy a dataset and paste it into my program, and have python remove or ignore the line breaks.
Am I entering something in wrong? Is there an elegant solution to this? Thanks in advance for your patience.
Use the replace method:
myString = myString.replace("\n", "")
For example:
>>> s = """
test
test
test
"""
>>> s.replace("\n", "")
'testtesttest'
>>> s
'\ntest\ntest\ntest\n' # warning! replace does not alter the original
>>> myString = """a
... b
... c
... d
... e"""
>>> ''.join(myString.splitlines())
'abcde'

Removing a prefix from a string [duplicate]

This question already has answers here:
Remove a prefix from a string [duplicate]
(6 answers)
Closed 6 months ago.
Trying to strip the "0b1" from the left end of a binary number.
The following code results in stripping all of binary object. (not good)
>>> bbn = '0b1000101110100010111010001' #converted bin(2**24+**2^24/11)
>>> aan=bbn.lstrip("0b1") #Try stripping all left-end junk at once.
>>> print aan #oops all gone.
''
So I did the .lstrip() in two steps:
>>> bbn = '0b1000101110100010111010001' # Same fraction expqansion
>>> aan=bbn.lstrip("0b")# Had done this before.
>>> print aan #Extra "1" still there.
'1000101110100010111010001'
>>> aan=aan.lstrip("1")# If at first you don't succeed...
>>> print aan #YES!
'000101110100010111010001'
What's the deal?
Thanks again for solving this in one simple step. (see my previous question)
The strip family treat the arg as a set of characters to be removed. The default set is "all whitespace characters".
You want:
if strg.startswith("0b1"):
strg = strg[3:]
No. Stripping removes all characters in the sequence passed, not just the literal sequence. Slice the string if you want to remove a fixed length.
In Python 3.9 you can use bbn.removeprefix('0b1').
(Actually this question has been mentioned as part of the rationale in PEP 616.)
This is the way lstrip works. It removes any of the characters in the parameter, not necessarily the string as a whole. In the first example, since the input consisted of only those characters, nothing was left.
Lstrip is removing any of the characters in the string. So, as well as the initial 0b1, it is removing all zeros and all ones. Hence it is all gone!
#Harryooo: lstrip only takes the characters off the left hand end. So, because there's only one 1 before the first 0, it removes that. If the number started 0b11100101..., calling a.strip('0b').strip('1') would remove the first three ones, so you'd be left with 00101.
>>> i = 0b1000101110100010111010001
>>> print(bin(i))
'0b1000101110100010111010001'
>>> print(format(i, '#b'))
'0b1000101110100010111010001'
>>> print(format(i, 'b'))
'1000101110100010111010001'
See Example in python tutor:
From the standard doucmentation (See standard documentation for function bin()):
bin(x)
Convert an integer number to a binary string prefixed with “0b”. The result is a valid Python expression. If x is not a Python int object, it has to define an index() method that returns an integer. Some examples:
>>> bin(3)
'0b11'
>>> bin(-10)
'-0b1010'
If prefix “0b” is desired or not, you can use either of the following ways.
>>> format(14, '#b'), format(14, 'b')
('0b1110', '1110')
>>> f'{14:#b}', f'{14:b}'
('0b1110', '1110')
See also format() for more information.

str.strip() strange behavior [duplicate]

This question already has answers here:
How do the .strip/.rstrip/.lstrip string methods work in Python?
(4 answers)
Closed 28 days ago.
>>> t1 = "abcd.org.gz"
>>> t1
'abcd.org.gz'
>>> t1.strip("g")
'abcd.org.gz'
>>> t1.strip("gz")
'abcd.org.'
>>> t1.strip(".gz")
'abcd.or'
Why is the 'g' of '.org' gone?
strip(".gz") removes any of the characters ., g and z from the beginning and end of the string.
x.strip(y) will remove all characters that appear in y from the beginning and end of x.
That means
'foo42'.strip('1234567890') == 'foo'
becuase '4' and '2' both appear in '1234567890'.
Use os.path.splitext if you want to remove the file extension.
>>> import os.path
>>> t1 = "abcd.org.gz"
>>> os.path.splitext(t1)
('abcd.org', '.gz')
In Python 3.9, there are two new string methods .removeprefix() and .removesuffix() to remove the beginning or end of a string, respectively. Thankfully this time, the method names make it aptly clear what these methods are supposed to perform.
>>> print (sys.version)
3.9.0
>>> t1 = "abcd.org.gz"
>>> t1.removesuffix('gz')
'abcd.org.'
>>> t1
'abcd.org.gz'
>>> t1.removesuffix('gz').removesuffix('.gz')
'abcd.org.' # No unexpected effect from last removesuffix call
The argument given to strip is a set of characters to be removed, not a substring. From the docs:
The chars argument is a string specifying the set of characters to be removed.
as far as I know strip removes from the beginning or end of a string only. If you want to remove from the whole string use replace.

Categories

Resources