How to extract specific characters from a string that can vary - python

I'm trying to extract the specific part of the name of the file that can have varying number of '_'. I previously used partition/rpartition to strip everything before and after underscore bars, but I didn't take into account the possibilities of different underscore bar numbers.
The purpose of the code is to extract specific characters in between underscore bars.
filename = os.path.basename(files).partition('_')[2].rpartition('_')[0].rpartition('_')[0].rpartition('_')[0]
The above is my current code. A typical name of the file looks like:
P0_G12_190325184517_t20190325_5
or it can also have
P0_G12_190325184517_5
From what I understand, my current code's rpartition needs to match the number of underscore bars of the file for the first file, but the same code doesn't work for the second file obviously.
I want to extract
G12
this part can also be just two characters like G1 so two to three characters from the above types of filenames.

You can use:
os.path.basename(files).split('_')[1]

You could either use split to create a list with the separate parts, like this:
files.split('_')
Or you could use regex:
https://regex101.com/r/jiUNLV/1
And do like this:
import re
pattern = r'.*_(\w{2,3})_\d+.*'
match = re.match(pattern, files)
if match:
print(match.group(1))

Related

Regex pattern to require either of two specific strings in different positions

I have a string that can exist in either of the following two formats within a larger body of text:
OptionalSpecificString1 1234
1234 OptionalSpecificString2
The text here is all placeholders. I'm looking for a numerical string that's either preceded or followed by a specific optional string. One of the two optional specific strings will always be present and is needed to locate and capture the numerical string-of-interest. Is there a single regex pattern that exists that can capture this behavior?
Something like:
(?:OptionalSpecificString1)? (\d+) (?:OptionalSpecificString2)?
almost does it, but doesn't require that one of the two optional strings is present, and so it could end up matching any other numerical string in the body of the text. I know I could do something like:
(OptionalSpecificString1 (\d+)|(\d+) OptionSpecificString2)
but I guess I'm just wondering if there's something a little more elegant. I'm doing this with the Python re module, so code can be a bit simpler too when I can express a single capture group for the same pattern.
The solution could be OptionalSpecificString1\s*(?P<numeric>\d+)|(?P<numeric>\d+)\s*OptionalSpecificString2, simply making two different syntaxes regexp alternatives, if Python supported named groups redefinition.
As it doesn't, you could capture your numerical values into different groups, named or not, and choose the non-empty one back in Python code, like this:
import re
text = r'''
OptionalSpecificString1 1234
An irrelevant line
5678 OptionalSpecificString2
Another irrelevant line
'''
pattern = r'OptionalSpecificString1\s*(?P<numeric1>\d+)|(?P<numeric2>\d+)\s*OptionalSpecificString2'
numerics = []
for match in re.finditer (pattern, text):
numerics.append (match.group ('numeric1') or match.group ('numeric2'))
print (numerics)

Regex filter containing word at beginning but not containing another word

suppose i have the following string
GPH_EPL_GK_FIN
i want a regex that ill be using in python that looks for such string from a csv file (not relevant to this question) for records that start with GPH but DONT contain EPL
i know carrot ^ is used for searching at beginning
so i have something like this
^GPH_.*
i want to include the NOT contain part as well, how do i chain the regex?
i.e.
(^GPH_.*)(?!EPL)
i would like to take this a step further eventually and any records that are returned without EPL, i.e.
GPH_ABC_JKL_OPQ
to include AFTER GPH_ the EPL part
i.e. desired result
GPH_EPL_ABC_JKL_OPQ
To cover both requirements:
compose a pattern to match lines that start with GPH but DONT contain EPL
insert EPL_ part into matched line to a particular position
import re
# sample string containing lines
s = '''GPH_EPL_GK_FIN
GPH_ABC_JKL_OPQ'''
pat = re.compile(r'^(GPH_)(?!.*EPL.*)')
for line in s.splitlines():
print(pat.sub('\\1EPL_', line))
The output:
GPH_EPL_GK_FIN
GPH_EPL_ABC_JKL_OPQ
This here would do, I think:
^GPH_(?!EPL).*
This will return any string that start with GPH and does not have EPL after GPH_.
I'm just guessing that one option would be,
(?<=^GPH_(?!EPL))
and re.sub with,
EPL_
Test
import re
print(re.sub(r"(?<=^GPH_(?!EPL))", "EPL_", "GPH_ABC_JKL_OPQ"))
Output
GPH_EPL_ABC_JKL_OPQ
Simply use this:
https://regex101.com/r/GwBsg2/2
pattern: ^(?!^(?:[^_\n]+_)*EPL_?(?:[^_\n]+_?)*)(.*)GPH
substitute: \1GPH_EPL
flags: gm

Filename match with Python regex

I have a text file scraped from my email which contains 1 attachment/mail. The attachment is present under different names with different formats, for example:
filename="John_wheeler 11041997 resume.pdf";
filename="Kujal_newResume(1).pdf";
filename=JohnKrasinski_Resume.pdf
My question is: is there any way to find a RegEx pattern that would start searching from filename= and go until the dot character (that separates from file extension)? Getting file extension would be next task, but I can hold that for now.
You could try this pattern: filename="?([^.]+)
It assumes that dot separates filename from extension.
Explanation:
filename="? - match filename= literally and tehn match 0 or 1 apostrophe "
([^.]+) - match one or more characters that is not a dot (match everything until dot) and store it in capturing group
Your desired filename will be stored in capturing group.
Demo
EXTRA: to capture also file extension, you could use such pattern: filename="?([^.]+)\.([^";]+)
Additional thing here is \.([^";]+): matches dot literally with \.. Then it matches one or more characters other than " or ; with pattern [^";]+ and stores it in second capturing gropup.
Another demo
How about the following:
(?:filename=)([^\.]*)\.(\w*)
This REGEX returns different groups containing the different elements you're interested in.
I'm not sure the output you expect. But this may help. RegexDemo
(?<=filename=)[\"]?(\w.*[.].*)(?<=\w)[\"]?
Or if you want to ignore the file extension:
(?<=filename=)[\"]?(\w.*)[\.]

Edit file names in Python according to certain rules

I have a great number of files whose names are structured as follows:
this_is_a_file.extension
I got to strip them of what begins with the last underscore (included), preserving the extension, and save the file with the new name into another directory.
Note that these names have variable length, so I cannot leverage single characters' position.
Also, they have a different number of underscores, otherwise I'd have applied something similar: split a file name
How can I do it?
You could create a function that splits the original filename along underscores, and splits the last segment along periods. Then you can join it all back together again like so:
def myJoin(filename):
splitFilename=filename.split('_')
extension=splitFilename[-1].split('.')
splitFilename.pop(-1)
return('_'.join(splitFilename)+'.'+extension[-1])
Some examples to show it working:
>>> p="this_is_a_file.extension"
>>> myJoin(p)
'this_is_a.extension'
>>> q="this_is_a_file_with_more_segments.extension"
>>> myJoin(q)
'this_is_a_file_with_more.extension'

alternative regex to match all text in between first two dashes

I'm trying to use the following regex \-(.*?)-|\-(.*?)* it seems to work fine on regexr but python says there's nothing to repeat?
I'm trying to match all text in between the first two dashes or if a second dash does not exist after the first all text from the first - onwards.
Also, the regex above includes the dashes, but would preferrably like to exclude these so I don't have to do an extra replace etc.
You can use re.search with this pattern:
-([^-]*)
Note that - doesn't need to be escaped.
An other way consists to only search the positions of the two first dashes, and to extract the substring between these positions. Or you can use split:
>>> 'aaaaa-bbbbbb-ccccc-ddddd'.split('-')[1]
'bbbbbb'

Categories

Resources