Python2: Regex pattern to match all names

Python2: Regex pattern to match all names - python

Below are the files in a folder and wanted to write a regex pattern to match all the filenames
and separate it like into 4 groups like
Groups:
Text before date pattern
Date Pattern
Text after date pattern
Extension (any or no extension)
Names:
XYZ_XY__T_20180808_88
GYG_20180813.csv
JENNY_BH_COSTUMES_T_20180808_88.csv
JKS9KS9_DDD_20180809_2.txt
AMY_BH_MAKEUP_T_20180808_88.dat
UUB-134941099-00002531-003_20180814
usa-Nasa_Y_20180806_01.csv
usa-Tpkyo-HHDY_Y_20180806_01.csv
Tried this -
(\w+)(-?)_(\d{4}\d{2}\d{2})(\w+)?(\.csv|\.dat|\.txt)?
but doesn't seem to work. How to go about this?

Since you want to capture all the text before the date substring regardless, it looks like all you have to do is include dashes and underscores in the initial group: ([\w-_]+) (and drop the following group that captures an optional standalone dash):
You may also use ^ and $ to ensure that matches span the entire line:
^([\w-_]+)(\d{4}\d{2}\d{2})(\w+)?(\.csv|\.dat|\.txt)?$
https://regex101.com/r/M5kIo6/1

Related

How to extract specific characters from a string that can vary

I'm trying to extract the specific part of the name of the file that can have varying number of '_'. I previously used partition/rpartition to strip everything before and after underscore bars, but I didn't take into account the possibilities of different underscore bar numbers.
The purpose of the code is to extract specific characters in between underscore bars.
filename = os.path.basename(files).partition('_')[2].rpartition('_')[0].rpartition('_')[0].rpartition('_')[0]
The above is my current code. A typical name of the file looks like:
P0_G12_190325184517_t20190325_5
or it can also have
P0_G12_190325184517_5
From what I understand, my current code's rpartition needs to match the number of underscore bars of the file for the first file, but the same code doesn't work for the second file obviously.
I want to extract
G12
this part can also be just two characters like G1 so two to three characters from the above types of filenames.

You can use:
os.path.basename(files).split('_')[1]

You could either use split to create a list with the separate parts, like this:
files.split('_')
Or you could use regex:
https://regex101.com/r/jiUNLV/1
And do like this:
import re
pattern = r'.*_(\w{2,3})_\d+.*'
match = re.match(pattern, files)
if match:
print(match.group(1))

Filename match with Python regex

I have a text file scraped from my email which contains 1 attachment/mail. The attachment is present under different names with different formats, for example:
filename="John_wheeler 11041997 resume.pdf";
filename="Kujal_newResume(1).pdf";
filename=JohnKrasinski_Resume.pdf
My question is: is there any way to find a RegEx pattern that would start searching from filename= and go until the dot character (that separates from file extension)? Getting file extension would be next task, but I can hold that for now.

You could try this pattern: filename="?([^.]+)
It assumes that dot separates filename from extension.
Explanation:
filename="? - match filename= literally and tehn match 0 or 1 apostrophe "
([^.]+) - match one or more characters that is not a dot (match everything until dot) and store it in capturing group
Your desired filename will be stored in capturing group.
Demo
EXTRA: to capture also file extension, you could use such pattern: filename="?([^.]+)\.([^";]+)
Additional thing here is \.([^";]+): matches dot literally with \.. Then it matches one or more characters other than " or ; with pattern [^";]+ and stores it in second capturing gropup.
Another demo

How about the following:
(?:filename=)([^\.]*)\.(\w*)
This REGEX returns different groups containing the different elements you're interested in.

I'm not sure the output you expect. But this may help. RegexDemo
(?<=filename=)[\"]?(\w.*[.].*)(?<=\w)[\"]?
Or if you want to ignore the file extension:
(?<=filename=)[\"]?(\w.*)[\.]

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

I am trying the following regex: https://regex101.com/r/5dlRZV/1/, I am aware, that I am trying with \author and not \maketitle
In python, I try the following:
import re
text = str(r'
\author{
\small
}
\maketitle
')
regex = [re.compile(r'[\\]author*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S),
re.compile(r'[\\]maketitle*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S)]
for p in regex:
for m in p.finditer(text):
print(m.group())
Python freezes, I am suspecting that this has something to do with my pattern, and the SRE fails.
EDIT: Is there something wrong with my regex? Can it be improved to actually work? Still I get the same results on my machine.
EDIT 2: Can this be fixed somehow so the pattern supports optional followed by ?: or ?= look-heads? So that one can capture both?

After reading the heading, "Parentheses Create Numbered Capturing Groups", on this site: https://www.regular-expressions.info/brackets.html, I managed to find the answer which is:
Besides grouping part of a regular expression together, parentheses also create a
numbered capturing group. It stores the part of the string matched by the part of
the regular expression inside the parentheses.
The regex Set(Value)? matches Set or SetValue.
In the first case, the first (and only) capturing group remains empty.
In the second case, the first capturing group matches Value.

alternative regex to match all text in between first two dashes

I'm trying to use the following regex \-(.*?)-|\-(.*?)* it seems to work fine on regexr but python says there's nothing to repeat?
I'm trying to match all text in between the first two dashes or if a second dash does not exist after the first all text from the first - onwards.
Also, the regex above includes the dashes, but would preferrably like to exclude these so I don't have to do an extra replace etc.

You can use re.search with this pattern:
-([^-]*)
Note that - doesn't need to be escaped.
An other way consists to only search the positions of the two first dashes, and to extract the substring between these positions. Or you can use split:
>>> 'aaaaa-bbbbbb-ccccc-ddddd'.split('-')[1]
'bbbbbb'

regex for capturing group that is only sometimes present

I have a set of filenames like:
PATJVI_RNA_Tumor_8_3_63BJTAAXX.310_BUSTARD-2012-02-19.fq.gz
PATMIF_RNA_Tumor_CGTGAT_2_1_BC0NKBACXX.334_BUSTARD-2012-05-07.fq.gz
I would like to have a single regex (in python, fyi) that can capture each of the groups between the "_" characters. However, note that in the second filename, there is a group that is present that is not present in the first filename. Of course, one can use a string split, etc., but I would like to do this with a single regex. The regex for the first filename is something like:
(\w+)_(\w+)_(\w+)_(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz
And the second will be:
(\w+)_(\w+)_(\w+)_(\w+)_(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz
I'd like the regex group to be empty when the optional group is present and contain the optional group when it is present (so that I can use it later to in constructing a new filename with \4).

To make a group optional, you can add ? after the desired group. Like this:
(\w+)?
But your example has an underscore that should be optional as well. To deal with it, you can group it together with optional group.
((\w+)_)?
However this will add a new group to your match results. To avoid it, use a non-matching group:
(?:(\w+)_)?
The final result will look like this:
(\w+)_(\w+)_(\w+)_(?:(\w+)_)?(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python2: Regex pattern to match all names - python

Related

How to extract specific characters from a string that can vary

Filename match with Python regex

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

alternative regex to match all text in between first two dashes

regex for capturing group that is only sometimes present

Categories

Resources