Filename match with Python regex - python

I have a text file scraped from my email which contains 1 attachment/mail. The attachment is present under different names with different formats, for example:
filename="John_wheeler 11041997 resume.pdf";
filename="Kujal_newResume(1).pdf";
filename=JohnKrasinski_Resume.pdf
My question is: is there any way to find a RegEx pattern that would start searching from filename= and go until the dot character (that separates from file extension)? Getting file extension would be next task, but I can hold that for now.

You could try this pattern: filename="?([^.]+)
It assumes that dot separates filename from extension.
Explanation:
filename="? - match filename= literally and tehn match 0 or 1 apostrophe "
([^.]+) - match one or more characters that is not a dot (match everything until dot) and store it in capturing group
Your desired filename will be stored in capturing group.
Demo
EXTRA: to capture also file extension, you could use such pattern: filename="?([^.]+)\.([^";]+)
Additional thing here is \.([^";]+): matches dot literally with \.. Then it matches one or more characters other than " or ; with pattern [^";]+ and stores it in second capturing gropup.
Another demo

How about the following:
(?:filename=)([^\.]*)\.(\w*)
This REGEX returns different groups containing the different elements you're interested in.

I'm not sure the output you expect. But this may help. RegexDemo
(?<=filename=)[\"]?(\w.*[.].*)(?<=\w)[\"]?
Or if you want to ignore the file extension:
(?<=filename=)[\"]?(\w.*)[\.]

Related

Match only words (sometimes with dots seperating) regex

I have a list like so:
example.com=120.0.0.0
ben.example.com=120.0.0.0
+ben.example=120.0.0.0
+ben.example.com.np=120.0.0.0
ben=120.0.0.0
ben-example.com=120.0.0.0
ben43.example.com=120.0.0.0
I need to find only the words (with dots seperated).
No ips, =, + and so on.
Some FQDN have multiple dots, some none at all and so on.
Is this possible?
If the script works well when i run the regex i want to get these only:
ben.example.com.np
ben.example
ben.example.com
example.com
ben
ben43.example.com
I want to parse the file into ips and FQDNS via python regex so i can work with it and check if the ips are available for the domain.
This is very straightforward
import re
fqdns = re.findall(r"[a-zA-Z\.-]{2,}", text, flags=re.M)
gives
['example.com', 'ben.example.com', 'ben.example', 'ben-example.com.np', 'ben']
regex101 example here
The group matches all characters in the ranges a-z and A-Z, along with dot . and -. The {2,} means match at least 2 characters in a row, so it won't match the dots in the IPs.
EDIT: After I wrote this answer the parameters of the question changed slightly, as some of the URLs contained numbers. So, instead of using re.findall() to get all matches in a (potentially multi-line) input, you should use re.match().group() with a slightly altered regex and process the input line by line:
import re
with open("path/to/file", "r") as f:
fqdns = [re.match(r"(?:[a-zA-Z\.\-0-9]{2,})", line).group() for line in f]
re.match(), in the absence of any flags, returns after the first match in the line. .group() is the way you access the matched string.

python regex expression to match (first multipart or simple part) rar archive

I would like match
first element in multipart rar archive,
regex (.*.)part0*1.rar
or
single part rar archive,
don't match string contains ^.*(part\d+).rar$
I use this regex:
regex = r"(.*)(?:part0*1|.*[^(part\d+)])\.rar"
I 've got some issues:
apps.rar match but apps2.rar dont match and should
LA460.6.7.rar dont match and should
apps.rar should match in group(1)="apps" not group(1)="app"
You can check snippet #regex101
Could you find the error in the regex?
Thanks
The reason that you sometimes match the last character is because the pattern (.*)(?:part0*1|.*[^(part\d+)])\.rar that you tried, first captures the whole line in capture group 1.
That capture group is followed by an alternation matching either part0*1 or .*[^(part\d+)]
You can see that the lines that have part followed by a digit at the end are matched.
But, when there is no match for part0*1 the next alternative is tried which is .*[^(part\d+)].
The second alternative matches until the end of the string (where it already is), and then matches a single character of [^(part\d+)] because using the square brackets makes it a character class without a quantifier.
One option could be using a negative lookahead asserting that the string does not contain part followed by optional zeroes and either a char 2-9 and optional digits or | 1-9 and 1 or more digits.
^(?!.*part0*(?:[2-9]\d*|[1-9]\d+)\.rar)(.+)\.rar$
Regex demo
You can search for filenames that "Either have word 'part' followed by 01/1 or don't have the word 'part' at all"
Please try below regex
(.*part0?1|^(?!.*part.*).*)\.rar
Demo

How to extract specific characters from a string that can vary

I'm trying to extract the specific part of the name of the file that can have varying number of '_'. I previously used partition/rpartition to strip everything before and after underscore bars, but I didn't take into account the possibilities of different underscore bar numbers.
The purpose of the code is to extract specific characters in between underscore bars.
filename = os.path.basename(files).partition('_')[2].rpartition('_')[0].rpartition('_')[0].rpartition('_')[0]
The above is my current code. A typical name of the file looks like:
P0_G12_190325184517_t20190325_5
or it can also have
P0_G12_190325184517_5
From what I understand, my current code's rpartition needs to match the number of underscore bars of the file for the first file, but the same code doesn't work for the second file obviously.
I want to extract
G12
this part can also be just two characters like G1 so two to three characters from the above types of filenames.
You can use:
os.path.basename(files).split('_')[1]
You could either use split to create a list with the separate parts, like this:
files.split('_')
Or you could use regex:
https://regex101.com/r/jiUNLV/1
And do like this:
import re
pattern = r'.*_(\w{2,3})_\d+.*'
match = re.match(pattern, files)
if match:
print(match.group(1))

Python2: Regex pattern to match all names

Below are the files in a folder and wanted to write a regex pattern to match all the filenames
and separate it like into 4 groups like
Groups:
Text before date pattern
Date Pattern
Text after date pattern
Extension (any or no extension)
Names:
XYZ_XY__T_20180808_88
GYG_20180813.csv
JENNY_BH_COSTUMES_T_20180808_88.csv
JKS9KS9_DDD_20180809_2.txt
AMY_BH_MAKEUP_T_20180808_88.dat
UUB-134941099-00002531-003_20180814
usa-Nasa_Y_20180806_01.csv
usa-Tpkyo-HHDY_Y_20180806_01.csv
Tried this -
(\w+)(-?)_(\d{4}\d{2}\d{2})(\w+)?(\.csv|\.dat|\.txt)?
but doesn't seem to work. How to go about this?
Since you want to capture all the text before the date substring regardless, it looks like all you have to do is include dashes and underscores in the initial group: ([\w-_]+) (and drop the following group that captures an optional standalone dash):
You may also use ^ and $ to ensure that matches span the entire line:
^([\w-_]+)(\d{4}\d{2}\d{2})(\w+)?(\.csv|\.dat|\.txt)?$
https://regex101.com/r/M5kIo6/1

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Categories

Resources