Match only words (sometimes with dots seperating) regex - python

I have a list like so:
example.com=120.0.0.0
ben.example.com=120.0.0.0
+ben.example=120.0.0.0
+ben.example.com.np=120.0.0.0
ben=120.0.0.0
ben-example.com=120.0.0.0
ben43.example.com=120.0.0.0
I need to find only the words (with dots seperated).
No ips, =, + and so on.
Some FQDN have multiple dots, some none at all and so on.
Is this possible?
If the script works well when i run the regex i want to get these only:
ben.example.com.np
ben.example
ben.example.com
example.com
ben
ben43.example.com
I want to parse the file into ips and FQDNS via python regex so i can work with it and check if the ips are available for the domain.

This is very straightforward
import re
fqdns = re.findall(r"[a-zA-Z\.-]{2,}", text, flags=re.M)
gives
['example.com', 'ben.example.com', 'ben.example', 'ben-example.com.np', 'ben']
regex101 example here
The group matches all characters in the ranges a-z and A-Z, along with dot . and -. The {2,} means match at least 2 characters in a row, so it won't match the dots in the IPs.
EDIT: After I wrote this answer the parameters of the question changed slightly, as some of the URLs contained numbers. So, instead of using re.findall() to get all matches in a (potentially multi-line) input, you should use re.match().group() with a slightly altered regex and process the input line by line:
import re
with open("path/to/file", "r") as f:
fqdns = [re.match(r"(?:[a-zA-Z\.\-0-9]{2,})", line).group() for line in f]
re.match(), in the absence of any flags, returns after the first match in the line. .group() is the way you access the matched string.

Related

Filename match with Python regex

I have a text file scraped from my email which contains 1 attachment/mail. The attachment is present under different names with different formats, for example:
filename="John_wheeler 11041997 resume.pdf";
filename="Kujal_newResume(1).pdf";
filename=JohnKrasinski_Resume.pdf
My question is: is there any way to find a RegEx pattern that would start searching from filename= and go until the dot character (that separates from file extension)? Getting file extension would be next task, but I can hold that for now.
You could try this pattern: filename="?([^.]+)
It assumes that dot separates filename from extension.
Explanation:
filename="? - match filename= literally and tehn match 0 or 1 apostrophe "
([^.]+) - match one or more characters that is not a dot (match everything until dot) and store it in capturing group
Your desired filename will be stored in capturing group.
Demo
EXTRA: to capture also file extension, you could use such pattern: filename="?([^.]+)\.([^";]+)
Additional thing here is \.([^";]+): matches dot literally with \.. Then it matches one or more characters other than " or ; with pattern [^";]+ and stores it in second capturing gropup.
Another demo
How about the following:
(?:filename=)([^\.]*)\.(\w*)
This REGEX returns different groups containing the different elements you're interested in.
I'm not sure the output you expect. But this may help. RegexDemo
(?<=filename=)[\"]?(\w.*[.].*)(?<=\w)[\"]?
Or if you want to ignore the file extension:
(?<=filename=)[\"]?(\w.*)[\.]

Python regex match pattern "X<string1>:X<string2>"

I'm parsing a file which has text "$string1:$string2"
How do I regex match this string and extract "string1" and "string2" from it, basically regex match this pattern : "$*:$*"
You were nearly there with your own pattern, it needs three alterations in order to work as you want it.
First, the star in regexes isn't a glob, as you might be expecting it from shell scripting, it's a kleene star. Meaning, it needs some character group it can apply it's "zero to n times" logic on. In your case, the alphanumeric character class \w should work. If that's too restrictive, use . instead, which matches any character except line breaks.
Secondly, you need to apply the regex in a way that you can easily extract the results you want. The usual way to go about it is to define groups, using parentheses.
Last but not least, the $ sign is a meta-character in regexes, so if you want to match it literally, you need to write a backslash in front of it.
In working code, it'll look like this:
import re
s = "$string1:$string2"
r = re.compile(r"\$(\w*):\$(\w*)")
match = r.match(s)
print(match.group(1)) # print the first group that was matched
print(match.group(2)) # print the second group that was matched
Output:
string1
string2

Using regex to find multiple matches on the same line

I need to build a program that can read multiple lines of code, and extract the right information from each line.
Example text:
no matches
one match <'found'>
<'one'> match <found>
<'three'><'matches'><'found'>
For this case, the program should detect <'found'>, <'one'>, <'three'>, <'matches'> and <'found'> as matches because they all have "<" and "'".
However, I cannot work out a system using regex to account for multiple matches on the same line. I was using something like:
re.search('^<.*>$')
But if there are multiple matches on one line, the extra "'<" and ">'" are taken as part of the .*, without counting them as separate matches. How do I fix this?
This works -
>>> r = re.compile(r"\<\'.*?\'\>")
>>> r.findall(s)
["<'found'>", "<'one'>", "<'three'>", "<'matches'>", "<'found'>"]
Use findall instead of search:
re.findall( r"<'.*?'>", str )
You can use re.findall and match on non > characters inside of the angle brackets:
>>> re.findall('<[^>]*>', "<'three'><'matches'><'found'>")
["<'three'>", "<'matches'>", "<'found'>"]
Non-greedy quantifier '?' as suggested by anubhava is also an option.

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Editing a text file using python

I have an auto generated bibliography file which stores my references. The citekey in the generated file is of the form xxxxx:2009tb. Is there a way to make the program to detect such a pattern and change the citekey form to xxxxx:2009?
It's not quite clear to me which expression you want to match, but you can build everything with regex, using import re and re.sub as shown. [0-9]*4 matches exactly 4 numbers.
(Edit, to incorporate suggestions)
import re
inf = 'temp.txt'
outf = 'out.txt'
with open(inf) as f,open(outf,'w') as o:
all = f.read()
all = re.sub("xxxxx:[0-9]*4tb","xxxxx:tb",all) # match your regex here
o.write(all)
o.close()
You actually just want to remove the two letters after the year in a reference. Supposing we could uniquely identify a reference as a colon followed by four numbers and two letters, than the following regular expression would work (at least it is working in this example code):
import re
s = """
according to some works (newton:2009cb), gravity is not the same that
severity (darwin:1873dc; hampton:1956tr).
"""
new_s = re.sub('(:[0-9]{4})\w{2}', r'\1', s)
print new_s
Explanation: "match a colon : followed by four numbers [0-9]{4} followed by any two "word" characters \w{2}. The parentheses catch just the part you want to keep, and r'\1' means you are replacing each whole match by a smaller part of it which is in the first (and only) group of parentheses. The r before the string is there because it is necessary to interpret \1 as a raw string, and not as an escape sequence.
Hope this helps!

Categories

Resources