How can I write a regular expression to replace hash-like strings - python

There are some windows names and folders containing names like:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157
c:\windows\system32\config\systemprofile\appdata\locallow\microsoft\cryptneturlcache\metadata\be7ffd2fd84d3b32fd43dc8f575a9f28
c:\windows\softwaredistribution\download\ab1b092b40dee3ba964e8305ecc7d0d9
Notice how they end with a string that looks like a hash:
57c8edb95df3f0ad4ee2dc2b8cfd4157, be7ffd2fd84d3b32fd43dc8f575a9f28,
ab1b092b40dee3ba964e8305ecc7d0d9
I am not good with regex and I would like to know if there is a way to write a regex that would replace these hash-like names within a path with something like
"##HASH##"
The paths do not necessarily end with these, as these are usually folders/subfolders containing other folders of their own.
So my goal is to essentially get a path looking like:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157\some_subfolder\some_file.inf
to become:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata##HASH##\some_subfolder\some_file.inf
Is there a way to do that in Python ?
Thanks in advance.

If you noticed, the "hashes" are 32 characters. (IF THIS IS TRUE FOR ALL OF THEM) Then the regex is pretty straightforward.
For example with the last string you posted
import re
text = 'c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\\57c8edb95df3f0ad4ee2dc2b8cfd4157\some_subfolder\some_file.inf'
res = re.sub('\w{32}', '##HASH##', text)
print(res)
prints:
c:\windows\serviceprofiles\localserviceppdata\locallow\microsoft\cryptneturlcache\metadata\##HASH##\some_subfolder\some_file.inf
Notice how i escaped the \ with \\5 that's necessary to tell python it's a literal \5.
The \w{32} regex means "match any word character Exactly 32 times"

This might help:
import os
import re
uuid = re.compile('[0-9a-f]{30}\Z', re.I)
A = "c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157\sub_folder"
path = os.path.normpath(A)
path = path.split(os.sep)
path = "\\".join(["##"+i+"##" if uuid.match(i) else i for i in path])
print path
Result:
c:\windows\serviceprofiles\localserviceppdata\locallow\microsoft\cryptneturlcache\metadata\##c8edb95df3f0ad4ee2dc2b8cfd4157##\sub_folder
Note: I am compiling for 30 chars in length. You can modify that value in re.compile

Related

Match Naming Convention of Text File Regex

I am trying to find text files with a certain naming convention using regex, but have so far been unsuccessful.
The naming convention is file_[year]-[month]-[day].txt (eg. file_2010-09-15.txt).
Here is what I have so far: ^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$
I'm trying to use it in my code like this:
for text_file in os.listdir(path):
if fnmatch.fnmatch(text_file, '^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$'):
# print number of files found
I think the issue is because of the pattern type that fnmatch is expecting. In the documents it states the following:
This module provides support for Unix shell-style wildcards, which are not the same as regular expressions (which are documented in the re module). The special characters used in shell-style wildcards are:
Pattern Meaning
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any character not in seq
`
You could keep it the way it is and just change it to that style of support, i.e.:
for text_file in os.listdir(path):
if fnmatch.fnmatch(text_file, 'file_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].txt'):
# print number of files found
Or what I would suggest is using re.match like so:
regex = re.compile(r'^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$')
for text_file in os.listdir(path):
if regex.match(text_file):
# print the text file
The fnmatch translates the regex to re python module. Take a look at the source code here. Basically, the shortcuts supported are:
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
Your regex is should be: 'file_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].txt'
Alternatively, you can get started with re directly, without using fnmatch (take a code below as a starting point, but there is room for improvement: check whether a year is a valid year, a month is between 1-12 and a day is between 1 and 28,29,30, or 31):
import re
example_file = 'file_2010-09-15.txt'
myregex = 'file_\d\d\d\d-\d\d-\d\d\.txt'
result = re.match(myregex, example_file)
print(result.group(0))

Look for files not having a given extension

I'v tried the following code.
import re
regobj = re.compile(r"^.+\.(oth|xyz)$")
for test in ["text.txt", "other.oth", "abc.xyz"]:
if regobj.match(test):
print("Method 1:", test)
regobj = re.compile(r"^.+\.[^txt]$")
for test in ["text.txt", "other.oth", "abc.xyz"]:
if regobj.match(test):
print("Method 2:", test)
I would like that the 2nd method finds any file not having the extension txt but the way I try is not the good one. What am I doing wrong ?
Regular expressions are overkill here. Use the str.endswith() method:
if not str.endswith('.txt'):
Your regular expression uses a negative character class, which is a set of characters that should not be matched. Anything that is not a t or x will satisfy that test. You could have explicitly matched .txt and used not to exclude rather than include:
regobj = re.compile(r"^.+\.txt$")
if not regobj.match(test):
If all you can use is regular expressions, use negative look-ahead assertions;
regobj = re.compile(r"^[^.]+\.(?!txt$)[^.]+$")
Here (?!...) only matches locations where there is no literal txt following, all the way to the end of the string. The [^.]+ then matches any number of characters that is not a . character until the end of the string.
Change the second regex to,
regobj = re.compile(r"^.+\.(?!txt$)[^.]+$")
[^txt] matches any character not of t or x. (?!txt$) asserts that the dot won't be followed by txt . And the [^.]+ after \. asserts that there must be atleast one single char must exists just after to the dot. So this matches the filenames which has any extensions but not of .txt
As Martijn Pieters mentioned regex is overkill, considering there are other more efficient ways:
fileName, fileExt = os.path.splitext(string)
Using splitext it's simple to isolate the extension.
import os
fileDict = ["text.txt", "other.oth", "abc.xyz"]
matchExt = ".txt"
for eachFile in fileDict:
fileName, fileExt = os.path.splitext(eachFile)
if matchExt not in fileExt:
print("(not %s) %s %s" % (matchExt, fileExt, fileName))
You can easily add an else statement to match other extensions, which I'll leave up to you.

Python Regex: Matching from end of string (reverse)

I want to match a string with the following criteria:
Match any letters, followed by a '.', followed by letters, followed by end-of-line.
For example, for the string 'www.stackoverflow.com', the regex should return 'stackoverflow.com'. I have the following code that works:
my_string = '''
123.domain.com
123.456.domain.com
domain.com
'''
>>> for i in my_string.split():
... re.findall('[A-Za-z\.]*?([A-Za-z]+\.[a-z]+)$', i)
...
['domain.com']
['domain.com']
['domain.com']
>>>
The code snippet above works perfectly. But I'm sure there must be a more elegant way to achieve the same.
Is it possible to start the regex search/match starting from the end of the string, moving towards the start of the string? How would one code that type of regex? Or should I be using regex at all?
Your regex won't account for domains like domain.co.uk, so I would consider using something a little more robust. If you don't mind adding more dependencies to your script, there's a module named tldextract (pip install tldextract) that makes this pretty simple:
import tldextract
def get_domain(url):
result = tldextract.extract(url)
return result.domain + '.' + result.tld
I'm not sure from your example if you're just trying to get the last two parts of the domain name, or if you're trying to remove the numbers. If you just want the last parts of the domain, you can do something like:
for i in my_string.split():
'.'.join(i.split('.')[-2:])
This:
splits each string into a list of words, split where the '.' was originally, then
combines the final two words into a single string, with a '.' separator.
Or, like this:
>>> my_string = ['123.domain.com', '123.456.domain.com', 'domain.com', 'www.stackoverflow.com']
>>> ['.'.join(i.split('.')[-2:]) for i in my_string]
['domain.com', 'domain.com', 'domain.com', 'stackoverflow.com']

How to remove special characters from txt files using Python

from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
with open(fp) as fh:
return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
for name in files:
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
print "There are" ,len(uniquewords), "unique words in the files." "From directory", pattern
So far my code is this. This counts the number of unique words and total words from D:\report\shakeall\*.txt
The problem is, for example, this code recognizes code code. and code! different words. So, this can't be an answer to an exact number of unique words.
I'd like to remove special characters from 42 text files using Windows text editor
Or make an exception rule that solve this problem.
If using the latter, how shoud I make up my code?
Make it to directly modify text files? Or make an exception that doesn't count special characters?
import re
string = open('a.txt').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('b.txt', 'w').write(new_str)
It will change every non alphanumeric char to white space.
I'm pretty new and I doubt this is very elegant at all, but one option would be to take your string(s) after reading them in and running them through string.translate() to strip out the punctuation. Here is the Python documentation for it for version 2.7 (which i think you're using).
As far as the actual code, it might be something like this (but maybe someone better than me can confirm/improve on it):
fileString.translate(None, string.punctuation)
where "fileString" is the string that your open(fp) read in. "None" is provided in place of a translation table (which would normally be used to actually change some characters into others), and the second parameter, string.punctuation (a Python string constant containing all the punctuation symbols) is a set of characters that will be deleted from your string.
In the event that the above doesn't work, you could modify it as follows:
inChars = string.punctuation
outChars = ['']*32
tranlateTable = maketrans(inChars, outChars)
fileString.translate(tranlateTable)
There are a couple of other answers to similar questions i found via a quick search. I'll link them here, too, in case you can get more from them.
Removing Punctuation From Python List Items
Remove all special characters, punctuation and spaces from string
Strip Specific Punctuation in Python 2.x
Finally, if what I've said is completely wrong please comment and i'll remove it so that others don't try what I've said and become frustrated.
import re
Then replace
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
By
[uniquewords.add(re.sub('[^a-zA-Z0-9]*$', '', x) for x in open(os.path.join(root,name)).read().split()]
This will strip all trailing non-alphanumeric characters from each word before adding it to the set.
When working in Linux, some system files in /proc lib contains chars with ascii value 0.
full_file_path = 'test.txt'
result = []
with open(full_file_path, encoding='utf-8') as f:
line = f.readline()
for c in line:
if ord(c) == 0:
result.append(' ')
else:
result.append(c)
print (''.join(result))

Regular expression to match start of filename and filename extension

What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'?
The regular expression should match any of the following:
RunFoo.py
RunBar.py
Run42.py
It should not match:
myRunFoo.py
RunBar.py1
Run42.txt
The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ....
For a regular expression, you would use:
re.match(r'Run.*\.py$')
A quick explanation:
. means match any character.
* means match any repetition of the previous character (hence .* means any sequence of chars)
\ is an escape to escape the explicit dot
$ indicates "end of the string", so we don't match "Run_foo.py.txt"
However, for this task, you're probably better off using simple string methods. ie.
filename.startswith("Run") and filename.endswith(".py")
Note: if you want case insensitivity (ie. matching "run.PY" as well as "Run.py", use the re.I option to the regular expression, or convert to a specific case (eg filename.lower()) before using string methods.
I don't really understand why you're after a regular expression to solve this 'problem'. You're just after a way to find all .py files that start with 'Run'. So this is a simple solution that will work, without resorting to compiling an running a regular expression:
import os
for filename in os.listdir(dirname):
root, ext = os.path.splitext(filename)
if root.startswith('Run') and ext == '.py':
print filename
Warning:
jobscry's answer ("^Run.?.py$") is incorrect (will not match "Run123.py", for example).
orlandu63's answer ("/^Run[\w]*?.py$/") will not match "RunFoo.Bar.py".
(I don't have enough reputation to comment, sorry.)
/^Run.*\.py$/
Or, in python specifically:
import re
re.match(r"^Run.*\.py$", stringtocheck)
This will match "Runfoobar.py", but not "runfoobar.PY". To make it case insensitive, instead use:
re.match(r"^Run.*\.py$", stringtocheck, re.I)
You don't need a regular expression, you can use glob, which takes wildcards e.g. Run*.py
For example, to get those files in your current directory...
import os, glob
files = glob.glob( "".join([ os.getcwd(), "\\Run*.py"]) )
If you write a slightly more complex regular expression, you can get an extra feature: extract the bit between "Run" and ".py":
>>> import re
>>> regex = '^Run(?P<name>.*)\.py$'
>>> m = re.match(regex, 'RunFoo.py')
>>> m.group('name')
'Foo'
(the extra bit is the parentheses and everything between them, except for '.*' which is as in Rob Howard's answer)
This probably doesn't fully comply with file-naming standards, but here it goes:
/^Run[\w]*?\.py$/
mabye:
^Run.*\.py$
just a quick try

Categories

Resources