using a regex wildcard within a specific pattern match

using a regex wildcard within a specific pattern match - python

my code:
f = open("file.bin", 'rb')
s = f.read()
str1 = ''.join(re.findall( b'\x00\x00\x00\x12\x00\x00\x00(.*?)\x00\x01\x00\x00', s )[0])
I have some binary files from which I want to extract information (strings). The information/strings in this file looks like "[DELIMITER]String1[DELIMITER]STRING2"... The delimiters used in these files are always different but the 00's are always the same so a good workaround would be to tell regex that \x12 and \x01 can be anything.
So what I would need is
str1 = ''.join(re.findall( b'\x00\x00\x00\x[ANYTHING]\x00\x00\x00(.*?)\x00\x[ANYTHING]\x00\x00', s )[0])
How can I do this in regex?

You could try
str1 = ''.join(re.findall(b'\x00\x00\x00.\x00\x00\x00(.*?)\x00.\x00\x00', s)[0], re.S)
The re.S is needed for . to match absolutely any character (or byte in this case), including \n (aka \x0a).
(Notice that to the regular expression engine, each \xnn is just 1 character, so you cannot use any operators within such escape).

Related

Python regex numbers and underscores

I'm trying to get a list of files from a directory whose file names follow this pattern:
PREFIX_YYYY_MM_DD.dat
For example
FOO_2016_03_23.dat
Can't seem to get the right regex. I've tried the following:
pattern = re.compile(r'(\d{4})_(\d{2})_(\d{2}).dat')
>>> []
pattern = re.compile(r'*(\d{4})_(\d{2})_(\d{2}).dat')
>>> sre_constants.error: nothing to repeat
Regex is certainly a weakpoint for me. Can anyone explain where I'm going wrong?
To get the files, I'm doing:
files = [f for f in os.listdir(directory) if pattern.match(f)]
PS, how would I allow for .dat and .DAT (case insensitive file extension)?
Thanks

You have two issues with your expression:
re.compile(r'(\d{4})_(\d{2})_(\d{2}).dat')
The first one, as a previous comment stated, is that the . right before dat should be escaped by putting a backslash (\) before. Otherwise, python will treat it as a special character, because in regex . represents "any character".
Besides that, you're not handling uppercase exceptions on your expression. You should make a group for this with dat and DAT as possible choices.
With both changes made, it should look like:
re.compile(r'(\d{4})_(\d{2})_(\d{2})\.(?:dat|DAT)')
As an extra note, I added ?: at the beginning of the group so the regex matcher ignores it at the results.

Use pattern.search() instead of pattern.match().
pattern.match() always matches from the start of the string (which includes the PREFIX).
pattern.search() searches anywhere within the string.

Does this do what you want?
>>> import re
>>> pattern = r'\A[a-z]+_\d{4}_\d{2}_\d{2}\.dat\Z'
>>> string = 'FOO_2016_03_23.dat'
>>> re.search(pattern, string, re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 18), match='FOO_2016_03_23.dat'>
>>>
It appears to match the format of the string you gave as an example.

The following should match for what you requested.
[^_]+[_]\d{4}[_]\d{2}[_]\d{2}[\.]\w+
I recommend using https://regex101.com/ (for python regular expressions) or http://regexr.com/ (for javascript regular expressions) in the future if you want to validate your regular expressions.

python re.sub newline multiline dotall

I have this CSV with the next lines written on it (please note the newline /n):
"<a>https://google.com</a>",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Dirección
I am trying to delete all that commas and putting the address one row up. Thus, on Python I am using this:
with open('Reutput.csv') as e, open('Put.csv', 'w') as ee:
text = e.read()
text = str(text)
re.compile('<a/>*D', re.MULTILINE|re.DOTALL)
replace = re.sub('<a/>*D','<a/>",D',text) #arreglar comas entre campos
replace = str(replace)
ee.write(replace)
f.close()
As far as I know, re.multiline and re.dotall are necessary to fulfill /n needs. I am using re.compile because it is the only way I know to add them, but obviously compiling it is not needed here.
How could I finish with this text?
"<a>https://google.com</a>",Dirección

You don't need the compile statement at all, because you aren't using it. You can put either the compiled pattern or the raw pattern in the re.sub function. You also don't need the MULTILINE flag, which has to do with the interpretation of the ^ and $ metacharacters, which you don't use.
The heart of the problem is that you are compiling the flag into a regular expression pattern, but since you aren't using the compiled pattern in your substitute command, it isn't getting recognized.
One more thing. re.sub returns a string, so replace = str(replace) is unnecessary.
Here's what worked for me:
import re
with open('Reutput.csv') as e:
text = e.read()
text = str(text)
s = re.compile('</a>".*D',re.DOTALL)
replace = re.sub(s, '</a>"D',text) #arreglar comas entre campos
print(replace)
If you just call re.sub without compiling, you need to call it like
re.sub('</a>".*D', '</a>"D', text, flags=re.DOTALL)
I don't know exactly what your application is, of course, but if all you want to do is to delete all the commas and newlines, it might be clearer to write
replace = ''.join((c for c in text if c not in ',\n'))

When you use re.compile you need to save the returned Regular Expression object and then call sub on that. You also need to have a .* to match any character instead of matching close html tags. The re.MULTILINE flag is only for the begin and end string symbols (^ and $) so you do not need it in this case.
regex = re.compile('</a>.*D',re.DOTALL)
replace = regex.sub('</a>",D',text)
That should work. You don't need to convert replace to a string since it is already a string.
Alternative you can write a regular expression that doesn't use .
replace = re.sub('"(,|\n)*D','",D',text)

This worked for me using re.sub with multiline texte
#!/usr/bin/env python3
import re
output = open("newFile.txt","w")
input = open("myfile.txt")
file = input.read()
input.close()
text = input.read()
replace = re.sub("value1\n\s +nickname", "value\n\s +name", text, flags=re.DOTALL)
output.write(replace)
output.close()

How to remove escape sequence like '\xe2' or '\x0c' in python

I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file.
But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like '\x01').I know its because of bullets(•).
I want only text, so is there any way to remove this escape sequence.I have done something like this
escape_char = re.compile('\+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)
But this do not remove escape sequence
Thanks in advance.

The problem is that \xXX is just a representation of a control character, not the character itself. Therefore, you can't literally match \x unless you're working with the repr of the string.
You can remove nonprintable characters using a character class:
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', text)
Example:
>>> re.sub(r'[\x00-\x1f\x7f-\xff]', '', ''.join(map(chr, range(256))))
' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

Your only real problem is that backslashes are tricky. In a string, a backslash might be treated specially; for example \t would turn into a tab. Since \+ isn't special in strings, the string was actually what you expected. So then the regular expression compiler looked at it, and \+ in a regular expression would just be a plain + character. Normally the + has a special meaning ("1 or more instances of the preceding pattern") and the backslash escapes it.
The solution is just to double the backslash, which makes a pattern that matches a single backslash.
I put the pattern into r'', to make it a "raw string" where Python leaves backslashes alone. If you don't do that, Python's string parser will turn the two backslashes into a single backslash; just as \t turns into a tab, \\ turns into a single backslash. So, use a raw string and put exactly what you want the regular expression compiler to see.
Also, a better pattern would be: backslash, then an x, then 1 or more instances of the character class matching a hex character. I rewrote the pattern to this.
import re
s = r'+\x01+'
escape_char = re.compile(r'\\x[0123456789abcdef]+')
s = re.sub(escape_char, " ", s)
Instead of using a raw string, you could use a normal string and just be very careful with backslashes. In this case we would have to put four backslashes! The string parser would turn each doubled backslash into a single backslash, and we want the regular expression compiler to see two backslashes. It's easier to just use the raw string!
Also, your original pattern would remove zero or more hex digits. My pattern removes one or more. But I think it is likely that there will always be exactly two hex digits, or perhaps with Unicode maybe there will be four. You should figure out how many there can be and put a pattern that ensures this. Here's a pattern that matches 2, 3, or 4 hex digits:
escape_char = re.compile(r'\\x[0123456789abcdef]{2,4}')
And here is one that matches exactly two or exactly four. We have to use a vertical bar to make two alternatives, and we need to make a group with parentheses. I'm using a non-matching group here, with (?:pattern) instead of just (pattern) (where pattern means a pattern, not literally the word pattern).
escape_char = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
Here is example code. The bullet sequence is immediately followed by a 1 character, and this pattern leaves it alone.
import re
s = r'+\x011+'
pat = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
s = pat.sub("#", s)
print("Result: '%s'" % s)
This prints: Result: '+#1+'
NOTE: all of this is assuming that you actually are trying to match a backslash character followed by hex chars. If you are actually trying to match character byte values that might or might not be "printable" chars, then use the answer by #nneonneo instead of this one.

If you're working with 8-bit char values, it's possible to forgo regex's by building some simple tables beforehand and then use them inconjunction with str.translate() method to remove unwanted characters in strings very quickly and easily:
import random
import string
allords = [i for i in xrange(256)]
allchars = ''.join(chr(i) for i in allords)
printableords = [ord(ch) for ch in string.printable]
deletechars = ''.join(chr(i) for i in xrange(256) if i not in printableords)
test = ''.join(chr(random.choice(allords)) for _ in xrange(10, 40)) # random string
print test.translate(allchars, deletechars)

not enough reputation to comment, but the accepted answer removes printable characters as well.
s = "pörféct änßwer"
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', s)
'prfct nwer'
For non-English strings, please use answer https://stackoverflow.com/a/62530464/3021668
import unicodedata
''.join(c for c in s if not unicodedata.category(c).startswith('C'))
'pörféct änßwer'

C Preprocessing with Python Regular Expressions

I've never used regular expressions before and I'm struggling to make sense of them. I have strings in the form of 'define(__arch64__)' and I just want the __arch64__.
import re
mystring = 'define(this_symbol)||define(that_symbol)'
pattern = 'define\(([a-zA-Z_]\w*)\)'
re.search(mystring, pattern).groups()
(None, None)
What doesn't search return 'this_symbol' and 'that_symbol'?

You have the parameters of search() in the wrong order, it should be:
re.search(pattern, mystring)
Also, backslashes are escape characters in python strings (for example "\n" will be a string containing a newline). If you want literal backslaches, like in the regular expression, you have to escape them with another backslash. Alternatively you can use raw strings that are marked by an r in front of them and don't treat backslashes as escape characters:
pattern = r'define\(([a-zA-Z_]\w*)\)'

You must differentiate between the symbol ( and the regexp group characters. Also, the pattern goes first in re.search:
pattern = 'define\\(([a-zA-Z_]\w*)\\)'
re.search(pattern, mystring).groups()

Python regex findall numbers and dots

I'm using re.findall() to extract some version numbers from an HTML file:
>>> import re
>>> text = "<table><td>Test0.2.1.zip</td><td>Test0.2.1</td></table> Test0.2.1"
>>> re.findall("Test([\.0-9]*)", text)
['0.2.1.', '0.2.1', '0.2.1']
but I would like to only get the ones that do not end in a dot.
The filename might not always be .zip so I can't just stick .zip in the regex.
I wanna end up with:
['0.2.1', '0.2.1']
Can anyone suggest a better regex to use? :)

re.findall(r"Test([0-9.]*[0-9]+)", text)
or, a bit shorter:
re.findall(r"Test([\d.]*\d+)", text)
By the way - you do not need to escape the dot in a character class. Inside [] the . has no special meaning, it just matches a literal dot. Escaping it has no effect.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using a regex wildcard within a specific pattern match - python

Related

Python regex numbers and underscores

python re.sub newline multiline dotall

How to remove escape sequence like '\xe2' or '\x0c' in python

C Preprocessing with Python Regular Expressions

Python regex findall numbers and dots

Categories

Resources