stripping away code in python using "re.sub" - python

I read this:
Stripping everything but alphanumeric chars from a string in Python
and this:
Python: Strip everything but spaces and alphanumeric
Didn't quite understand but I tried a bit on my own code, which now looks like this:
import re
decrypt = str(open("crypt.txt"))
crypt = re.sub(r'([^\s\w]|_)+', '', decrypt)
print(crypt)
When I run the script It comes back with this answer:
C:\Users\Adrian\Desktop\python>python tick.py
ioTextIOWrapper namecrypttxt moder encodingcp1252
I am trying to get away all the extra code from the document and just keep numbers and letter, inside the document the following text can be found: http://pastebin.com/Hj3SjhxC
I am trying to solve the assignment here:
http://www.pythonchallenge.com/pc/def/ocr.html
Anyone knows what "ioTextIOWrapper namecrypttxt moder encodingcp1252" means?
And how should I format the code to properly strip it from everything except letter and numbers?
Sincerely

str(open("file.txt")) doesn't do what you think it does. open() returns a file object. str gives you the string representation of that file object, not the contents of the file. If you want to read the contents of the file use open("file.txt").read().
Or, more safely, use a with statement:
with open("file.txt") as f:
decrypt = f.read()
crypt = ...
# etc.

You could just search for the alphanumeric chars instead. Like this:
print ''.join(re.findall('[A-Za-z]', decrypt))
And you also want:
decrypt = open("crypt.txt").read()

Related

Why does python add additional backslashes to the path?

I have a text file with a path that goes like this:
r"\\user\data\t83\rf\Desktop\QA"
When I try to read this file a print a line it returns the following string, I'm unable to open the file from this location:
'r"\\\\user\\data\\t83\\rf\\Desktop\\QA"\n'
Seems you've got Python code in your text file, so either sanitize your file, so it only includes the actual path (not a Python string representation) or you can try to fiddle with string replace until you're satisfied, or just evaluate the Python string.
Note that using eval() opens Padora's box (it as unsafe as it gets), it's safer to use ast.literal_eval() instead.
import ast
file_content = 'r"\\\\user\\data\\t83\\rf\\Desktop\\QA"\n'
print(eval(file_content)) # do not use this, it's only shown for the sake of completeness
print(ast.literal_eval(file_content))
Output:
\\user\data\t83\rf\Desktop\QA
\\user\data\t83\rf\Desktop\QA
Personally, I'd prefer to sanitize the file, so it only contains \\user\data\t83\rf\Desktop\QA
\ will wait for another character to form one like \n (new line) or \t (tab) therefore a single backslash will merge with the next character. To solve this if the next character is \\ it will represent the single backslash.

replacing string from '\' into '/' Python

I've been strugling with some code where i need to change simple \ into / in Python. Its a path of file- Python doesn't read path of file in Windows'es way, so i simply want to change Windows path for Python to read file correctly.
I want to parse some text from game to count statistics. Im Doing it this way:
import re
pathNumbers = "D:\Gry\Tibia\packages\TibiaExternal\log\test server.txt"
pathNumbers = re.sub(r"\\", r"/",pathNumbers)
fileNumbers = open (pathNumbers, "r")
print(fileNumbers.readline())
fileNumbers.close()
But the Error i get back is
----> 6 fileNumbers = open (pathNumbers, "r") OSError: [Errno 22] Invalid argument: 'D:/Gry/Tibia/packages/TibiaExternal\test server.txt'
And the problem is, that function re.sub() and .replace(), give the same result- almost full path is replaced, but last char to change always stays untouched.
Do you have any solution for this, because it seems like changing those chars are for python a sensitive point.
Simple answer:
If you want to use paths on different plattforms join them with
os.path.join(path,*paths)
This way you don't have to work with the different separators at all.
Answer to what you intended to do:
The actual problem is, that your pathNumbers variable is not raw (leading r in definition), meaning that the backslashes are used as escape characters. In most cases this does not change anything, because the combinations with the following characters don't have a meaning. \t is the tab character, \n would be the newline character, so these are not simple backslash characters any more.
So simply write
pathNumbers = r"D:\Gry\Tibia\packages\TibiaExternal\log\test server.txt"

Python TypeError: expected a string or other character buffer object when importing text file

I am pretty new to python. For this task, I am trying to import a text file, add and to id, and remove punctuation from the text. I tried this method How to strip punctuation from a text file.
import string
def readFile():
translate_table = dict((ord(char), None) for char in string.punctuation)
with open('out_file.txt', 'w') as out_file:
with open('moviereview.txt') as file:
for line in file:
line = ' '.join(line.split(' '))
line = line.translate(translate_table)
out_file.write("<s>" + line.rstrip('\n') + "</s>" + '\n')
return out_file
However, I get an error saying:
TypeError: expected a string or other character buffer object
My thought is that after I split and join the line, I get a list of strings, so I cannot use str.translate() to process it. But it seems like everyone else have the same thing and it works,
ex. https://appliedmachinelearning.blog/2017/04/30/language-identification-from-texts-using-bi-gram-model-pythonnltk/ in example code from line 13.
So I am really confused, can anyone help? Thanks!
On Python 2, only unicode types have a translate method that takes a dict. If you intend to work with arbitrary text, the simplest solution here is to just use the Python 3 version of open on Py2; it will seamlessly decode your inputs and produce unicode instead of str.
As of Python 2.6+, replacing the normal built-in open with the Python 3 version is simple. Just add:
from io import open
to the imports at the top of your file. You can also remove line = ' '.join(line.split(' ')); that's definitionally a no-op (it splits on single spaces to make a list, then rejoins on single spaces). You may also want to add:
from __future__ import unicode_literals
to the very top of your file (before all of your code); that will make all of your uses of plain quotes automatically unicode literals, not str literals (prefix actual binary data with b to make it a str literal on Py2, bytes literal on Py3).
The above solution is best if you can swing it, because it will make your code work correctly on both Python 2 and Python 3. If you can't do it for whatever reason, then you need to change your translate call to use the API Python 2's str.translate expects, which means removing the definition of translate_table entirely (it's not needed) and just doing:
line = line.translate(None, string.punctuation)
For Python 2's str.translate, the arguments are a one-to-one mapping table for all values from 0 to 255 inclusive as the first argument (None if no mapping needed), and the second argument is a string of characters to delete (which string.punctuation already provides).
Answering here because a comment doesn't let me format code properly:
def r():
translate_table = dict((ord(char), None) for char in string.punctuation)
a = []
with open('out.txt', 'w') as of:
with open('test.txt' ,'r') as f:
for l in f:
l = l.translate(translate_table)
a.append(l)
of.write(l)
return a
This code runs fine for me with no errors. Can you try running that, and responding with a screenshot of the code you ran?

python 3 regex not finding confirmed matches

So I'm trying to parse a bunch of citations from a text file using the re module in python 3.4 (on, if it matters, a mac running mavericks). Here's some minimal code. Note that there are two commented lines: they represent two alternative searches. (Obviously, the little one, r'Rawls', is the one that works)
def makeRefList(reffile):
print(reffile)
# namepattern = r'(^[A-Z1][A-Za-z1]*-?[A-Za-z1]*),.*( \(?\d\d\d\d[a-z]?[.)])'
# namepattern = r'Rawls'
refsTuplesList = re.findall(namepattern, reffile, re.MULTILINE)
print(refsTuplesList)
The string in question is ugly, and so I stuck it in a gist: https://gist.github.com/paultopia/6c48c398a42d4834f2ae
As noted, the search string r'Rawls' produces expected output ['Rawls', 'Rawls']. However, the other search string just produces an empty list.
I've confirmed this regex (partially) works using the regex101 tester. Confirmation here: https://regex101.com/r/kP4nO0/1 -- this match what I expect it to match. Since it works in the tester, it should work in the code, right?
(n.b. I copied the text from terminal output from the first print command, then manually replaced \n characters in the string with carriage returns for regex101.)
One possible issue is that python has appended the bytecode flag (is the little b called a "flag?") to the string. This is an artifact of my attempt to convert the text from utf-8 to ascii, and I haven't figured out how to make it go away.
Yet re clearly is able to parse strings in that form. I know this because I'm converting two text files from utf-8 to ascii, and the following code works perfectly fine on the other string, converted from the other text file, which also has a little b in front of it:
def makeCiteList(citefile):
print(citefile)
citepattern = r'[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]? \(?\d\d\d\d[a-z]?[\s.,)]'
rawCitelist = re.findall(citepattern, citefile)
cleanCitelist = cleanup(rawCitelist)
finalCiteList = list(set(cleanCitelist))
print(finalCiteList)
return(finalCiteList)
The other chunk of text, which the code immediately above matches correctly: https://gist.github.com/paultopia/a12eba2752638389b2ee
The only hypothesis I can come up with is that the first, broken, regex expression is puking on the combination of newline characters and the string being treated as a byte object, even though a) I know the regex is correct for newlines (because, confirmation from the linked regex101), and b) I know it's matching the strings (because, confirmation from the successful match on the other string).
If that's true, though, I don't know what to do about it.
Thus, questions:
1) Is my hypothesis right that it's the combination of newlines and b that blows up my regex? If not, what is?
2) How do I fix that?
a) replace the newlines with something in the string?
b) rewrite the regex somehow?
c) somehow get rid of that b and make it into a normal string again? (how?)
thanks!
Addition
In case this is a problem I need to fix upstream, here's the code I'm using to get the text files and convert to ascii, replacing non-ascii characters:
this function gets called on utf-8 .txt files saved by textwrangler in mavericks
def makeCorpoi(citefile, reffile):
citebox = open(citefile, 'r')
refbox = open(reffile, 'r')
citecorpus = citebox.read()
refcorpus = refbox.read()
citebox.close()
refbox.close()
corpoi = [str(citecorpus), str(refcorpus)]
return corpoi
and then this function gets called on each element of the list the above function returns.
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
stringstring = str(bigstring)
return stringstring
Aah. I've tracked it down and answered my own question. Apparently one needs to call some kind of encode method on the decoded thing. The following code produces an actual string, with newlines and everything, out the other end (though now I have to fix a bunch of other bugs before I can figure out if the final output is as expected):
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
newstring = bigstring.decode('ascii', 'foreign')
return newstring
apparently the str() function doesn't do the same job, for reasons that are mysterious to me. This is despite an answer here How to make new line commands work in a .txt file opened from the internet? which suggests that it does.

How do I get a hex signiature like in the clamav database from a file with python

How do I get a hex signiature from a file that looks like this:
Exploit.HTML.ObjectType:3:*:3c6f626a65637420747970653d222f2f2f2f2f2f2f2f2f2f2f2f
HTML.Phishing.Bank-1:3:*:3c6d6170206e616d653d22{-36}223e3c6172656120636f6f7264733d22302c20302c20{4-12}222073686170653d22726563742220687265663d22{-160}3c2f6d61703e3c696d67207372633d226369643a
Exploit.HTML.MHTRedir.1n:3:*:6d732d6974733a6d68746d6c3a66696c653a2f2f633a5c*21687474703a2f2f
Exploit.HTML.MHTRedir.2n:3:*:646174613d226d732d6974733a6d68746d6c3a66696c653a2f2f(63|64)3a5c
Exploit.HTML.MHTRedir.3n:3:*:7372633d226d732d6974733a6d68746d6c3a66696c653a2f2f633a5c
Exploit.HTML.DragDrop:3:*:6265686176696f723a75726c282364656661756c7423616e63686f72636c69636b293b*666f6c6465723d227368656c6c3a
HTML.Phishing.Bank-4:3:*:7468697320656d61696c20697320666f72206e6f74696669636174696f6e206f6e6c792e20746f20636f6e746163742075732c20706c65617365206c6f6720696e746f20796f7572206163636f756e7420616e642073656e6420612062616e6b206d61696c2e203c2f7072653e
W32.MyLife.E:1:*:7a6172793230*40656d61696c2e636f6d
I know the signiature starts at the 3rd ':'
I'm trying to make a simple virus scanner in python with the clamav database, but i can't get a signiature like from the database...
I already tried binascii.hexlify(file.read()) but it gives a long hex string.
One way to do it is this. It can be written shorter, but I think this is more clear.
f=open('clamavfile','r')
signatures={}
for l in f.readlines():
data=l.strip().split(':') #strip whitespace from the line, and split by :
signatures[data[0]]=data[-1] # the filename is the first item, the signature is the last item
You could also use the re module if simply splitting by : is not enough.

Categories

Resources