I'm trying to implement Vigenere's Cipher. I want to be able to obfuscate every single character in a file, not just alphabetic characters.
I think I'm missing something with the different types of encoding. I have made some test cases and some characters are getting replaced badly in the final result.
This is one test case:
,.-´`1234678abcde^*{}"¿?!"·$%&/\º
end
And this is the result I'm getting:
).-4`1234678abcde^*{}"??!"7$%&/:
end
As you can see, ',' is being replaced badly with ')' as well as some other characters.
My guess is that the others (for example, '¿' being replaced with '?') come from the original character not being in the range of [0, 127], so its normal those are changed. But I don't understand why ',' is failing.
My intent is to obfuscate CSV files, so the ',' problem is the one I'm mainly concerned about.
In the code below, I'm using modulus 128, but I'm not sure if that's correct. To execute it, put a file named "OriginalFile.txt" in the same folder with the content to cipher and run the script. Two files will be generated, Ciphered.txt and Deciphered.txt.
"""
Attempt to implement Vigenere cipher in Python.
"""
import os
key = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
fileOriginal = "OriginalFile.txt"
fileCiphered = "Ciphered.txt"
fileDeciphered = "Deciphered.txt"
# CIPHER PHASE
if os.path.isfile(fileCiphered):
os.remove(fileCiphered)
keyToUse = 0
with open(fileOriginal, "r") as original:
with open(fileCiphered, "a") as ciphered:
while True:
c = original.read(1) # read char
if not c:
break
k = key[keyToUse]
protected = chr((ord(c) + ord(k))%128)
ciphered.write(protected)
keyToUse = (keyToUse + 1)%len(key)
print("Cipher successful")
# DECIPHER PHASE
if os.path.isfile(fileDeciphered):
os.remove(fileDeciphered)
keyToUse = 0
with open(fileCiphered, "r") as ciphered:
with open(fileDeciphered, "a") as deciphered:
while True:
c = ciphered.read(1) # read char
if not c:
break
k = key[keyToUse]
unprotected = chr((128 + ord(c) - ord(k))%128) # +128 so that we don't get into negative numbers
deciphered.write(unprotected)
keyToUse = (keyToUse + 1)%len(key)
print("Decipher successful")
Assumption: you're trying to produce a new, valid CSV with the contents of cells enciphered via Vigenere, not to encipher the whole file.
In that case, you should check out the csv module, which will handle properly reading and writing CSV files for you (including cells that contain commas in the value, which might happen after you encipher a cell's contents, as you see). Very briefly, you can do something like:
with open("...", "r") as fpin, open("...", "w") as fpout:
reader = csv.reader(fpin)
writer = csv.writer(fpout)
for row in reader:
# row will be a list of strings, one per column in the row
ciphered = [encipher(cell) for cell in row]
writer.writerow(ciphered)
When using the csv module you should be aware of the notion of "dialects" -- ways that different programs (usually spreadsheet-like things, think Excel) handle CSV data. csv.reader() usually does a fine job of inferring the dialect you have in the input file, but you might need to tell csv.writer() what dialect you want for the output file. You can get the list of built-in dialects with csv.list_dialects() or you can make your own by creating a custom Dialect object.
Related
Well, I'm learning Python, so I'm working on a project that consists in passing some numbers of PDF files to xlsx and placing them in their corresponding columns, rows determined according to row heading.
The idea that came to me to carry it out is to convert the PDF files to txt and make a dictionary with the txt files, whose key is a part of the file name (because it contains a part of the row header) and the values be the numbers I need.
I have already managed to convert the txt files, now i'm dealing with the script to carry the dictionary. at the moment look like this:
import os
import re
p = re.compile(r'\w+\f+')
'''
I'm not entirely sure at the moment how the .compile of regular expressions works, but I know I'm missing something to indicate that what I want is immediately to the right, I'm also not sure if the keywords will be ignored, I just want take out the numbers
'''
m = p.match('Theese are the keywords' or 'That are immediately to the left' or 'The numbers I want')
def IsinDict(txtDir):
ToData = ()
if txtDir == "": txtDir = os.getcwd() + "\\"
for txt in os.listdir(txtDir):
ToKey = txt[9:21]
if ToKey == (r"\w+"):
Data = open(txt, "r")
for string in Data:
ToData += m.group()
Diccionary = dict.fromkeys(ToKey, ToData)
return Diccionary
txtDir = "Absolute/Path/OfTheText/Files"
IsinDict(txtDir)
Any contribution is welcome, thanks for your attention.
I am cataloging attribute fields for each feature class in the input list, below, and then I am writing the output to a spreadsheet for the occurance of the attribute in one or more of the feature classes.
import arcpy,collections,re
arcpy.env.overwriteOutput = True
input = [list of feature classes]
outfile= # path to csv file
f=open(outfile,'w')
f.write('ATTRIBUTE,FEATURE CLASS\n\n')
mydict = collections.defaultdict(list)
for fc in input:
cmp=[]
lstflds=arcpy.ListFields(fc)
for fld in lstflds:
cmp.append(fld.name)
for item in cmp:
mydict[item].append(fc)
for keys, vals in mydict.items():
#remove these characters
char_removal = ["[","'",",","]"]
new_char = '[' + re.escape(''.join(char_removal)) + ']'
v=re.sub(new_char,'', str(vals))
line=','.join([keys,v])+'\n'
print line
f.write(line)
f.close()
This code gets me 90% of the way to the intended solution. I still cannot get the feature classes(values) to separate by a comma within the same cell(being comma delimited it shifts each value over to the next column as I mentioned). In this particular code the "v" on line 20(feature class names) are output to the spreadsheet, separated by a space(" ") in the same cell. Not a huge deal because the replace " " with "," can be done very quickly in the spreadsheet itself but it would be nice to work this into the code to improve reusability.
For a CSV file, use double-quotes around the cell content to preserve interior commas within, like this:
content1,content2,"content3,contains,commas",content4
Generally speaking, many libraries that output CSV just put all contents in quotes, like this:
"content1","content2","content3,contains,commas","content4"
As a side note, I'd strongly recommend using an existing library to create CSV files instead of reinventing the wheel. One such library is built into Python 2.6+.
As they say, "Good coders write. Great coders reuse."
import arcpy,collections,re,csv
arcpy.env.overwriteOutput = True
input = [# list of feature classes]
outfile= # path to output csv file
f=open(outfile,'wb')
csv_write=csv.writer(f)
csv_write.writerow(['Field','Feature Class'])
csv_write.writerow('')
mydict = collections.defaultdict(list)
for fc in input:
cmp=[]
lstflds=arcpy.ListFields(fc)
for fld in lstflds:
cmp.append(fld.name)
for item in cmp:
mydict[item].append(fc)
for keys, vals in mydict.items():
# remove these characters
char_removal = ["[","'","]"]
new_char = '[' + re.escape(''.join(char_removal)) + ']'
v=re.sub(new_char,'', str(vals))
csv_write.writerow([keys,""+v+""])
f.close()
I have created a dictionary (which as Keys has encoded words in utf-8) :
import os.path
import codecs
import pickle
from collections import Counter
wordDict = {}
def pathFilesList():
source='StemmedDataset'
retList = []
for r, d, f in os.walk(source):
for files in f:
retList.append(os.path.join(r, files))
return retList
# Starts to parse a corpus, it counts the frequency of each word and
# the date of the data (the date is the file name.) then saves words
# as keys of dictionary and the tuple of (freq,date) as values of each
# key.
def parsing():
fileList = pathFilesList()
for f in fileList:
date_stamp = f[15:-4]
print "Processing file: " + str(f)
fileWordList = []
fileWordSet = set()
# One word per line, strip space. No empty lines.
fw = codecs.open(f, mode = 'r' , encoding='utf-8')
fileWords = Counter(w for w in fw.read().split())
# For each unique word, count occurance and store in dict.
for stemWord, stemFreq in fileWords.items():
if stemWord not in wordDict:
wordDict[stemWord] = [(date_stamp, stemFreq)]
else:
wordDict[stemWord].append((date_stamp, stemFreq))
# Close file and do next.
fw.close()
if __name__ == "__main__":
# Parse all files and store in wordDict.
parsing()
output = open('data.pkl', 'wb')
print "Dumping wordDict of size {0}".format(len(wordDict))
pickle.dump(wordDict, output)
output.close()
when I unpickle the pickled data , and query this dictionary I can't query alphabetical words , even words of which I'm sure they're in the dictionary,it always returns false , but for the numeric query , it works fine. here is how I unpickle the data and query :
pkl_file=codecs.open('data.pkl' , 'rb' )
wd=pickle.load(pkl_file)
pprint.pprint(wd) #to make sure the wd is correct and it has been created
print type(wd) #making sure of the type of data structure
pkl_file.close()
#tried lots of other ways to query like if wd.has_key('some_encoded_word')
value= None
inputList= ['اندیمشک' , '16' , 'درحوزه' ]
for i in inputList :
if i in wd :
value = wd[i]
print value
else:
print 'False'
here is my output
pa#pa:~/Desktop$ python unpickle.py
False
[('2000-05-07', 5), ('2000-07-05', 2)]
False
so I'm quite sure there's something wrong with the encoded words .
Your problem is that you're using codecs.open. That function is specifically for opening a text file and automatically decoding the data to Unicode. (As a side note, you usually want io.open, not codecs.open, even for that case.)
To open a binary file to be passed as bytes to something like pickle.load, just use the builtin open function, not codecs.open.
Meanwhile, it usually makes things simpler to use Unicode strings throughout your program, and only use byte strings at the edges, decoding input as soon as you get it and encoding output as late as possible.
Also, you have literal Unicode characters in non-Unicode string literals. Never, ever, ever do this. That's the surest way to create invisible mojibake strings. Always use Unicode literals (like u'abc' instead of 'abc') when you have non-ASCII characters.
Plus, if you use non-ASCII characters in the source, always use a coding declaration (and of course make sure your editor is using the same encoding you put in the coding declaration),
Also, keep in mind that == (and dict lookup) may not do what you want for Unicode. If you have a word stored in NFC, and look up the same word in NFD, the characters won't be identical even though they represent the same string. Without knowing the exact strings you're using and how they're represented it's hard to know if this is a problem in your code, but it's pretty common with, e.g., Mac programs that get strings out of filenames or Cocoa GUI apps.
I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.
This is the python script:
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
bits[1] = '"input"'
fo.write( ','.join(bits) )
f.close()
fo.close()
I have a CSV file and I'm replacing the content of the 2nd column with the string "input". However, I need to grab some information from that column content first.
The content might look like this:
failurelog_wl","inputfile/source/XXXXXXXX"; "**X_CORD2**"; "Invoice_2M";
"**Y_CORD42**"; "SIZE_ID37""
It has weird type of data as you can see, especially that it has 2 double quotes at the end of the line instead of just one that you would expect.
I need to extract the XCORD and YCORD information, like XCORD = 2 and YCORD = 42, before replacing the column value. I then want to insert an extra column, named X_Y, which represents (2_42).
How can I modify my script to do that?
If I understand your question correctly, you can use a simple regular expression to pull out the numbers you want:
import re
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
x_y_matches = re.match('.*X_CORD(\d+).*Y_CORD(\d+).*', bits[1])
assert x_y_matches is not None, 'Line had unexpected format: {0}'.format(bits[1])
x_y = '({0}_{1})'.format(x_y_matches.group(1), x_y_matches.group(2))
bits[1] = '"input"'
bits.append(x_y)
fo.write( ','.join(bits) )
f.close()
fo.close()
Note that this will only work if column 2 always says 'X_CORD' and 'Y_CORD' immediately before the numbers. If it is sometimes a slightly different format, you'll need to adjust the regular expression to allow for that. I added the assert to give a more useful error message if that happens.
You mentioned wanting the column to be named X_Y. Your script appears to assume that there is no header, and my modified version definitely makes this assumption. Again, you'd need to adjust for that if there is a header line.
And, yes, I agree with the other commenters that using the csv module would be cleaner, in general, for reading and writing csv files.