unpickling a dictionary of encoded words - python

I have created a dictionary (which as Keys has encoded words in utf-8) :
import os.path
import codecs
import pickle
from collections import Counter
wordDict = {}
def pathFilesList():
source='StemmedDataset'
retList = []
for r, d, f in os.walk(source):
for files in f:
retList.append(os.path.join(r, files))
return retList
# Starts to parse a corpus, it counts the frequency of each word and
# the date of the data (the date is the file name.) then saves words
# as keys of dictionary and the tuple of (freq,date) as values of each
# key.
def parsing():
fileList = pathFilesList()
for f in fileList:
date_stamp = f[15:-4]
print "Processing file: " + str(f)
fileWordList = []
fileWordSet = set()
# One word per line, strip space. No empty lines.
fw = codecs.open(f, mode = 'r' , encoding='utf-8')
fileWords = Counter(w for w in fw.read().split())
# For each unique word, count occurance and store in dict.
for stemWord, stemFreq in fileWords.items():
if stemWord not in wordDict:
wordDict[stemWord] = [(date_stamp, stemFreq)]
else:
wordDict[stemWord].append((date_stamp, stemFreq))
# Close file and do next.
fw.close()
if __name__ == "__main__":
# Parse all files and store in wordDict.
parsing()
output = open('data.pkl', 'wb')
print "Dumping wordDict of size {0}".format(len(wordDict))
pickle.dump(wordDict, output)
output.close()
when I unpickle the pickled data , and query this dictionary I can't query alphabetical words , even words of which I'm sure they're in the dictionary,it always returns false , but for the numeric query , it works fine. here is how I unpickle the data and query :
pkl_file=codecs.open('data.pkl' , 'rb' )
wd=pickle.load(pkl_file)
pprint.pprint(wd) #to make sure the wd is correct and it has been created
print type(wd) #making sure of the type of data structure
pkl_file.close()
#tried lots of other ways to query like if wd.has_key('some_encoded_word')
value= None
inputList= ['اندیمشک' , '16' , 'درحوزه' ]
for i in inputList :
if i in wd :
value = wd[i]
print value
else:
print 'False'
here is my output
pa#pa:~/Desktop$ python unpickle.py
False
[('2000-05-07', 5), ('2000-07-05', 2)]
False
so I'm quite sure there's something wrong with the encoded words .

Your problem is that you're using codecs.open. That function is specifically for opening a text file and automatically decoding the data to Unicode. (As a side note, you usually want io.open, not codecs.open, even for that case.)
To open a binary file to be passed as bytes to something like pickle.load, just use the builtin open function, not codecs.open.
Meanwhile, it usually makes things simpler to use Unicode strings throughout your program, and only use byte strings at the edges, decoding input as soon as you get it and encoding output as late as possible.
Also, you have literal Unicode characters in non-Unicode string literals. Never, ever, ever do this. That's the surest way to create invisible mojibake strings. Always use Unicode literals (like u'abc' instead of 'abc') when you have non-ASCII characters.
Plus, if you use non-ASCII characters in the source, always use a coding declaration (and of course make sure your editor is using the same encoding you put in the coding declaration),
Also, keep in mind that == (and dict lookup) may not do what you want for Unicode. If you have a word stored in NFC, and look up the same word in NFD, the characters won't be identical even though they represent the same string. Without knowing the exact strings you're using and how they're represented it's hard to know if this is a problem in your code, but it's pretty common with, e.g., Mac programs that get strings out of filenames or Cocoa GUI apps.

Related

How to get an unknown substring between two known substrings, within a giant string/file

I'm trying to get all the substrings under a "customLabel" tag, for example "Month" inside of ...,"customLabel":"Month"},"schema":"metric...
Unusual issue: this is a 1071552 characters long ndjson file, of a single line ("for line in file:" is pointless since there's only one).
The best I found was that:
How to find a substring of text with a known starting point but unknown ending point in python
but if I use this, the result obviously doesn't stop (at Month) and keeps writing the whole remaining of the file, same as if using partition()[2].
Just know that Month is only an example, customLabel has about 300 variants and they are not listed (I'm actually doing this to list them...)
To give some details here's my script so far:
with open("file.ndjson","rt", encoding='utf-8') as ndjson:
filedata = ndjson.read()
x="customLabel"
count=filedata.count(x)
for i in range (count):
if filedata.find(x)>0:
print("Found "+str(i+1))
So right now it properly tells me how many occurences of customLabel there are, I'd like to get the substring that comes after customLabel":" instead (Month in the example) to put them all in a list, to locate them way more easily and enable the use of replace() for traductions later on.
I'd guess regex are the solution but I'm pretty new to that, so I'll post that question by the time I learn about them...
If you want to search for all (even nested) customLabel values like this:
{"customLabel":"Month" , "otherJson" : {"customLabel" : 23525235}}
you can use RegEx patterns with the re module
import re
label_values = []
regex_pattern = r"\"customLabel\"[ ]?:[ ]?([1-9a-zA-z\"]+)"
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
values = re.findall(regex_pattern, line)
label_values.extend(values)
print(label_values) # ['"Month"', '23525235']
# If you don't want the items to have quotations
label_values = [i.replace('"', "") for i in label_values]
print(label_values) # ['Month', '23525235']
Note: If you're only talking about ndjson files and not nested searching, then it'd be better to use the json module to parse the lines and then easily get the value of your specific key which is customLabel.
import json
label = "customLabel"
label_values = []
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
line_json = json.loads(line)
if line_json.get(label) is not None:
label_values.append(line_json.get(label))
print(label_values) # ['Month']

How can I effectively pull out human readable strings/terms from code automatically?

I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files.
Example - For this line of code found in a file:
for w in sorted(strings, key=strings.get, reverse=True):
I'd want these unique strings/terms returned to my dictionary as keys:
for
w
in
sorted
strings
key
strings
get
reverse
True
However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times:
strings.get
How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term":
strings = dict()
fname = '/tmp/bigfile.txt'
with open(fname, "r") as f:
for line in f:
if line in strings:
strings[line] += 1
else:
strings[line] = 1
for w in sorted(strings, key=strings.get, reverse=True):
print str(w).rstrip() + " : " + str(strings[w])
(Yes I used code from my little snippet here as the example at the top.)
If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression
\w+\.?\w*
Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"
note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards.
then you can use collections.Counter to do the actual counting for you:
import collections
import re
pattern = re.compile(r"\w+\.?\w*")
#here I'm using the source file for `collections` as the test example
with open(collections.__file__, "r") as f:
tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
for token, count in tokens.most_common(5): #show only the top 5
print(token, count)
Running python version 3.6.0a1 the output is this:
self 226
def 173
return 170
self.data 129
if 102
which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.

Rosalind Profile and Consensus: Writing long strings to one line in Python (Formatting)

I'm trying to tackle a problem on Rosalind where, given a FASTA file of at most 10 sequences at 1kb, I need to give the consensus sequence and profile (how many of each base do all the sequences have in common at each nucleotide). In the context of formatting my response, what I have as my code works for small sequences (verified).
However, I have issues in formatting my response when it comes to large sequences.
What I expect to return, regardless of length, is:
"consensus sequence"
"A: one line string of numbers without commas"
"C: one line string """" "
"G: one line string """" "
"T: one line string """" "
All aligned with each other and on their own respective lines, or at least some formatting that allows me to carry this formatting as a unit onward to maintain the integrity of aligning.
but when I run my code for a large sequence, I get each separate string below the consensus sequence broken up by a newline, presumably because the string itself is too long. I've been struggling to think of ways to circumvent the issue, but my searches have been fruitless. I'm thinking about some iterative writing algorithm that can just write the entirety of the above expectation but in chunks Any help would be greatly appreciated. I have attached the entirety of my code below for the sake of completeness, with block comments as needed, though the main section.
def cons(file):
#returns consensus sequence and profile of a FASTA file
import os
path = os.path.abspath(os.path.expanduser(file))
with open(path,"r") as D:
F=D.readlines()
#initialize list of sequences, list of all strings, and a temporary storage
#list, respectively
SEQS=[]
mystrings=[]
temp_seq=[]
#get a list of strings from the file, stripping the newline character
for x in F:
mystrings.append(x.strip("\n"))
#if the string in question is a nucleotide sequence (without ">")
#i'll store that string into a temporary variable until I run into a string
#with a ">", in which case I'll join all the strings in my temporary
#sequence list and append to my list of sequences SEQS
for i in range(1,len(mystrings)):
if ">" not in mystrings[i]:
temp_seq.append(mystrings[i])
else:
SEQS.append(("").join(temp_seq))
temp_seq=[]
SEQS.append(("").join(temp_seq))
#set up list of nucleotide counts for A,C,G and T, in that order
ACGT= [[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))]]
#assumed to be equal length sequences. Counting amount of shared nucleotides
#in each column
for i in range(0,len(SEQS[0])-1):
for j in range(0, len(SEQS)):
if SEQS[j][i]=="A":
ACGT[0][i]+=1
elif SEQS[j][i]=="C":
ACGT[1][i]+=1
elif SEQS[j][i]=="G":
ACGT[2][i]+=1
elif SEQS[j][i]=="T":
ACGT[3][i]+=1
ancstr=""
TR_ACGT=list(zip(*ACGT))
acgt=["A: ","C: ","G: ","T: "]
for i in range(0,len(TR_ACGT)-1):
comp=TR_ACGT[i]
if comp.index(max(comp))==0:
ancstr+=("A")
elif comp.index(max(comp))==1:
ancstr+=("C")
elif comp.index(max(comp))==2:
ancstr+=("G")
elif comp.index(max(comp))==3:
ancstr+=("T")
'''
writing to file... trying to get it to write as
consensus sequence
A: blah(1line)
C: blah(1line)
G: blah(1line)
T: blah(line)
which works for small sequences. but for larger sequences
python keeps adding newlines if the string in question is very long...
'''
myfile="myconsensus.txt"
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
with open(myfile,'w') as D:
D.writelines(ancstr)
D.writelines("\n")
for i in range(0,len(writing_strings)):
D.writelines(writing_strings[i])
D.writelines("\n")
cons("rosalind_cons.txt")
Your code is totally fine except for this line:
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
You accidentally replicate your data. Try replacing it with:
writing_strings=[ACGT[i] + str(ACGT[i]) for i in range(0,len(ACGT))]
and then write it to your output file as follows:
D.write(writing_strings[i][1:-1])
That's a lazy way to get rid of the brackets from your list.

Vigenere Cipher in Python bug

I'm trying to implement Vigenere's Cipher. I want to be able to obfuscate every single character in a file, not just alphabetic characters.
I think I'm missing something with the different types of encoding. I have made some test cases and some characters are getting replaced badly in the final result.
This is one test case:
,.-´`1234678abcde^*{}"¿?!"·$%&/\º
end
And this is the result I'm getting:
).-4`1234678abcde^*{}"??!"7$%&/:
end
As you can see, ',' is being replaced badly with ')' as well as some other characters.
My guess is that the others (for example, '¿' being replaced with '?') come from the original character not being in the range of [0, 127], so its normal those are changed. But I don't understand why ',' is failing.
My intent is to obfuscate CSV files, so the ',' problem is the one I'm mainly concerned about.
In the code below, I'm using modulus 128, but I'm not sure if that's correct. To execute it, put a file named "OriginalFile.txt" in the same folder with the content to cipher and run the script. Two files will be generated, Ciphered.txt and Deciphered.txt.
"""
Attempt to implement Vigenere cipher in Python.
"""
import os
key = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
fileOriginal = "OriginalFile.txt"
fileCiphered = "Ciphered.txt"
fileDeciphered = "Deciphered.txt"
# CIPHER PHASE
if os.path.isfile(fileCiphered):
os.remove(fileCiphered)
keyToUse = 0
with open(fileOriginal, "r") as original:
with open(fileCiphered, "a") as ciphered:
while True:
c = original.read(1) # read char
if not c:
break
k = key[keyToUse]
protected = chr((ord(c) + ord(k))%128)
ciphered.write(protected)
keyToUse = (keyToUse + 1)%len(key)
print("Cipher successful")
# DECIPHER PHASE
if os.path.isfile(fileDeciphered):
os.remove(fileDeciphered)
keyToUse = 0
with open(fileCiphered, "r") as ciphered:
with open(fileDeciphered, "a") as deciphered:
while True:
c = ciphered.read(1) # read char
if not c:
break
k = key[keyToUse]
unprotected = chr((128 + ord(c) - ord(k))%128) # +128 so that we don't get into negative numbers
deciphered.write(unprotected)
keyToUse = (keyToUse + 1)%len(key)
print("Decipher successful")
Assumption: you're trying to produce a new, valid CSV with the contents of cells enciphered via Vigenere, not to encipher the whole file.
In that case, you should check out the csv module, which will handle properly reading and writing CSV files for you (including cells that contain commas in the value, which might happen after you encipher a cell's contents, as you see). Very briefly, you can do something like:
with open("...", "r") as fpin, open("...", "w") as fpout:
reader = csv.reader(fpin)
writer = csv.writer(fpout)
for row in reader:
# row will be a list of strings, one per column in the row
ciphered = [encipher(cell) for cell in row]
writer.writerow(ciphered)
When using the csv module you should be aware of the notion of "dialects" -- ways that different programs (usually spreadsheet-like things, think Excel) handle CSV data. csv.reader() usually does a fine job of inferring the dialect you have in the input file, but you might need to tell csv.writer() what dialect you want for the output file. You can get the list of built-in dialects with csv.list_dialects() or you can make your own by creating a custom Dialect object.

When writing to a csv file, why is each letter in a column?

The code I am using:
import urllib2
import csv
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
soup = BeautifulSoup(urllib2.urlopen(url))
fl = open('locations.csv', 'w')
def unique(countries):
seen = set()
for country in countries:
l = country.lower()
if l in seen:
continue
seen.add(l)
yield country
locs = []
for row in soup.select('table.wikitable tr'):
cells = row.find_all('td')
if cells:
for location in cells[3].find_all(text=True):
locs.extend(location.split())
locs2 = []
for locations in unique(locs):
locations = locs2.extend(locations.split())
print sorted(locs2)
writer = csv.writer(fl)
writer.writerow(['location'])
for values in sorted(locs2):
writer.writerow(values)
fl.close()
When I print the code I am writing I get a u' in front of each element which I think is why it is outputting this way. I tried using .strip(u'') but it gives me an error that .strip cannot be used as it is a list.
What am I doing wrong?
locs2 is a list with strings, not a list of lists. As such you are trying to write individual strings as a row:
for values in sorted(locs2):
writer.writerow(values)
Here values is a string, and writerow() treats it as a sequence. Each element of whatever sequence you pass to that function will be treated as a separate column.
If you wanted to write all locations as one row, pass the whole list to writer.writerow():
writer.writerow(sorted(locs2))
If you wanted to write a new row for each individual location, wrap it in a list first:
for location in sorted(locs2):
writer.writerow([location])
You don't need to string u prefixes from strings; that's just Python telling you you have Unicode string objects, not byte string objects:
>>> 'ASCII byte string'
'ASCII byte string'
>>> 'ASCII unicode string'.decode('ascii')
u'ASCII unicode string'
See the following information if you want to learn more about Python and Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

Categories

Resources