Enumerating and replacing all tokens in a string file in python - python

I have a question for you, dear python lovers.
I have a corpus file, as the following:
Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?
I want to assign a specific number for each token, and replace it with the assigned number on the file.
What I mean by saying token is, basically each group of characters in the file separated by ' '. So, for example, ? is a token, also Excuse is a token as well.
I have a corpus file which involves more than 4 million lines, as above. Can you show me a fastest way to do I want?
Thanks,

Might be overkill but you could write your own classifier:
# Python 3.x
class Classifier(dict):
def __init__(self, args = None):
'''args is an iterable of keys (only)'''
self.n = 1
super().__init__()
if args:
for thing in args:
self[thing] = self.n
def __setitem__(self, key, value = None):
## print('setitem', key)
if key not in self:
super().__setitem__(key, self.n)
self.n += 1
def setdefault(self, key, default = None):
increment = key not in self
n = super().setdefault(key, self.n)
self.n += int(increment)
## print('setdefault', n)
return n
def update(self, other):
for k, v in other:
self.setdefault(k)
def transpose(self):
return {v:k for k, v in self.items()}
Usage:
c = Classifier()
with open('foo.txt') as infile, open('classified.txt', 'w+') as outfile:
for line in infile:
line = (str(c.setdefault(token)) for token in line.strip().split())
outfile.write(' '.join(line))
outfile.write('\n')
To reduce the number of writes you could accumulate lines in a list and use writelines() at some set length.
If you have enough memory, you could read the entire file in and split it then feed that to Classifier.
De-classify
z = c.transpose()
with open('classified.txt') as f:
for line in f:
line = (z[int(n)] for n in line.strip().split())
print(' '.join(line))
For Python 2.7 super() requires arguments - replace super() with super(Classifier, self).
If you are going to be working mainly with strings for the token numbers, in the class you should convert self.n to a string when saving it then you won't have to convert back and forth between strings and ints in your working code.
You also may be able to use LabelEncoder from sklearn.

If you have a specific dictionary already to change your values, you need to simply map the new values.
mapping = { '?':1, 'Excuse':2, ...}
for k, v in mapping.iteritems():
my_string = my_string.replace(k, v)
If you want to create a brand new dictionary:
mapping = list(set(my_string.split(' ')))
mapping = dict[(i,x) for i,x in enumerate(mapping)]
for k, v in mapping.iteritems():
my_string = my_string.replace(k, v)

from collection import defaultdict
from itertools import count
with open(filename) as f:
with open(output, 'w+') as out:
c = count()
d = defaultdict(c.__next__)
for line in f:
line = line.split()
line = ' '.join([d[token] for token in line])
out.write(line)
Using a defaultdict, we remember what tokens we've seen. Every time we see a new token, we get the next number and assign it to that token. This writes output to a different file.

split = "super string".split(' ')
map = []
result = ''
foreach word in split:
if not map.__contains__(word):
map[word] = len(map)
result += ' ' + str(map[word]
this way avoid to do my_string = my_string.replace(k, v) that makes it slow

Try the following: it assigns a numeric number to each token then replaces tokens with corresponding number.
a = """Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?""".split(" ")
key_map = dict({(j,str(m)) for m,j in enumerate(set(a))})
" ".join(map(lambda x:key_map[x], a))
i.e. first map each unique token to a number, then you can use the key_map to assign the numeric value to each token

Related

Checking a file of words for those that can be formed with a given set of letters: strange result?

I’m quite new to python and am getting some strange results, surely due to a basic error on my part…
Basically, in Python 3.x, I must define a function (best_words(ltr_set,word_file)) that takes a set of letters (a list of characters) and searches a .txt file of words (1 word per line) for those that can be formed with those letters.
I first defined a function that checks if a given word can be made from a given set of letters. The word to be checked must be fed into this function as a list of characters (lsta), so it can be checked against the set of letters available (lstb):
def can_make_lsta_wrd_frm_lstb(lsta,lstb):
result = True
i = 0
while i < len(lsta) and result == True:
if lsta[i] in lstb:
lstb.remove(lsta[i])
i+=1
else:
result = False
return result
I also defined a function that takes any given string and converts it into a list of it's characters:
def lst(string):
ls = []
for c in string:
ls.append(c)
return ls
The idea behind the main best_words function is therefore to take a given set of letters and apply the above function to every line in a file of words, with the aim of filtering down to only those that can be made from the letters available...
def best_words(ltr_set, word_file):
possible_words = []
f = open(word_file)
lines = f.readlines()
i = 0
while i < len(lines):
lines[i] = lines[i].strip('\n')
i+=1
for item in lines:
if can_make_lsta_wrd_frm_lstb(lst(item),ltr_set):
possible_words.append(item)
return possible_words
However, I keep getting an unexpected result, as if a loop is not continued as it should be…
For instance, if I take a file short_dictionnary.txt with the following words:
AA
AAS
ABACA
ABACAS
ABACOST
ABACOSTS
ABACULE
ABACULES
ABAISSA
ABAISSABLE
and call the function:
best_words([‘A’,’C’,’B’,’A’,’S’,’A’], “short_dictionnary.txt”)
The possible_words list is comprised solely of “AA”…whilst AAS, ABACA and ABACAS could also be formed…
If anyone can see what’s going on, their input be greatly appreciated!
I would convert letter_set to a Counter and then for each letter in possible word, check that there are enough of that letter in letter_set to make that word. You're also leaking file references.
from collections import Counter
def can_make_word(c, word):
return all(c[letter]>=count for letter, count in Counter(word).most_common())
def best_words(ltr_set, word_file):
possible_words = []
c = Counter(ltr_set)
with open(word_file) as f:
lines = f.readlines()
lines = [line.strip() for line in lines]
for item in lines:
if can_make_word(c, item):
possible_words.append(item)
return possible_words
Thanks everyone, I now understand what I had to do!
I essentially needed to ensure the original ltr_set wasn't modified; so this was achieved by making a simple copy. I don't know if answering my own question is of any use (I'm quite new to this forum), but here's the corrected can_make... function should anyone find it useful for resolving a similar issue:
def can_make_lsta_wrd_frm_lstb(lsta,lstb):
lstb_copy = lstb[:]
result = True
i = 0
while i < len(lsta) and result == True:
if lsta[i] in lstb_copy:
lstb_copy.remove(lsta[i])
i+=1
else:
result = False
return result

Pythonic way to update nested dictionaries in Python

I have some data like this:
FeatureName,Machine,LicenseHost
Feature1,host1,lichost1
Feature1,host2,lichost1
Feature2,host1,lichost2
Feature1,host1,lichost1
and so on...
I want to maintain a nested dictionary where the first level of key is the feature name, next is machine name, finally license host, and the value is the number of times that combination occurs.
Something like:
dictionary['Feature1']['host1']['lichost1'] = 2
dictionary['Feature1']['host2']['lichost1'] = 1
dictionary['Feature2']['host1']['lichost2'] = 1
The obvious way of creating/updating such a dictionary is (assuming I am reading the data line by line from the CSV):
for line in file:
feature, machine, license = line.split(',')
if feature not in dictionary:
dictionary[feature] = {}
if machine not in dictionary[feature]:
dictionary[feature][machine] = {}
if license not in dictionary[feature][machine]:
dictionary[feature][machine][license] = 1
else:
dictionary[feature][machine][license] += 1
This ensures that I will never run into key not found errors at any level.
What is the best way to do the above (for any number of nested levels) ?
You could use defaultdict:
from collections import defaultdict
import csv
def d1(): return defaultdict(int)
def d2(): return defaultdict(d1)
def d3(): return defaultdict(d2)
dictionary = d3()
with open('input.csv') as input_file:
next (input_file);
for line in csv.reader(input_file):
dictionary[line[0]][line[1]][line[2]] += 1
assert dictionary['Feature1']['host1']['lichost1'] == 2
assert dictionary['Feature1']['host2']['lichost1'] == 1
assert dictionary['Feature2']['host1']['lichost2'] == 1
assert dictionary['InvalidFeature']['host1']['lichost1'] == 0
If the multiple function defs bother you, you can say the same thing more succinctly:
dictionary = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))

appending a list from a read text file python3

I am attempting to read a txt file and create a dictionary from the text. a sample txt file is:
John likes Steak
John likes Soda
John likes Cake
Jane likes Soda
Jane likes Cake
Jim likes Steak
My desired output is a dictionary with the name as the key, and the "likes" as a list of the respective values:
{'John':('Steak', 'Soda', 'Cake'), 'Jane':('Soda', 'Cake'), 'Jim':('Steak')}
I continue to run into the error of adding my stripped word to my list and have tried a few different ways:
pred = ()
prey = ()
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
while line!= "":
line = line.split()
pred.append = (line[0])
prey.append = (line[2])
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
and also:
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
while line!= "":
line = line.split()
if line[0] in chain:
chain[line[0] = [0, line[2]]
else:
chain[line[0]] = line[2]
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
any ideas?
This will do it (without needing to read the entire file into memory first):
likes = {}
for who, _, what in (line.split()
for line in (line.strip()
for line in open('likes.txt', 'rt'))):
likes.setdefault(who, []).append(what)
print(likes)
Output:
{'Jane': ['Soda', 'Cake'], 'John': ['Steak', 'Soda', 'Cake'], 'Jim': ['Steak']}
Alternatively, to simplify things slightly you could use a temporarycollections.defaultdict:
from collections import defaultdict
likes = defaultdict(list)
for who, _, what in (line.split()
for line in (line.strip()
for line in open('likes.txt', 'rt'))):
likes[who].append(what)
print(dict(likes)) # convert to plain dictionary and print
Your input is a sequence of sequences. Parse the outer sequence first, parse each item next.
Your outer sequence is:
Statement
<empty line>
Statement
<empty line>
...
Assume that f is the open file with the data. Read each statement and return a list of them:
def parseLines(f):
result = []
for line in f: # file objects iterate over text lines
if line: # line is non-empty
result.append(line)
return result
Note that the function above accepts a much wider grammar: it allows arbitrarily many empty lines between non-empty lines, and two non-empty lines in a row. But it does accept any correct input.
Then, your statement is a triple: X likes Y. Parse it by splitting it by whitespace, and checking the structure. The result is a correct pair of (x, y).
def parseStatement(s):
parts = s.split() # by default, it splits by all whitespace
assert len(parts) == 3, "Syntax error: %r is not three words" % s
x, likes, y = parts # unpack the list of 3 items into varaibles
assert likes == "likes", "Syntax error: %r instead of 'likes'" % likes
return x, y
Make a list of pairs for each statement:
pairs = [parseStatement(s) for s in parseLines(f)]
Now you need to group values by key. Let's use defaultdict which supplies a default value for any new key:
from collections import defaultdict
the_answer = defaultdict(list) # the default value is an empty list
for key, value in pairs:
the_answer[key].append(value)
# we can append because the_answer[key] is set to an empty list on first access
So here the_answer is what you need, only it uses lists as dict values instead of tuples. This must be enough for you to understand your homework.
dic={}
for i in f.readlines():
if i:
if i.split()[0] in dic.keys():
dic[i.split()[0]].append(i.split()[2])
else:
dic[i.split()[0]]=[i.split()[2]]
print dic
This should do it.
Here we iterater through f.readlines f being the file object,and on each line we fill up the dictionary by using first part of split as key and last part of split as value

Find if groups of characters are repeated in a string in python

I am a beginner and I want to know that how to determine if a string contains characters which are re-occuring in a pattern.
Example: "aabcdabcdabcdabcd"
Here four characters - 'abcd' ore getting repeated.
But I do not know that how many characters are getting repeated.
The pattern is not certain. I do not know it. "abcd is just" an example.
The pattern can be in any order
Please help.
My code is :
I don't actually know the string!
s1=str("aabcdabcdabcd")
x=0
z=""
for i in range (1,len(s1)):
z=s1[i:i+5]
s1.replace(z,"",1)
if z in s1:
x+=1
if x!=0:
print "yes":
else:
print "no"
The above program works only for the given string. I want it to be able to evaluate any string.
This will find all repeats of letters - then you can filter the sets you want.
cstr = 'aabcdabcdabcdabcd'
dd = {}
for ii, ch in enumerate(cstr):
# find all sequences of 3-6 characters long
for jj in range(3,7):
wrd = cstr[ii:ii+jj]
if not len(wrd) == jj:
break
dd.setdefault(wrd, 0)
dd[wrd] += 1
# find any "word" that occurs more than once
for k, v in dd.iteritems():
if v > 2:
print k, v
I'm relatively new to Python myself, and one of the things I found most exciting initially was that it's very easy to start working with the characters which make up strings.
To answer your problem, I would start with:
for letter in string:
# work through the string and check for repeated patterns
In Natural Language Processing these are called ngrams, for most common NLP tasks the nltk library is very useful:
from nltk.util import ngrams
from collections import Counter
s = 'aabcdabcdabcdabcd'
max_ngram = 5
minimum_count = 2
ngrams_found = Counter()
for x in range(max_ngram-1):
ngrams_found += Counter(["".join(ngram) for ngram in ngrams(s, x+minimum_count)])
for key, val in ngrams_found.items():
if val < minimum_count:
del ngrams_found[key]
else:
print(key, val)
The Counter object also allows you to print the x most common ngrams:
ngrams_found.most_common(5)

Add new item to Dictionary dynamically using a variable

I'm currently reading values from file and spliting them as parameter and value e.g. #id=7 becomes param = #id, value = 7. I would like to use the param variable as a new key in the dictionary. However, it is not working as expected. I'm using the following code.
list1 = {}
with open('C:/Temp/file1.txt') as f:
lines = f.read().splitlines()
for line in lines:
middle = line.find("=")
param = line[:middle]
value = line[middle+1:]
list1[param] = value
In this code, the dictionary key and value becomes 7.
Thanks in advance.
You have to define your dictionary (d is a nice name). You can do it this way:
with open('C:/Temp/file1.txt') as f:#
d = dict(line.strip().split('=', 1) for line in f)
for k,v in d.iteritems():
print("param = {0}, value = {1}".format(k,v))
If you are defining list1 as a dict list1 = {} then your print statement is incorrect.
print("param = " + list1[param] + ", value = " + value)
both list1[param] and value would be 7. since list1[param] would give you the value of it's contents and not it's key.
Try looking at the dictionary afterwards by printing it.
print(list1)
I know this wasn't what you were asking for, but I would suggest to look at ConfigParser. This is a standard way to use configuration files.

Categories

Resources