I am using a map reduce implementation called mincemeat.py. It contains a map function and reduce function. First off I will tell what I am trying to accomplish. I am doing a coursera course on bigdata where there is a programming assignment. The question is that there are hundreds of files containing data of the form paperid:::author1::author2::author3:::papertitle
We have to go through all the files and give for a particular author, the word he has used to the maximum. So I wrote the following code for it.
import re
import glob
import mincemeat
from collections import Counter
text_files = glob.glob('test/*')
def file_contents(file_name):
f = open(file_name)
try:
return f.read()
finally:
f.close()
datasource = dict((file_name, file_contents(file_name)) for file_name in text_files)
def mapfn(key, value):
for line in value.splitlines():
wordsinsentence = line.split(":::")
authors = wordsinsentence[1].split("::")
# print authors
words = str(wordsinsentence[2])
words = re.sub(r'([^\s\w-])+', '', words)
# re.sub(r'[^a-zA-Z0-9: ]', '', words)
words = words.split(" ")
for author in authors:
for word in words:
word = word.replace("-"," ")
word = word.lower()
yield author, word
def reducefn(key, value):
return Counter(value)
s = mincemeat.Server()
s.datasource = datasource
s.mapfn = mapfn
s.reducefn = reducefn
results = s.run_server(password="changeme")
# print results
i = open('outfile','w')
i.write(str(results))
i.close()
My problem now is that, reduce function has to receive authorname and all the words he has used in his titles, for all authors. So I expected an output like
{authorname: Counter({'word1':countofword1,'word2':countofword2,'word3':countofword3,..}).
But what I get is
authorname: (authorname, Counter({'word1': countofword1,'word2':countofword2}))
Can someone tell why it is happening like that? I don't need help to solve the question, I need help to know why it is happening like that!
I ran your code and I see it is working as expected. The output looks like {authorname : Counter({'word1':countofword1,'word2':countofword2,'word3':countofword3,..}).
That said. Remove the code from here as it violates Coursera Code of Honor.
Check your value data structure in reducefn before Counter.
def reducefn(key, value):
print(value)
return Counter(value)
Related
I want to ask something about translating somestring using python. I have a csv file contains list of abreviation dictionary like this.
before, after
ROFL, Rolling on floor laughing
STFU, Shut the freak up
LMK, Let me know
...
I want to translate string that contains word in column "before" to be word in column "after". I try to use this code, but it doesn't change anything.
def replace_abbreviation(tweet):
dictionary = pd.read_csv("dict.csv", encoding='latin1')
dictionary['before'] = dictionary['before'].apply(lambda val: unicodedata.normalize('NFKD', val).encode('ascii', 'ignore').decode())
tmp = dictionary.set_index('before').to_dict('split')
tweet = tweet.translate(tmp)
return tweet
For example :
Input = "lmk your test result please"
Output = "let me know your test
result please"
You can read the contents to a dict and then use the following code.
res = {}
with open('dict.csv') as file:
next(file) # skip the first line "before, after"
for line in file:
k, v = line.strip().split(', ')
res[k] = v
def replace(tweet):
return ' '.join(res.get(x.upper(), x) for x in tweet.split())
print(replace('stfu and lmk your test result please'))
Output
Shut the freak up and Let me know your test result please
I'm running the following python script on a large dataset (around 100 000 items). Currently the execution is unacceptably slow, it would probably take a month to finish at least (no exaggeration). Obviously I would like it to run faster.
I've added a comment belong to highlight where I think the bottleneck is. I have written my own database functions which are imported.
Any help is appreciated!
# -*- coding: utf-8 -*-
import database
from gensim import corpora, models, similarities, matutils
from gensim.models.ldamulticore import LdaMulticore
import pandas as pd
from sklearn import preprocessing
def getTopFiveSimilarAuthors(author, authors, ldamodel, dictionary):
vec_bow = dictionary.doc2bow([researcher['full_proposal_text']])
vec_lda = ldamodel[vec_bow]
# normalization
try:
vec_lda = preprocessing.normalize(vec_lda)
except:
pass
similar_authors = []
for index, other_author in authors.iterrows():
if(other_author['id'] != author['id']):
other_vec_bow = dictionary.doc2bow([other_author['full_proposal_text']])
other_vec_lda = ldamodel[other_vec_bow]
# normalization
try:
other_vec_lda = preprocessing.normalize(vec_lda)
except:
pass
sim = matutils.cossim(vec_lda, other_vec_lda)
similar_authors.append({'id': other_author['id'], 'cosim': sim})
similar_authors = sorted(similar_authors, key=lambda k: k['cosim'], reverse=True)
return similar_authors[:5]
def get_top_five_similar(author, authors, ldamodel, dictionary):
top_five_similar_authors = getTopFiveSimilarAuthors(author, authors, ldamodel, dictionary)
database.insert_top_five_similar_authors(author['id'], top_five_similar_authors, cursor)
connection = database.connect()
authors = []
authors = pd.read_sql("SELECT id, full_text FROM author WHERE full_text IS NOT NULL;", connection)
# create the dictionary
dictionary = corpora.Dictionary([authors["full_text"].tolist()])
# create the corpus/ldamodel
author_text = []
for text in author_text['full_text'].tolist():
word_list = []
for word in text:
word_list.append(word)
author_text.append(word_list)
corpus = [dictionary.doc2bow(text) for text in author_text]
ldamodel = LdaMulticore(corpus, num_topics=50, id2word = dictionary, workers=30)
#BOTTLENECK: the script hangs after this point.
authors.apply(lambda x: get_top_five_similar(x, authors, ldamodel, dictionary), axis=1)
I noticed these problems in your code.. but I'm not sure the they are the reason for the slow execution..
this loop here is useless it well never run:
for text in author_text['full_text'].tolist():
word_list = []
for word in text:
word_list.append(word)
author_text.append(word_list)
also there is no need to loop the words of the text it is enough to use split function on it and it will be a list of words, by lopping authors courser..
try to write it like this:
first:
all_authors_text = []
for author in authors:
all_authors_text.append(author['full_text'].split())
and after that make the dictionary:
dictionary = corpora.Dictionary(all_authors_text)
novice coder here, trying to sort out issues I've found with a simple spam detection python script from Youtube.
Naive Bayes cannot be applied because the list isn't generating correctly. I know the problem step is
featuresets = [(email_features(n),g) for (n,g) in mixedemails]
Could someone help me understand why that line is failing to generate anything?
def email_features(sent):
features = {}
wordtokens = [wordlemmatizer.lemmatize(word.lower()) for word in word_tokenize(sent)]
for word in wordtokens:
if word not in commonwords:
features[word] = True
return features
hamtexts=[]
spamtexts=[]
for infile in glob.glob(os.path.join('ham/','*.txt')):
text_file =open(infile,"r")
hamtexts.append(text_file.read())
text_file.close()
for infile in glob.glob(os.path.join('spam/','*.txt')):
text_file =open(infile,"r")
spamtexts.append(text_file.read())
text_file.close()
mixedemails = ([(email,'spam') for email in spamtexts]+ [(email,'ham') for email in hamtexts])
featuresets = [(email_features(n),g) for (n,g) in mixedemails]
I converted your problem into a minimal, runnable example:
commonwords = []
def lemmatize(word):
return word
def word_tokenize(text):
return text.split(" ")
def email_features(sent):
wordtokens = [lemmatize(word.lower()) for word in word_tokenize(sent)]
features = dict((word, True) for word in wordtokens if word not in commonwords)
return features
hamtexts = ["hello test", "test123 blabla"]
spamtexts = ["buy this", "buy that"]
mixedemails = [(email,'spam') for email in spamtexts] + [(email,'ham') for email in hamtexts]
featuresets = [(email_features(n),g) for (n,g) in mixedemails]
print len(mixedemails), len(featuresets)
Executing that example prints 4 4 on the console. Therefore, most of your code seems to work and the exact cause of the error cannot be estimated based on what you posted. I would suggest you to look at the following points for the bug:
Maybe your spam and ham files are not read properly (e.g. your path might be wrong). To validate that this is not the case add print hamtexts, spamtexts before mixedemails = .... Both variables should contain not-empty lists of strings.
Maybe your implementation of word_tokenize() returns always an empty list. Add a print sent, wordtokens after wordtokens = [...] in email_features() to make sure that sent contains a string and that it gets correctly converted to a list of tokens.
Maybe commonwords contains every single word from your ham and spam emails. To make sure that this is not the case, add the previous print sent, wordtokens before the loop in email_features() and a print features after the loop. All three variables should (usually) be not empty.
I have this code, which I want to open a specified file, and then every time there is a while loop it will count it, finally outputting the total number of while loops in a specific file. I decided to convert the input file to a dictionary, and then create a for loop that every time the word while followed by a space was seen it would add a +1 count to WHILE_ before finally printing WHILE_ at the end.
However this did not seem to work, and I am at a loss as to why. Any help fixing this would be much appreciated.
This is the code I have at the moment:
WHILE_ = 0
INPUT_ = input("Enter file or directory: ")
OPEN_ = open(INPUT_)
READLINES_ = OPEN_.readlines()
STRING_ = (str(READLINES_))
STRIP_ = STRING_.strip()
input_str1 = STRIP_.lower()
dic = dict()
for w in input_str1.split():
if w in dic.keys():
dic[w] = dic[w]+1
else:
dic[w] = 1
DICT_ = (dic)
for LINE_ in DICT_:
if ("while\\n',") in LINE_:
WHILE_ += 1
elif ('while\\n",') in LINE_:
WHILE_ += 1
elif ('while ') in LINE_:
WHILE_ += 1
print ("while_loops {0:>12}".format((WHILE_)))
This is the input file I was working from:
'''A trivial test of metrics
Author: Angus McGurkinshaw
Date: May 7 2013
'''
def silly_function(blah):
'''A silly docstring for a silly function'''
def nested():
pass
print('Hello world', blah + 36 * 14)
tot = 0 # This isn't a for statement
for i in range(10):
tot = tot + i
if_im_done = false # Nor is this an if
print(tot)
blah = 3
while blah > 0:
silly_function(blah)
blah -= 1
while True:
if blah < 1000:
break
The output should be 2, but my code at the moment prints 0
This is an incredibly bizarre design. You're calling readlines to get a list of strings, then calling str on that list, which will join the whole thing up into one big string with the quoted repr of each line joined by commas and surrounded by square brackets, then splitting the result on spaces. I have no idea why you'd ever do such a thing.
Your bizarre variable names, extra useless lines of code like DICT_ = (dic), etc. only serve to obfuscate things further.
But I can explain why it doesn't work. Try printing out DICT_ after you do all that silliness, and you'll see that the only keys that include while are while and 'while. Since neither of these match any of the patterns you're looking for, your count ends up as 0.
It's also worth noting that you only add 1 to WHILE_ even if there are multiple instances of the pattern, so your whole dict of counts is useless.
This will be a lot easier if you don't obfuscate your strings, try to recover them, and then try to match the incorrectly-recovered versions. Just do it directly.
While I'm at it, I'm also going to fix some other problems so that your code is readable, and simpler, and doesn't leak files, and so on. Here's a complete implementation of the logic you were trying to hack up by hand:
import collections
filename = input("Enter file: ")
counts = collections.Counter()
with open(filename) as f:
for line in f:
counts.update(line.strip().lower().split())
print('while_loops {0:>12}'.format(counts['while']))
When you run this on your sample input, you correctly get 2. And extending it to handle if and for is trivial and obvious.
However, note that there's a serious problem in your logic: Anything that looks like a keyword but is in the middle of a comment or string will still get picked up. Without writing some kind of code to strip out comments and strings, there's no way around that. Which means you're going to overcount if and for by 1. The obvious way of stripping—line.partition('#')[0] and similarly for quotes—won't work. First, it's perfectly valid to have a string before an if keyword, as in "foo" if x else "bar". Second, you can't handle multiline strings this way.
These problems, and others like them, are why you almost certainly want a real parser. If you're just trying to parse Python code, the ast module in the standard library is the obvious way to do this. If you want to be write quick&dirty parsers for a variety of different languages, try pyparsing, which is very nice, and comes with some great examples.
Here's a simple example:
import ast
filename = input("Enter file: ")
with open(filename) as f:
tree = ast.parse(f.read())
while_loops = sum(1 for node in ast.walk(tree) if isinstance(node, ast.While))
print('while_loops {0:>12}'.format(while_loops))
Or, more flexibly:
import ast
import collections
filename = input("Enter file: ")
with open(filename) as f:
tree = ast.parse(f.read())
counts = collections.Counter(type(node).__name__ for node in ast.walk(tree))
print('while_loops {0:>12}'.format(counts['While']))
print('for_loops {0:>14}'.format(counts['For']))
print('if_statements {0:>10}'.format(counts['If']))
So I have a problem.
I am wanting to do something similar to this, where I call out a value, and it prints out the keys associated with that value. And I can even get it working:
def test(pet):
dic = {'Dog': ['der Hund', 'der Katze'] , 'Cat' : ['der Katze'] , 'Bird': ['der Vogel']}
items = dic.items()
key = dic.keys()
values = dic.values()
for x, y in items:
for item in y:
if item == pet:
print x
However, when I incorporate this same code format into a larger program it stops working:
def movie(movie):
file = open('/Users/Danrex/Desktop/Text.txt' , 'rt')
read = file.read()
list = read.split('\n')
actorList=[]
for item in list:
actorList = actorList + [item.split(',')]
actorDict = dict()
for item in actorList:
if item[0] in actorDict:
actorDict[item[0]].append(item[1])
else:
actorDict[item[0]] = [item[1]]
items = actorDict.items()
for x, y in items:
for item in y:
if item == movie:
print x
I have print(ed) out actorDict, items, x, y, and item and they all seem to follow the same format as the previous code so I can't figure out why this isn't working! So confused. And, please, when you explain it to me do it as if I am a complete idiot, which I probably am.
Cleaning up the code with some more idiomatic Python will sometimes clarify things. This is how I would write it in Python 2.7:
from collections import defaultdict
def movie(movie):
actorDict = defaultdict(list)
movie_info_filename = '/Users/Danrex/Desktop/Text.txt'
with open(movie_info_filename, 'rt') as fin:
for line_item in fin:
split_items = line_item.split(',')
actorDict[split_items[0]].append(split_items[1])
for actor, actor_info in actorDict.items():
for info_item in actor_info:
if info_item == movie:
print actor
In this case, what mostly boiled out were temporary objects created for making the actorDict. defaultdict creates a dictionary-like object that allows one to specify a function to generate the default value for a key that isn't currently present. See the collections documentation for more info.
What it looks like you're trying to do is print out some actor value for each time they are listed with a particular movie in your text file.
If you're going to check more than one movie, make the actorDict once and reference your movies against that existing actorDict. This will save you trips to disk.
from collections import defaultdict
def make_actor_dict():
actorDict = defaultdict(list)
movie_info_filename = '/Users/Danrex/Desktop/Text.txt'
with open(movie_info_filename, 'rt') as fin:
for line_item in fin:
split_items = line_item.split(',')
actorDict[split_items[0]].append(split_items[1])
def movie(movie, actorDict):
for actor, actor_info in actorDict.items():
for info_item in actor_info:
if info_item == movie:
print actor
def main():
actorDict = make_actor_dict()
movie('Star Wars', actorDict)
movie('Indiana Jones', actorDict)
If you only care that the actor was in that movie, you don't have to iterate through the movie list manually, you can just check that movie is in actor_info:
def movie(movie, actorDict):
for actor in actorDict:
if movie in actorDict[actor]:
print actor
Of course, you already figure out that the problem was the movie name not being an exact match to the text you read from the file. If you want to allow less-than-exact matches, you should consider normalizing your movie string and your data strings from the file. The string methods strip() and lower() can be really helpful there.