I have a problem with using coherence model
my code is
def compute_coherence_values(dictionary, corpus, texts, limit, start, step):
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence="c_v")
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
coherence_values = []
model_list = []
# topic number
nt = pre_nt
start_ = nt;
limit_ = nt + 1;
step_ = 1;
model_list1, coherence_values1 = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=texts_wi_new,
start=start_, limit=limit_, step=step_)
and the error is
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\multiprocessing\spawn.py", line 105, in spawn_main
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "<input>", line 92, in compute_coherence_values
File "D:\All Python\venv\lib\site-packages\gensim\models\coherencemodel.py", line 609, in get_coherence
confirmed_measures = self.get_coherence_per_topic()
File "D:\All Python\venv\lib\site-packages\gensim\models\coherencemodel.py", line 569, in get_coherence_per_topic
self.estimate_probabilities(segmented_topics)
File "D:\All Python\venv\lib\site-packages\gensim\models\coherencemodel.py", line 541, in estimate_probabilities
self._accumulator = self.measure.prob(**kwargs)
File "D:\All Python\venv\lib\site-packages\gensim\topic_coherence\probability_estimation.py", line 156, in p_boolean_sliding_window
return accumulator.accumulate(texts, window_size)
File "D:\All Python\venv\lib\site-packages\gensim\topic_coherence\text_analysis.py", line 444, in accumulate
workers, input_q, output_q = self.start_workers(window_size)
File "D:\All Python\venv\lib\site-packages\gensim\topic_coherence\text_analysis.py", line 478, in start_workers
worker.start()
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe
exitcode = _main(fd)
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="__mp_main__")
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\runpy.py", line 261, in run_path
code, fname = _get_code_from_file(run_name, path_name)
File "C:\Users\lee96\AppData\Local\Programs\Python\Python37\Lib\runpy.py", line 231, in _get_code_from_file
with open(fname, "rb") as f:
OSError: [Errno 22] Invalid argument: 'D:\\All Python\\<input>'
The error occurs in this part
coherencemodel.get_coherence()
I use pycharm.
How can I solve it?
sorry
It looks like your post is mostly code; please add some more details.
It looks like your post is mostly code; please add some more details.
It looks like your post is mostly code; please add some more details.
I'm having the same exact issue with the same exact code. The code works perfectly fine when I run it from my Spyder IDE, but when I plug it into Power BI, it errors out. So far, I've broken it out of the function and loop into the basic lines below. The LDA and Coherence model runs fine, but for some reason when get_coherence() is called it errors out.
model = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=10)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
test = coherencemodel.get_coherence()
Below is part of the error message I received back:
RuntimeError: An attempt has been made to start a new process before
the current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your child
processes and you have forgotten to use the proper idiom in the main
module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program is not going
to be frozen to produce an executable.
Details:
DataSourceKind=Python
DataSourcePath=Python
Message=Python script error.
I did some more researching on this and found a few other articles that were helpful for me, but ultimately it seems like the errors have to do with multiprocessing within a windows framework.
where to put freeze_support() in a Python script?
https://docs.python.org/2/library/multiprocessing.html#windows
What worked for me, is that I placed all of my code under the following line of code:
if __name__ == '__main__':
freeze_support()
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=texts, start=start, limit=limit, step=step)
max_value = max(coherence_values)
max_index = coherence_values.index(max_value)
best_model = model_list[max_index]
ldamodel= best_model
I'm not the greatest developer within Python, but I got it working for what I needed. If others have better suggestions, I'm all eyes and ears :)
Related
Traceback (most recent call last): File
"C:\Users\HP\Downloads\1\main.py", line 27, in <module>
p.start() File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\process.py",
line 121, in start
self._popen = self._Popen(self) File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\context.py",
line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj) File
"C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\context.py",
line 326, in _Popen
return Popen(process_obj) File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\popen_spawn_win32.py",
line 93, in _init_
reduction.dump(process_obj, to_child) File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\reduction.py",
line 60, in dump
ForkingPickler(file, protocol).dump(obj) pickle.PicklingError: Can't pickle <function <lambda> at 0x039E9538>: attribute lookup
<lambda> on __main_ failed
C:\Users\HP\Downloads\1>Traceback (most recent call last): File
"<string>", line 1, in <module> File
"C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\spawn.py",
line 102, in spawn_main
source_process = _winapi.OpenProcess( OSError: [WinError 87] The parameter is incorrect
""
Im trying to open multiple chromedrivers/files.py at the same time/instant.
But it shows the error.
The code is to get the sample from file.py and create multiple same files like file_1.py.....
Then run all the files that has created but is not working.
The Code is:
for proxy in file:
shutil.copyfile('/home/hp/Documents/pr/proxy_project/template.py', '/home/hp/Documents/pr/proxy_project/file_{}.py'.format(i))
file_name = "file_{}".format(i)
file_list.append(file_name)
with open('/home/hp/Documents/pr/proxy_project/file_{}.py'.format(i), 'r') as replicated_file:
data = replicated_file.readlines()
data_to_be_added = "proxy = '{}'\n".format(proxy.strip("\n"))
data[16] = data_to_be_added
with open('/home/hp/Documents/pr/proxy_project/file_{}.py'.format(i), 'w') as new_replicated_file:
new_replicated_file.writelines(data)
i += 1
for file_py in tuple(file_list):
print("File name = {}".format(file_py))
p = multiprocessing.Process(target=lambda: __import__(file_py))
p.start()
There is another file that will be replicated when this code runs
On Windows when using multiprocessing you have to protect your main code with if name==‘main’: - try reading the multiprocessing documentation. – barny Mar 23 at 19:46
I have a large dataframe and a column "image" in it, the data in "image" is the file name(with extension equals "jpg" or "jpeg") of a large amount of files. Some files exist with right extension, but others not. So, I have to check whether "image" data is right, but it takes 30 seconds with single-threading, I then decide to do this with multi-threading.
I have written a code with Python(3.6.5) to check this, it runs well when I execute it on Command Line, but error occurs when I execute it on Spyder(3.2.8), how could I do to avoid this?
Here is my code:
# -*- coding: utf-8 -*-
import multiprocessing
import numpy as np
import os
import pandas as pd
from multiprocessing import Pool
#some large scale DataFrame, the size is about (600, 15)
waferDf = pd.DataFrame({"image": ["aaa.jpg", "bbb.jpeg", "ccc.jpg", "ddd.jpeg", "eee.jpg", "fff.jpg", "ggg.jpeg", "hhh.jpg"]})
waferDf["imagePath"] = np.nan
#to parallelize whole process
def parallelize(func, df, uploadedDirPath):
partitionCount = multiprocessing.cpu_count()
partitions = np.array_split(df, partitionCount)
paras = [(part, uploadedDirPath) for part in partitions]
pool = Pool(partitionCount)
df = pd.concat(pool.starmap(func, paras))
pool.close()
pool.join()
return df
#check whether files exist
def checkImagePath(partialDf, uploadedDirPath):
for index in partialDf.index.values:
print(index)
if os.path.exists(os.path.join(uploadedDirPath, partialDf.loc[index, ["image"]][0].replace(".jpeg\n", ".jpeg"))):
partialDf.loc[index, ["imagePath"]][0] = os.path.join(uploadedDirPath, partialDf.loc[index, ["image"]][0].replace(".jpeg\n", ".jpeg"))
elif os.path.exists(os.path.join(uploadedDirPath, partialDf.loc[index, ["image"]][0].replace(".jpeg\n", ".jpg"))):
partialDf.loc[index, ["imagePath"]][0] = os.path.join(uploadedDirPath, partialDf.loc[index, ["image"]][0].replace(".jpeg\n", ".jpg"))
print(partialDf)
return partialDf
if __name__ == '__main__':
waferDf = parallelize(checkImagePath, waferDf, "/eap/uploadedFiles/")
print(waferDf)
and here is the error:
runfile('C:/Users/00048564/Desktop/Multi-Threading.py', wdir='C:/Users/00048564/Desktop')
Traceback (most recent call last):
File "<ipython-input-24-732edc0ea3ea>", line 1, in <module>
runfile('C:/Users/00048564/Desktop/Multi-Threading.py', wdir='C:/Users/00048564/Desktop')
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/00048564/Desktop/Multi-Threading.py", line 35, in <module>
waferDf = parallelize(checkImagePath, waferDf, "/eap/uploadedFiles/")
File "C:/Users/00048564/Desktop/Multi-Threading.py", line 17, in parallelize
pool = Pool(partitionCount)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 174, in __init__
self._repopulate_pool()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 239, in _repopulate_pool
w.start()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 172, in get_preparation_data
main_mod_name = getattr(main_module.__spec__, "name", None)
AttributeError: module '__main__' has no attribute '__spec__'
In most cases,when you run python script from command line by calling keyword python 'YourFile.py' , script is executed as main program.Hence it was able to call required modules such as multiprocessing and other modules shown on your error trace.
However, your Spyder configurations could be different and your instruction to run the script as main program is not working .
Were you able to successfully run any script from Spyder that has
if __name__ == '__main__':
Read the accepted answer on this thread https://stackoverflow.com/a/419185/9968677
I'm generating a list filled with sublists of randomly generated 0s and 1s, and then trying to compare each list with every other list to determine their similarity, efficiently.
I know that my code works with a single process (i.e. without involving multiprocessing, but once I start involving multiprocessing.Pool() or multiprocessing.Process() everything starts to break.
I want to compare how long a single process would take compared to multiple processes. I've tried this with threading, but a single process actually ended up taking less time, probably due to the Global Interpreter Lock.
Here's my code:
import difflib
import secrets
import timeit
import multiprocessing
import numpy
random_lists = [[secrets.randbelow(2) for _ in range(500)] for _ in range(500)]
random_lists_split = numpy.array_split(numpy.array(random_lists), 5)
def get_similarity_value(lists_to_check, sublists_to_check) -> list:
ratios = []
matcher = difflib.SequenceMatcher()
for sublist_major in sublists_to_check:
try:
sublist_major = sublist_major.tolist()
except AttributeError:
pass
for sublist_minor in lists_to_check:
if sublist_major == sublist_minor or [lists_to_check.index(sublist_major), lists_to_check.index(sublist_minor)] in [ratios[i][1] for i in range(len(ratios))] or [lists_to_check.index(sublist_minor), lists_to_check.index(sublist_major)] in [ratios[i][1] for i in range(len(ratios))]: # or lists_to_check.index(sublist_major.tolist()) > lists_to_check.index(sublist_minor):
pass
else:
matcher.set_seqs(sublist_major, sublist_minor)
ratios.append([matcher.ratio(), sorted([lists_to_check.index(sublist_major), lists_to_check.index(sublist_minor)])])
return ratios
def start():
test = multiprocessing.Pool(4)
data = [(random_lists, random_lists_split[i]) for i in range(len(random_lists_split))]
print(test.map(get_similarity_value, data))
statement = timeit.Timer(start)
print(statement.timeit(1))
statement2 = timeit.Timer(lambda: get_similarity_value(random_lists, random_lists))
print(statement2.timeit(1))
And here's the error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="__mp_main__")
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "timings.py", line 38, in <module>
print(statement.timeit(1))
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\timeit.py", line 178, in timeit
timing = self.inner(it, self.timer)
File "<timeit-src>", line 6, in inner
File "timings.py", line 32, in start
test = multiprocessing.Pool(4)
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\pool.py", line 174, in __init__
self._repopulate_pool()
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\pool.py", line 239, in _repopulate_pool
w.start()
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
N.B. I have tried using multiprocessing.freeze_support() but it results in the same error. The code also seems to be attempting to run indefinitely, as the error appears over and over again.
Thanks!
The problem is that your top-level code—including the code that creates the child Process—is not protected from being run in the child processes.
As the docs explain:, if you're not using the fork start method (and since you're on Windows, you're not):
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
In fact, it's nearly identical to the example that follows that warning. You're launching a whole pool of children instead of just one, but it's the same problem. Every child in the pool tries to launch a new pool, and, fortunately, multiprocessing figures out that something bad is going on and fails with a RuntimeError instead of exponentially spawning processes until Windows refuses to spawn anymore or its scheduler just falls down.
As the docs say:
Instead one should protect the “entry point” of the program by using if __name__ == '__main__':
In your case, that means this part:
if __name__ == '__main__':
statement = timeit.Timer(start)
print(statement.timeit(1))
statement2 = timeit.Timer(lambda: get_similarity_value(random_lists, random_lists))
print(statement2.timeit(1))
To get more hands-on experience I wanted to try a project word count.
Here is the sample data which I have.
The United Nations (UN) is an intergovernmental organisation
established on 24 October 1945 to promote international cooperation. A
replacement for the ineffective League of Nations, the organisation
was created following World War II to prevent another such conflict.
[...]
and i used the following python code to get my result
from mrjob.job import MRJob
from mrjob.step import MRStep
class MovieRatings(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings),
]
def mapper_get_ratings(self, _, line):
(word) = line.split(' ')
yield word, 1
def reducer_count_ratings(self, key, values):
yield Key, sum(values)
if __name__ == '__main__':
MovieRatings.run()
i am getting the following error with Python 2
[root#localhost Desktop]# python RatingsBreakdown.py UN.txt
Traceback (most recent call last):
File "RatingsBreakdown.py", line 1, in <module>
from mrjob.job import MRJob
File "/usr/lib/python2.6/site-packages/mrjob/job.py", line 1106
for k, v in unfiltered_jobconf.items() if v is not None
^
SyntaxError: invalid syntax
Also Python 3
[root#localhost Desktop]# python3 RatingsBreakdown.py UN.txt
No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 2...
Creating temp directory /tmp/RatingsBreakdown.training.20171128.083536.602598
Error while reading from /tmp/RatingsBreakdown.training.20171128.083536.602598/step/000/mapper/00000/input:
Traceback (most recent call last):
File "RatingsBreakdown.py", line 25, in <module>
RatingsBreakdown.run()
File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 424, in run
mr_job.execute()
File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 445, in execute
super(MRJob, self).execute()
File "/usr/lib/python3.4/site-packages/mrjob/launch.py", line 185, in execute
self.run_job()
File "/usr/lib/python3.4/site-packages/mrjob/launch.py", line 233, in run_job
runner.run()
File "/usr/lib/python3.4/site-packages/mrjob/runner.py", line 511, in run
self._run()
File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 144, in _run
self._run_mappers_and_combiners(step_num, map_splits)
File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 185, in _run_mappers_and_combiners
for task_num, map_split in enumerate(map_splits)
File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 120, in _run_multiple
func()
File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 662, in _run_mapper_and_combiner
run_mapper()
File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 685, in _run_task
stdin, stdout, stderr, wd, env)
File "/usr/lib/python3.4/site-packages/mrjob/inline.py", line 92, in invoke_task
task.execute()
File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 433, in execute
self.run_mapper(self.options.step_num)
File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 517, in run_mapper
for out_key, out_value in mapper(key, value) or ():
File "RatingsBreakdown.py", line 13, in mapper_get_ratings
(userID, movieID, rating, timestamp) = line.split('\t')
ValueError: need more than 1 value to unpack
Also with my MovieRatings
[root#localhost Desktop]# python3 MovieRatings.py UN.txt
No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/MovieRatings.training.20171128.083635.368889
Error while reading from /tmp/MovieRatings.training.20171128.083635.368889/step/000/reducer/00000/input:
Traceback (most recent call last):
File "MovieRatings.py", line 20, in <module>
MovieRatings.run()
File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 424, in run
mr_job.execute()
File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 445, in execute
super(MRJob, self).execute()
File "/usr/lib/python3.4/site-packages/mrjob/launch.py", line 185, in execute
self.run_job()
File "/usr/lib/python3.4/site-packages/mrjob/launch.py", line 233, in run_job
runner.run()
File "/usr/lib/python3.4/site-packages/mrjob/runner.py", line 511, in run
self._run()
File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 150, in _run
self._run_reducers(step_num, num_reducer_tasks)
File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 246, in _run_reducers
for task_num in range(num_reducer_tasks)
File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 120, in _run_multiple
func()
File "/usr/lib/python3.4/site-packages/mrjob/sim.py", line 685, in _run_task
stdin, stdout, stderr, wd, env)
File "/usr/lib/python3.4/site-packages/mrjob/inline.py", line 92, in invoke_task
task.execute()
File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 439, in execute
self.run_reducer(self.options.step_num)
File "/usr/lib/python3.4/site-packages/mrjob/job.py", line 560, in run_reducer
for out_key, out_value in reducer(key, values) or ():
File "MovieRatings.py", line 17, in reducer_count_ratings
yield Key, sum(values)
NameError: name 'Key' is not defined
I would like to solve the error and understand what ny mistake is.
Seems like this library only works in Python3
File "RatingsBreakdown.py", line 13, in mapper_get_ratings
(userID, movieID, rating, timestamp) = line.split('\t')
ValueError: need more than 1 value to unpack
First, you ran RatingsBreakdown.py... Also, your shown input contains no tabs and you have tried to extract 4 columns. Not really clear what you expected here.
File "MovieRatings.py", line 17, in reducer_count_ratings
yield Key, sum(values)
NameError: name 'Key' is not defined
Self explanatory... your variable is lowercase key
You are trying the examples in this course (link) right?
According to this issue, mrjob has dropped for Python 2.6.
How I fixed it is installing Python 2.7 (refernece) on the CentOS the VM is using. Then set up pip (reference) and install mrjob again.
Everything works now by running this:
python2.7 RatingsBreakdown.py u.data
I worked on the same but without using steps function. It worked.
from mrjob.job import MRJob
class wordcount(MRJob):
def mapper(self, _, line):
(word) = line.split(' ')
yield word, 1
def reducer(self,x,count):
yield x,sum(count)
if __name__ == '__main__':
wordcount.run()
I'm trying to multiprocess a code I've just written in Python using multiprocessing library.
Here is the part of the code I'm trying to multiprocess:
for b in (Bulle):
p=[]
for i in range(len(X)):
print(i,"/",len(X))
for j in range(len(Y)):
for n in b.voisins:
c_pos=[X[i][j],Y[i][j]]
if __name__ == "__main__":
p.append(Process(target=fonction_courant2,args=(b,n,c_pos)))
for i in (range(len(p))): p[i].start()
for i in (range(len(p))): p[i].join()
Output:
NotifierThreadProc: could not create trigger pipe
Traceback (most recent call last):
File "multi_t_bubbles.py", line 113, in <module>
for i in (range(len(p))): p[i].start()
File "/usr/lib/python3.4/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.4/multiprocessing/context.py", line 212, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/lib/python3.4/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.4/multiprocessing/popen_fork.py", line 21, in __init__
self._launch(process_obj)
File "/usr/lib/python3.4/multiprocessing/popen_fork.py", line 69, in _launch
parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files
I'm fairly new in multiprocessing (I took https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Process as a base for this code) and I don't understand this output. It seems that the creation of Processes is ok, but not the start.
I would be really grateful if you have any suggestions!
Tom