I am attempting to write a Python script to download and unzip hundreds of files from an AWS server. As I understand it, these tasks are I/O-bound tasks, so I would like to multi-thread this task to speed up processing times.
Since I am new to Python, I've been reading guides like this one and that one on multithreading and multiprocessing.
Both of the above links suggest code to import methods from the subprocess library, but I am running into trouble completing these imports. The second link above suggests the following code to illustrate multithreading:
from multiprocessing import Pool as ProcessPool
from urllib.request import urlopen
def run_tasks(function, args, pool, chunk_size=None):
results = pool.map(function, args, chunk_size)
return results
def work(n):
with urlopen("https://www.google.com/#{n}") as f:
contents = f.read(32)
return contents
if __name__ == '__main__':
numbers = [x for x in range(1,100)]
# Run the task using a thread pool
t_p = ThreadPool()
result = run_tasks(work, numbers, t_p)
print (result)
t_p.close()
When I tried running this script, I got the following error with traceback:
PS C:\Users\USERNAME> & "C:/Users/USERNAME/AppData/Local/Continuum/anaconda3/python.exe" "h:/Post-Processing/API Query/Python Test/subprocess_test/subprocess.py"
Traceback (most recent call last):
File "h:/Post-Processing/API Query/Python Test/subprocess_test/subprocess.py", line 38, in <module>
t_p = ThreadPool()
File "C:\Users\USERNAME\AppData\Local\Continuum\anaconda3\lib\multiprocessing\dummy\__init__.py", line 123, in Pool
from ..pool import ThreadPool
File "C:\Users\USERNAME\AppData\Local\Continuum\anaconda3\lib\multiprocessing\pool.py", line 26, in <module>
from . import util
File "C:\Users\USERNAME\AppData\Local\Continuum\anaconda3\lib\multiprocessing\util.py", line 17, in <module>
from subprocess import _args_from_interpreter_flags
ImportError: cannot import name '_args_from_interpreter_flags' from 'subprocess' (h:\PSO Post-Processing\API Query\Python Test\subprocess_test\subprocess.py)
I found this SO thread, in which the answer suggests adding
from subprocess import _args_from_interpreter_flags
to the list of imports. However, when I added this line, the import error seems to shift into my current script:
Traceback (most recent call last):
File "h:/Post-Processing/API Query/Python Test/subprocess_test/subprocess.py", line 20, in <module>
from subprocess import _args_from_interpreter_flags
File "h:\Post-Processing\API Query\Python Test\subprocess_test\subprocess.py", line 20, in <module>
from subprocess import _args_from_interpreter_flags
ImportError: cannot import name '_args_from_interpreter_flags' from 'subprocess' (h:\PSO Post-Processing\API Query\Python Test\subprocess_test\subprocess.py)
I am now suspecting that something is wrong with my Python installation, but I am not sure how to troubleshoot it.
I am running Windows 10 on a work computer and using Visual Studio Code as my editor. According to Visual Studio Code, I'm running Python 3.7.6 64-bit ('Continuum': virtualenv). I found that I have subprocess.py installed at
"C:\Users\USER\AppData\Local\Continuum\anaconda3\Lib\subprocess.py"
and this subprocess.py file indeed has a segment with
def _args_from_interpreter_flags():
"""Return a list of command-line arguments reproducing the current
settings in sys.flags, sys.warnoptions and sys._xoptions."""
flag_opt_map = {
'debug': 'd',
# 'inspect': 'i',
# 'interactive': 'i',
'dont_write_bytecode': 'B',
'no_site': 'S',
'verbose': 'v',
'bytes_warning': 'b',
'quiet': 'q',
# -O is handled in _optim_args_from_interpreter_flags()
}
args = _optim_args_from_interpreter_flags()
for flag, opt in flag_opt_map.items():
v = getattr(sys.flags, flag)
if v > 0:
args.append('-' + opt * v)
if sys.flags.isolated:
args.append('-I')
else:
if sys.flags.ignore_environment:
args.append('-E')
if sys.flags.no_user_site:
args.append('-s')
# -W options
warnopts = sys.warnoptions[:]
bytes_warning = sys.flags.bytes_warning
xoptions = getattr(sys, '_xoptions', {})
dev_mode = ('dev' in xoptions)
if bytes_warning > 1:
warnopts.remove("error::BytesWarning")
elif bytes_warning:
warnopts.remove("default::BytesWarning")
if dev_mode:
warnopts.remove('default')
for opt in warnopts:
args.append('-W' + opt)
# -X options
if dev_mode:
args.extend(('-X', 'dev'))
for opt in ('faulthandler', 'tracemalloc', 'importtime',
'showalloccount', 'showrefcount', 'utf8'):
if opt in xoptions:
value = xoptions[opt]
if value is True:
arg = opt
else:
arg = '%s=%s' % (opt, value)
args.extend(('-X', arg))
return args
Given all this information, I am sure that I'm missing a simple detail that's stopping the threading code from working. I appreciate any help you can give.
Thank you!!
Related
Im new to pyflink. Im tryig to write a python program to read data from kafka topic and prints data to stdout. I followed the link Flink Python Datastream API Kafka Producer Sink Serializaion. But i keep seeing NoSuchMethodError due to version mismatch. I have added the flink-sql-kafka-connector available at https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka_2.11/1.13.0/flink-sql-connector-kafka_2.11-1.13.0.jar. Can someone help me in with a proper example to do this? Following is my code
import json
import os
from pyflink.common import SimpleStringSchema
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer
from pyflink.common.typeinfo import Types
def my_map(obj):
json_obj = json.loads(json.loads(obj))
return json.dumps(json_obj["name"])
def kafkaread():
env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///automation/flink/flink-sql-connector-kafka_2.11-1.10.1.jar")
deserialization_schema = SimpleStringSchema()
kafkaSource = FlinkKafkaConsumer(
topics='test',
deserialization_schema=deserialization_schema,
properties={'bootstrap.servers': '10.234.175.22:9092', 'group.id': 'test'}
)
ds = env.add_source(kafkaSource).print()
env.execute('kafkaread')
if __name__ == '__main__':
kafkaread()
But python doesnt recognise the jar file and throws the following error.
Traceback (most recent call last):
File "flinkKafka.py", line 31, in <module>
kafkaread()
File "flinkKafka.py", line 20, in kafkaread
kafkaSource = FlinkKafkaConsumer(
File "/automation/flink/venv/lib/python3.8/site-packages/pyflink/datastream/connectors.py", line 186, in __init__
j_flink_kafka_consumer = _get_kafka_consumer(topics, properties, deserialization_schema,
File "/automation/flink/venv/lib/python3.8/site-packages/pyflink/datastream/connectors.py", line 336, in _get_kafka_consumer
j_flink_kafka_consumer = j_consumer_clz(topics,
File "/automation/flink/venv/lib/python3.8/site-packages/pyflink/util/exceptions.py", line 185, in wrapped_call
raise TypeError(
TypeError: Could not found the Java class 'org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer'. The Java dependencies could be specified via command line argument '--jarfile' or the config option 'pipeline.jars'
What is the correct location to add the jar file?
I see that you downloaded flink-sql-connector-kafka_2.11-1.13.0.jar, but the code loades flink-sql-connector-kafka_2.11-1.10.1.jar.
May be you can have a check
just need to check the path to flink-sql-connector jar
You should add jar file of flink-sql-connector-kafka, it depends on your pyflink and scala version. If versions are true, check your path in add_jars function if the jar package is here.
I would like to implement multiprocessing into a simulation which I have written in python. The simulation is very extensive and to clean the code I have created a number of modules.
One of the modules is now supposed to do some number crunching. Thus, I'd like to implement multiprocessing. However, I will always encounter an issue as I can not employ an if __name__ == "__main__" guard with in the module.
I can reproduce the error by running the following:
# filename: test_mp_module.py
import concurrent.futures
def test_fct(arg):
return arg
class TestMpModule():
def __init__(self):
pass
def do(arg):
para = [1,2,3]
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(test_fct, para)
for result in results:
print(result)
and
# filename: main.py
from test_mp_module import TestMpModule
test = TestMpModule()
test.do()
The Exception displayed states:
runfile('C:/XXX/test_mp.py', wdir='C:/XXX')
Reloaded modules: test_mp_module
Traceback (most recent call last):
File "C:\XXX\test_mp.py", line 17, in <module>
test.do()
File "C:\XXX\test_mp_module.py", line 22, in do
for result in results:
File "C:\YYY\Anaconda3\lib\concurrent\futures\process.py", line 484, in _chain_from_iterable_of_lists
for element in iterable:
File "C:\YYY\Anaconda3\lib\concurrent\futures\_base.py", line 611, in result_iterator
yield fs.pop().result()
File "C:\YYY\Anaconda3\lib\concurrent\futures\_base.py", line 439, in result
return self.__get_result()
File "C:\YYY\Anaconda3\lib\concurrent\futures\_base.py", line 388, in __get_result
raise self._exception
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I'm using Python 3.8.3, usually execute my code in Spyder and run a Windows machine.
How may I adapt my code to utilise multiprocessing within a module? Would that be even possible in the first place - I found very conflicting statements?
Any help is appreciated, cheers.
Try this for your file "main.py":
if __name__ == '__main__':
test = TestMpModule()
test.do()
For the multiprocessing part, I recommend to use the multiprocessing package. Here is a little exemple on how to use it:
import multiprocessing
def my_func(i):
return i
if __name__ == '__main__':
with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
outputs = p.starmap(my_func, [(i, ) for i in range(5)])
print(outputs) # > [0, 1, 2, 3, 4]
I found a solution not sure if it is considered pretty, though. The name_guard needs to be carried into the module as follows:
# filename: test_mp_module.py
import concurrent.futures
def test_fct(i):
return i
class TestMpModule():
def __init__(self):
pass
def do(self, name_guard):
para = [1, 2, 3]
if name_guard == 'parent_module_name': # check parent module name here
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(test_fct, para)
for result in results:
print(result)
and
from test_mp_module import TestMpModule
if __name__ == "__main__":
name_guard = "parent_module_name" # insert __name__ here
test = TestMpModule()
test.do(name_guard)
Works fine now.
I have a python script that is being run on some Macs by my MDM tool. This means that the script is being run as root. There is a part of the script I need to run as the currently logged in local user on the account. I found this article below for using Popen to do this:
Run child processes as a different user from a long-running process
However, I am getting an error when I attempt to use this method on any pre macOS 10.13 computers. These are still modern OS versions such as 10.12 and 10.11. I have not been able to track this error down. Please see the code below.
Note: There are likely some extra import statements as this is pulled from a larger script. This snippet should work as-is.
#!/usr/bin/python
import subprocess
import platform
import os
import pwd
import sys
import hashlib
import plistlib
import time
from SystemConfiguration import SCDynamicStoreCopyConsoleUser
from distutils.version import StrictVersion as SV
def getLoggedInUserUID():
userUID = SCDynamicStoreCopyConsoleUser(None, None, None)[1]
return userUID
def getLoggedInUsername():
username = (SCDynamicStoreCopyConsoleUser(None, None, None) or [None])[0]; username = [username,""][username in [u"loginwindow", None, u""]]
return username
def getLoggedInUserGID():
username = getLoggedInUsername()
pwRecord = pwd.getpwnam(username)
userGID = pwRecord.pw_gid
return userGID
def getLoggedInUserHomeDir():
username = getLoggedInUsername()
pwRecord = pwd.getpwnam(username)
homeDir = pwRecord.pw_dir
return homeDir
def demote():
def result():
os.setgid(getLoggedInUserGID())
os.setuid(getLoggedInUserUID())
return result
def setupEnvironment():
environment = os.environ.copy()
environment['HOME'] = str(getLoggedInUserHomeDir())
environment['LOGNAME'] = str(getLoggedInUsername())
environment['PWD'] = str(getLoggedInUserHomeDir())
environment['USER'] = str(getLoggedInUsername())
return environment
def launchCommand():
command = ['echo', 'whoami']
process = subprocess.Popen(command,
stdout=subprocess.PIPE,
preexec_fn=demote(),
cwd=str(getLoggedInUserHomeDir()),
env=setupEnvironment())
def main():
launchCommand()
if __name__== "__main__":
main()
The error that I get is:
Traceback (most recent call last):
File "testScript.py", line 60, in <module>
main()
File "testScript.py", line 57, in main
launchCommand()
File "testScript.py", line 54, in launchCommand
env=setupEnvironment())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
KeyError: 'getpwnam(): name not found: '
It looks like it is missing some key value but I cannot for the like of me figure out what it is. Any help in tracking this down so I can run the command as the logged in user would help greatly.
Thanks in Advance,
Ed
The way that I did in my small mdm managed environment is like this:
I developed small windowless LoginItem helperApp that starts for every open user session on the mac and listens for custom distributed system notification. It also has a function for executing terminal commands without showing terminal window (You can find examples for this on stackowerflow).
I transmit to all apps currently running on the system two params in the notification: an username and a terminal command string. All of the users running instances get the notification, than they check the username for if they run in that user and the one that does - executes the command in that users name.
Try this if it fits your requirement.
I have the hardest time to share a string between processes. I looked at the following Q1, Q2, Q3 but still fail in an actual example. If I understood the docs correctly, looking should be done automatically therefore there should be no race condition with setting/reading. Here's what I've got:
#!/usr/bin/env python3
import multiprocessing
import ctypes
import time
SLEEP = 0.1
CYCLES = 20
def child_process_fun(share):
for i in range(CYCLES):
time.sleep(SLEEP)
share.value = str(time.time())
if __name__ == '__main__':
share = multiprocessing.Value(ctypes.c_wchar_p, '')
process = multiprocessing.Process(target=child_process_fun, args=(share,))
process.start()
for i in range(CYCLES):
time.sleep(SLEEP)
print(share.value)
which produces:
Traceback (most recent call last):
File "test2.py", line 23, in <module>
print(share.value)
File "<string>", line 5, in getvalue
ValueError: character U+e479b7b0 is not in range [U+0000; U+10ffff]
Edit: ´id(share.value)´ is different for each process. However if I try a double as shared variable instead they are the same and it works like a charm. Could this be a python bug?
When using python's sh module (not a part of stdlib), I can call a program in my path as a function and run it in the background:
from sh import sleep
# doesn't block
p = sleep(3, _bg=True)
print("prints immediately!")
p.wait()
print("...and 3 seconds later")
And I can use sh's Command wrapper and pass in the absolute path of an executable (helpful if the executable isn't in my path or has characters such as .):
import sh
run = sh.Command("/home/amoffat/run.sh")
run()
But trying to run the wrapped executable in the background, as follows:
import sh
run = sh.Command("/home/amoffat/run.sh", _bg=True)
run()
Fails with a traceback error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __init__() got an unexpected keyword argument '_bg'
How can I run an executable wrapped by sh.Command in the background? Looking for an elegant solution.
EDIT:
I used the python interpreter for testing passing _bg to the command (not the wrapper), which I now realize is a bad way to test for blocking and non-blocking processes:
>>> import sh
>>> hello = sh.Command("./hello.py")
>>> hello(_bg=True) # 5 second delay before the following prints and prompt is returned
HI
HI
HI
HI
HI
With hello.py being as follows:
#!/usr/bin/python
import time
for i in xrange(5):
time.sleep(1)
print "HI"
import sh
run = sh.Command("/home/amoffat/run.sh", _bg=True) # this isn't your command,
# so _bg does not apply
run()
Instead, do
import sh
run = sh.Command("/home/amoffat/run.sh")
run(_bg=True)
(BTW, the subprocess module provides a much less magical way to do such things.)