Running Python on Hadoop

Running Python on Hadoop - python

I am trying to run a very simple python script via hive and hadoop.
This is my script:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
nums = line.split()
i = nums[0]
print i
And I want to run it on the following table:
hive> select * from test;
OK
1 3
2 2
3 1
Time taken: 0.071 seconds
hive> desc test;
OK
col1 int
col2 string
Time taken: 0.215 seconds
I am running:
hive> select transform (col1, col2) using './proba.py' from test;
But always get something like:
...
2011-11-18 12:23:32,646 Stage-1 map = 0%, reduce = 0%
2011-11-18 12:23:58,792 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201110270917_20215 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
I have tried many different modifications of this procedure but I constantly fail. :(
Am I do something wrong or there is a problem with my hive/hadoop installation?

A few things I'd check for if I were debugging this:
1) Is the python file set to be executable (chmod +x file.py)
2) Make sure the python file is in the same place on all machines. Probably better - put the file in hdfs then you can use " using 'hdfs://path/to/file.py' " instead of a local path
3) Take a look at your job on the hadoop dashboard (http://master-node:9100), if you click on a failed task it will give you the actual java error and stack trace so you can see what actually went wrong with the execution
4) make sure python is installed on all the slave nodes! (I always overlook this one)
Hope that helps...

Check hive.log and/or the log from the hadoop job (job_201110270917_20215 in your example) for a more detailed error message.

"FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask" is a generic error that hive returns when something goes wong in the underlying map/reduce task. You need to go to hive log files(located on the HiveServer2 machine) and find the actual exception stack trace.

Related

How to catch the terminal output?

I'm working on https://github.com/JsBergbau/MiTemperature2 with raspberry pi 3 model b. It's working properly on its own infinite loop but I am not able to catch the output from the terminal. How can I reach to output by using python?
Here is the part of printing:
measurement_time = datetime.datetime.fromtimestamp(measurement.timestamp)
print(measurement_time)
humidity=int.from_bytes(data[2:3],byteorder='little')
print("Temperature: " + str(temp))
print("Humidity: " + str(humidity))
voltage=int.from_bytes(data[3:5],byteorder='little') / 1000.
print("Battery voltage:",voltage,"V")
measurement.temperature = temp
measurement.humidity = humidity
measurement.voltage = voltage
measurement.sensorname = args.name
batteryLevel = min(int(round((voltage - 2.1),2) * 100), 100) #3.1 or above --> 100% 2.1 --> 0 %
measurement.battery = batteryLevel
print("Battery level:",batteryLevel)
measurement_time = datetime.datetime.fromtimestamp(measurement.timestamp)
Here is the script I run on terminal:
python3 LYWSD03MMC.py -d AA:BB:CC:DD:EE:FF
And here is the output:
2021-08-05 11:21:24
Temperature: 24.79
Humidity: 47
Battery voltage: 3.092 V
Battery level: 99
here is the run command and sample output, thanks for helps, best regards.

Change your code so it returns the information instead of just printing it. If you have code which looks like
something = some_function_call(123)
print(something)
other_one = different_function("some data here?").strip()
print(other_one)
probably refactor to
def get_something(number):
return some_function_call(number)
def get_other_one():
return different_function("some data here?").strip()
if __name__ == '__main__':
print(get_something(123))
print(get_other_one())
Now, you can create additional code which retrieves these values without printing them, and does whatever it wants with them. Put them on a web site? Upload them to a database? Rot13 encrypt them and send an email to Bill Gates? Your imagination is the limit.
How exactly you design your code is a broad topic where many books have been written, and more will be. A common arrangement is to make sure the useful parts are in modular functions which do one thing only (ideally without any side effects) so you can import this code and use it from other programs. (That's why the if __name__ part is useful. It makes sure code inside the block doesn't run when you import this file.)

Have you had a closer look at the code? There is a callback option. This is the easiest way to get values from this script. Or is this question more academically on how to capture python output?
If not, that should help you:
Documentation where callback is described:
https://github.com/JsBergbau/MiTemperature2#callback-for-processing-the-data
Accessing the single values:
In sendToInflux.sh https://github.com/JsBergbau/MiTemperature2/blob/master/sendToInflux.sh is an example in which argument are the values like temperature and so on.
Or when using sendToFile.sh it gives line by line
sensorname,temperature,humidity,voltage,humidityCalibrated,timestamp MySensor 20.61 54 2.944 49 1582120122
That data should be easy to process by python or awk.

add commandline
2>&1 | tee result.txt
it can save command line output

If you are running a command from python you can use subprocess.check_output to get the returning output from the terminal. Don't work if the called script runs forever.
Like this:
output = subprocess.check_output([sys.executable, 'LYWSD03MMC.py', '-d', 'AA:BB:CC:DD:EE:FF']).decode()

Floating point exception when using SQL when run but not when debugged

As part of my Python program, I have created a method which runs sql queries on a Db2 server. Here it is:
def run_query(c, query, return_results=False):
stmt = db.exec_immediate(c, query)
if return_results:
df = {}
row = db.fetch_assoc(stmt)
for key in [key.lower() for key in row.keys()]:
df[key] = []
while row:
for key in [key .lower() for key in row.keys()]:
df[key].append(row[key.upper()])
row = db.fetch_assoc(stmt)
return pd.DataFrame(df)
It uses the ibm_db API library and its goal is to run an SQL query. If the results are wanted, it converts the resultset into a pandas dataframe for use in the program. When I run the program to print out the returned dataframe with print(run_query(conn, "SELECT * FROM ppt_products;", True)), it does not print anything but instead exits with this error code: Process finished with exit code 136 (interrupted by signal 8: SIGFPE) (btw I am using PyCharm Professional). However, when I debug the program with pydev debugger in PyCharm, the program runs smoothly and prints out the desired output, which should look like:
id brand model url
0 2392 sdf rtsg asdfasdfasdf
1 23452345 sdf rtsg asdfasdfasdf
2 6245 sdf rtsg asdfasdfasdf
3 8467 sdf rtsg asdfasdfasdf
I had tried debugging the floating-point-exception but could only find solutions for Python 2 with a module called fpectl which can be used to turn on and off floating-point-exceptions.
I would appreciate any assistance.

The error was only occurring in PyCharm. When I run it using the command line, the error did not occur. This leads me to believe that the error may have been in the JetBrains mechanism for running scripts. Thank you data_henrik for the suggestion to use pandas.read_sql because it simplified the process of getting the result set from the SQL queries.

How to link interactive problems (w.r.t. CodeJam)?

I'm not sure if it's allowed to seek for help(if not, I don't mind not getting an answer until the competition period is over).
I was solving the Interactive Problem (Dat Bae) on CodeJam. On my local files, I can run the judge (testing_tool.py) and my program (<name>.py) separately and copy paste the I/O manually. However, I assume I need to find a way to make it automatically.
Edit: To make it clear, I want every output of x file to be input in y file and vice versa.
Some details:
I've used sys.stdout.write / sys.stdin.readline instead of print / input throughout my program
I tried running interactive_runner.py, but I don't seem to figure out how to use it.
I tried running it on their server, my program in first tab, the judge file in second. It's always throwing TLE error.
I don't seem to find any tutorial to do the same either, any help will be appreciated! :/

The usage is documented in comments inside the scripts:
interactive_runner.py
# Run this as:
# python interactive_runner.py <cmd_line_judge> -- <cmd_line_solution>
#
# For example:
# python interactive_runner.py python judge.py 0 -- ./my_binary
#
# This will run the first test set of a python judge called "judge.py" that
# receives the test set number (starting from 0) via command line parameter
# with a solution compiled into a binary called "my_binary".
testing_tool.py
# Usage: `testing_tool.py test_number`, where the argument test_number
# is 0 for Test Set 1 or 1 for Test Set 2.
So use them like this:
python interactive_runner.py python testing_tool.py 0 -- ./dat_bae.py

improving performance of behave tests

We run behave BDD tests in our pipeline. We run the tests in the docker container as part of the jenkins pipeline. Currently it takes ~10 minutes to run all the tests. We are adding a lot of tests and in few months, it might go upto 30 minutes. It is outputting a lot of information. I believe that if I reduce the amount of information it outputs, I can get the tests to run faster. Is there a way to control the amount of information behave outputs? I want to print the information only if something fails.
I took a look at behave-parallel. Looks like it is in python 2.7. We are in python3.
I was looking at various options behave provides.
behave -verbose=false folderName (I assumed that it will not output all the steps)
behave --logging-level=ERROR TQXYQ (I assumed it will print only if there is an error)
behave --logging-filter="Test Step" TQXYQ (I assumed it will print only the tests that has "Test Step" in it)
None of the above worked.
The current output looks like this
Scenario Outline: IsError is populated correctly based on Test Id -- #1.7 # TestName/Test.feature:187
Given the test file folder is set to /TestName/steps/ # common/common_steps.py:22 0.000s
And Service is running # common/common_steps.py:10 0.000s
Given request used is current.json # common/common_steps.py:26 0.000s
And request is modified to set X to q of type str # common/common_steps.py:111 0.000s
And request is modified to set Y to null of type str # common/common_steps.py:111 0.000s
And request is modified to set Z to USD of type str # common/common_steps.py:111 0.000s
When make a modification request # common/common_steps.py:37 0.203s
Then it returns 200 status code # common/common_steps.py:47 0.000s
And transformed result has IsError with 0 of type int # common/common_steps.py:92 0.000s
And transformed result has ErrorMessages contain [] # common/common_steps.py:52 0.000s
I want to print only all these things only if there is an error. If everything is passing, I don't want to display this information.

I think the default log level INFO will not impact the performance of your tests.
I am also using docker container to run the regression suite and it takes about 2 hours to run 2300 test scenarios. It took nearly a day before and here is what I did :
1. Run all test suite parallel.
This is the most important reason that will reduce the execution time.
We spent a lot of efforts to turn the regression suite to be parallel-able.
- make atomic, autonomous and independent tests so that you can run all your tests in parallel effectively.
- create a parallel runner to run tests on multiple processes. I am using multiprocessing and subprocessing libraries to do this.
I would not recommend behave-parallel because it is no longer active supported.
You can refer to this link :
http://blog.crevise.com/2018/02/executing-parallel-tests-using-behave.html?m=1
- using Docker Swarm to add more nodes into Selenium Grid.
You can scale up to add more nodes and the maximum numbers of nodes depend on the number of cpus. The best practice is number of node = number of cpu.
I have 4 PCs , each has 4 cores so I can scale up to 1 hub and 15 nodes.
2. Optimize synchronization in your framework.
Remove time.sleep()
Remove implicitly wait. Use explicitly wait instead.
Hope it helps.

Well I have solved this in a traditional way, but I m not sure how effective it could be. I just started this yesterday and now trying to work on building the reports out of it. Approach as below, suggestions welcome
this solves the parallel execution at the example driven as well.
parallel_behave.py
Run command (mimics all the params of behave command)
py parallel_behave.py -t -d -f ......
initial_command = 'behave -d -t <tags>'
'''
the above command returns the eligible cases. may not be the right approach, but works well for me
'''
r = subprocess.Popen(initial_command.split(' '), stdout=subprocess.PIPE, bufsize=0)
finalsclist = []
_tmpstr=''
for out in r.stdout:
out = out.decode('utf-8')
# print(out.decode('utf-8'))
if shellout.startswith('[') :
_tmpstr+=out
if shellout.startswith('{') :
_tmpstr+=out
if shellout.startswith(']'):
_tmpstr+=out
break
scenarionamedt = json.loads(_tmpstr)
for sc in scenarionamedt:
[finalsclist.append(s['name']) for s in sc['elements']]
now the finalsclist contains the scenario name
ts = int(timestamp.timestamp)
def foo:
cmd = "behave -n '{}' -o ./report/output{}.json".format(scenarioname,ts)
pool = Pool(<derive based on the power of the processor>)
pool.map(foo, finalsclist)
this will create that many processes of individual behave calls and generate the json output under report folder
*** there was a reference from https://github.com/hugeinc/behave-parallel but this is at the feature level. I just extended it to the scenarios and example ****

How to debug a SQL/Python UDF in MonetDB

Using native Python code in SQL UDFs in Monetdb is really powerful. BUT, debugging such UDFs could benefit from more support. In particular, if I use the old-fashioned print('debugging info') it disappears in the big black void.
create function dummy()
returns string
language python{
print('Entering the dummy UDF')
return 'hello';
};
How to retrieve this information from the server or MonetDB client.

I was debugging some Python UDF last week :)
Step 1: first make sure your Python code at least works in a Python interpreter.
Step 2: in a Python UDF, write your debugging info. to a file, e.g.:
f = open('/tmp/debug.out', 'w')
f.write('my debugging info\n')
f.close()
This isn't ideal, but it works. Also, I used this to export the parameter values of my Python UDF. In this way, I can run the body of my Python UDF in a Python interpreter with the exact data I receive from MonetDB.

In case someone is still interested in this problem.
There are two novel ways of debugging MonetDB's Python/UDFs.
1) Using the python client pymonetdb (https://github.com/gijzelaerr/pymonetdb).
You can install it throw pip
pip install numpy
To use it, think of the following setting with a table that holds an integer and a UDF that computes the mean absolute deviation of a given column.
CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (1), (3), (6), (8), (10);
CREATE OR REPLACE FUNCTION mean_deviation(column INTEGER)
RETURNS DOUBLE LANGUAGE PYTHON {
mean = 0.0
for i in range (0, len(column)):
mean += column[I]
mean = mean / len(column)
distance = 0.0
for i in range (0, len(column)):
distance += column[i] - mean
deviation = distance/len(column)
return deviation;
};
To debug your function using terminal debugging (i.e., pdb) you just need to open a database connection using pymonetdb.connect(), later you get a cursor object from the connection, and through the cursor object you call the debug() function, sending as parameters the SQL you want to examine and the UDF name you wish to debug.
import pymonetdb
conn = pymonetdb.connect(database='demo') #Open Database connection
c = conn.cursor()
sql = 'select mean_deviation(i) from integers;'
c.debug(sql, 'mean_deviation') #Console Debugging
There is an optional sampling step that only transfers a uniform random sample of the data instead of the full input data set. If you wish to sample you just need to send the number of elements you wish to get from the sampling (e.g., c.debug(sql, 'mean_deviation', 10) in case you desire the subset of 10 elements)
2) Using a POC plugin for PyCharm called devudf, which you can install throw the plugin page of pycharm, or by directly going to the JetBrains page: https://plugins.jetbrains.com/plugin/12063-devudf. It adds an option to the main menu called "UDF Development" and allows for you do directly import and export UDFs from your database directly to pycharm, and enjoy the IDE's debugging capabilities.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.