Apache Hive getting error while using Python UDF - python

I am using Python user defined function in Apache hive to change characters from lower case character to upper case. I am getting error as "Hive Runtime Error while closing operators".
Below are the query I tried:
describe table1;
OK
item string
count int
city string
select * from table1;
aaa 1 tokyo
aaa 2 london
bbb 3 washington
ccc 4 moscow
ddd 5 bejing
From the above table, item and city field should change from lower case to upper case and count should increment by 10.
Python script used:
cat caseconvert.py
import sys
import string
for line in sys.stdin:
line = line.strip()
item,count,city=line.split('\t')
ITEM1=item.upper()
COUNT1=count+10
CITY1=city.upper()
print '\t'.join([ITEM1,str(COUNT1),FRUIT1])
Inserting table1 data to table2
create table table2(ITEM1 string, COUNT1 int, CITY1 string) row format delimited fields terminated by ',';
add FILE caseconvert.py
insert overwrite table table2 select TRANSFORM(item,count,city) using 'python caseconvert.py' as (ITEM1,COUNT1,CITY1) from table1;
If I execute I am getting the following error. I could'nt able to trace the issue. Can I know it going wrong?
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201508151858_0014, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201508151858_0014
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201508151858_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-08-15 22:24:06,212 Stage-1 map = 0%, reduce = 0%
2015-08-15 22:25:01,559 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201508151858_0014 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201508151858_0014
Examining task ID: task_201508151858_0014_m_000002 (and more) from job job_201508151858_0014
Task with the most failures(4):
-----
Task ID:
task_201508151858_0014_m_000000
URL:
http://localhost.localdomain:50030/taskdetails.jsp?jobid=job_201508151858_0014&tipid=task_201508151858_0014_m_000000
-----
Diagnostic Messages for this Task:
java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:224)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:488)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:570)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:5
FAILED: Execution Error, return code 20003 from org.apache.hadoop.hive.ql.exec.MapRedTask. An error occurred when trying to close the Operator running your custom script.
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

In the last line of your Python script, where you print the output to STDOUT, you call FRUIT1 without having defined it. This should be CITY1. You have also imported string but not used it. I'd write the script a bit differently:
import sys
import string
while True:
line = sys.stdin.readline()
if not line:
break
line = string.strip(line, '\n ')
item,count,city=string.split(line, '\t')
ITEM1=item.upper()
COUNT1=count+10
CITY1=city.upper()
print '\t'.join([ITEM1,str(COUNT1),CITY1])
Then, I'd use a CREATE TABLE AS SELECT query (assuming both TABLE1 and your python script live in the HDFS):
create table TABLE2
as select transform(item, count, city)
using 'hdfs:///user/username/caseconvert.py'
as (item1 string, count1 string, city1 string)
FROM TABLE1;
This has worked for me. However, there is a much easier way to make the transformations you want using Hive built-in functions:
upper(string A) >>> returns the string resulting from converting all characters of A to upper case. For example, upper('fOoBaR') results in 'FOOBAR'.
And of course for city you could make it: (city + 10) AS city1.
So, TABLE2 could be created as follows:
CREATE TABLE2
AS SELECT
UPPER(ITEM) AS ITEM1,
COUNT + 10 AS COUNT1,
UPPER CITY AS CITY1
FROM TABLE1;
Much less trouble than writing your custom UDF.

Related

mySQL Load Data - Row 1 doesn't contain data for all columns

I've looked at many similar questions on this topic.. But none appear to apply.
Here are the details:
I have a table with 8 columns.
create table test (
node_name varchar(200),
parent varchar(200),
actv int(11),
fid int(11),
cb varchar(100),
co datetime,
ub varchar(100),
uo datetime
);
There is a trigger on the table:
CREATE TRIGGER before_insert_test
BEFORE INSERT ON test
FOR EACH ROW SET NEW.co = now(), NEW.uo = now(), NEW.cb = user(), NEW.ub = user()
I have a csv file to load into this table. Its got just 2 columns in it.
First few rows:
node_name,parent
West,
East,
BBB: someone,West
Quebec,East
Ontario,East
Manitoba,West
British Columbia,West
Atlantic,East
Alberta,West
I have this all set up in a mySQL 5.6 environment. Using python and SQLAlchemy, i run the load of the file without issue.. It LOADS ALL RECORDS with empty strings for the second field in the first 2 records. All as expected.
I have a mysql 8 environment, and run the exact same routine. All the same statements, etc. It fails with the 'Row 1 doesn't contain data for all columns' error.
The connection is made using this:
engine = create_engine(
connection_string,
pool_size=6, max_overflow=10, encoding='latin1', isolation_level='AUTOCOMMIT',
connect_args={"local_infile": 1}
)
db_connection = engine.connect()
The Command I place in the sql variable is:
LOAD DATA INFILE 'test.csv'
INTO TABLE test
FIELDS TERMINATED BY ',' ENCLOSED BY '\"' IGNORE 1 LINES SET fid = 526, actv = 1;
And execute it with:
db_connection.execute(sql)
So.. I basically load the first two columns from the file.. I set the next 2 columns in the load statement, and the final 4 others are handled by the trigger.
I repeat - this is working fine in the mysql 5 environment, but not the mysql 8.
I checked mysql character set variables in both db environments, and they are equivalent (just in case the default character set change between 5.6 and 8 had an impact).
I will say that the mySQL 5 db is running on ubuntu 18.04.5 while mySQL 8 is running on ubuntu 20.02.2 - could there be something there??
I have tried all sorts of fiddling with the LOAD DATA statement.. I tried filling in data for the first two records in the file in case that was it.. I tried using different line terminators in the LOAD statement.. I'm at a loss for the next thing to look into..
Thanks for any pointers..
MySQL will assume that each row in your CSV maps to a column in the table, unless you tell it otherwise.
Give the query a column list:
LOAD DATA INFILE 'test.csv'
INTO TABLE test
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
IGNORE 1 LINES
(node_name, parent)
SET fid = 526, actv = 1;
In addition to Tangentially Perpendicular's answer, there are other options:
add the IGNORE keyword as per:
https://dev.mysql.com/doc/refman/8.0/en/sql-mode.html#ignore-effect-on-execution
it should come just before the ' INTO' in the LOAD DATA statement as per https://dev.mysql.com/doc/refman/8.0/en/load-data.html.
or, altering the sql_mode to be less strict will work also.
Due to the strict sql_mode, LOAD DATA isn't smart enough to realize that TRIGGERS are handling a couple columns.. Would be nice if they enhanced it to be that smart.. but alas.

How to extract more than 1 variable from Python launched in SQL-query to SQL-table

Would you prompt, how to extract to SQL more than 1 variable meaning from Python script written in SQL (e.g. tn -variable. Actually, I want to extract 6 variables).
The query below perfectly extracts the meaning of 1 variable - dataset_with_pred2.
I've tried to list the variables via comma but faced the error:
Msg 39012, Level 16, State 1, Line 92
Unable to communicate with the runtime for 'Python' script. Please
check the requirements of 'Python' runtime. Query was canceled by
user.
Thank you very much.
INSERT INTO [dbo].[table_name]
([col1]
,[col2]
,[...]
,[coln])
exec sp_execute_external_script
#language =N'Python',
#script=N'
#python script here
...
tn = 5
dataset_with_pred2 = dataframe
'
,
#input_data_1=N'
select col1
,col2
from dbo.table_name
'
,#input_data_1_name = N'df'
,#output_data_1_name = N'dataset_with_pred2'
It's possible to put out of the Python script only 1 variable meaning. And the variable type should always be pandas Dataframe.
This means, if you want to extract from Python script something but a table (e.g. string), you will have to put it into a table and then extract it from a table.
Unfortunatelly, I found only documentation in Russian. Hope, in your country the link will redirect you in English version: MS documentation

I have a table with a status however I want it to change status every 10 minutes. I'm using Postgresql

I want to change the value of a table from one status to another type 'A’ and after 10 minutes I want it to become 'B' postgres with python.
I can see what you're after although there is a big omission: What happens when the status has reached 'Z' and it is time to update? More on that later.
Your request has 2 components, actually getting a process to run and a rolling update procedure (script) for the status. Well, Postgres has no native functionally for initiating a script. You'll have to setup a cron job, or an entry is what ever job scheduler you have. The update process is not all that difficult, except for the undefined 'Z' status problem. (So when that happens I'll just repeat the A-Z extending the code length (sort of like excel names columns).
The basic update needed is to simply add 1 to the current value. But of course the statement " 'A'+1 " doesn't work, but the result can be obtained with the CHR and ASCII functions. Chr(ascii('A')+1)) effectively accomplishes that so your update could be accomplished as:
Update table_name set status = chr(ascii(status)+1);
However, that fails once the status reaches 'Z'. Well it doesn't fail in the since of generating an error, but it produces '['. The following script produces 'AA' in the above case, and each time status reaches '...Z' the next status becomes '...AA'.
--- setup
drop table if exists current_stat;
create table current_stat(id serial,status text,constraint status_alpha_ck check( status ~ '^[A-Z]+'));
insert into current_stat(status) values (null), ('A'), ('B'), ('Y'), ('Z'), ('AA'), ('ABZ')
--- Update SQL
with curr_stat as
( select id, status
, chr(ascii(substring(status,char_length(status),1))+1) nstat
, char_length(status) lstat from current_stat)
update current_stat cs
set status = ( select case when status is null or lstat < 1 then 'A'
when substring(status,lstat,1) = 'Z' then overlay( status placing 'AA' from lstat for 2)
else overlay( status placing nstat from lstat for 1)
end
from curr_stat
where cs.id = curr_stat.id
);
select * from current_stat;

Python sqlite3 never returns an inner join with 28 milion+ rows

Sqlite database with two tables, each over 28 million rows long. Here's the schema:
CREATE TABLE MASTER (ID INTEGER PRIMARY KEY AUTOINCREMENT,PATH TEXT,FILE TEXT,FULLPATH TEXT,MODIFIED_TIME FLOAT);
CREATE TABLE INCREMENTAL (INC_ID INTEGER PRIMARY KEY AUTOINCREMENT,INC_PATH TEXT,INC_FILE TEXT,INC_FULLPATH TEXT,INC_MODIFIED_TIME FLOAT);
Here's an example row from MASTER:
ID PATH FILE FULLPATH MODIFIED_TIME
---------- --------------- ---------- ----------------------- -------------
1 e:\ae/BONDS/0/0 100.bin e:\ae/BONDS/0/0/100.bin 1213903192.5
The tables have mostly identical data, with some differences between MODIFIED_TIME in MASTER and INC_MODIFIED_TIME in INCREMENTAL.
If I execute the following query in sqlite, I get the results I expect:
select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;
That query will pause for a minute or so, return a number of rows, pause again, return some more, etc., and finish without issue. Takes about 2 minutes to fully return everything.
However, if I execute the same query in Python:
changed_files = conn.execute("select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;")
It will never return - I can leave it running for 24 hours and still have nothing. The python32.exe process doesn't start consuming a large amount of cpu or memory - it stays pretty static. And the process itself doesn't actually seem to go unresponsive - however, I can't Ctrl-C to break, and have to kill the process to actually stop the script.
I do not have these issues with a small test database - everything runs fine in Python.
I realize this is a large amount of data, but if sqlite is handling the actual queries, python shouldn't be choking on it, should it? I can do other large queries from python against this database. For instance, this works:
new_files = conn.execute("SELECT DISTINCT INC_FULLPATH, INC_PATH, INC_FILE from INCREMENTAL where INC_FULLPATH not in (SELECT DISTINCT FULLPATH from MASTER);")
Any ideas? Are the pauses in between sqlite returning data causing a problem for Python? Or is something never occurring at the end to signal the end of the query results (and if so, why does it work with small databases)?
Thanks. This is my first stackoverflow post and I hope I followed the appropriate etiquette.
Python tends to have older versions of the SQLite library, especially Python 2.x, where it is not updated.
However, your actual problem is that the query is slow.
Use the usual mechanisms to optimize it, such as creating a two-column index on INC_FULLPATH and INC_MODIFIED_TIME.

how to read sql query output in python

I am using sqlplus in my Python code to connect to data base and executing the query and reading the results. Can any one help me how to read the data from sysout.
My Code is Like this:
stdout = os.popen(cmd)
for line in stdout:
print line
stdout.close()
But I could see the result as for every three rows title is repeating like:
Name ID
---- ---
AB 1
AC 2
AD 3
Name ID
---- ---
BC 1
BD 2
like this.
Is it possible to control this, with out repeating the header, header should come only once and it should come only in the beginning.
What you are doing:
Launching a standalone program which queries the database and prints the results to stdout
Reading the stdout of that program and thinking about parsing it.
What you should be doing:
Using a database API in Python.
This page contains a list of Oracle DB APIs you could use: https://wiki.python.org/moin/Oracle
Many benefits will come from using a real API to query the database, such as better opportunities to handle errors, probably better performance, and future maintainers of your code not being upset with you.

Categories

Resources