We have ETL jobs in Python (Luigi). They all connect to Hive Metastore to get partitions info.
Code:
from hive_metastore import ThriftHiveMetastore
client = ThriftHiveMetastore.Client(protocol)
partitions = client.get_partition_names('sales', 'salesdetail', -1)
-1 is max_parts (max partitions returned)
It randomly times out like this:
File "/opt/conda/envs/etl/lib/python2.7/site-packages/luigi/contrib/hive.py", line 210, in _existing_partitions
partition_strings = client.get_partition_names(database, table, -1)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/hive_metastore/ThriftHiveMetastore.py", line 1703, in get_partition_names
return self.recv_get_partition_names()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/hive_metastore/ThriftHiveMetastore.py", line 1716, in recv_get_partition_names
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
sz = self.readI32()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 206, in readI32
buff = self.trans.readAll(4)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 58, in readAll
chunk = self.read(sz - have)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 159, in read
self.__rbuf = StringIO(self.__trans.read(max(sz, self.__rbuf_size)))
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 105, in read
buff = self.handle.recv(sz)
timeout: timed out
This error happens occasionally.
There is 15 minute timeout on Hive Metastore.
When I investigate to run get_partition_names separately, it returns data within a few seconds.
Even when I set socket.timeout to 1 or 2 seconds, query completes.
There is no record of socket close connection message in Hive metastore logs cat /var/log/hive/..log.out
The tables it usually times out on have large number of partitions ~10K+. But as mentioned before, they only time out randomly. And they return partitions metadata quickly when that portion of code alone is tested.
Any ideas why it times out randomly, or how to catch these timeout errors in metastore logs, or how to fix them ?
The issue was a thread overlap in LUIGI
We used a Singleton to implement a poor-man's connection pool. But Luigi's different worker threads stepped on each other, and caused strange behavior when one thread's get_partition_names conflict's with another's.
We fixed the issue by ensuring each thread's connection object gets its own 'key' in the connection pool (instead of all threads sharing the process id key)
Related
I am trying to create, edit and read from a sqlite file using a python script. My code is a server client model with the client writing to the database on receiving a command from the server.
Each command from the server is received on a separate thread to provide for parallel operation.
The client never restarts unless the system reboots but the server program is launched when needed by the user.
Now my problem arises because sqlite for python is not thread safe. So I have a consumer queue to the database for all write operations.
I cannot provide the code because it is really long and is very har to decouple and provide a complete working copy.
But a snippet of the code where the error is :
def writedata(self, _arg1, _arg2, _arg3):
# self.sql_report is the fully qualified path of sqlite file
db = sqlite3.connect(self.sql_report)
c = db.cursor()
res = c.execute("Select id from State")
listRowId = []
for element in res.fetchall():
listRowId.append(element[0])
self.currentState = max(listRowId)
sql = "INSERT INTO Analysis (type, reason, Component_id, State_id) VALUES (?, ?, ifnull((SELECT id from `Component` WHERE name = ? LIMIT 1), 1), ?)"
# call to the write queue.
Report.strReference.writeToDb(sql, [(_arg1, _arg3, _arg2, self.currentState)])
The error I am receiving is
File "/usr/lib/python2.6/threading.py", line 525, in __bootstrap_inner
self.run()
File "/usr/lib/python2.6/threading.py", line 477, in run
self.__target(*self.__args, **self.__kwargs)
File "/nameoffile", line 357, in nameofmethod
Report().writedata("test","text","analysis")
File "./nameofscript/Report.py", line 81, in writedata
ValueError: database parameter must be string or APSW Connection object
line 81: here is:
#first line of the snippet code pasted above
db = sqlite3.connect(self.sql_report)
I don't know why this error comes up. One point to be noted though is that this error comes up only after the server is run for a few times.
The error is exactly what it says. You are passing self.sql_report as the string database filename to use, but at the time of making the call it is not a string.
You'll need to find out what it really is, which is standard Python debugging. Use whatever you normally use for that. Here is also a suggestion that will print what it is and drop into an interactive debugger so you can examine further.
try:
db = sqlite3.connect(self.sql_report)
except ValueError:
print (repr(self.sql_report))
import pdb; pdb.set_trace()
raise
This is my full trace:
Traceback (most recent call last):
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/app/trace.py", line 283, in trace_task
uuid, retval, SUCCESS, request=task_request,
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/base.py", line 256, in store_result
request=request, **kwargs)
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/base.py", line 490, in _store_result
self.set(self.get_key_for_task(task_id), self.encode(meta))
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/redis.py", line 160, in set
return self.ensure(self._set, (key, value), **retry_policy)
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/redis.py", line 149, in ensure
**retry_policy
File "/home/server/backend/venv/lib/python3.4/site-packages/kombu/utils/__init__.py", line 243, in retry_over_time
return fun(*args, **kwargs)
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/redis.py", line 169, in _set
pipe.execute()
File "/home/server/backend/venv/lib/python3.4/site-packages/redis/client.py", line 2593, in execute
return execute(conn, stack, raise_on_error)
File "/home/server/backend/venv/lib/python3.4/site-packages/redis/client.py", line 2447, in _execute_transaction
connection.send_packed_command(all_cmds)
File "/home/server/backend/venv/lib/python3.4/site-packages/redis/connection.py", line 532, in send_packed_command
self.connect()
File "/home/pserver/backend/venv/lib/python3.4/site-packages/redis/connection.py", line 436, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 0 connecting to localhost:6379. Error.
[2016-09-21 10:47:18,814: WARNING/Worker-747] Data collector is not contactable. This can be because of a network issue or because of the data collector being restarted. In the event that contact cannot be made after a period of time then please report this problem to New Relic support for further investigation. The error raised was ConnectionError(ProtocolError('Connection aborted.', BlockingIOError(11, 'Resource temporarily unavailable')),).
I really searched for ConnectionError but there was no matching problem with mine.
My platform is ubuntu 14.04. This is a part of my redis config. (I can share if you need the whole redis.conf file. By the way all parameters are closed on LIMITS section.)
# By default Redis listens for connections from all the network interfaces
# available on the server. It is possible to listen to just one or multiple
# interfaces using the "bind" configuration directive, followed by one or
# more IP addresses.
#
# Examples:
#
# bind 192.168.1.100 10.0.0.1
bind 127.0.0.1
# Specify the path for the unix socket that will be used to listen for
# incoming connections. There is no default, so Redis will not listen
# on a unix socket when not specified.
#
# unixsocket /var/run/redis/redis.sock
# unixsocketperm 755
# Close the connection after a client is idle for N seconds (0 to disable)
timeout 0
# TCP keepalive.
#
# If non-zero, use SO_KEEPALIVE to send TCP ACKs to clients in absence
# of communication. This is useful for two reasons:
#
# 1) Detect dead peers.
# 2) Take the connection alive from the point of view of network
# equipment in the middle.
#
# On Linux, the specified value (in seconds) is the period used to send ACKs.
# Note that to close the connection the double of the time is needed.
# On other kernels the period depends on the kernel configuration.
#
# A reasonable value for this option is 60 seconds.
tcp-keepalive 60
This is my mini redis wrapper:
import redis
from django.conf import settings
REDIS_POOL = redis.ConnectionPool(host=settings.REDIS_HOST, port=settings.REDIS_PORT)
def get_redis_server():
return redis.Redis(connection_pool=REDIS_POOL)
And this is how i use it:
from redis_wrapper import get_redis_server
# view and task are working in different, indipendent processes
def sample_view(request):
rs = get_redis_server()
# some get-set stuff with redis
#shared_task
def sample_celery_task():
rs = get_redis_server()
# some get-set stuff with redis
Package versions:
celery==3.1.18
django-celery==3.1.16
kombu==3.0.26
redis==2.10.3
So the problem is that; this connection error occurs after some time of starting celery workers. And after first seem of that error, all the tasks ends with this error until i restart all of my celery workers. (Interestingly, celery flower also fails during that problematic period)
I suspect of my redis connection pool usage method, or redis configuration or less probably network issues. Any ideas about the reason? What am i doing wrong?
(PS: I will add redis-cli info results when i will see this error today)
UPDATE:
I temporarily solved this problem by adding --maxtasksperchild parameter to my worker starter command. I set it to 200. Ofcourse it is not the proper way to solve this problem, it is just a symptomatic cure. It basically refreshes the worker instance periodically (closes old process and creates new one when old one reached 200 task) and refreshes my global redis pool and connections. So i think i should focus on global redis connection pool usage way and i'm still waiting for new ideas and comments.
Sorry for my bad English and thanks in advance.
Have you enabled the rdb background save method in redis ??
if so check for the size of the dump.rdb file in /var/lib/redis.
Sometimes the file grows in size and fill the root directory and the redis instance cannot save to that file anymore.
You can stop the background save process by issuing
config set stop-writes-on-bgsave-error no
command on redis-cli
This is the problem:
File "/app/.heroku/python/lib/python2.7/site-packages/django/utils/six.py", line 535, in next
return type(self).__next__(self)
File "/app/.heroku/python/lib/python2.7/site-packages/django/http/multipartparser.py", line 344, in __next__
output = next(self._producer)
File "/app/.heroku/python/lib/python2.7/site-packages/django/utils/six.py", line 535, in next
return type(self).__next__(self)
File "/app/.heroku/python/lib/python2.7/site-packages/django/http/multipartparser.py", line 406, in __next__
data = self.flo.read(self.chunk_size)
File "/app/.heroku/python/lib/python2.7/site-packages/django/http/request.py", line 267, in read
six.reraise(UnreadablePostError, UnreadablePostError(*e.args), sys.exc_info()[2])
File "/app/.heroku/python/lib/python2.7/site-packages/django/http/request.py", line 265, in read
return self._stream.read(*args, **kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 59, in read
result = self.buffer + self._read_limited(size - len(self.buffer))
File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 47, in _read_limited
result = self.stream.read(size)
UnreadablePostError: error during read(65536) on wsgi.input
My current configuration reads like this:
[uwsgi]
http-socket = :$(PORT)
master = true
processes = 4
die-on-term = true
module = app.wsgi:application
memory-report = true
chunked-input-limit = 25000000
chunked-input-timeout = 300
socket-timeout = 300
Python: 2.7.x | uWsgi: 2.0.10
And to make the problem even more specific, this is happening when I process images synchronously along with an image upload. I know that ideally I must do this using Celery, but because of a business requirement I am not able to do that. So need to configure the timeout in such a way that it allows me to accept a large image file, process it and then return response.
Any kind of light on the question will be extremely helpful. Thank you.
The error quoted in the description isn't the full picture; the relevant part is this lot entry:
[uwsgi-body-read] Error reading 65536 bytes … message: Client closed connection uwsgi_response_write_body_do() TIMEOUT
This specific error is being raised because (most probably) the client, or something between it and uWSGI, aborted the request.
There are a number of possible causes for this:
A buggy client
Network-level filtering (DPI or some misconfigured firewall)
Bugs / misconfiguration in the server in front of uWSGI
The last one is covered in the uWSGI docs:
If you plan to put uWSGI behind a proxy/router be sure it supports chunked input requests (or generally raw HTTP requests).
To verify your issue really isn't in uWSGI, try to upload the file via the console on the server hosting your uWSGI application. Hit the HTTP endpoint directly, bypassing nginx/haproxy and friends.
Update 3/4:
I've done some testing and proved that using checkout event handler to check disconnects works with Elixir. Beginning to think my problem has something to do with calling session.commit() from a subprocess? Update: I just disproved myself by calling session.commit() in a subprocess, updated example below. I'm using the multiprocessing module to create the subprocess.
Here's the code that shows how it should work (without even using pool_recycle!):
from sqlalchemy import exc
from sqlalchemy import event
from sqlalchemy.pool import Pool
from elixir import *
import multiprocessing as mp
class SubProcess(mp.Process):
def run(self):
a3 = TestModel(name="monkey")
session.commit()
class TestModel(Entity):
name = Field(String(255))
#event.listens_for(Pool, "checkout")
def ping_connection(dbapi_connection, connection_record, connection_proxy):
cursor = dbapi_connection.cursor()
try:
cursor.execute("SELECT 1")
except:
# optional - dispose the whole pool
# instead of invalidating one at a time
# connection_proxy._pool.dispose()
# raise DisconnectionError - pool will try
# connecting again up to three times before raising.
raise exc.DisconnectionError()
cursor.close()
from sqlalchemy import create_engine
metadata.bind = create_engine("mysql://foo:bar#localhost/some_db", echo_pool=True)
setup_all(True)
subP = SubProcess()
a1 = TestModel(name='foo')
session.commit()
# pool size is now three.
print "Restart the server"
raw_input()
subP.start()
#a2 = TestModel(name='bar')
#session.commit()
Update 2:
I'm forced to find another solution as post 1.2.2 versions of MySQL-python drops support for the reconnect param. Anyone got a solution? :\
Update 1 (old-solution, doesn't work for MySQL-python versions > 1.2.2):
Found a solution: passing connect_args={'reconnect':True} to the create_engine call fixes the problem, automagically reconnects. Don't even seem to need the checkout event handler.
So, in the example from the question:
metadata.bind = create_engine("mysql://foo:bar#localhost/db_name", pool_size=100, pool_recycle=3600, connect_args={'reconnect':True})
Original question:
Done quite a bit of Googling for this problem and haven't seem to found a solution specific to Elixir - I'm trying to use the "Disconnect Handling - Pessimistic" example from the SQLAlchemy docs to handle MySQL disconnects. However, when I test this (by restarting the MySQL server), the "MySQL server has gone away" error is raised before before my checkout event handler.
Here's the code I use to initialize elixir:
##### Initialize elixir/SQLAlchemy
# Disconnect handling
from sqlalchemy import exc
from sqlalchemy import event
from sqlalchemy.pool import Pool
#event.listens_for(Pool, "checkout")
def ping_connection(dbapi_connection, connection_record, connection_proxy):
logging.debug("***********ping_connection**************")
cursor = dbapi_connection.cursor()
try:
cursor.execute("SELECT 1")
except:
logging.debug("######## DISCONNECTION ERROR #########")
# optional - dispose the whole pool
# instead of invalidating one at a time
# connection_proxy._pool.dispose()
# raise DisconnectionError - pool will try
# connecting again up to three times before raising.
raise exc.DisconnectionError()
cursor.close()
metadata.bind= create_engine("mysql://foo:bar#localhost/db_name", pool_size=100, pool_recycle=3600)
setup_all()
I create elixir entity objects and save them with session.commit(), during which I see the "ping_connection" message generated from the event defined above. However, when I restart the mysql server and test it again, it fails with the mysql server has gone away message just before the ping connection event.
Here's the stack trace starting from the relevant lines:
File "/usr/local/lib/python2.6/dist-packages/elixir/entity.py", line 1135, in get_by
return cls.query.filter_by(*args, **kwargs).first()
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/query.py", line 1963, in first
ret = list(self[0:1])
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/query.py", line 1857, in __getitem__
return list(res)
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/query.py", line 2032, in __iter__
return self._execute_and_instances(context)
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/query.py", line 2047, in _execute_and_instances
result = conn.execute(querycontext.statement, self._params)
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/engine/base.py", line 1399, in execute
params)
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/engine/base.py", line 1532, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/engine/base.py", line 1640, in _execute_context
context)
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/engine/base.py", line 1633, in _execute_context
context)
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/engine/default.py", line 330, in do_execute
cursor.execute(statement, parameters)
File "/usr/lib/pymodules/python2.6/MySQLdb/cursors.py", line 166, in execute
self.errorhandler(self, exc, value)
File "/usr/lib/pymodules/python2.6/MySQLdb/connections.py", line 35, in defaulterrorhandler
raise errorclass, errorvalue
OperationalError: (OperationalError) (2006, 'MySQL server has gone away')
The final workaround was calling session.remove() in the start of methods before manipulating and loading elixir entities. What this does is it will return the connection to the pool, so that when it's used again, the pool's checkout event will be fired, and our handler will detect the disconnection. From SQLAlchemy docs:
It’s not strictly necessary to remove the session at the end of the request - other options include calling Session.close(), Session.rollback(), Session.commit() at the end so that the existing session returns its connections to the pool and removes any existing transactional context. Doing nothing is an option too, if individual controller methods take responsibility for ensuring that no transactions remain open after a request ends.
Quite an important little piece of information I wish it were mentioned in the elixir docs. But then I guess it assumes prior knowledge with SQLAlchemy?
the actual problem is sqlalchemy giving you the same session every time you call the sessionmaker factory. Due to this it can happen that a later query is performed with a much earlier opened session as long as you did not call session.remove() on the session. Having to remember calling remove() every time you request a session however is no fun and sqlalchemy provides a much simpler thing: contexual "scoped" sessions.
To create a scoped session simply wrap your sessionmaker:
from sqlalchemy.orm import scoped_session, sessionmaker
Session = scoped_session(sessionmaker())
This way you get a contexual bound session every time you call the factory, meaning sqlalchemy calls the session.remove() for you as soon as the calling function exits. See here: sqlalchemy - lifespan of a contextual session
Are you using the same session for both (before and after mysqld restart) operations? If so, the "checkout" event occurs only when new transaction is started. When you call commit() the new transaction is started (unless you use autocommit mode) and connection is checked out. So you restart mysqld after checkout.
The simple hack with commit() or rollback() call just before the second operation (and after restarting mysqld) should solve your problem. Otherwise consider using new fresh session each time you wait long time after previous commit.
I'm not sure if this is the same problem that I had, but here goes:
When I encountered MySQL server has gone away, I solved it using create_engine(..., pool_recycle=3600), see http://www.sqlalchemy.org/docs/dialects/mysql.html#connection-timeouts
I am running Celery 2.2.4/djCelery 2.2.4, using RabbitMQ 2.1.1 as a backend. I recently brought online two new celery servers -- I had been running 2 workers across two machines with a total of ~18 threads and on my new souped up boxes (36g RAM + dual hyper-threaded quad-core), I am running 10 workers with 8 threads each, for a total of 180 threads -- my tasks are all pretty small so this should be fine.
The nodes have been running fine for the last few days, but today I noticed that .delaay() is hanging. When I interrupt it, I see a traceback that points here:
File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 324, in delay
return self.apply_async(args, kwargs)
File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 449, in apply_async
publish.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/kombu/compat.py", line 108, in close
self.backend.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/channel.py", line 194, in close
(20, 41), # Channel.close_ok
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
self.channel_id, allowed_methods)
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
self.method_reader.read_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 212, in read_method
self._next_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 127, in _next_method
frame_type, channel, payload = self.source.read_frame()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 109, in read_frame
frame_type, channel, size = unpack('>BHI', self._read(7))
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 200, in _read
s = self.sock.recv(65536)
I've checked the Rabbit logs, and I see it the process trying to connect as:
=INFO REPORT==== 12-Jun-2011::22:58:12 ===
accepted TCP connection on 0.0.0.0:5672 from x.x.x.x:48569
I have my Celery log level set to INFO, but I don't see anything particularly interesting in the Celery logs EXCEPT that 2 of the workers can't connect to the broker:
[2011-06-12 22:41:08,033: ERROR/MainProcess] Consumer: Connection to broker lost. Trying to re-establish connection...
All of the other nodes are able to connect without issue.
I know that there was a posting ( RabbitMQ / Celery with Django hangs on delay/ready/etc - No useful log info ) last year of a similar nature, but I'm pretty certain that this is different. Could it be that the sheer number of workers is creating some sort of a race condition in amqplib -- I found this thread which seems to indicate that amqplib is not thread-safe, not sure if this matters for Celery.
EDIT: I've tried celeryctl purge on both nodes -- on one it succeeds, but on the other it fails with the following AMQP error:
AMQPConnectionException(reply_code, reply_text, (class_id, method_id))
amqplib.client_0_8.exceptions.AMQPConnectionException:
(530, u"NOT_ALLOWED - cannot redeclare exchange 'XXXXX' in vhost 'XXXXX'
with different type, durable or autodelete value", (40, 10), 'Channel.exchange_declare')
On both nodes, inspect stats hangs with the "can't close connection" traceback above. I'm at a loss here.
EDIT2: I was able to delete the offending exchange using exchange.delete from camqadm and now the second node hangs too :(.
EDIT3: One thing that also recently changed is that I added an additional vhost to rabbitmq, which my staging node connects to.
Hopefully this will save somebody a lot of time...though it certainly does not save me any embarrassment:
/var was full on the server that was running rabbit. With all of the nodes that I added, rabbit was doing a lot more logging and it filled up /var -- I couldn't write to /var/lib/rabbitmq, and so no messages were going through.
I had the same symptoms, but not the same cause, for anyone else who stumbles up on this, mine was solved by https://stackoverflow.com/a/63591450/284164 -- I wasn't importing the celery app at the project level, and .delay() was hanging until I added that.