mySQL Trigger works after console insert, but not after script insert - python

I have a problem with a trigger.
I set up a trigger for update other tables after an insert in a table.
If I make an insert from MySQL console, all works fine, but if I do inserts, even with the same data, from an external python script, the trigger does nothing, as you can see bellow.
I tried changing the Definer to 'user'#'%' and 'root'#'%', but it's still doing nothing.
mysql> select vid_visit,vid_money from videos where video_id=487;
+-----------+-----------+
| vid_visit | vid_money |
+-----------+-----------+
| 21 | 0.297 |
+-----------+-----------+
1 row in set (0,01 sec)
mysql> INSERT INTO `table`.`validEvents` ( `id` , `campaigns_id` , `video_id` , `date` , `producer_id` , `distributor_id` , `money_producer` , `money_distributor` , `type` ) VALUES ( NULL , '30', '487', '2010-05-20 01:20:00', '1', '0', '0.009', '0.000', 'PRE' );
Query OK, 1 row affected (0,00 sec)
mysql> select vid_visit,vid_money from videos where video_id=487;
+-----------+-----------+
| vid_visit | vid_money |
+-----------+-----------+
| 22 | 0.306 |
+-----------+-----------+
DROP TRIGGER IF EXISTS `updateVisitAndMoney`//
CREATE TRIGGER `updateVisitAndMoney` BEFORE INSERT ON `validEvents`
FOR EACH ROW BEGIN
if (NEW.type = 'PRE') THEN
SET #eventcash=NEW.money_producer + NEW.money_distributor;
UPDATE campaigns SET cmp_visit_distributed = cmp_visit_distributed + 1 , cmp_money_distributed = cmp_money_distributed + NEW.money_producer + NEW.money_distributor WHERE idcampaigns = NEW.campaigns_id;
UPDATE offer_producer SET ofp_visit_procesed = ofp_visit_procesed + 1 , ofp_money_procesed = ofp_money_procesed + NEW. money_producer WHERE ofp_video_id = NEW.video_id AND ofp_money_procesed = NEW. campaigns_id;
UPDATE videos SET vid_visit = vid_visit + 1 , vid_money = vid_money + #eventcash WHERE video_id = NEW.video_id;
if (NEW.distributor_id != '') then
UPDATE agreements SET visit_procesed = visit_procesed + 1, money_producer = money_producer + NEW.money_producer, money_distributor = money_distributor + NEW.money_distributor WHERE id_campaigns = NEW. campaigns_id AND id_video = NEW.video_id AND ag_distributor_id = NEW.distributor_id;
UPDATE eventForDay SET visit = visit + 1, money = money + NEW. money_distributor WHERE date = SYSDATE() AND campaign_id = NEW. campaigns_id AND user_id = NEW.distributor_id;
UPDATE eventForDay SET visit = visit + 1, money = money + NEW.money_producer WHERE date = SYSDATE() AND campaign_id = NEW. campaigns_id AND user_id= NEW.producer_id;
ELSE
UPDATE eventForDay SET visit = visit + 1, money = money + NEW. money_producer WHERE date = SYSDATE() AND campaign_id = NEW. campaigns_id AND user_id = NEW.producer_id;
END IF;
END IF;
END
//

I think that it's far more likely that you are encountering an uncaught error, rather than that the trigger is not executing, particularly since it executes successfully from the console.
You need to isolate where the error occurs - in the trigger itself, or in the calling script.
In your python script, print out the SQL statement that python sends to MySQL for execution in order to ensure that it is constructed as you expect - for example, if NEW.type does not equal 'PRE', the trigger will have executed, but will not result in any updates.
Also ensure that you are checking for errors on the insert. I'm not a python programmer, so I can't tell you how that is done, but this seems to be what you're looking for.
If neither of these leads you to the problem, comment out the whole if (NEW.type = 'PRE') THEN block and do a simple modification, such as setting NEW.type to 'debug'. After ensuring that the trigger does in fact execute, retest successively with more of the real code added back in until you isolate the problem.
Also, wrt Marcos comment, I would be surprised if the script didn't auto-commit upon successful completion. Indeed, I would make this statement about any script/language.

For anyone finding this question in the future - I am guessing the solution is setting mysql autocommit to true.
after opening the connection query the following:
SET autocommit = 1

Shell commands only run in the console, not on the actual server. You can use UDF or polling to accomplish what you need.

Related

How to make a list of commands and then display it in a table

so I've decided to try to make a nice cmd menu on windows in python, but I got stuck on one of the first things. I want to create a list of commands and then display them in a table.I am using prettytable to create the tables.
So I would like my output to look like this:
+---------+-------------------------------+
| Command | Usage |
+---------+-------------------------------+
| Help | /help |
| Help2 | /help 2 |
| Help3 | /help 3 |
+---------+-------------------------------+
But I cannot figure out how to create and work with the list. The code currently looks like this
from prettytable import PrettyTable
_cmdTable = PrettyTable(["Command", "Usage"])
#Here I create the commands
help = ['Help','/help']
help2 = ['Help2','/help2']
help3 = ['Help2','/help3']
#And here I add rows and print it
_cmdTable.add_row([help[0], help[1]])
_cmdTable.add_row([help2[0], help2[1]])
_cmdTable.add_row([help3[0], help3[1]])
print(_cmdTable)
But this is way too much work. I would like to make it easier, but I cannot figure out how. I'd imagine it to look something like this:
from prettytable import PrettyTable
_cmdTable = PrettyTable(["Command", "Usage"])
commands = {["Help", "/help"], ["Help2", "/help2"], ["Help3", "/help3"]}
for cmd in commands:
_cmdTable.add_row([cmd])
print(_cmdTable)
I know it's possible, just don't know how. It doesn't have to use the same module for tables, if you know some that's better or fits this request more, use it.
I basically want to make the process easier, not make it manually everytime I add a new command. Hope I explained it clearly. Thanks!
You can have more manual control using string formatting
header = ['Command', 'Usage']
rows = [['Help', '/help'], ['Help2', '/help 2'], ['Help3', '/help 3']]
spacer1 = 10
spacer2 = 20
line = '+' + '-'*spacer1 + '+' + '-'*spacer2 + '+\n'
header = f'| {header[0]:<{spacer1-1}}|{header[1]:^{spacer2}}|\n'
table_rows = ''
for row in rows:
table_rows += f'| {row[0]:<{spacer1-1}}|{row[1]:^{spacer2}}|\n'
print(line + header + line + table_rows + line)
Edit: Added spacing control with variables.
You can't put lists in a set. commands should either be a list of lists, or a set of tuples. Using a list is probably more appropriate in this application, because you may want the table items in a specific order.
You shouldn't put cmd inside another list. Each element of commands is already a list.
commands = [["Help", "/help"], ["Help2", "/help2"], ["Help3", "/help3"]]
for cmd in commands:
_cmdTable.add_row(cmd)

Making comparing 2 tables faster (Postgres/SQLAlchemy)

I wrote a code in python to manipulate a table I have in my database. I am doing so using SQL Alchemy. Basically I have table 1 that has 2 500 000 entries. I have another table 2 with 200 000 entries. Basically what I am trying to do, is compare my source ip and dest ip in table 1 with source ip and dest ip in table 2. if there is a match, I replace the ip source and ip dest in table 1 with a data that matches ip source and ip dest in table 2 and I add the entry in table 3. My code also checks if the entry isn't already in the new table. If so, it skips it and then goes on with the next row.
My problem is its extremely slow. I launched my script yesterday and in 24 hours it only went through 47 000 entries out of 2 500 000. I am wondering if there are anyways I can speed up the process. It's a postgres db and I can't tell if the script taking this much time is reasonable or if something is up. If anyone had a similar experience with something like this, how much time did it take before completion ?
Many thanks.
session = Session()
i = 0
start_id = 1
flows = session.query(Table1).filter(Table1.id >= start_id).all()
result_number = len(flows)
vlan_list = {"['0050']", "['0130']", "['0120']", "['0011']", "['0110']"}
while i < result_number:
for flow in flows:
if flow.vlan_destination in vlan_list:
usage = session.query(Table2).filter(Table2.ip ==
str(flow.ip_destination)).all()
if len(usage) > 0:
usage = usage[0].usage
else:
usage = str(flow.ip_destination)
usage_ip_src = session.query(Table2).filter(Table2.ip ==
str(flow.ip_source)).all()
if len(usage_ip_src) > 0:
usage_ip_src = usage_ip_src[0].usage
else:
usage_ip_src = str(flow.ip_source)
if flow.protocol == "17":
protocol = func.REPLACE(flow.protocol, "17", 'UDP')
elif flow.protocol == "1":
protocol = func.REPLACE(flow.protocol, "1", 'ICMP')
elif flow.protocol == "6":
protocol = func.REPLACE(flow.protocol, "6", 'TCP')
else:
protocol = flow.protocol
is_in_db = session.query(Table3).filter(Table3.protocol ==
protocol)\
.filter(Table3.application == flow.application)\
.filter(Table3.destination_port == flow.destination_port)\
.filter(Table3.vlan_destination == flow.vlan_destination)\
.filter(Table3.usage_source == usage_ip_src)\
.filter(Table3.state == flow.state)\
.filter(Table3.usage_destination == usage).count()
if is_in_db == 0:
to_add = Table3(usage_ip_src, usage, protocol, flow.application, flow.destination_port,
flow.vlan_destination, flow.state)
session.add(to_add)
session.flush()
session.commit()
print("added " + str(i))
else:
print("usage already in DB")
i = i + 1
session.close()
EDIT As requested, here are more details : Table 1 has 11 columns, the two we are interested in are source ip and dest ip.
Table 1
Here, I have Table 2 :Table 2. It has an IP and a Usage. What my script is doing is that it takes source ip and dest ip from table one and looks up if there is a match in Table 2. If so, it replaces the ip address by usage, and adds this along with some of the columns of Table 1 in Table 3 :[Table3][3]
Along doing this, when adding the protocol column into Table 3, it writes the protocol name instead of the number, just to make it more readable.
EDIT 2 I am trying to think about this differently, so I did a diagram of my problem Diagram (X problem)
What I am trying to figure out is if my code (Y solution) is working as intended. I've been coding in python for a month only and I feel like I am messing something up. My code is supposed to take every row from my Table 1, compare it to Table 2 and add data to table 3. My Table one has over 2 million entries and it's understandable that it should take a while but its too slow. For example, when I had to load the data from the API to the db, it went faster than the comparisons im trying to do with everything that is already in the db. I am running my code on a virtual machine that has sufficient memory so I am sure it's my code that is lacking and I need direction to as what can be improved. Screenshots of my tables:
Table 2
Table 3
Table 1
EDIT 3 : Postgresql QUERY
SELECT
coalesce(table2_1.usage, table1.ip_source) AS coalesce_1,
coalesce(table2_2.usage, table1.ip_destination) AS coalesce_2,
CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END AS anon_1,
table1.application AS table1_application,
table1.destination_port AS table1_destination_port,
table1.vlan_destination AS table1_vlan_destination,
table1.state AS table1_state
FROM
table1
LEFT OUTER JOIN table2 AS table2_2 ON table2_2.ip = table1.ip_destination
LEFT OUTER JOIN table2 AS table2_1 ON table2_1.ip = table1.ip_source
WHERE
table1.vlan_destination IN (
%(vlan_destination_1) s,
%(vlan_destination_2) s,
%(vlan_destination_3) s,
%(vlan_destination_4) s,
%(vlan_destination_5) s
)
AND NOT (
EXISTS (
SELECT
1
FROM
table3
WHERE
table3.usage_source = coalesce(table2_1.usage, table1.ip_source)
AND table3.usage_destination = coalesce(table2_2.usage, table1.ip_destination)
AND table3.protocol = CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END
AND table3.application = table1.application
AND table3.destination_port = table1.destination_port
AND table3.vlan_destination = table1.vlan_destination
AND table3.state = table1.state
)
)
Given the current question, I think this at least comes close to what you might be after. The idea is to perform the entire operation in the database, instead of fetching everything – the whole 2,500,000 rows – and filtering in Python etc.:
from sqlalchemy import func, case
from sqlalchemy.orm import aliased
def newhotness(session, vlan_list):
# The query needs to join Table2 twice, so it has to be aliased
dst = aliased(Table2)
src = aliased(Table2)
# Prepare required SQL expressions
usage = func.coalesce(dst.usage, Table1.ip_destination)
usage_ip_src = func.coalesce(src.usage, Table1.ip_source)
protocol = case({"17": "UDP",
"1": "ICMP",
"6": "TCP"},
value=Table1.protocol,
else_=Table1.protocol)
# Form a query producing the data to insert to Table3
flows = session.query(
usage_ip_src,
usage,
protocol,
Table1.application,
Table1.destination_port,
Table1.vlan_destination,
Table1.state).\
outerjoin(dst, dst.ip == Table1.ip_destination).\
outerjoin(src, src.ip == Table1.ip_source).\
filter(Table1.vlan_destination.in_(vlan_list),
~session.query(Table3).
filter_by(usage_source=usage_ip_src,
usage_destination=usage,
protocol=protocol,
application=Table1.application,
destination_port=Table1.destination_port,
vlan_destination=Table1.vlan_destination,
state=Table1.state).
exists())
stmt = insert(Table3).from_select(
["usage_source", "usage_destination", "protocol", "application",
"destination_port", "vlan_destination", "state"],
flows)
return session.execute(stmt)
If the vlan_list is selective, or in other words filters out most rows, this will perform a lot less operations in the database. Depending on the size of Table2 you may benefit from indexing Table2.ip, but do test first. If it is relatively small, I would guess that PostgreSQL will perform a hash or nested loop join there. If some column of the ones used to filter out duplicates in Table3 is unique, you could perform an INSERT ... ON CONFLICT ... DO NOTHING instead of removing duplicates in the SELECT using the NOT EXISTS subquery expression (which PostgreSQL will perform as an antijoin). If there is a possibility that the flows query may produce duplicates, add a call to Query.distinct() to it.

Apache Beam per-user session windows are unmerged

We have an app that has users; each user uses our app for something like 10-40 minutes per go and I would like to count the distribution/occurrences of events happing per-such-session, based on specific events having happened (e.g. "this user converted", "this user had a problem last session", "this user had a successful last session").
(After this I'd like to count these higher-level events per day, but that's a separate question)
For this I've been looking into session windows; but all docs seem geared towards global session windows, but I'd like to create them per-user (which is also a natural partitioning).
I'm having trouble finding docs (python preferred) on how to do this. Could you point me in the right direction?
Or in other words: How do I create per-user per-session windows that can output more structured (enriched) events?
What I have
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
_, x = element
logging.info(">>> Received %s %s with window=%s", x['jsonPayload']['value'], x['timestamp'], window)
yield element
def sum_by_event_type(user_session_events):
logging.debug("Received %i events: %s", len(user_session_events), user_session_events)
d = {}
for key, group in groupby(user_session_events, lambda e: e['jsonPayload']['value']):
d[key] = len(list(group))
logging.info("After counting: %s", d)
return d
# ...
by_user = valid \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['jsonPayload']['userId'], x))
session_gap = 5 * 60 # [s]; 5 minutes
user_sessions = by_user \
| 'user_session_window' >> beam.WindowInto(beam.window.Sessions(session_gap),
timestamp_combiner=beam.window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'debug_printer' >> beam.ParDo(DebugPrinter()) \
| beam.CombinePerKey(sum_by_event_type)
What it outputs
INFO:root:>>> Received event_1 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_2 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_3 2019-03-12T08:54:30.400Z with window=[1552380870.4, 1552381170.4)
INFO:root:>>> Received event_4 2019-03-12T08:54:36.300Z with window=[1552380876.3, 1552381176.3)
INFO:root:>>> Received event_5 2019-03-12T08:54:38.100Z with window=[1552380878.1, 1552381178.1)
So as you can see; the Session() window doesn't expand the Window, but groups only very close events together... What's being done wrong?
You can get it to work by adding a Group By Key transform after the windowing. You have assigned keys to the records but haven't actually grouped them together by key and session windowing (which works per-key) does not know that these events need to be merged together.
To confirm this I did a reproducible example with some in-memory dummy data (to isolate Pub/Sub from the problem and be able to test it more quickly). All five events will have the same key or user_id but they will "arrive" sequentially 1, 2, 4 and 8 seconds apart from each other. As I use session_gap of 5 seconds I expect the first 4 elements to be merged into the same session. The 5th event will take 8 seconds after the 4th one so it has to be relegated to the next session (gap over 5s). Data is created like this:
data = [{'user_id': 'Thanos', 'value': 'event_{}'.format(event), 'timestamp': time.time() + 2**event} for event in range(5)]
We use beam.Create(data) to initialize the pipeline and beam.window.TimestampedValue to assign the "fake" timestamps. Again, we are just simulating streaming behavior with this. After that, we create the key-value pairs thanks to the user_id field, we window into window.Sessions and, we add the missing beam.GroupByKey() step. Finally, we log the results with a slightly modified version of DebugPrinter:. The pipeline now looks like this:
events = (p
| 'Create Events' >> beam.Create(data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['user_id'], x))
| 'user_session_window' >> beam.WindowInto(window.Sessions(session_gap),
timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'Group' >> beam.GroupByKey()
| 'debug_printer' >> beam.ParDo(DebugPrinter()))
where DebugPrinter is:
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
for x in element[1]:
logging.info(">>> Received %s %s with window=%s", x['value'], x['timestamp'], window)
yield element
If we test this without grouping by key we get the same behavior:
INFO:root:>>> Received event_0 1554117323.0 with window=[1554117323.0, 1554117328.0)
INFO:root:>>> Received event_1 1554117324.0 with window=[1554117324.0, 1554117329.0)
INFO:root:>>> Received event_2 1554117326.0 with window=[1554117326.0, 1554117331.0)
INFO:root:>>> Received event_3 1554117330.0 with window=[1554117330.0, 1554117335.0)
INFO:root:>>> Received event_4 1554117338.0 with window=[1554117338.0, 1554117343.0)
But after adding it, the windows now work as expected. Events 0 to 3 are merged together in an extended 12s session window. Event 4 belongs to a separate 5s session.
INFO:root:>>> Received event_0 1554118377.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_1 1554118378.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_3 1554118384.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_2 1554118380.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_4 1554118392.37 with window=[1554118392.37, 1554118397.37)
Full code here
Two additional things worth mentioning. The first one is that, even if running this locally in a single machine with the DirectRunner, records can come unordered (event_3 is processed before event_2 in my case). This is done on purpose to simulate distributed processing as documented here.
The last one is that if you get a stack trace like this:
TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'Write Results/Write/WriteImpl/WriteBundles']
downgrade from 2.10.0/2.11.0 SDK to 2.9.0. See this answer for example.

SQL Query translation to SQLAlchemy

Hello I am trying to translate the following relatively simple query to SQLAlchemy but I get
('Unexpected error:', <class 'sqlalchemy.exc.InvalidRequestError'>)
SELECT model, COUNT(model) AS count FROM log.logs
WHERE SOURCE = "WEB" AND year(timestamp) = 2015 AND month(timestamp) = 1
and account = "Test" and brand = "Nokia" GROUP BY model ORDER BY count DESC limit 10
This is what I wrote but it is not working. What is wrong ?
devices = db.session.query(Logs.model).filter_by(source=source).filter_by(account=acc).filter_by(brand=brand).\
filter_by(year=year).filter_by(month=month).group_by(Logs.model).order_by(Logs.model.count().desc()).all()
It's a bit hard to tell from your code sample, but the following is hopefully the correct SQLAlchemy code. Try:
from sqlalchemy.sql import func
devices = (db.session
.query(Logs.model, func.count(Logs.model).label('count'))
.filter(source=source)
.filter_by(account=acc)
.filter_by(brand=brand)
.filter_by(year=year)
.filter_by(month=month)
.group_by(Logs.model)
.order_by(func.count(Logs.model).desc()).all())
Note that I've enclosed the query in a (...) to avoid having to use \ at the end of each line.

Large Data Analytics

I'm trying to analyze a large amount of GitHub Archive Data and am stumped by many limitations.
So my analysis requires me too search a 350GB Data set. I have a local copy of the data and there is also a copy available via Google BigQuery. The local dataset is split up into 25000 individual files. The dataset is a timeline of events.
I want to plot the number of stars each repository has since its creation. (Only for repos with > 1000 currently)
I can get this result very quickly using Google BigQuery, but it "analyzes" 13.6GB of data each time. This limits me to <75 requests without having to pay $5 per additional 75.
My other option is to search through my local copy, but searching through each file for a specific string (repository name) takes way too long. Took over an hour on an SSD drive to get through half the files before I killed the process.
What is a better way I can approach analyzing such a large amount of data?
Python Code for Searching Through all Local Files:
for yy in range(11,15):
for mm in range(1,13):
for dd in range(1,32):
for hh in range(0,24):
counter = counter + 1
if counter < startAt:
continue
if counter > stopAt:
continue
#print counter
strHH = str(hh)
strDD = str(dd)
strMM = str(mm)
strYY = str(yy)
if len(strDD) == 1:
strDD = "0" + strDD
if len(strMM) == 1:
strMM = "0" + strMM
#print strYY + "-" + strMM + "-" + strDD + "-" + strHH
try:
f = json.load (open ("/Volumes/WD_1TB/GitHub Archive/20"+strYY+"-"+strMM+"-"+strDD+"-"+strHH+".json", 'r') , cls=ConcatJSONDecoder)
for each_event in f:
if(each_event["type"] == "WatchEvent"):
try:
num_stars = int(each_event["repository"]["watchers"])
created_at = each_event["created_at"]
json_entry[4][created_at] = num_stars
except Exception, e:
print e
except Exception, e:
print e
Google Big Query SQL Command:
SELECT repository_owner, repository_name, repository_watchers, created_at
FROM [githubarchive:github.timeline]
WHERE type = "WatchEvent"
AND repository_owner = "mojombo"
AND repository_name = "grit"
ORDER BY created_at
I am really stumped so any advice at this point would be greatly appreciated.
If most of your BigQuery queries only scan a subset of the data, you can do one initial query to pull out that subset (use "Allow Large Results"). Then subsequent queries against your small table will cost less.
For example, if you're only querying records where type = "WatchEvent", you can run a query like this:
SELECT repository_owner, repository_name, repository_watchers, created_at
FROM [githubarchive:github.timeline]
WHERE type = "WatchEvent"
And set a destination table as well as the "Allow Large Results" flag. This query will scan the full 13.6 GB, but the output is only 1 GB, so subsequent queries against the output table will only charge you for 1 GB at most.
That still might not be cheap enough for you, but just throwing the option out there.
I found a solution to this problem - Using a database. i imported the relevant data from my 360+GB of JSON data to a MySQL Database and queried that instead. What used to be a 3hour+ query time per element became <10seconds.
MySQL wasn't the easiest thing to set up, and import took approximately ~7.5 hours, but the results made it well worth it for me.

Categories

Resources