Python & MySQL -- find all rows within 3 seconds of one another - python

I have physical devices that are part of a game. Players get points for activating one of these devices. I am trying to incorporate logic to flag 'cheaters'. All activations of a device count as one 'event' if they occur within a 2.5 second span. I am trying to identify spans of a 'cheating' player (someone just rapidly jamming the button over and over). In the database, this would look like multiple rows with the same deviceAddress, with eventValues of 3 or 4 and with eventTimes within 3 seconds of one another.
I am having a very hard time trying to figure out how to appropriately find events within 3 seconds of one another. Here is the error message I'm currently getting:
if delta <= (timePreviousRow - timedelta(seconds=3)) and devPreviousRow == devThisRow:
TypeError: can't compare datetime.timedelta to datetime.datetime
Any help would be greatly appreciated.
Here is my MySQL query, set to the variable cheatFinder:
SELECT
eventTime,
deviceAddress,
node_id,
eventType,
eventValue,
sequence
FROM device_data
WHERE eventTime >= DATE_SUB(NOW(), INTERVAL 1 DAY)
AND eventType = 1
AND eventValue BETWEEN 3 AND 4
ORDER BY deviceAddress, eventTime;
Here are the first 10 results:
+---------------------+------------------+------------+-----------+------------+----------+
| eventTime | deviceAddress | node_id | eventType | eventValue | sequence |
+---------------------+------------------+------------+-----------+------------+----------+
| 2017-01-26 21:19:03 | 776128 | 221 | 1 | 03 | 3 |
| 2017-01-26 21:48:19 | 776128 | 221 | 1 | 03 | 4 |
| 2017-01-27 06:45:50 | 776128 | 221 | 1 | 04 | 18 |
| 2017-01-27 12:41:03 | 776128 | 221 | 1 | 03 | 24 |
| 2017-01-26 23:03:18 | 6096372 | 301 | 1 | 03 | 119 |
| 2017-01-27 00:21:47 | 6096372 | 301 | 1 | 03 | 120 |
| 2017-01-26 23:50:27 | 27038894 | 157 | 1 | 03 | 139 |
| 2017-01-27 01:19:42 | 29641083 | 275 | 1 | 03 | 185 |
| 2017-01-27 00:10:13 | 30371381 | 261 | 1 | 03 | 82 |
| 2017-01-27 00:53:45 | 30371381 | 261 | 1 | 03 | 87 |
+---------------------+------------------+------------+-----------+------------+----------+
Here is the Python method in question:
import mysql.connector
import pandas
from datetime import datetime,timedelta
. . .
def findBloodyCheaters(self):
self.cursor.execute(cheatFinder)
resultSet = self.cursor.fetchall()
targetDF = pandas.DataFrame(columns=['eventTime', 'deviceAddress', 'gateway_id', 'eventType', 'eventValue',
'sequence'])
timePreviousRow = None
devPreviousRow = None
for row in resultSetRaw:
if timePreviousRow is None and devPreviousRow is None:
timePreviousRow = row[0]
devPreviousRow = row[1]
elif timePreviousRow is not None and devPreviousRow is not None:
timeThisRow = row[0]
devThisRow = row[1]
delta = timePreviousRow - timeThisRow
print(delta)
if delta <= (timePreviousRow - timedelta(seconds=3)) and devPreviousRow == devThisRow:
targetDF.append(row)
timePreviousRow = timeThisRow
devPreviousRow = devThisRow
print(targetDF)

This is happening because you're trying to compare delta which is a timedelta object to (timePreviousRow - timedelta(seconds=3)) which is a datetime object. I would try to compare delta to just timedelta(seconds=3) so your code would look like:
if delta <= timedelta(seconds=3) and devPreviousRow == devThisRow:

Related

Python - pandas remove duplicate rows based on condition

I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |
There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})

How to assign time stamp to the command in Python?

There are several types of Commands in the third column of the text file. So, I am using the regular expression method to grep the number of occurrences for each type of command.
For example, ACTIVE has occurred 3 times, REFRESH 2 times. I desire to enhance the flexibility of my program. So, I wish to assign the time for each command.
Since one command can happen more than 1 time, if the script supports the command being assigned to the time, then the users will know which ACTIVE occurs at what time. Any guidance or suggestions are welcomed.
The idea is to have more flexible support for the script.
My code:
import re
a = a_1 = b = b_1 = c = d = e = 0
lines = open("page_stats.txt", "r").readlines()
for line in lines:
if re.search(r"WRITING_A", line):
a_1 += 1
elif re.search(r"WRITING", line):
a += 1
elif re.search(r"READING_A", line):
b_1 += 1
elif re.search(r"READING", line):
b += 1
elif re.search(r"PRECHARGE", line):
c += 1
elif re.search(r"ACTIVE", line):
d += 1
File content:
-----------------------------------------------------------------
| Number | Time | Command | Data |
-----------------------------------------------------------------
| 1 | 0015 | ACTIVE | |
| 2 | 0030 | WRITING | |
| 3 | 0100 | WRITING_A | |
| 4 | 0115 | PRECHARGE | |
| 5 | 0120 | REFRESH | |
| 6 | 0150 | ACTIVE | |
| 7 | 0200 | READING | |
| 8 | 0314 | PRECHARGE | |
| 9 | 0318 | ACTIVE | |
| 10 | 0345 | WRITING_A | |
| 11 | 0430 | WRITING_A | |
| 12 | 0447 | WRITING | |
| 13 | 0503 | PRECHARGE | |
| 14 | 0610 | REFRESH | |
Assuming you want to count the occurrences of each command and store
the timestamps of each command as well, would you please try:
import re
count = {}
timestamps = {}
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
count[m[3]] = count[m[3]] + 1 if m[3] in count else 1
if m[3] in timestamps:
timestamps[m[3]].append(m[2])
else:
timestamps[m[3]] = [m[2]]
# see the limited result (example)
#print(count["ACTIVE"])
#print(timestamps["ACTIVE"])
# see the results
for key in count:
print("%-10s: %2d, %s" % (key, count[key], timestamps[key]))
Output:
REFRESH : 2, ['0120', '0610']
WRITING : 2, ['0030', '0447']
PRECHARGE : 3, ['0115', '0314', '0503']
ACTIVE : 3, ['0015', '0150', '0318']
READING : 1, ['0200']
WRITING_A : 3, ['0100', '0345', '0430']
m = re.split(r"\s*\|\s*", line) splits line on a pipe character which
may be preceded and/or followed by blank characters.
Then the list elements m[1], m[2], m[3] are assingned to the Number, Time, Command
in order.
The condition if len(m) > 3 and re.match(r"\d+", m[1]) skips the
header lines.
Then the dictionary variables count and timestamps are assigned,
incremented or appended one by one.

Translating conditional RANK Window-Function from SQL to Pandas

I’m trying to translate a window-function from SQL to Pandas, which is only applied under the condition, that a match is possible – otherwise a NULL (None) value is inserted.
SQL-Code (example)
SELECT
[ID_customer]
[cTimestamp]
[TMP_Latest_request].[ID_req] AS [ID of Latest request]
FROM [table].[Customer] AS [Customer]
LEFT JOIN (
SELECT * FROM(
SELECT [ID_req], [ID_customer], [rTimestamp],
RANK() OVER(PARTITION BY ID_customer ORDER BY rTimestamp DESC) as rnk
FROM [table].[Customer_request]
) AS [Q]
WHERE rnk = 1
) AS [TMP_Latest_request]
ON [Customer].[ID_customer] = [TMP_Latest_request].[ID_customer]
Example
Joining the ID of the latest customer request (if exists) to the customer.
table:Customer
+-------------+------------+
| ID_customer | cTimestamp |
+-------------+------------+
| 1 | 2014 |
| 2 | 2014 |
| 3 | 2015 |
+-------------+------------+
table: Customer_request
+--------+-------------+------------+
| ID_req | ID_customer | rTimestamp |
+--------+-------------+------------+
| 1 | 1 | 2012 |
| 2 | 1 | 2013 |
| 3 | 1 | 2014 |
| 4 | 2 | 2014 |
+--------+-------------+------------+
Result: table:merged
+-------------+------------+----------------------+
| ID_customer | cTimestamp | ID of Latest request |
+-------------+------------+----------------------+
| 1 | 2014 | 3 |
| 2 | 2014 | 4 |
| 3 | 2015 | None/NULL |
+-------------+------------+----------------------+
What is the equivalent in Python Pandas?
Instead of using RANK() function, you can simply using the below, and it is easy to convert.
SELECT A.ID_Customer,A.cTimeStamp,B.ID_req
FROM Customer A
LEFT JOIN (
SELECT ID_Customer,MAX(ID_req)ID_req
FROM Customer_request
GROUP BY ID_Customer
)B
ON A.ID_Customer = B.ID_Customer
Try the following query, if you are facing any issues, ask me in the comments.

Points inside polygon in PostGIS

I have a table samplecol that contains (a sample):
vessel_hash | status | station | speed | latitude | longitude | course | heading | timestamp | the_geom
--------------+--------+---------+-------+-------------+-------------+--------+---------+--------------------------+----------------------------------------------------
103079215239 | 99 | 841 | 5 | -5.41844510 | 36.12160900 | 314 | 511 | 2016-06-12T06:31:04.000Z | 0101000020E61000001BF33AE2900F424090AF4EDF7CAC15C0
103079215239 | 99 | 3008 | 0 | -5.41778710 | 36.12144900 | 117 | 511 | 2016-06-12T06:43:27.000Z | 0101000020E6100000E2900DA48B0F424042C3AC61D0AB15C0
103079215239 | 99 | 841 | 17 | -5.42236900 | 36.12356900 | 259 | 511 | 2016-06-12T06:50:27.000Z | 0101000020E610000054E6E61BD10F42407C60C77F81B015C0
103079215239 | 99 | 841 | 17 | -5.41781710 | 36.12147900 | 230 | 511 | 2016-06-12T06:27:03.000Z | 0101000020E61000004D13B69F8C0F424097D6F03ED8AB15C0
103079215239 | 99 | 841 | 61 | -5.42201900 | 36.13256100 | 157 | 511 | 2016-06-12T06:08:04.000Z | 0101000020E6100000CFDC43C2F71042409929ADBF25B015C0
103079215239 | 99 | 841 | 9 | -5.41834020 | 36.12225000 | 359 | 511 | 2016-06-12T06:33:03.000Z | 0101000020E6100000CFF753E3A50F42408D68965F61AC15C0
I try to fetch all points inside polygon with:
poisInpolygon = """SELECT col.vessel_hash,col.longitude,col.latitude,
ST_Contains(ST_GeomFromEWKT('SRID=4326; POLYGON((-15.0292969 47.6357836,-15.2050781 47.5172007,-16.2597656 29.3821751, 35.0683594 26.1159859, 38.0566406 47.6357836,-15.0292969 47.6357836))'),
ST_GeomFromEWKT(col.the_geom)) FROM samplecol As col;"""
The output is:
(103079215291L, Decimal('40.87123100'), Decimal('29.24107000'), False)
(103079215291L, Decimal('40.86702000'), Decimal('29.23967000'), False)
(103079215291L, Decimal('40.87208200'), Decimal('29.22113000'), False)
(103079215291L, Decimal('40.86973200'), Decimal('29.23963000'), False)
(103079215291L, Decimal('40.87770800'), Decimal('29.20229900'), False)
I don't figure out what is False in the results. Is this the correct way or am I doing something wrong?
Also this code uses the INDEX in the field the_geom?
The query returns false because all points from your sample are outside of the given polygon. Here an overview of your points (somewhere in the northeast of Tanzania) and polygon (south Europe and north Africa):
To test your query, I added another point somewhere in Málaga, which is inside of your polygon, and it returned true just as expected (last geometry in the insert statement as EWKT). This is the script:
CREATE TEMPORARY TABLE t (the_geom GEOMETRY);
INSERT INTO t VALUES ('0101000020E61000001BF33AE2900F424090AF4EDF7CAC15C0'),
('0101000020E6100000E2900DA48B0F424042C3AC61D0AB15C0'),
('0101000020E610000054E6E61BD10F42407C60C77F81B015C0'),
('0101000020E61000004D13B69F8C0F424097D6F03ED8AB15C0'),
('0101000020E6100000CFDC43C2F71042409929ADBF25B015C0'),
('0101000020E6100000CFF753E3A50F42408D68965F61AC15C0'),
(ST_GeomFromEWKT('SRID=4326;POINT(-4.4427 36.7233)'));
And here is your query:
db=# SELECT
ST_Contains(ST_GeomFromEWKT('SRID=4326; POLYGON((-15.0292969 47.6357836,-15.2050781 47.5172007,-16.2597656 29.3821751, 35.0683594 26.1159859, 38.0566406 47.6357836,-15.0292969 47.6357836))'),
ST_GeomFromEWKT(col.the_geom))
FROM t As col;
st_contains
-------------
f
f
f
f
f
f
t
(7 Zeilen)
Btw: storing the same coordinates as GEOMETRY and as NUMERIC is totally redundant. You might want to get rid of the columns latitude and longitude and extract their values with ST_X and ST_Y on demand.

MySQL data to a python dict structure

The structure of my mysql table looks like this:
id | mid | liters | timestamp
1 | 20 | 50 | 2016-10-11 10:53:25
2 | 30 | 60 | 2016-10-11 10:40:20
3 | 20 | 100 | 2016-10-11 10:09:27
4 | 30 | 110 | 2016-10-11 09:55:07
5 | 40 | 80 | 2016-10-11 09:44:46
6 | 40 | 90 | 2016-10-11 07:56:14
7 | 20 | 120 | 2016-04-08 13:27:41
8 | 20 | 130 | 2016-04-08 15:35:28
My desired output is like this :
dict = {
20:{50:[2016-10-11 10:53:25,2016-10-11 10:53:25],100:[2016-10-11 10:53:25,2016-10-11 10:09:27],120:[2016-10-11 10:09:27,2016-04-08 13:27:41],130:[2016-04-08 13:27:41,2016-04-08 15:35:28]},
30:{60:[2016-10-11 10:40:20,2016-10-11 10:40:20],110:[2016-10-11 10:40:20,2016-10-11 09:55:07]}
40:{80:[2016-10-11 09:44:46,2016-10-11 09:44:46],90:[2016-10-11 09:44:46,2016-10-11 07:56:14]}
}
If 50:[2016-10-11 10:53:25,2016-10-11 10:53:25] (when you don't have a previews timestamp and you need to duplicate) it's hard to make, just a 50:[2016-10-11 10:53:25] can be ok.
How can I extract this to a python dict from my db table.
I tryed something like :
query_all = "SELECT GROUP_CONCAT(mid,liters,timestamp) FROM `my_table` GROUP BY mid ORDER BY mid "
But I don't know how to order this. Thanks for your time.
GROUP_CONCAT may have own ORDER BY
e.g .
GROUP_CONCAT(mid,liters,timestamp ORDER BY liters)

Categories

Resources