The structure of my mysql table looks like this:
id | mid | liters | timestamp
1 | 20 | 50 | 2016-10-11 10:53:25
2 | 30 | 60 | 2016-10-11 10:40:20
3 | 20 | 100 | 2016-10-11 10:09:27
4 | 30 | 110 | 2016-10-11 09:55:07
5 | 40 | 80 | 2016-10-11 09:44:46
6 | 40 | 90 | 2016-10-11 07:56:14
7 | 20 | 120 | 2016-04-08 13:27:41
8 | 20 | 130 | 2016-04-08 15:35:28
My desired output is like this :
dict = {
20:{50:[2016-10-11 10:53:25,2016-10-11 10:53:25],100:[2016-10-11 10:53:25,2016-10-11 10:09:27],120:[2016-10-11 10:09:27,2016-04-08 13:27:41],130:[2016-04-08 13:27:41,2016-04-08 15:35:28]},
30:{60:[2016-10-11 10:40:20,2016-10-11 10:40:20],110:[2016-10-11 10:40:20,2016-10-11 09:55:07]}
40:{80:[2016-10-11 09:44:46,2016-10-11 09:44:46],90:[2016-10-11 09:44:46,2016-10-11 07:56:14]}
}
If 50:[2016-10-11 10:53:25,2016-10-11 10:53:25] (when you don't have a previews timestamp and you need to duplicate) it's hard to make, just a 50:[2016-10-11 10:53:25] can be ok.
How can I extract this to a python dict from my db table.
I tryed something like :
query_all = "SELECT GROUP_CONCAT(mid,liters,timestamp) FROM `my_table` GROUP BY mid ORDER BY mid "
But I don't know how to order this. Thanks for your time.
GROUP_CONCAT may have own ORDER BY
e.g .
GROUP_CONCAT(mid,liters,timestamp ORDER BY liters)
Related
I have two data frames, one of them has data about railways and coordinates, in the other I have the city code and the coordinates but these coordinates don't fit perfectly so I need to calculate the difference between all the coordinates of the dataframe b and the lines from dataframe a to choose the city code that has the smallest difference.
Dataframe a:
| FROMNODENO | TONODENO | LON | LAT |
| 3 | 4 | -46.720863 | -23.653625 |
| 3 | 5 | -46.868323 | -23.270917 |
| 4 | 6 | -46.869839 | -23.274121 |
Dataframe b:
| COD | LON | LAT |
| 5200050 | -16.75730 | -49.4412 |
| 3100104 | -18.48310 | -47.3916 |
| 5200100 | -16.19700 | -48.7057 |
I need the final dataframe to be something like this:
| FROMNODENO | TONODENO | LON | LAT | COD |
| 3 | 4 | -46.720863 | -23.653625 | 5200050 |
i imagine i need to do a for loop but i don't know how i can do that
You can use a package like geo_pandas to efficiently solve this problem. However if you can't/dont want another 3rd party dependency install, then you can:
cross join these DataFrames.
calculate the abs LAT/LON distance for each.
then filter that data down to the minimum for each node.
print(cities)
COD LON LAT
0 5200050 -16.7573 -49.4412
1 3100104 -18.4831 -47.3916
2 5200100 -16.1970 -48.7057
print(nodes)
FROMNODENO TONODENO LON LAT
0 3 4 -46.720863 -23.653625
1 3 5 -46.868323 -23.270917
2 4 6 -46.869839 -23.274121
out = (
pd.merge(cities, nodes, how="cross", suffixes=("_city", "_node"))
.eval("combined_abs_dist = abs(LON_city - LON_node) + abs(LAT_city - LAT_node)")
.loc[lambda df:
df.groupby(["FROMNODENO", "TONODENO"])["combined_abs_dist"].idxmin()
]
)
print(out)
COD LON_city LAT_city FROMNODENO TONODENO LON_node LAT_node combined_abs_dist
3 3100104 -18.4831 -47.3916 3 4 -46.720863 -23.653625 51.975738
4 3100104 -18.4831 -47.3916 3 5 -46.868323 -23.270917 52.505906
5 3100104 -18.4831 -47.3916 4 6 -46.869839 -23.274121 52.504218
One question: how can I transpose rows into columns on pyspark?
My original dataframe looks like this:
ID | DATE | APP | DOWNLOADS | ACTIVE_USERS
___________________________________________________________
0 | 2021-01-10 | FACEBOOK | 1000 | 5000
1 | 2021-01-10 | INSTAGRAM | 9000 | 90000
2 | 2021-02-10 | FACEBOOK | 9000 | 72000
3 | 2021-02-10 | INSTAGRAM | 16000 | 500000
But I need it like this:
ID | DATE | FACEBOOK - DOWNLOADS | FACEBOOK - ACTIVE_USERS | INSTAGRAM - DOWNLOADS | INSTAGRAM - ACTIVE_USERS
___________________________________________________________________________________________________________________
0 | 2021-01-10 | 1000 | 5000 | 9000 | 90000
1 | 2021-02-10 | 9000 | 72000 | 16000 | 50000
I tried using the answer on this question: Transpose pyspark rows into columns, but i couldn't make it work.
Could you help me please? Thank you!
From your example I assume the "ID" column is not needed to group on, as it looks to be reset in your outcome. That would make the query something like below:
import pyspark.sql.functions as F
df.groupBy("DATE").pivot('APP').agg(
F.first('DOWNLOADS').alias("DOWNLOADS"),
F.first("ACTIVE_USERS").alias("ACTIVE_USERS")
)
We groupby the date and pivot on app and retrieve the first value for downloads and active users.
outcome:
+----------+------------------+---------------------+-------------------+----------------------+
| DATE|FACEBOOK_DOWNLOADS|FACEBOOK_ACTIVE_USERS|INSTAGRAM_DOWNLOADS|INSTAGRAM_ACTIVE_USERS|
+----------+------------------+---------------------+-------------------+----------------------+
|2021-02-10| 9000| 72000| 16000| 500000|
|2021-01-10| 1000| 5000| 9000| 90000|
+----------+------------------+---------------------+-------------------+----------------------+
Is there a way to identify which GPS coordinates represent same location. e.g. given the following Data Frame. How to tell that Id 1 and 2 are from same source location.
+-----+--------------+-------------+
| Id | VehLat | VehLong |
+-----+--------------+-------------+
| 66 | 63.3917005 | 10.4264724 |
| 286 | 63.429603 | 10.4167367 |
| 61 | 33.6687838 | 73.0755573 |
| 67 | 63.4150316 | 10.3980401 |
| 5 | 64.048128 | 10.083776 |
| 8 | 63.4332386 | 10.3971859 |
| 9 | 63.4305769 | 10.3927124 |
| 6 | 63.4293578 | 10.4164764 |
| 1 | 64.048254 | 10.084230 |
+-----+--------------+-------------+
Now, Ids 5 and 1 are basically same location but what's the best approach to classify these two locations as same.
IIUC, you need this.
df[['VehLat','VehLong']].round(3).duplicated(keep=False)
You can change the number within round to adjust what you consider as "same"
Output
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 True
If you want the df itself with duplicate values, you can do as below
df[df[['VehLat','VehLong']].round(2).duplicated(keep=False)]
OR
df.loc[df[['VehLat','VehLong']].round(2).duplicated(keep=False)]
Output
id VehLat VehLong
1 286 63.429603 10.416737
4 5 64.048128 10.083776
7 6 63.429358 10.416476
8 1 64.048254 10.084230
Use DataFrame.sort_values + Series.between:
this allows you greater flexibility when establishing the criteria to
consider two coordinates as equivalent
df2=df[['VehLat','VehLong']].sort_values(['VehLong','VehLat'])
eq=df2.apply(lambda x: x.diff().between(-0.001,0.001)).all(axis=1)
df2[eq|eq.shift(-1)]
this returns a data frame with equivalent coordinates
VehLat VehLong
4 64.048128 10.083776
8 64.048254 10.084230
7 63.429358 10.416476
1 63.429603 10.416737
df2[~(eq|eq.shift(-1))]
this returns unique coordinates
VehLat VehLong
6 63.430577 10.392712
5 63.433239 10.397186
3 63.415032 10.398040
0 63.391700 10.426472
2 33.668784 73.075557
you can restore order using DataFrame.sort_index
df_noteq=df2[~(eq|eq.shift(-1))].sort_index()
print(df_noteq)
VehLat VehLong
0 63.391700 10.426472
2 33.668784 73.075557
3 63.415032 10.398040
5 63.433239 10.397186
6 63.430577 10.392712
I have physical devices that are part of a game. Players get points for activating one of these devices. I am trying to incorporate logic to flag 'cheaters'. All activations of a device count as one 'event' if they occur within a 2.5 second span. I am trying to identify spans of a 'cheating' player (someone just rapidly jamming the button over and over). In the database, this would look like multiple rows with the same deviceAddress, with eventValues of 3 or 4 and with eventTimes within 3 seconds of one another.
I am having a very hard time trying to figure out how to appropriately find events within 3 seconds of one another. Here is the error message I'm currently getting:
if delta <= (timePreviousRow - timedelta(seconds=3)) and devPreviousRow == devThisRow:
TypeError: can't compare datetime.timedelta to datetime.datetime
Any help would be greatly appreciated.
Here is my MySQL query, set to the variable cheatFinder:
SELECT
eventTime,
deviceAddress,
node_id,
eventType,
eventValue,
sequence
FROM device_data
WHERE eventTime >= DATE_SUB(NOW(), INTERVAL 1 DAY)
AND eventType = 1
AND eventValue BETWEEN 3 AND 4
ORDER BY deviceAddress, eventTime;
Here are the first 10 results:
+---------------------+------------------+------------+-----------+------------+----------+
| eventTime | deviceAddress | node_id | eventType | eventValue | sequence |
+---------------------+------------------+------------+-----------+------------+----------+
| 2017-01-26 21:19:03 | 776128 | 221 | 1 | 03 | 3 |
| 2017-01-26 21:48:19 | 776128 | 221 | 1 | 03 | 4 |
| 2017-01-27 06:45:50 | 776128 | 221 | 1 | 04 | 18 |
| 2017-01-27 12:41:03 | 776128 | 221 | 1 | 03 | 24 |
| 2017-01-26 23:03:18 | 6096372 | 301 | 1 | 03 | 119 |
| 2017-01-27 00:21:47 | 6096372 | 301 | 1 | 03 | 120 |
| 2017-01-26 23:50:27 | 27038894 | 157 | 1 | 03 | 139 |
| 2017-01-27 01:19:42 | 29641083 | 275 | 1 | 03 | 185 |
| 2017-01-27 00:10:13 | 30371381 | 261 | 1 | 03 | 82 |
| 2017-01-27 00:53:45 | 30371381 | 261 | 1 | 03 | 87 |
+---------------------+------------------+------------+-----------+------------+----------+
Here is the Python method in question:
import mysql.connector
import pandas
from datetime import datetime,timedelta
. . .
def findBloodyCheaters(self):
self.cursor.execute(cheatFinder)
resultSet = self.cursor.fetchall()
targetDF = pandas.DataFrame(columns=['eventTime', 'deviceAddress', 'gateway_id', 'eventType', 'eventValue',
'sequence'])
timePreviousRow = None
devPreviousRow = None
for row in resultSetRaw:
if timePreviousRow is None and devPreviousRow is None:
timePreviousRow = row[0]
devPreviousRow = row[1]
elif timePreviousRow is not None and devPreviousRow is not None:
timeThisRow = row[0]
devThisRow = row[1]
delta = timePreviousRow - timeThisRow
print(delta)
if delta <= (timePreviousRow - timedelta(seconds=3)) and devPreviousRow == devThisRow:
targetDF.append(row)
timePreviousRow = timeThisRow
devPreviousRow = devThisRow
print(targetDF)
This is happening because you're trying to compare delta which is a timedelta object to (timePreviousRow - timedelta(seconds=3)) which is a datetime object. I would try to compare delta to just timedelta(seconds=3) so your code would look like:
if delta <= timedelta(seconds=3) and devPreviousRow == devThisRow:
I am trying to select a grouped average.
a1_avg = session.query(func.avg(Table_A.a1_value).label('a1_avg'))\
.filter(between(Table_A.a1_date, '2011-10-01', '2011-10-30'))\
.group_by(Table_A.a1_group)
I have tried a few different iterations of this query and this is as close as I can get to what I need. I am fairly certain the group_by is creating the issue, but I am unsure how to correctly implement the query using SQLA. The table structure and expected output:
TABLE A
A1_ID | A1_VALUE | A1_DATE | A1_LOC | A1_GROUP
1 | 5 | 2011-10-05 | 5 | 6
2 | 15 | 2011-10-14 | 5 | 6
3 | 2 | 2011-10-21 | 6 | 7
4 | 20 | 2011-11-15 | 4 | 8
5 | 6 | 2011-10-27 | 6 | 7
EXPECTED OUTPUT
A1_LOC | A1_GROUP | A1_AVG
5 | 6 | 10
6 | 7 | 4
I would guess that you are just missing the group identifier (a1_group) in the result. Also (given I understand your model correctly), you need to add a group by clause also for a1_loc column:
edit-1: updated the query due to question specificaion
a1_avg = session.query(Table_A.a1_loc, Table_A.a1_group, func.avg(Table_A.a1_value).label('a1_avg'))\
.filter(between(Table_A.a1_date, '2011-10-01', '2011-10-30'))\
#.filter(Table_A.a1_id == '12')\ # #note: you do NOT NEED this
.group_by(Table_A.a1_loc)\ # #note: you NEED this
.group_by(Table_A.a1_group)