I have a db with the following info:
+----+------+-----+-----+------+------+-----------+
| id | time | lat | lon | dep | dest | reg |
+----+------+-----+-----+------+------+-----------+
| a | 1 | 10 | 20 | home | work | alpha |
| a | 6 | 11 | 21 | home | work | alpha |
| a | 11 | 12 | 22 | home | work | alpha |
| b | 2 | 70 | 80 | home | cine | beta |
| b | 8 | 70 | 85 | home | cine | beta |
| b | 13 | 70 | 90 | home | cine | beta |
+----+------+-----+-----+------+------+-----------+
Is it possible to extract the following info:
+----+------+------+----------+----------+----------+----------+------+------+--------+
| id | tmin | tmax | lat_tmin | lon_tmin | lat_tmax | lon_tmax | dep | dest | reg |
+----+------+------+----------+----------+----------+----------+------+------+--------+
| a | 1 | 11 | 10 | 20 | 12 | 22 | home | work | alpha |
| b | 2 | 13 | 70 | 80 | 70 | 90 | home | cine | beta |
+----+------+------+----------+----------+----------+----------+------+------+--------+
what if dep & dest were varying - how would tou select them?
Thks
You could use a window function for that:
SELECT t0.id,
t0.time as tmin, t1.time as tmax,
t0.lat as lat_tmin, t1.lat as lat_tmax,
t0.lon as lon_tmin, t1.lon as lon_tmax,
t0.dep,
t0.dest,
t0.reg
FROM ( SELECT *,
row_number() over (partition by id order by time asc) as rn
FROM t) AS t0
INNER JOIN ( SELECT *,
row_number() over (partition by id order by time desc) as rn
FROM t) AS t1
ON t0.id = t1.id
WHERE t0.rn = 1
AND t1.rn = 1
This will return the data from the first and last row for each id, when sorted by id, time.
The values for dep, dest and reg are taken from the first row (per id) only.
If you want to also have separate rows for when -- for the same id -- you have different values of dep or dest, then just add those in the partition by clauses. All depends on which output you expect in that case:
SELECT t0.id,
t0.time as tmin, t1.time as tmax,
t0.lat as lat_tmin, t1.lat as lat_tmax,
t0.lon as lon_tmin, t1.lon as lon_tmax,
t0.dep,
t0.dest,
t0.reg
FROM ( SELECT *,
row_number() over (partition by id, dep, dest, reg
order by time asc) as rn
FROM t) AS t0
INNER JOIN ( SELECT *,
row_number() over (partition by id, dep, dest, reg
order by time desc) as rn
FROM t) AS t1
ON t0.id = t1.id
WHERE t0.rn = 1
AND t1.rn = 1
Please consider using a different name for the column time because of this remark in the documentation:
The following keywords could be reserved in future releases of SQL Server as new features are implemented. Consider avoiding the use of these words as identifiers.
... TIME ...
select min(t.time) as tmin, max(t.time) as tmax, (...) from table_name t group by t.dest
(...) just repeat for the other columns
Related
Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)
I have a table as below.
If I do a group by operation on the name field, the key of b is 11,but what I need to leave is 12, because 12 has already appeared in other records.
What should I do to achieve this result, without using the max aggregation method
Introduce the meaning of the table,
key-12 also provides name-a, name-b, key-11 only provides name-b
For name-c, there are three key that can provide name-c, all of which are not repeated
|name|key|
| a | 12 |
| b | 11 |
| b | 12 |
| c | 15 |
| c | 14 |
| c | 17 |
....
What I hope is that the result obtained through group by is:
|name|key |
| a | 12 |
| b | 12 |
| c | 15 |
To perform group by operations through the name field,
b needs to leave key-12, because key-12 provides both name-a and name-b,
so key-11 is not needed.
For name-c, there are three key that can provide name-c, all of which are not repeated, so let's just use the one that appears for the first time.
Assuming that key is unique for each name you can use COUNT() and ROW_NUMBER() window functions:
select name, key
from (
select *, row_number() over (partition by name order by counter desc, rowid) rn
from (
select *, rowid, count(*) over (partition by key) counter
from tablename
)
)
where rn = 1
See the demo.
Results:
> name | key
> :--- | --:
> a | 12
> b | 12
> c | 15
I’m trying to merge two tables, where the rows of the left side stay unchanged and a column gets updated based on the right side. Thereby, the column of the left table is taking the value of the right side, if it is the highest value (i.e., higher then the current one on the left side) but below an individually set threshold.
The threshold is set by the column “Snapshop”; the column “Latest value found” indicates the highest so far observed value (within the threshold).
In order to be memory efficient, the process will work over many small chunks of data and needs to be able to iterate over a list of dataframes. In each dataframe the origin is recorded in column “Table ID”. If the main-dataframe finds a value it stores the origin in its column “Found in”.
Example
Main table (left side)
+----+-------------------------------------------+--------------------+----------+
| ID | Snapshot timestamp (Maximum search value) | Latest value found | Found in |
+----+-------------------------------------------+--------------------+----------+
| 1 | Aug-18 | NULL | NULL |
| 2 | Aug-18 | NULL | NULL |
| 3 | May-18 | NULL | NULL |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+-------------------------------------------+--------------------+----------+
First data chunk
+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
| 1 | Table1 | 1 | Jan-14 |
| 2 | Table1 | 1 | Feb-14 |
| 3 | Table1 | 2 | Jan-14 |
| 4 | Table1 | 2 | Feb-14 |
| 5 | Table1 | 3 | Mar-14 |
+-----+----------+-------------+--------------------+
Result: Left-side after first merge
+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
| 1 | Aug-18 | Feb-14 | Table1 |
| 2 | Aug-18 | Feb-14 | Table1 |
| 3 | May-18 | Mar-14 | Table1 |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+--------------------+--------------------+----------+
Second data chunk
+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
| 1 | Table2 | 1 | Mar-15 |
| 2 | Table2 | 1 | Apr-15 |
| 3 | Table2 | 2 | Feb-14 |
| 4 | Table2 | 3 | Feb-14 |
| 5 | Table2 | 4 | Aug-19 |
+-----+----------+-------------+--------------------+
Result: Left-side after second merge
+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
| 1 | Aug-18 | Apr-15 | Table2 |
| 2 | Aug-18 | Feb-14 | Table1 |
| 3 | May-18 | Mar-14 | Table1 |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+--------------------+--------------------+----------+
Code
import pandas as pd
import numpy as np
# Main dataframe
df = pd.DataFrame({"ID": [1,2,3,4,5],
"Snapshot": ["2019-08-31", "2019-08-31","2019-05-31","2019-05-31","2019-05-31"], # the maximum interval than can be used
"Latest_value_found": [None,None,None,None,None],
"Found_in": [None,None,None,None,None]}
)
# Data chunks used for updates
Table1 = pd.DataFrame({"Idx": [1,2,3,4,5],
"Table_ID": ["Table1", "Table1", "Table1", "Table1", "Table1"],
"Customer_ID": [1,1,2,2,3],
"Snapshot_timestamp": ["2019-01-31","2019-02-28","2019-01-31","2019-02-28","2019-03-30"]}
)
Table2 = pd.DataFrame({"Idx": [1,2,3,4,5],
"Table_ID": ["Table2", "Table2", "Table2", "Table2", "Table2"],
"Customer_ID": [1,1,2,3,4],
"Snapshot_timestamp": ["2019-03-31","2019-04-30","2019-02-28","2019-02-28","2019-08-31"]}
)
list_of_data_chunks = [Table1, Table2]
# work: iteration
for data_chunk in list_of_data_chunks:
pass
# here the merging is performed iteratively
Here is my workaround, although I would try not to do this in a loop if it's just two tables. I removed your "idx" column from the joining tables.
df_list = [df,Table1,Table2]
main_df = df_list[0]
count_ = 0
for i in df_list[1:]:
main_df = main_df.merge(i, how = 'left', on = 'ID').sort_values(by = ['ID','Snapshot_timestamp'], ascending = [True,False])
main_df['rownum'] = main_df.groupby(['ID']).cumcount()+1
if count_ < 1:
main_df = main_df[main_df['rownum'] == 1].drop(columns = ['rownum','Latest_value_found','Found_in'])
main_df['Latest_value_found'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Snapshot_timestamp'], pd.NaT)
main_df['Found_in'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Table_ID'], np.NaN)
main_df = main_df.drop(columns = ['Snapshot_timestamp','Table_ID']).reset_index(drop = True)
count_ += 1
else:
main_df = main_df[main_df['rownum']==1].drop(columns = 'rownum').reset_index(drop = True)
this_table = []
this_date = []
for i in main_df.index:
curr_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot'])
curr_latest_val = pd.to_datetime(main_df.loc[i,'Latest_value_found'])
curr_foundin = main_df.loc[i,'Found_in']
next_foundin = main_df.loc[i,'Table_ID']
next_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot_timestamp'])
if curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val == next_snapshot:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val > next_snapshot:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val < next_snapshot:
this_date.append(next_snapshot)
this_table.append(next_foundin)
elif pd.isnull(curr_latest_val) and next_snapshot < curr_snapshot:
this_date.append(next_snapshot)
this_table.append(next_foundin)
else:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
main_df = main_df.drop(columns = ['Latest_value_found','Found_in','Table_ID','Snapshot_timestamp'])
main_df = pd.concat([main_df,pd.Series(this_date),pd.Series(this_table)], axis = 1).rename(columns = {0:'Latest_value_found',1:'Found_in'})
count_ += 1
I’m trying to translate a window-function from SQL to Pandas, which is only applied under the condition, that a match is possible – otherwise a NULL (None) value is inserted.
SQL-Code (example)
SELECT
[ID_customer]
[cTimestamp]
[TMP_Latest_request].[ID_req] AS [ID of Latest request]
FROM [table].[Customer] AS [Customer]
LEFT JOIN (
SELECT * FROM(
SELECT [ID_req], [ID_customer], [rTimestamp],
RANK() OVER(PARTITION BY ID_customer ORDER BY rTimestamp DESC) as rnk
FROM [table].[Customer_request]
) AS [Q]
WHERE rnk = 1
) AS [TMP_Latest_request]
ON [Customer].[ID_customer] = [TMP_Latest_request].[ID_customer]
Example
Joining the ID of the latest customer request (if exists) to the customer.
table:Customer
+-------------+------------+
| ID_customer | cTimestamp |
+-------------+------------+
| 1 | 2014 |
| 2 | 2014 |
| 3 | 2015 |
+-------------+------------+
table: Customer_request
+--------+-------------+------------+
| ID_req | ID_customer | rTimestamp |
+--------+-------------+------------+
| 1 | 1 | 2012 |
| 2 | 1 | 2013 |
| 3 | 1 | 2014 |
| 4 | 2 | 2014 |
+--------+-------------+------------+
Result: table:merged
+-------------+------------+----------------------+
| ID_customer | cTimestamp | ID of Latest request |
+-------------+------------+----------------------+
| 1 | 2014 | 3 |
| 2 | 2014 | 4 |
| 3 | 2015 | None/NULL |
+-------------+------------+----------------------+
What is the equivalent in Python Pandas?
Instead of using RANK() function, you can simply using the below, and it is easy to convert.
SELECT A.ID_Customer,A.cTimeStamp,B.ID_req
FROM Customer A
LEFT JOIN (
SELECT ID_Customer,MAX(ID_req)ID_req
FROM Customer_request
GROUP BY ID_Customer
)B
ON A.ID_Customer = B.ID_Customer
Try the following query, if you are facing any issues, ask me in the comments.
I have a table(from log file) with emails and three other columns that contains states of that user's interaction with a system, an email(user) may have 100 or 1000 entries, each entries contain those three combinations of values, that might repeat on and on for same email and others.
something look like this:
+---------+---------+---------+-----+
| email | val1 | val2 | val3 |
+---------+---------+---------+-----+
|jal#h | cast | core | cam |
|hal#b |little ja| qar | ja sa |
|bam#t | cast | core | cam |
|jal#h |little ja| qar | jaja |
+---------+---------+---------+-----+
and so, the emails repeat, all values repeat, and there are 40+ possible values for each columns, all strings. so i want to sort distinct email email and then put all possible value as column name, and under it a count for how many this value repeated for a particular email, like so:
+-------+--------+--------+------+----------+-----+--------+-------+
| email | cast | core | cam | little ja| qar | ja sa | blabla |
+-------+--------+--------+------+----------+-----+--------+--------|
|jal#h | 55 | 2 | 44 | 244 | 1 | 200 | 12 |
|hal#b | 900 | 513 | 101 | 146 | 2 | 733 | 833 |
|bam#t | 1231 | 33 | 433 | 411 | 933 | 833 | 53 |
+-------+--------+--------+------+----------+-----+--------+---------
I have tried mysql but i managed to count a certain value total occurances for each email, but not counting all possible values in each columns:
SELECT
distinct email,
count(val1) as "cast"
FROM table1
where val1 = 'cast'
group by email
This query clearly doesn't do it, as it output only on value 'cast' from the first column 'val1', What i'm looking for is all distinct values in first, second, and third columns be turned to columns heads and the values in rows will be the total for that value, for a certain email 'user'.
there is a pivote table thing but i couldn't get it to work.
I'm dealing with this data as a table in mysql, but it is available in csv file, so if it isn't possible with a query, python would be a possible solution, and prefered after sql.
update
in python, is it possible to output the data as:
+-------+--------+---------+------+----------+-----+--------+-------+
| | val1 | val2 | val3 |
+-------+--------+---------+------+----------+-----+--------+-------+
| email | cast |little ja|core | qar |cam | ja sa | jaja |
+-------+--------+---------+------+----------+-----+--------+--------|
|jal#h | 55 | 2 | 44 | 244 | 1 | 200 | 12 |
|hal#b | 900 | 513 | 101 | 146 | 2 | 733 | 833 |
|bam#t | 1231 | 33 | 433 | 411 | 933 | 833 | 53 |
+-------+--------+--------+------+----------+-----+--------+---------
I'm not very familiar with python.
If you use pandas, you can do a value_counts after grouping your data frame by email and then unstack/pivot it to wide format:
(df.set_index("email").stack().groupby(level=0).value_counts()
.unstack(level=1).reset_index().fillna(0))
To get the updated result, you can group by both the email and val* columns after the stack:
(df.set_index("email").stack().groupby(level=[0, 1]).value_counts()
.unstack(level=[1, 2]).fillna(0).sort_index(axis=1))
I'd reconstruct dataframe, then groupby and unstack with pd.value_counts
v = df.values
s = pd.Series(v[:, 1:].ravel(), v[:, 0].repeat(3))
s.groupby(level=0).value_counts().unstack(fill_value=0)
cam cast core ja sa jaja little ja qar
bam#t 1 1 1 0 0 0 0
hal#b 0 0 0 1 0 1 1
jal#h 1 1 1 0 1 1 1
If you know the list you can calculate it using group by:
SELECT email,
sum(val1 = 'cast') as `cast`,
sum(val1 = 'core') as `core`,
sum(val1 = 'cam') as `cam`,
. . .
FROM table1
GROUP BY email;
The . . . is for you to fill in the remaining values.
You can use this Query to generate a PREPARED Statement dynamic from your Values val1-val3 in your table:
SELECT
CONCAT( "SELECT email,\n",
GROUP_CONCAT(
CONCAT (" SUM(IF('",val1,"' IN(val1,val2,val3),1,0)) AS '",val1,"'")
SEPARATOR ',\n'),
"\nFROM table1\nGROUP BY EMAIL\nORDER BY email") INTO #myquery
FROM (
SELECT val1 FROM table1
UNION SELECT val2 FROM table1
UNION SELECT val3 FROM table1
) AS vals
ORDER BY val1;
-- ONLY TO VERIFY QUERY
SELECT #myquery;
PREPARE stmt FROM #myquery;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
sample table
mysql> SELECT * FROM table1;
+----+-------+-----------+------+-------+
| id | email | val1 | val2 | val3 |
+----+-------+-----------+------+-------+
| 1 | jal#h | cast | core | cam |
| 2 | hal#b | little ja | qar | ja sa |
| 3 | bam#t | cast | core | cam |
| 4 | jal#h | little ja | qar | cast |
+----+-------+-----------+------+-------+
4 rows in set (0,00 sec)
generate query
mysql> SELECT
-> CONCAT( "SELECT email,\n",
-> GROUP_CONCAT(
-> CONCAT (" SUM(IF('",val1,"' IN(val1,val2,val3),1,0)) AS '",val1,"'")
-> SEPARATOR ',\n'),
-> "\nFROM table1\nGROUP BY EMAIL\nORDER BY email") INTO #myquery
-> FROM (
-> SELECT val1 FROM table1
-> UNION SELECT val2 FROM table1
-> UNION SELECT val3 FROM table1
-> ) AS vals
-> ORDER BY val1;
Query OK, 1 row affected (0,00 sec)
verify query
mysql> -- ONLY TO VERIFY QUERY
mysql> SELECT #myquery;
SELECT email,
SUM(IF('cast' IN(val1,val2,val3),1,0)) AS 'cast',
SUM(IF('little ja' IN(val1,val2,val3),1,0)) AS 'little ja',
SUM(IF('core' IN(val1,val2,val3),1,0)) AS 'core',
SUM(IF('qar' IN(val1,val2,val3),1,0)) AS 'qar',
SUM(IF('cam' IN(val1,val2,val3),1,0)) AS 'cam',
SUM(IF('ja sa' IN(val1,val2,val3),1,0)) AS 'ja sa'
FROM table1
GROUP BY EMAIL
ORDER BY email
1 row in set (0,00 sec)
execute query
mysql> PREPARE stmt FROM #myquery;
Query OK, 0 rows affected (0,00 sec)
Statement prepared
mysql> EXECUTE stmt;
+-------+------+-----------+------+------+------+-------+
| email | cast | little ja | core | qar | cam | ja sa |
+-------+------+-----------+------+------+------+-------+
| bam#t | 1 | 0 | 1 | 0 | 1 | 0 |
| hal#b | 0 | 1 | 0 | 1 | 0 | 1 |
| jal#h | 2 | 1 | 1 | 1 | 1 | 0 |
+-------+------+-----------+------+------+------+-------+
3 rows in set (0,00 sec)
mysql> DEALLOCATE PREPARE stmt;
Query OK, 0 rows affected (0,00 sec)
mysql>