Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)
I have a dataframe somewhat like this:
ID | Relationship | First Name | Last Name | DOB | Address | Phone
0 | 2 | Self | Vegeta | Saiyan | 01/01/1949 | Saiyan Planet | 123-456-7891
1 | 2 | Spouse | Bulma | Saiyan | 04/20/1969 | Saiyan Planet | 123-456-7891
2 | 3 | Self | Krilin | Human | 08/21/1992 | Planet Earth | 789-456-4321
3 | 4 | Self | Goku | Kakarot | 05/04/1975 | Planet Earth | 321-654-9870
4 | 4 | Child | Gohan | Kakarot | 04/02/2001 | Planet Earth | 321-654-9870
5 | 5 | Self | Freezer | Fridge | 09/15/1955 | Deep Space | 456-788-9568
I'm looking to have the rows with same ID appended to the right of the first row with that ID.
Example:
ID | Relationship | First Name | Last Name | DOB | Address | Phone | Spouse_First Name | Spouse_Last Name | Spouse_DOB | Child_First Name | Child_Last Name | Child_DOB |
0 | 2 | Self | Vegeta | Saiyan | 01/01/1949 | Saiyan Planet | 123-456-7891 | Bulma | Saiyan | 04/20/1969 | | |
1 | 3 | Self | Krilin | Human | 08/21/1992 | Planet Earth | 789-456-4321 | | | | | |
2 | 4 | Self | Goku | Kakarot | 05/04/1975 | Planet Earth | 321-654-9870 | | | | Gohan | Kakarot | 04/02/2001 |
3 | 5 | Self | Freezer | Fridge | 09/15/1955 | Deep Space | 456-788-9568 | | | | | |
My real scenario dataframe has more columns, but they all have the same information when the two rows share the same ID, so no need to duplicate those in the other rows. I only need to add to the right the columns that I choose, which in this case would be First Name, Last Name, DOB with the identifier for the new column label depending on what's on the 'Relationship' column (I can rename them later if it's not possible to do in a straight way, just wanted to illustrate my point.
Now that I've said this, I want to add that I have tried different ways and seems like approaching with unstack or pivot is the way to go but I have not been successful in making it work.
Any help would be greatly appreciated.
This solution assumes that the DataFrame is indexed by the ID column.
not_self = (
df.query("Relationship != 'Self'")
.pivot(columns='Relationship')
.swaplevel(axis=1)
.reindex(
pd.MultiIndex.from_product(
(
set(df['Relationship'].unique()) - {'Self'},
df.columns.to_series().drop('Relationship')
)
),
axis=1
)
)
not_self.columns = [' '.join((a, b)) for a, b in not_self.columns]
result = df.query("Relationship == 'Self'").join(not_self)
Please let me know if this is not what was wanted.
I have a 'master table' that contains just one column of ids from all my other tables. I also have several other tables that contain some of the ids, along with other columns of data. I am trying to iterate through all of the ids for each smaller table, create a new column for the smaller table, check if the id exists on that table and create a binary entry in the master table. (0 if the id doesn't exist, and 1 if the id does exist on the specified table)
That seems pretty confusing, but the application of this is to check if a user exists on the table for a specific date, and keep track of this information day to day.
Right now my I am iterating through the dates, and inside each iteration I am iterating through all of the ids to check if they exist for that date. This is likely going to be incredibly slow, and there is probably a better way to do this though. My code looks like this:
def main():
dates = init()
id_list = getids()
print(dates)
for date in reversed(dates):
cursor.execute("ALTER TABLE " + table + " ADD " + date + " BIT;")
cnx.commit()
for ID in id_list:
(...)
I know that the next step will be to generate a query using each id that looks something like:
SELECT id FROM [date]_table
WHERE EXISTS (SELECT 1 FROM master_table WHERE master_table.id = [date]_table.id)
I've been stuck on this problem for a couple days and so far I cannot come up with a query that gives a useful result.
.
For an example, if I had three tables for three days...
Monday:
+------+-----+
| id | ... |
+------+-----+
| 1001 | ... |
| 1002 | ... |
| 1003 | ... |
| 1004 | ... |
| 1005 | ... |
+------+-----+
Tuesday:
+------+-----+
| id | ... |
+------+-----+
| 1001 | ... |
| 1003 | ... |
| 1005 | ... |
+------+-----+
Wednesday:
+------+-----+
| id | ... |
+------+-----+
| 1002 | ... |
| 1004 | ... |
+------+-----+
I'd like to end up with a master table like this:
+------+--------+---------+-----------+
| id | monday | tuesday | wednesday |
+------+--------+---------+-----------+
| 1001 | 1 | 1 | 0 |
| 1002 | 1 | 0 | 1 |
| 1003 | 1 | 1 | 0 |
| 1004 | 1 | 0 | 1 |
| 1005 | 1 | 1 | 0 |
+------+--------+---------+-----------+
Thank you ahead of time for any help with this issue. And since it's sort of a confusing problem, please let me know if there are any additional details I can provide.
I have a table(from log file) with emails and three other columns that contains states of that user's interaction with a system, an email(user) may have 100 or 1000 entries, each entries contain those three combinations of values, that might repeat on and on for same email and others.
something look like this:
+---------+---------+---------+-----+
| email | val1 | val2 | val3 |
+---------+---------+---------+-----+
|jal#h | cast | core | cam |
|hal#b |little ja| qar | ja sa |
|bam#t | cast | core | cam |
|jal#h |little ja| qar | jaja |
+---------+---------+---------+-----+
and so, the emails repeat, all values repeat, and there are 40+ possible values for each columns, all strings. so i want to sort distinct email email and then put all possible value as column name, and under it a count for how many this value repeated for a particular email, like so:
+-------+--------+--------+------+----------+-----+--------+-------+
| email | cast | core | cam | little ja| qar | ja sa | blabla |
+-------+--------+--------+------+----------+-----+--------+--------|
|jal#h | 55 | 2 | 44 | 244 | 1 | 200 | 12 |
|hal#b | 900 | 513 | 101 | 146 | 2 | 733 | 833 |
|bam#t | 1231 | 33 | 433 | 411 | 933 | 833 | 53 |
+-------+--------+--------+------+----------+-----+--------+---------
I have tried mysql but i managed to count a certain value total occurances for each email, but not counting all possible values in each columns:
SELECT
distinct email,
count(val1) as "cast"
FROM table1
where val1 = 'cast'
group by email
This query clearly doesn't do it, as it output only on value 'cast' from the first column 'val1', What i'm looking for is all distinct values in first, second, and third columns be turned to columns heads and the values in rows will be the total for that value, for a certain email 'user'.
there is a pivote table thing but i couldn't get it to work.
I'm dealing with this data as a table in mysql, but it is available in csv file, so if it isn't possible with a query, python would be a possible solution, and prefered after sql.
update
in python, is it possible to output the data as:
+-------+--------+---------+------+----------+-----+--------+-------+
| | val1 | val2 | val3 |
+-------+--------+---------+------+----------+-----+--------+-------+
| email | cast |little ja|core | qar |cam | ja sa | jaja |
+-------+--------+---------+------+----------+-----+--------+--------|
|jal#h | 55 | 2 | 44 | 244 | 1 | 200 | 12 |
|hal#b | 900 | 513 | 101 | 146 | 2 | 733 | 833 |
|bam#t | 1231 | 33 | 433 | 411 | 933 | 833 | 53 |
+-------+--------+--------+------+----------+-----+--------+---------
I'm not very familiar with python.
If you use pandas, you can do a value_counts after grouping your data frame by email and then unstack/pivot it to wide format:
(df.set_index("email").stack().groupby(level=0).value_counts()
.unstack(level=1).reset_index().fillna(0))
To get the updated result, you can group by both the email and val* columns after the stack:
(df.set_index("email").stack().groupby(level=[0, 1]).value_counts()
.unstack(level=[1, 2]).fillna(0).sort_index(axis=1))
I'd reconstruct dataframe, then groupby and unstack with pd.value_counts
v = df.values
s = pd.Series(v[:, 1:].ravel(), v[:, 0].repeat(3))
s.groupby(level=0).value_counts().unstack(fill_value=0)
cam cast core ja sa jaja little ja qar
bam#t 1 1 1 0 0 0 0
hal#b 0 0 0 1 0 1 1
jal#h 1 1 1 0 1 1 1
If you know the list you can calculate it using group by:
SELECT email,
sum(val1 = 'cast') as `cast`,
sum(val1 = 'core') as `core`,
sum(val1 = 'cam') as `cam`,
. . .
FROM table1
GROUP BY email;
The . . . is for you to fill in the remaining values.
You can use this Query to generate a PREPARED Statement dynamic from your Values val1-val3 in your table:
SELECT
CONCAT( "SELECT email,\n",
GROUP_CONCAT(
CONCAT (" SUM(IF('",val1,"' IN(val1,val2,val3),1,0)) AS '",val1,"'")
SEPARATOR ',\n'),
"\nFROM table1\nGROUP BY EMAIL\nORDER BY email") INTO #myquery
FROM (
SELECT val1 FROM table1
UNION SELECT val2 FROM table1
UNION SELECT val3 FROM table1
) AS vals
ORDER BY val1;
-- ONLY TO VERIFY QUERY
SELECT #myquery;
PREPARE stmt FROM #myquery;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
sample table
mysql> SELECT * FROM table1;
+----+-------+-----------+------+-------+
| id | email | val1 | val2 | val3 |
+----+-------+-----------+------+-------+
| 1 | jal#h | cast | core | cam |
| 2 | hal#b | little ja | qar | ja sa |
| 3 | bam#t | cast | core | cam |
| 4 | jal#h | little ja | qar | cast |
+----+-------+-----------+------+-------+
4 rows in set (0,00 sec)
generate query
mysql> SELECT
-> CONCAT( "SELECT email,\n",
-> GROUP_CONCAT(
-> CONCAT (" SUM(IF('",val1,"' IN(val1,val2,val3),1,0)) AS '",val1,"'")
-> SEPARATOR ',\n'),
-> "\nFROM table1\nGROUP BY EMAIL\nORDER BY email") INTO #myquery
-> FROM (
-> SELECT val1 FROM table1
-> UNION SELECT val2 FROM table1
-> UNION SELECT val3 FROM table1
-> ) AS vals
-> ORDER BY val1;
Query OK, 1 row affected (0,00 sec)
verify query
mysql> -- ONLY TO VERIFY QUERY
mysql> SELECT #myquery;
SELECT email,
SUM(IF('cast' IN(val1,val2,val3),1,0)) AS 'cast',
SUM(IF('little ja' IN(val1,val2,val3),1,0)) AS 'little ja',
SUM(IF('core' IN(val1,val2,val3),1,0)) AS 'core',
SUM(IF('qar' IN(val1,val2,val3),1,0)) AS 'qar',
SUM(IF('cam' IN(val1,val2,val3),1,0)) AS 'cam',
SUM(IF('ja sa' IN(val1,val2,val3),1,0)) AS 'ja sa'
FROM table1
GROUP BY EMAIL
ORDER BY email
1 row in set (0,00 sec)
execute query
mysql> PREPARE stmt FROM #myquery;
Query OK, 0 rows affected (0,00 sec)
Statement prepared
mysql> EXECUTE stmt;
+-------+------+-----------+------+------+------+-------+
| email | cast | little ja | core | qar | cam | ja sa |
+-------+------+-----------+------+------+------+-------+
| bam#t | 1 | 0 | 1 | 0 | 1 | 0 |
| hal#b | 0 | 1 | 0 | 1 | 0 | 1 |
| jal#h | 2 | 1 | 1 | 1 | 1 | 0 |
+-------+------+-----------+------+------+------+-------+
3 rows in set (0,00 sec)
mysql> DEALLOCATE PREPARE stmt;
Query OK, 0 rows affected (0,00 sec)
mysql>
I have a db with the following info:
+----+------+-----+-----+------+------+-----------+
| id | time | lat | lon | dep | dest | reg |
+----+------+-----+-----+------+------+-----------+
| a | 1 | 10 | 20 | home | work | alpha |
| a | 6 | 11 | 21 | home | work | alpha |
| a | 11 | 12 | 22 | home | work | alpha |
| b | 2 | 70 | 80 | home | cine | beta |
| b | 8 | 70 | 85 | home | cine | beta |
| b | 13 | 70 | 90 | home | cine | beta |
+----+------+-----+-----+------+------+-----------+
Is it possible to extract the following info:
+----+------+------+----------+----------+----------+----------+------+------+--------+
| id | tmin | tmax | lat_tmin | lon_tmin | lat_tmax | lon_tmax | dep | dest | reg |
+----+------+------+----------+----------+----------+----------+------+------+--------+
| a | 1 | 11 | 10 | 20 | 12 | 22 | home | work | alpha |
| b | 2 | 13 | 70 | 80 | 70 | 90 | home | cine | beta |
+----+------+------+----------+----------+----------+----------+------+------+--------+
what if dep & dest were varying - how would tou select them?
Thks
You could use a window function for that:
SELECT t0.id,
t0.time as tmin, t1.time as tmax,
t0.lat as lat_tmin, t1.lat as lat_tmax,
t0.lon as lon_tmin, t1.lon as lon_tmax,
t0.dep,
t0.dest,
t0.reg
FROM ( SELECT *,
row_number() over (partition by id order by time asc) as rn
FROM t) AS t0
INNER JOIN ( SELECT *,
row_number() over (partition by id order by time desc) as rn
FROM t) AS t1
ON t0.id = t1.id
WHERE t0.rn = 1
AND t1.rn = 1
This will return the data from the first and last row for each id, when sorted by id, time.
The values for dep, dest and reg are taken from the first row (per id) only.
If you want to also have separate rows for when -- for the same id -- you have different values of dep or dest, then just add those in the partition by clauses. All depends on which output you expect in that case:
SELECT t0.id,
t0.time as tmin, t1.time as tmax,
t0.lat as lat_tmin, t1.lat as lat_tmax,
t0.lon as lon_tmin, t1.lon as lon_tmax,
t0.dep,
t0.dest,
t0.reg
FROM ( SELECT *,
row_number() over (partition by id, dep, dest, reg
order by time asc) as rn
FROM t) AS t0
INNER JOIN ( SELECT *,
row_number() over (partition by id, dep, dest, reg
order by time desc) as rn
FROM t) AS t1
ON t0.id = t1.id
WHERE t0.rn = 1
AND t1.rn = 1
Please consider using a different name for the column time because of this remark in the documentation:
The following keywords could be reserved in future releases of SQL Server as new features are implemented. Consider avoiding the use of these words as identifiers.
... TIME ...
select min(t.time) as tmin, max(t.time) as tmax, (...) from table_name t group by t.dest
(...) just repeat for the other columns