Hi I have a PySpark dataframe. So, I would like to add two columns from different rows with special condition. One of the columns is a date type.
Here is the example of the data:
--------------------------------
| flag| date | diff |
--------------------------------
| 1 | 2014-05-31 | 0 |
--------------------------------
| 2 | 2014-06-02 | 2 |
--------------------------------
| 3 | 2016-01-14 | 591 |
--------------------------------
| 1 | 2016-07-08 | 0 |
--------------------------------
| 2 | 2016-07-12 | 4 |
--------------------------------
Currently I only know how to add the two columns, by using this code:
from pyspark.sql.functions import expr
dataframe.withColumn("new_column", expr("date_add(date_column, int_column)"))
The expected result:
There's a new Column, called "new_date" which is a result by adding the "diff" column to "date column".
The catch is there's a special condition: if the "flag" is 1, "date" and "diff" come from the same row, if not, the "date" comes from the previous row.
I am aware that in this scenario, my data has to be correctly sorted.
If anyone could help me, I would be very grateful. Thank you.
You just have to create a column with the previous date using Window and construct the new column depending on the value of 'flag'
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().partitionBy().orderBy(F.col('date'))
dataframe = dataframe.withColumn('previous_date', F.lag('date', 1).over(w))
dataframe = dataframe.withColumn('new_date',
F.when(F.col('flag')==1,
F.expr("date_add(previous_date, diff)")
).otherwise(F.expr("date_add(date, diff)"))
).drop('previous_date')
Just in case you have the same issue with the answer of Xavier. The idea is the same, but I removed some unnecessary conditions for the Window and fixed the syntax error, as well as the date_add error I faced, when I tried his version.
from pyspark.sql.functions import *
df1 = spark.createDataFrame([(1,datetime.date(2014,5,31),0),(2,datetime.date(2014,6,2),2),(3,datetime.date(2016,1,14),591),(1,datetime.date(2016,7,8),0),(2,datetime.date(2016,7,12),4)], ["flag","date","diff"])
w = Window.orderBy(col("date"))
df1 = df1.withColumn('previous_date', lag('date', 1).over(w))
df1 = df1.withColumn('new_date',when(col('flag')==1,\
expr('date_add(date, diff)'))\
.otherwise(expr('date_add(previous_date,diff)'))).drop('previous_date')
df1.show()
Output:
+----+----------+----+----------+
|flag| date|diff| new_date|
+----+----------+----+----------+
| 1|2014-05-31| 0|2014-05-31|
| 2|2014-06-02| 2|2014-06-02|
| 3|2016-01-14| 591|2016-01-14|
| 1|2016-07-08| 0|2016-07-08|
| 2|2016-07-12| 4|2016-07-12|
+----+----------+----+----------+
Related
I have 2 DataFrame objects df and df2 each with their own set of data. Each object has one column in particular called 'Date Time'. df2 is a subset of df and I am trying to highlight the entire row where df['Date Time'] == df2['Date Time'] as it is a huge set of data and I want to easily find where they match. I have merged both sets of data - all df columns and df2 columns that I want - into df right now, but I removed the duplicated column in the merge and just have the remaining columns that I want in the correct row.
Here is a little code snippet of what I have:
def highlight_row(rows):
return ['background-color: green' for row in rows]
df[df['Date Time'] == df2['Date Time']].style.apply(highlight_row)
I'm getting an error, ValueError: Can only compare identically-labeled Series objects and I'm not sure what it's referring to that isn't identically labeled or if I have a simple mistake. It should just highlight the entire row where any df2 object is, in theory I guess for the end goal, but maybe down the line I'd need something like this conditional-wise - so I thought this could work.
As requested, here is some sample data:
df
| Date Time | Motion Detected | Alert |
| ---------------- | -------------- | ----- |
| 22-03-05 2:13:04 | False | No |
| 22-03-05 2:14:00 | True | Yes |
df2
| Date Time | WiFi Connection |
| --------- | --------------- |
| 22-03-05 2:14:00| Connected |
The actual data won't make any sense to you guys, so here is an example of data with a similar purpose. Basically I have - with this example data translated from my actual data - a table that looks like this in df:
| Date Time | Motion Detected | Alert | WiFi Connection |
| --------- | --------------- | ----- | --------------- |
| 22-03-05 2:13:04 | False | No | |
| 22-03-05 2:14:00 | True | Yes | Connected |
What I want is for any column (assume on a larger scale of data like this), where the 'Date Time' columns match (aka where there is data in 'WiFi Connection', to be highlighted a certain color.
Edit: In case the table formatting isn't showing properly, you can view the as requested information here
I want to get all the possible combinations of size 2 of a column in pyspark dataframe.
My pyspark dataframe looks like
| id |
| 1 |
| 2 |
| 3 |
| 4 |
For above input, I want to get output as
| id1 | id2 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 3 |
and so on..
One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations.
values = df.select(F.collect_list('id')).first()[0]
combns = list(itertools.combinations(values, 2))
However, I want to avoid collecting the dataframe column to the driver since the rows can be extremely large. Is there a better way to achieve this using spark APIs?
You can use the crossJoin method, and then cull the lines with id1 > id2.
df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')
My question is very similar to the one asked but unanswered here
Replicating GROUP_CONCAT for pandas.DataFrame
I have a Pandas DataFame which I want to group concat into a data frame
+------+---------+
| team | user |
+------+---------+
| A | elmer |
| A | daffy |
| A | bugs |
| B | dawg |
| A | foghorn |
+------+---------+
Becoming
+------+---------------------------------------+
| team | group_concat(user) |
+------+---------------------------------------+
| A | elmer,daffy,bugs,foghorn |
| B | dawg |
+------+---------------------------------------+
As answeed in the original topic, it can be done via any of these:
df.groupby('team').apply(lambda x: ','.join(x.user))
df.groupby('team').apply(lambda x: list(x.user))
df.groupby('team').agg({'user' : lambda x: ', '.join(x)})
But the resulting object is not a Pandas Dataframe anymore.
How can I get the GROUP_CONCAT results in the original Pandas DataFrame as a new column?
Cheers
You can apply list and join after grouping by, then reset_index to get the dataframe.
output_df = df.groupby('team')['user'].apply(lambda x: ",".join(list(x))).reset_index()
output_df.rename(columns={'user': 'group_concat(user)'})
team group_concat(user)
0 A elmer,daffy,bugs,foghorn
1 B dawg
Let's break down the below code:
Firstly, groupby team and, use apply on the user to join it's elements using a ,
Then, reset the index, and rename the resulting dataframe (axis=1, refers to columns and not rows)
res = (df.groupby('team')['user']
.apply(lambda x: ','.join(str(i) for i in x))).reset_index().rename({'user':'group_concat(user)'},axis=1)
Output:
team group_concat(user)
0 A elmer,daffy,bugs,foghorn
1 B dawg
Let's say I have a dataframe formatted the following way:
id | name | 052017 | 062017 | 072017 | 092017 | 102017
20 | abcd | 0 | 100 | 200 | 50 | 0
I need to retrieve the column name of the last month an organization had any transactions. In this case, I would like to add a column called "date_string" that would have 092017 as its contents.
Any way to achieve this?
Thanks!
replace 0 to np.nan then using last_valid_index
df.replace(0,np.nan).apply(lambda x :x.last_valid_index(),1)
Out[602]:
0 092017
dtype: object
#df['newcol'] = df.replace(0,np.nan).apply(lambda x :x.last_valid_index(),1)
I'm sure what I'm trying to do is fairly simple for those with better knowledge of PD, but I'm simply stuck at transforming:
+---------+------------+-------+
| Trigger | Date | Value |
+---------+------------+-------+
| 1 | 01/01/2016 | a |
+---------+------------+-------+
| 2 | 01/01/2016 | b |
+---------+------------+-------+
| 3 | 01/01/2016 | c |
+---------+------------+-------+
...etc, into:
+------------+---------------------+---------+---------+---------+
| Date | #of triggers | count a | count b | count c |
+------------+---------------------+---------+---------+---------+
| 01/01/2016 | 3 | 1 | 1 | 1 |
+------------+---------------------+---------+---------+---------+
| 02/01/2016 | 5 | 2 | 1 | 2 |
+------------+---------------------+---------+---------+---------+
... and so on
The issue is, I've got no bloody idea of how to achieve this..
I've scoured SO, but I can't seem to find anything that applies to my specific case.
I presume I'd have to group it all by date, but then once that is done, what do I need to do to get the remaining columns?
The initial DF is loaded from an SQL Alchemy query object, and then I want to manipulate it to get the result I described above. How would one do this?
Thanks
df.groupby(['Date','Value']).count().unstack(level=-1)
You can use GroupBy.size with unstack, also parameter sort=False is helpful:
df1 = df.groupby(['Date','Value'])['Value'].size().unstack(fill_value=0)
df1['Total'] = df1.sum(axis=1)
cols = df1.columns[-1:].union(df1.columns[:-1])
df1 = df1[cols]
print (df1)
Value Total a b c
Date
01/01/2016 3 1 1 1
The difference between size and count is:
size counts NaN values, count does not.