I have dataframe like this
name status
+----+------+
|name|value |
+----+------+
| x | down|
| y |normal|
| z | down|
| x |normal|
| y | down|
+----+------+
If the names are same i want to put number 1,2,3 like this, new column must be like this
+----+------+------+
|name|value |newCol|
+----+------+------+
| x|down | 1|
| y|normal| 2|
| z|down | 3|
| x|normal| 1|
| y|down | 2|
+----+------+------+
win = Window.partitionBy("name").orderBy("name")
print("value")
dp_df_classification_agg_join = dp_df_classification_agg_join.withColumn("newCol",count("name").over(win))
First, replace the count("name") function with the dense_rank() function.
Then, replace this win = Window.partitionBy("name").orderBy("name") with win = Window.partitionBy().orderBy("name")
Related
I have a pyspark dataframe as below, df
| D1 | D2 | D3 |Out|
| 2 | 4 | 5 |D2 |
| 5 | 8 | 4 |D3 |
| 3 | 7 | 8 |D1 |
And I would like to replace the row values of the "out" column by the row value within the same row with the same column name of the row value of the "out" column.
| D1 | D2 | D3 |Out|Result|
| 2 | 4 | 5 |D2 |4 |
| 5 | 8 | 4 |D3 |4 |
| 3 | 7 | 8 |D1 |3 |
df_lag=df.rdd.map(lambda row: row + (row[row.Out],)).toDF(df.columns + ["Result"])
I have tried the code above and it could obtain the result but when I tried to save to csv, it keeps showing the error "Job aborted due to......" so I would like to ask if there is any other method could also obtain the same result. Thanks!
You can use chained when statements generated dynamically from the column names using reduce:
from functools import reduce
import pyspark.sql.functions as F
df2 = df.withColumn(
'Result',
reduce(
lambda x, y: x.when(F.col('Out') == y, F.col(y)),
df.columns[:-1],
F
)
)
df2.show()
+---+---+---+---+------+
| D1| D2| D3|Out|Result|
+---+---+---+---+------+
| 2| 4| 5| D2| 4|
| 5| 8| 4| D3| 4|
| 3| 7| 8| D1| 3|
+---+---+---+---+------+
I have a data frame that looks like
+-------+-------+
| Code1 | Code2 |
+-------+-------+
| A | 1 |
| B | 1 |
| A | 2 |
| B | 2 |
| C | 2 |
| D | 2 |
| D | 3 |
| F | 3 |
| G | 3 |
+-------+-------+
I then want to apply a unique set of filters like so:
Scenario 1 -> filter on Code1 IN (A,B)
Scenario 2 -> filter on Code1 IN (A,D) and Code2 IN (2,3)
Scenario 3 -> filter on Code2 = 2
The result of applying the filter should be a data frame that looks like:
+-------+-------+----------+
| Code1 | Code2 | Scenario |
+-------+-------+----------+
| A | 1 | 1 |
| B | 1 | 1 |
| A | 2 | 1 |
| B | 2 | 1 |
| A | 2 | 2 |
| D | 2 | 2 |
| D | 3 | 2 |
| A | 2 | 3 |
| B | 2 | 3 |
| C | 2 | 3 |
| D | 2 | 3 |
+-------+-------+----------+
QUESTION: What is the most efficient way to do this with spark via python?
I am new to spark, so I am really asking from a conceptual level and don't need an explicit solution. I am aiming to achieve as much parallelism as possible in the operation. My real-life example involves using an initial data frame with 38 columns that is on the order of 100MB to a couple GB as a csv file and I typically have at most 100-150 scenarios.
The original design of the solution was to process each scenario filter sequentially and union the resulting filtered data frames together, but I feel like that negates the whole point of using spark.
EDIT: Does it though? For each scenario, I would filter and then union, which are both transformations (lazy eval). Would the eventual execution plan be smart enough to automatically parallelize the multiple unique filters?
Isn't there a way we can apply the filters in parallel, e.g., apply scenario filter 1 at the same time as applying filters 2 and 3? Would we have to "blow up" the initial dataframe N times, where N = # of scenario filters, append a Scenario # column to the new data frame, and apply one big filter that looks something like:
WHERE (Scenario = 1 AND Code1 IN (A,B)) OR
(Scenario = 2 AND Code1 IN (A,D) AND Code2 IN (2,3)) OR
(Scenario = 3 AND Code2 = 2)
And if that does end up being the most efficient way, isn't it also dependent on how much memory the "blown up" data frame takes? If the "blown up" data frame takes up more memory than what my cluster has, am I going to have to process only as many scenarios as can fit in memory?
you can apply all filters at once:
data.withColumn("scenario",
when('code1.isin("A", "B"), 1).otherwise(
when('code1.isin("A", "D") && 'code2.isin("2","3"), 2).otherwise(
when('code2==="2",3)
)
)
).show()
but you have one more problem, for example values (A,2) could be in all of your scenarios 1,2,3. In this case you could try smth like that:
data.withColumn("s1", when('code1.isin("A", "B"), 1).otherwise(0))
.withColumn("s2",when('code1.isin("A", "D") && 'code2.isin("2","3"), 1).otherwise(0))
.withColumn("s3",when('code2==="2",1).otherwise(0))
.show()
output:
+-----+-----+---+---+---+
|code1|code2| s1| s2| s3|
+-----+-----+---+---+---+
| A| 1| 1| 0| 0|
| B| 1| 1| 0| 0|
| A| 2| 1| 1| 1|
| B| 2| 1| 0| 1|
| A| 2| 1| 1| 1|
| D| 2| 0| 1| 1|
| D| 3| 0| 1| 0|
| A| 2| 1| 1| 1|
| B| 2| 1| 0| 1|
| C| 2| 0| 0| 1|
| D| 2| 0| 1| 1|
+-----+-----+---+---+---+
In the EDIT to my question, I questioned whether the lazy evaluation is the key to the problem. After doing some research in the Spark UI, I have concluded that even though my original solution looks like it is applying transformations (filter then union) sequentially for each scenario, it is actually applying all the transformations simultaneously, once an action (e.g., dataframe.count()) is called. The screenshot here represents the Event Timeline from the transformation phase of the dataframe.count() job.
The job includes 96 scenarios each with a unique filter on the original data frame. You can see that my local machine is running 8 tasks simultaneously, where each task represents a filter from one of the scenarios.
In conclusion, Spark takes care of optimizing the filters to run in parallel, once an action has been called on the resulting dataframe.
I have a pyspark dataframe with a list of customers, days, and transaction types.
+----------+-----+------+
| Customer | Day | Type |
+----------+-----+------+
| A | 2 | X11 |
| A | 4 | X2 |
| A | 9 | Y4 |
| A | 11 | X1 |
| B | 3 | Y4 |
| B | 7 | X1 |
+----------+-----+------+
I'd like to create a column that has "most recent X type" for each customer, like so:
+----------+-----+------+-------------+
| Customer | Day | Type | MostRecentX |
+----------+-----+------+-------------+
| A | 2 | X11 | X11 |
| A | 4 | X2 | X2 |
| A | 9 | Y4 | X2 |
| A | 11 | X1 | X1 |
| B | 3 | Y4 | - |
| B | 7 | X1 | X1 |
+----------+-----+------+-------------+
So for the X types it just takes the one from the current row, but for the Y type it takes the type from the most recent X row for that member (and if there isn't one, it gets a blank or something). I imagine I need a sort of window function but not very familiar with PySpark.
You can achieve by this taking the last column that startswith the letter "X" over a Window that partitions by the Customer and orders by the Day. Specify the Window to start at the beginning of the partition and stop at the current row.
from pyspark.sql import Window
from pyspark.sql.functions import col, last, when
w = Window.partitionBy("Customer").orderBy("Day").rowsBetween(Window.unboundedPreceding, 0)
df = df.withColumn(
"MostRecentX",
last(when(col("Type").startswith("X"), col("Type")), ignorenulls=True).over(w)
)
df.show()
#+--------+---+----+-----------+
#|Customer|Day|Type|MostRecentX|
#+--------+---+----+-----------+
#| A| 2| X11| X11|
#| A| 4| X2| X2|
#| A| 9| Y4| X2|
#| A| 11| X1| X1|
#| B| 3| Y4| null|
#| B| 7| X1| X1|
#+--------+---+----+-----------+
The trick here is to use when to return the Type column only if it starts with "X". By default, when will return null. Then we can use last with ignorenulls=True to get the value for MostRecentX.
If you want to replace the null with "-" as shown in your question, just call fillna on the MostRecentX column:
df.fillna("-", subset=["MostRecentX"]).show()
#+--------+---+----+-----------+
#|Customer|Day|Type|MostRecentX|
#+--------+---+----+-----------+
#| A| 2| X11| X11|
#| A| 4| X2| X2|
#| A| 9| Y4| X2|
#| A| 11| X1| X1|
#| B| 3| Y4| -|
#| B| 7| X1| X1|
#+--------+---+----+-----------+
I have a Spark dataframe that adheres to the following structure:
+------+-----------+-----------+-----------+------+
|ID | Name1 | Name2 | Name3 | Y |
+------+-----------+-----------+-----------+------+
| 1 | A,1 | B,1 | C,4 | B |
| 2 | D,2 | E,2 | F,8 | D |
| 3 | G,5 | H,2 | I,3 | H |
+------+-----------+-----------+-----------+------+
For every row I want to find in which column the value of Y is denoted as the first element. So, ideally I want to retrieve a list like: [Name2,Name1,Name2].
I am not sure how and whether it works to convert first to a RDD, then use a map function and convert the result back to DataFrame.
Any ideas are welcome.
You can probably try this piece of code :
df.show()
+---+-----+-----+-----+---+
| ID|Name1|Name2|Name3| Y|
+---+-----+-----+-----+---+
| 1| A,1| B,1| C,4| B|
| 2| D,2| E,2| F,8| D|
| 3| G,5| H,2| I,3| H|
+---+-----+-----+-----+---+
from pyspark.sql import functions as F
name_cols = ["Name1", "Name2", "Name3"]
cond = F
for col in name_cols:
cond = cond.when(F.split(F.col(col),',').getItem(0) == F.col("Y"), col)
df.withColumn("whichName", cond).show()
+---+-----+-----+-----+---+---------+
| ID|Name1|Name2|Name3| Y|whichName|
+---+-----+-----+-----+---+---------+
| 1| A,1| B,1| C,4| B| Name2|
| 2| D,2| E,2| F,8| D| Name1|
| 3| G,5| H,2| I,3| H| Name2|
+---+-----+-----+-----+---+---------+
I have a pyspark dataframe like: where c1,c2,c3,c4,c5,c6 are the columns
+----------------------------+
|c1 | c2 | c3 | c4 | c5 | c6 |
|----------------------------|
| a | x | y | z | g | h |
| b | m | f | l | n | o |
| c | x | y | z | g | h |
| d | m | f | l | n | o |
| e | x | y | z | g | i |
+----------------------------+
I want to extract c1 values for the rows which have same c2,c3,c4,c5 values but different c1 values.
Like, 1st, 3rd & 5th rows have same values for c2,c3,c4 & c5 but different c1 value. So the output should be a, c & e.
(update)
similarly, 2nd & 4th rows have same values for c2,c3,c4 & c5 but different c1 value. So the output should also contain b & d
How can I obtain such result ? I have tried applying groupby but I don't understand how to obtain distinct values for c1.
UPDATE:
Output should be a Dataframe of c1 values
# +-------+
# |c1_dups|
# +-------+
# | a,c,e|
# | b,e|
# +-------+
My Approach:
m = data.groupBy('c2','c3','c4','c5)
but I'm not understanding how to retrieve the values in m. I'm new to pyspark dataframes hence very much confused
This is actually very simple, let's create some data first :
schema = ['c1','c2','c3','c4','c5','c6']
rdd = sc.parallelize(["a,x,y,z,g,h","b,x,y,z,l,h","c,x,y,z,g,h","d,x,f,y,g,i","e,x,y,z,g,i"]) \
.map(lambda x : x.split(","))
df = sqlContext.createDataFrame(rdd,schema)
# +---+---+---+---+---+---+
# | c1| c2| c3| c4| c5| c6|
# +---+---+---+---+---+---+
# | a| x| y| z| g| h|
# | b| x| y| z| l| h|
# | c| x| y| z| g| h|
# | d| x| f| y| g| i|
# | e| x| y| z| g| i|
# +---+---+---+---+---+---+
Now the fun part, you'll just need to import some functions, group by and explode as following :
from pyspark.sql.functions import *
dupes = df.groupBy('c2','c3','c4','c5') \
.agg(collect_list('c1').alias("c1s"),count('c1').alias("count")) \ # we collect as list and count at the same time
.filter(col('count') > 1) # we filter dupes
df2 = dupes.select(explode("c1s").alias("c1_dups"))
df2.show()
# +-------+
# |c1_dups|
# +-------+
# | a|
# | c|
# | e|
# +-------+
I hope this answers your question.