I have this data frame
+---------+------+-----+-------------+-----+
| LCLid|KWH/hh|Acorn|Acorn_grouped|Month|
+---------+------+-----+-------------+-----+
|MAC000002| 0.0| 0| 0| 10|
|MAC000002| 0.0| 0| 0| 10|
|MAC000002| 0.0| 0| 0| 10|
I want to group by the LCid and month's average consumption only in a certain way that a get
+---------+-----+------------------+----------+------------------+
| LCLid|Month| sum(KWH/hh)|Acorn |Acorn_grouped |
+---------+-----+------------------+----------+------------------+
|MAC000003| 10| 904.9270009999999| 0 | 0 |
|MAC000022| 2|1672.5559999999978| 1 | 0 |
|MAC000019| 4| 368.4720001000007| 1 | 1 |
|MAC000022| 9|449.07699989999975| 0 | 1 |
|MAC000024| 8| 481.7160003000004| 1 | 0 |
but what I could do is using this code
dataset=dataset.groupBy("LCLid","Month").sum()
which gave me this result
+---------+-----+------------------+----------+------------------+----------+
| LCLid|Month| sum(KWH/hh)|sum(Acorn)|sum(Acorn_grouped)|sum(Month)|
+---------+-----+------------------+----------+------------------+----------+
|MAC000003| 10| 904.9270009999999| 2978| 2978| 29780|
|MAC000022| 2|1672.5559999999978| 12090| 4030| 8060|
|MAC000019| 4| 368.4720001000007| 20174| 2882| 11528|
|MAC000022| 9|449.07699989999975| 8646| 2882| 25938|
the problem is that the sum function was calculated also on the acron and acron_grouped
have you any idea how could I make the grouping only on the KWH/hh
Depends on how you want to handle the other two columns. If you don't want to sum them, and just want any value from that column, you can do
import pyspark.sql.functions as F
dataset = dataset.groupBy("LCLid","Month").agg(
F.sum("KWH/hh"),
F.first("Acorn").alias("Acorn"),
F.first("Acorn_grouped").alias("Acorn_grouped")
)
Related
Imagine you have two datasets df and df2 like the following:
df:
ID Size Condition
1 2 1
2 3 0
3 5 0
4 7 1
df2:
aux_ID Scalar
1 2
3 2
I want to get an output where if the condition of df is 1, we multiply the size times the scalar and then return df with the changed values.
I would want to do this as efficient as possible, perhaps avoiding the join if that's possible.
output_df:
ID Size Condition
1 4 1
2 3 0
3 5 0
4 7 1
Not sure why would you want to avoid Joins in the first place. They can be efficient in there own regards.
With this said , this can be easily done with Merging the 2 datasets and building a case-when statement against the condition
Data Preparation
df1 = pd.read_csv(StringIO("""ID,Size,Condition
1,2,1
2,3,0
3,5,0
4,7,1
""")
,delimiter=','
)
df2 = pd.read_csv(StringIO("""aux_ID,Scalar
1,2
3,2
""")
,delimiter=','
)
sparkDF1 = sql.createDataFrame(df1)
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show()
+---+----+---------+
| ID|Size|Condition|
+---+----+---------+
| 1| 2| 1|
| 2| 3| 0|
| 3| 5| 0|
| 4| 7| 1|
+---+----+---------+
sparkDF2.show()
+------+------+
|aux_ID|Scalar|
+------+------+
| 1| 2|
| 3| 2|
+------+------+
Case When
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['ID'] == sparkDF2['aux_ID']
,'left'
).select(sparkDF1['*']
,sparkDF2['Scalar']
,sparkDF2['aux_ID']
).withColumn('Size_Changed',F.when( ( F.col('Condition') == 1 )
& ( F.col('aux_ID').isNotNull())
,F.col('Size') * F.col('Scalar')
).otherwise(F.col('Size')
)
)
finalDF.show()
+---+----+---------+------+------+------------+
| ID|Size|Condition|Scalar|aux_ID|Size_Changed|
+---+----+---------+------+------+------------+
| 1| 2| 1| 2| 1| 4|
| 3| 5| 0| 2| 3| 5|
| 2| 3| 0| null| null| 3|
| 4| 7| 1| null| null| 7|
+---+----+---------+------+------+------------+
You can drop the unnecessary columns , I kept them for your illustration
spark = 2.x
New to pyspark.
While encoding date related columns for training DNN keep on facing error mentioned in the title.
from df
day month ...
1 1
2 3
3 1 ...
I am trying to get cos, sine value for each column in order to capture their cyclic nature.
When applying function to column in pyspark udf worked fine until now. But below code doesn't work
def to_cos(x, _max):
return np.sin(2*np.pi*x / _max)
to_cos_udf = udf(to_cos, DecimalType())
df = df.withColumn("month", to_cos_udf("month", 12))
I've tried it with IntegerType and tried it with only one variable def to_cos(x) however none of them seem to work and outputs:
Py4JJavaError: An error occurred while calling 0.24702.showString.
Since you havent shared the entire Stacktrack from the error , not sure what is the actual error which is causing the failure
However by the code snippets you have shared , Firstly you need to update your UDF definition as below -
Will passing arguments to a UDF function using it with lambda is probably the best approach towards it , apart from that you can use partial
Data Preparation
df = pd.DataFrame({
'month':[i for i in range(0,12)],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+-----+
|month|
+-----+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
+-----+
Custom UDF
def to_cos(x,_max):
try:
res = np.sin(2*np.pi*x / _max)
except Exception as e:
res = 0.0
return float(res)
max_cos = 12
to_cos_udf = F.udf(lambda x: to_cos(x,max_cos),FloatType())
sparkDF = sparkDF.withColumn('month_cos',to_cos_udf('month'))
sparkDF.show()
+-----+-------------+
|month| month_cos|
+-----+-------------+
| 0| 0.0|
| 1| 0.5|
| 2| 0.8660254|
| 3| 1.0|
| 4| 0.8660254|
| 5| 0.5|
| 6|1.2246469E-16|
| 7| -0.5|
| 8| -0.8660254|
| 9| -1.0|
| 10| -0.8660254|
| 11| -0.5|
+-----+-------------+
Custom UDF - Partial
from functools import partial
partial_func = partial(to_cos,_max=max_cos)
to_cos_partial_udf = F.udf(partial_func)
sparkDF = sparkDF.withColumn('month_cos',to_cos_partial_udf('month'))
sparkDF.show()
+-----+--------------------+
|month| month_cos|
+-----+--------------------+
| 0| 0.0|
| 1| 0.49999999999999994|
| 2| 0.8660254037844386|
| 3| 1.0|
| 4| 0.8660254037844388|
| 5| 0.49999999999999994|
| 6|1.224646799147353...|
| 7| -0.4999999999999998|
| 8| -0.8660254037844384|
| 9| -1.0|
| 10| -0.8660254037844386|
| 11| -0.5000000000000004|
+-----+--------------------+
I have a spark dataframe
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
1 | CA | appel | 100
2 | CA | orange| 20
3 | CA | berry | 10
I want to get rows where fruits are apple or orange. So I use Spark SQL:
SELECT * FROM table WHERE fruit LIKE '%apple%' OR fruit LIKE '%orange%';
It returns
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
2 | CA | orange| 20
But it is supposed to return
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
1 | CA | appel | 100
2 | CA | orange| 20
as row 1 is just a misspelling.
So I plan on using fuzzywuzzy for string matching.
I know that
import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
print(fuzz.partial_ratio('apple', 'apple')) -> 100
print(fuzz.partial_ratio('apple', 'appel')) -> 83
But am not sure how to apply this to a column in dataframe to get relevant rows
Since you are interested in implementing fuzzy matching as a filter, you must first decide on a threshold of how similar you would like the matches.
Approach 1
For your fuzzywuzzy import this could be 80 for the purpose of this demonstration (adjust based on your needs). You could then implement a udf to apply your imported fuzzy logic code eg
from pyspark.sql import functions as F
from pyspark.sql import types as T
F.udf(T.BooleanType())
def is_fuzzy_match(field_value,search_value, threshold=80):
from fuzzywuzzy import fuzz
return fuzz.partial_ratio(field_value, search_value) > threshold
Then apply your udf as a filter on your dataframe
df = (
df.where(
is_fuzzy_match(F.col("fruit"),F.lit("apple")) |
is_fuzzy_match(F.col("fruit"),F.lit("orange"))
)
)
Approach 2 : Recommended
However, udfs can be expensive when executed on spark and spark has already implemented the levenshtein function which is also useful here. You may start reading more about how the levenshtein distance accomplishes fuzzy matching .
With this approach your code code look like using a threshold of 3 below
from pyspark.sql import functions as F
df = df.where(
(
F.levenshtein(
F.col("fruit"),
F.lit("apple")
) < 3
) |
(
F.levenshtein(
F.col("fruit"),
F.lit("orange")
) < 3
)
)
df.show()
+---+----+------+--------+
| id|city| fruit|quantity|
+---+----+------+--------+
| 0| CA| apple| 300|
| 1| CA| appel| 100|
| 2| CA|orange| 20|
+---+----+------+--------+
For debugging purposes the result of the levenshtein has been included below
df.withColumn("diff",
F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).show()
+---+----+------+--------+----+
| id|city| fruit|quantity|diff|
+---+----+------+--------+----+
| 0| CA| apple| 300| 0|
| 1| CA| appel| 100| 2|
| 2| CA|orange| 20| 5|
| 3| CA| berry| 10| 5|
+---+----+------+--------+----+
Update 1
In response to additional sample data provided by Op in comments:
If I have a fruit like kashmir apple and want it to match with apple
Approach 3
You could try the following approach and adjust the threshold as desired.
Since you are interested in matching the possibility of a misspelled fruit throughout the text, you could attempt to apply the levenshtein to every piece throughout the fruit name. The functions below (not udfs but for readability simplifies the application of the task ) implement this approach. matches_fruit_ratio attempts to determine how much of a match is found while matches_fruit determines the maximum matches_fruit_ratio on every piece of the fruit name split by a space.
from pyspark.sql import functions as F
def matches_fruit_ratio(fruit_column,fruit_search,threshold=0.3):
return (F.length(fruit_column) - F.levenshtein(
fruit_column,
F.lit(fruit_search)
)) / F.length(fruit_column)
def matches_fruit(fruit_column,fruit_search,threshold=0.6):
return F.array_max(F.transform(
F.split(fruit_column," "),
lambda fruit_piece : matches_fruit_ratio(fruit_piece,fruit_search)
)) >= threshold
This can be used as follows:
df = df.where(
matches_fruit(
F.col("fruit"),
"apple"
) | matches_fruit(
F.col("fruit"),
"orange"
)
)
df.show()
+---+----+-------------+--------+
| id|city| fruit|quantity|
+---+----+-------------+--------+
| 0| CA| apple| 300|
| 1| CA| appel| 100|
| 2| CA| orange| 20|
| 4| CA| apply berry| 3|
| 5| CA| apple berry| 1|
| 6| CA|kashmir apple| 5|
| 7| CA|kashmir appel| 8|
+---+----+-------------+--------+
For debugging purposes, I have added additional sample data and output columns for the different components of each function while demonstrating how this function may be used
df.withColumn("length",
F.length(
"fruit"
)
).withColumn("levenshtein",
F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).withColumn("length - levenshtein",
F.length(
"fruit"
) - F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).withColumn(
"matches_fruit_ratio",
matches_fruit_ratio(
F.col("fruit"),
"apple"
)
).withColumn(
"matches_fruit_values_before_threshold",
F.array_max(F.transform(
F.split("fruit"," "),
lambda fruit_piece : matches_fruit_ratio(fruit_piece,"apple")
))
).withColumn(
"matches_fruit",
matches_fruit(
F.col("fruit"),
"apple"
)
).show()
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
| id|city| fruit|quantity|length|levenshtein|length - levenshtein|matches_fruit_ratio|matches_fruit_values_before_threshold|matches_fruit|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
| 0| CA| apple| 300| 5| 0| 5| 1.0| 1.0| true|
| 1| CA| appel| 100| 5| 2| 3| 0.6| 0.6| true|
| 2| CA| orange| 20| 6| 5| 1|0.16666666666666666| 0.16666666666666666| false|
| 3| CA| berry| 10| 5| 5| 0| 0.0| 0.0| false|
| 4| CA| apply berry| 3| 11| 6| 5|0.45454545454545453| 0.8| true|
| 5| CA| apple berry| 1| 11| 6| 5|0.45454545454545453| 1.0| true|
| 6| CA|kashmir apple| 5| 13| 8| 5|0.38461538461538464| 1.0| true|
| 7| CA|kashmir appel| 8| 13| 10| 3|0.23076923076923078| 0.6| true|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
I need help getting conditional output from pyspark when using groupBy. I have the following input table:
+----+-----------+-------+
|time|auth_orient|success|
+----+-----------+-------+
| 1| LogOn|Success|
| 1| LogOff|Success|
| 1| LogOff|Success|
| 1| LogOn|Success|
| 1| LogOn| Fail|
| 1| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Fail |
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Fail |
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
+----+-----------+-------+
The table below shows what I want, which only displays the logon stats:
+----+-----------+-------+
|time|Fail |success|
+----+-----------+-------+
| 1|1 |3 |
| 2|2 |8 |
+----+-----------+-------+
Overall I am trying to group on time and populate the new columns, preferably I would rather have the code populate the column names as I will not always have a complete list, with counts.
I know a portion of what I am trying to do is capable with MultilabelBinarizer, but that is not currently available in pyspark from what I have seen.
Filter the data frame down to LogOn only first and then do groupBy.pivot:
import pyspark.sql.functions as F
df.filter(
df.auth_orient == 'LogOn'
).groupBy('time').pivot('success').agg(F.count('*')).show()
+----+----+-------+
|time|Fail|Success|
+----+----+-------+
| 1| 1| 3|
| 2| 2| 8|
+----+----+-------+
With a dataframe like this,
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])
df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|null| 201602|
| 1| 20|3003| 201601|
| 1| 20|null| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|null| 201601|
+---+----+----+-------+
I need to fill the null values with the average of the existing values, with the expected result being
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
where 1128 is the average of the existing values. I need to do that for several columns.
My current approach is to use na.fill:
fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
But this is very cumbersome. Any ideas?
Well, one way or another you have to:
compute statistics
fill the blanks
It pretty much limits what you can really improve here, still:
replace flatMap(list).collect()[0] with first()[0] or structure unpacking
compute all stats with a single action
use built-in Row methods to extract dictionary
The final result could like this:
def fill_with_mean(df, exclude=set()):
stats = df.agg(*(
avg(c).alias(c) for c in df.columns if c not in exclude
))
return df.na.fill(stats.first().asDict())
fill_with_mean(df_data, ["id", "date"])
In Spark 2.2 or later you can also use Imputer. See Replace missing values with mean - Spark Dataframe.