I'm using the code bellow to collect some info:
df = (
df
.select(
date_format(date_trunc('month', col("reference_date")), 'yyyy-MM-dd').alias("month"),
col("id"),
col("name"),
col("item_type"),
col("sub_group"),
col("latitude"),
col("longitude")
)
My latitude and longitude are values with dots, like this: -30.130307 -51.2060018 but I must replace the dot for a comma. I've tried both .replace() and .regexp_replace() but none of them are working. Could you guys help me please?
With the following dataframe as an example.
df.show()
+-------------------+-------------------+
| latitude| longitude|
+-------------------+-------------------+
| 85.70708380916193| -68.05674981929877|
| 57.074495803252404|-42.648691976080215|
| 2.944303748172473| -62.66186439333423|
| 119.76923402031701|-114.41179457810185|
|-138.52573939229234| 54.38429596238362|
+-------------------+-------------------+
You should be able to use spark.sql functions like the following
from pyspark.sql import functions
df = df.withColumn("longitude", functions.regexp_replace('longitude',r'[.]',","))
df = df.withColumn("latitude", functions.regexp_replace('latitude',r'[.]',","))
df.show()
+-------------------+-------------------+
| latitude| longitude|
+-------------------+-------------------+
| 85,70708380916193| -68,05674981929877|
| 57,074495803252404|-42,648691976080215|
| 2,944303748172473| -62,66186439333423|
| 119,76923402031701|-114,41179457810185|
|-138,52573939229234| 54,38429596238362|
+-------------------+-------------------+
Related
I have a pyspark dataframe, with text column.
I wanted to map the values which with a regex expression.
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-RH', 'RH'))
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-FI, 'FI'))
Plus I wanted to map specifics values according to a dictionnary, I did the following (mapper is from create_map()):
df = df.withColumn("mapped_col",mapper.getItem(F.col("action")))
Finaly the values which has not been mapped by the dictionnary or the regex expression, will be set null. I do not know how to do this part in accordance to the two others.
Is it possible to have like a dictionnary of regex expression so I can regroup the two 'functions'?
{".*-RH": "RH", ".*FI" : "FI"}
Original Output Example
+-----------------------------+
|message |
+-----------------------------+
|GDF2009 |
|GDF2014 |
|ADS-set |
|ADS-set |
|XSQXQXQSDZADAA5454546a45a4-FI|
|dadaccpjpifjpsjfefspolamml-FI|
|dqdazdaapijiejoajojp565656-RH|
|kijipiadoa
+-----------------------------+
Expected Output Example
+-----------------------------+-----------------------------+
|message |status|
+-----------------------------+-----------------------------+
|GDF2009 | GDF
|GDF2014 | GDF
|ADS/set | ADS
|ADS-set | ADS
|XSQXQXQSDZADAA5454546a45a4-FI| FI
|dadaccpjpifjpsjfefspolamml-FI| FI
|dqdazdaapijiejoajojp565656-RH| RH
|kijipiadoa | null or ??
So first 4th line are mapped with a dict, and the other are mapped using regex. Unmapped are null or ??
Thank you,
You can achieve it using contains function:
from pyspark.sql.types import StringType
df = spark.createDataFrame(
["GDF2009", "GDF2014", "ADS-set", "ADS-set", "XSQXQXQSDZADAA5454546a45a4-FI", "dadaccpjpifjpsjfefspolamml-FI",
"dqdazdaapijiejoajojp565656-RH", "kijipiadoa"], StringType()).toDF("message")
df.show()
names = ("GDF", "ADS", "FI", "RH")
def c(col, names):
return [f.when(f.col(col).contains(i), i).otherwise("") for i in names]
df.select("message", f.concat_ws("", f.array_remove(f.array(*c("message", names)), "")).alias("status")).show()
output:
+--------------------+
| message|
+--------------------+
| GDF2009|
| GDF2014|
| ADS-set|
| ADS-set|
|XSQXQXQSDZADAA545...|
|dadaccpjpifjpsjfe...|
|dqdazdaapijiejoaj...|
| kijipiadoa|
+--------------------+
+--------------------+------+
| message|status|
+--------------------+------+
| GDF2009| GDF|
| GDF2014| GDF|
| ADS-set| ADS|
| ADS-set| ADS|
|XSQXQXQSDZADAA545...| FI|
|dadaccpjpifjpsjfe...| FI|
|dqdazdaapijiejoaj...| RH|
| kijipiadoa| |
+--------------------+------+
I want to perform a cumulative product, previous successful answers use logarithmic sums to the the deed. However, is there a way to use Numpy cumsum. I have tried with no clear result, here is my code:
import numpy as np
def cumulative_product (x):
"""Calculation of cumulative product using numpy function cumprod.
"""
return np.cumprod(float(x)).tolist()
spark_cumulative_product = udf(cumulative_product, ArrayType(DoubleType()))
# the dataset in question:
param.show()
Which gives me for example:
+--------------+-----+
|financial_year| wpi|
+--------------+-----+
| 2014|1.026|
| 2015|1.024|
| 2016|1.021|
| 2017|1.019|
| 2018|1.021|
+--------------+-----+
When applying
param = param.withColumn('cum_wpi', spark_cumulative_product(param_treasury['wpi']))
param.show()
I have that there are no changes i.e.
+--------------+-----+-------+
|financial_year| wpi|cum_wpi|
+--------------+-----+-------+
| 2014|1.026|[1.026]|
| 2015|1.024|[1.024]|
| 2016|1.021|[1.021]|
| 2017|1.019|[1.019]|
| 2018|1.021|[1.021]|
+--------------+-----+-------+
Can anyone help on what is going wrong or if there is a better way to do cumprod without using exp-sum-log
-Update:
The desired output is:
+--------------+-----+-------+
|financial_year| wpi|cum_wpi|
+--------------+-----+-------+
| 2014|1.026| 1.026 |
| 2015|1.024| 1.051 |
| 2016|1.021| 1.073 |
| 2017|1.019| 1.093 |
| 2018|1.021| 1.116 |
+--------------+-----+-------+
One way you can achieve this using cum_prod() pandas series function, using a pandas grouped map UDF.
Sample DataFrame:
#+--------------+-----+
#|financial_year| wpi|
#+--------------+-----+
#| 2014|1.026|
#| 2015|1.024|
#| 2016|1.021|
#| 2017|1.019|
#| 2018|1.021|
#+--------------+-----+
I will first create a dummy column, which will be similar to our cum_wpi. I will overwrite this dummy column in the pandas udf. The use of orderBy right before the groupby and apply is there to ensure that the dataframe is sorted on financial_year.
import pandas as pd
import numpy as np
from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
df1=df.withColumn("cum_wpi", F.lit(1.2456))
#pandas_udf(df1.schema, PandasUDFType.GROUPED_MAP)
def grouped_map(df1):
df1['cum_wpi']=df1['wpi'].cumprod().round(decimals=3)
return df1
df.orderBy(F.col("financial_year").asc())\
.groupby().apply(grouped_map).show()
#+--------------+-----+-------+
#|financial_year| wpi|cum_wpi|
#+--------------+-----+-------+
#| 2014|1.026| 1.026|
#| 2015|1.024| 1.051|
#| 2016|1.021| 1.073|
#| 2017|1.019| 1.093|
#| 2018|1.021| 1.116|
#+--------------+-----+-------+
UPDATE:
You can use aggregate as mentioned earlier by #pault, as long as we cast acc(accumulator) to double we can handle your values.
df.withColumn("cum_wpi", F.expr("""format_number(aggregate(collect_list(wpi)\
over (order by financial_year)\
,cast(1 as double),(acc,x)-> acc*x),3)"""))\
.show(truncate=False)
#+--------------+-----+-------+
#|financial_year|wpi |cum_wpi|
#+--------------+-----+-------+
#|2014 |1.026|1.026 |
#|2015 |1.024|1.051 |
#|2016 |1.021|1.073 |
#|2017 |1.019|1.093 |
#|2018 |1.021|1.116 |
#+--------------+-----+-------+
In my Spark application I have a dataframe with informations like
+------------------+---------------+
| labels | labels_values |
+------------------+---------------+
| ['l1','l2','l3'] | 000 |
| ['l3','l4','l5'] | 100 |
+------------------+---------------+
What I am trying to achieve is to create, given a label name as input a single_label_value column that takes the value for that label from the labels_values column.
For example, for label='l3' I would like to retrieve this output:
+------------------+---------------+--------------------+
| labels | labels_values | single_label_value |
+------------------+---------------+--------------------+
| ['l1','l2','l3'] | 000 | 0 |
| ['l3','l4','l5'] | 100 | 1 |
+------------------+---------------+--------------------+
Here's what I am attempting to use:
selected_label='l3'
label_position = F.array_position(my_df.labels, selected_label)
my_df= my_df.withColumn(
"single_label_value",
F.substring(my_df.labels_values, label_position, 1)
)
But I am getting an error because the substring function does not like the label_position argument.
Is there any way to combine these function outputs without writing an udf?
Hope, this will work for you.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.getOrCreate()
mydata=[[['l1','l2','l3'],'000'], [['l3','l4','l5'],'100']]
df = spark.createDataFrame(mydata,schema=["lebels","lebel_values"])
selected_label='l3'
df2=df.select(
"*",
(array_position(df.lebels,selected_label)-1).alias("pos_val"))
df2.createOrReplaceTempView("temp_table")
df3=spark.sql("select *,substring(lebel_values,pos_val,1) as val_pos from temp_table")
df3.show()
+------------+------------+-------+-------+
| lebels|lebel_values|pos_val|val_pos|
+------------+------------+-------+-------+
|[l1, l2, l3]| 000| 2| 0|
|[l3, l4, l5]| 100| 0| 1|
+------------+------------+-------+-------+
This is giving location of the value. If you want exact index then you can use -1 from this value.
--Edited anser -> Worked with temp view. Still looking for solution using withColumn option. I hope, it will help you for now.
Edit2 -> Answer using dataframe.
df2=df.select(
"*",
(array_position(df.lebels,selected_label)-1).astype("int").alias("pos_val")
)
df3=df2.withColumn("asked_col",expr("substring(lebel_values,pos_val,1)"))
df3.show()
Try maybe:
import pyspark.sql.functions as f
from pyspark.sql.functions import *
selected_label='l3'
df=df.withColumn('single_label_value', f.substring(f.col('labels_values'), array_position(f.col('labels'), lit(selected_label))-1, 1))
df.show()
(for spark version >=2.4)
I think lit() was the function you were missing - you can use it to pass constant values to spark dataframes.
This question already has answers here:
How to add a constant column in a Spark DataFrame?
(3 answers)
Closed 3 years ago.
I have two pyspark dataframes:
1st dataframe: plants
+-----+--------+
|plant|station |
+-----+--------+
|Kech | st1 |
|Casa | st2 |
+-----+--------+
2nd dataframe: stations
+-------+--------+
|program|station |
+-------+--------+
|pr1 | null|
|pr2 | st1 |
+-------+--------+
What i want is to replace the null values in the second dataframe stations with all the column station in the first dataframe. Like this :
+-------+--------------+
|program|station |
+-------+--------------+
|pr1 | [st1, st2]|
|pr2 | st1 |
+-------+--------------+
I did this:
stList = plants.select(F.col('station')).rdd.map(lambda x: x[0]).collect()
stations = stations.select(
F.col('program')
F.when(stations.station.isNull(), stList).otherwise(stations.station).alias('station')
)
but it gives me an error when doesn't accept python list as a parameter
Thanks for your replies.
I've found the solution by converting the column to pandas.
stList = list(plants.select(F.col('station')).toPandas()['station'])
and then use:
F.when(stations.station.isNull(), F.array([F.lit(x) for x in station])).otherwise(stations['station']).alias('station')
it gives directly an array.
quick work around is
F.lit(str(stList))
this should work.
For better type casting use below mentioned code.
stations = stations.select(
F.col('program'),
F.when(stations.station.isNull(), func.array([func.lit(x) for x in stList]))
.otherwise(func.array(stations.station)).alias('station')
)
Firstly, you can't keep different datatypes in station column, it needs to be consistent.
+-------+--------------+
|program|station |
+-------+--------------+
|pr1 | [st1, st2]| # this is array
|pr2 | st1 | # this is string
+-------+--------------+
Secondly, this should do the trick:
from pyspark.sql import functions as F
# Create the stList as a string.
stList = ",".join(plants.select(F.col('station')).rdd.map(lambda x: x[0]).collect())
# coalesce the variables and then apply pyspark.sql.functions.split function
stations = (stations.select(
F.col('program'),
F.split(F.coalesce(stations.station, F.lit(stList)), ",").alias('station')))
stations.show()
Output:
+-------+----------+
|program| station|
+-------+----------+
| pr1|[st1, st2]|
| pr2| [st1]|
+-------+----------+
How do I expand a dataframe based on column values? I intend to go from this dataframe:
+---------+----------+----------+
|DEVICE_ID| MIN_DATE| MAX_DATE|
+---------+----------+----------+
| 1|2019-08-29|2019-08-31|
| 2|2019-08-27|2019-09-02|
+---------+----------+----------+
To one that looks like this:
+---------+----------+
|DEVICE_ID| DATE|
+---------+----------+
| 1|2019-08-29|
| 1|2019-08-30|
| 1|2019-08-31|
| 2|2019-08-27|
| 2|2019-08-28|
| 2|2019-08-29|
| 2|2019-08-30|
| 2|2019-08-31|
| 2|2019-09-01|
| 2|2019-09-02|
+---------+----------+
Any help would be much appreciated.
from datetime import timedelta, date
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType
# Create a sample data row.
df = sqlContext.sql("""
select 'dev1' as device_id,
to_date('2020-01-06') as start,
to_date('2020-01-09') as end""")
# Define a UDf to return a list of dates
#udf
def datelist(start, end):
return ",".join([str(start + datetime.timedelta(days=x)) for x in range(0, 1+(end-start).days)])
# explode the list of dates into rows
df.select("device_id",
F.explode(
F.split(datelist(df["start"], df["end"]), ","))
.alias("date")).show(10, False)