Extracting the year from Date in Pyspark dataframe - python

I have a Pyspark data frame that contains a date column "Reported Date"(type:string). I would like to get the count of another column after extracting the year from the date.
I can get the count if I use the string date column.
crimeFile_date.groupBy("Reported Date").sum("Offence Count").show()
and I get this output
+-------------+------------------+
|Reported Date|sum(Offence Count)|
+-------------+------------------+
| 13/08/2010| 342|
| 6/10/2011| 334|
| 27/11/2011| 269|
| 12/01/2012| 303|
| 22/02/2012| 286|
| 31/07/2012| 276|
| 25/04/2013| 222|
+-------------+------------------+
To extract the year from "Reported Date" I have converted it to a date format (using this approach) and named the column "Date".
However, when I try to use the same code to group by the new column and do the count I get an error message.
crimeFile_date.groupBy(year("Date").alias("year")).sum("Offence Count").show()
TypeError: strptime() argument 1 must be str, not None
This is the data schema:
root
|-- Offence Count: integer (nullable = true)
|-- Reported Date: string (nullable = true)
|-- Date: date (nullable = true)
Is there a way to fix this error? or extract the year using another method?
Thank you

If I understand correctly then you want to extract the year from String date column. Of course, one way is using regex but sometimes it can throw your logic off if regex is not handling all scenarios.
here is the date data type approach.
Imports
import pyspark.sql.functions as f
Creating your Dataframe
l1 = [('13/08/2010',342),('6/10/2011',334),('27/11/2011',269),('12/01/2012',303),('22/02/2012',286),('31/07/2012',276),('25/04/2013',222)]
dfl1 = spark.createDataFrame(l1).toDF("dates","sum")
dfl1.show()
+----------+---+
| dates|sum|
+----------+---+
|13/08/2010|342|
| 6/10/2011|334|
|27/11/2011|269|
|12/01/2012|303|
|22/02/2012|286|
|31/07/2012|276|
|25/04/2013|222|
+----------+---+
Now, You can use to_timestamp or to_date apis of functions package
dfl2 = dfl1.withColumn('years',f.year(f.to_timestamp('dates', 'dd/MM/yyyy')))
dfl2.show()
+----------+---+-----+
| dates|sum|years|
+----------+---+-----+
|13/08/2010|342| 2010|
| 6/10/2011|334| 2011|
|27/11/2011|269| 2011|
|12/01/2012|303| 2012|
|22/02/2012|286| 2012|
|31/07/2012|276| 2012|
|25/04/2013|222| 2013|
+----------+---+-----+
Now, group by on years.
dfl2.groupBy('years').sum('sum').show()
+-----+--------+
|years|sum(sum)|
+-----+--------+
| 2013| 222|
| 2012| 865|
| 2010| 342|
| 2011| 603|
+-----+--------+
Showing into multiple steps for understanding but you can combine extract year and group by in one step.
Happy to extend if you need some other help.

Related

GeoPandas Convert Geometry Column To Geometry Type

I currently have a geopandas dataframe that looks like this
|----|-------|-----|------------------------------------------------|
| id | name | ... | geometry |
|----|-------|-----|------------------------------------------------|
| 1 | poly1 | ... | 0101000020E6100000A6D52A40F1E16690764A7D... |
|----|-------|-----|------------------------------------------------|
| 2 | poly2 | ... | 0101000020E610000065H7D2A459A295J0A67AD2... |
|----|-------|-----|------------------------------------------------|
And when getting ready to write it to postgis, I am getting the following error:
/python3.7/site-packages/geopandas/geodataframe.py:1321: UserWarning: Geometry column does not contain geometry.
warnings.warn("Geometry column does not contain geometry.")
Is there a way to convert this geometry column to a geometry type so that when it is appending to the existing table with geometry type column errors can be avoided. I've tried:
df['geometry'] = gpd.GeoSeries.to_wkt(df['geometry'])
But there are errors parsing the existing geometry column. Is there a correct way I am missing?
The syntax needs to be changed as below
df['geometry'] = df.geometry.apply(lambda x: x.wkt).apply(lambda x: re.sub('"(.*)"', '\\1', x))

Python / SQL - Loop to fill records based on a roll back/forward of existing populated records

I have some data in two tables, one table is a list of dates (with other fields), running from 1st Jan 2014 until yesterday. The other table contains a year's worth of numeric data (coefficients / metric data) in 2020.
A left join between the two datasets on the date table results in all the dates being brought back with only the year of data being populated for 2020, the rest null as expected.
What I want to do is to populate the history to 2014 (and future) with the data in 2020 on a -364 day mapping.
For example
#+----------+-----------+
#|date |metric |
#+----------+-----------+
#|03/02/2018|null |
#|04/02/2018|null |
#|05/02/2018|null |
#|06/02/2018|null |
#|07/02/2018|null |
#|08/02/2018|null |
#|09/02/2018|null |
#|10/02/2018|null |
#|.... | |
#|02/02/2019|null |
#|03/02/2019|null |
#|04/02/2019|null |
#|05/02/2019|null |
#|06/02/2019|null |
#|07/02/2019|null |
#|08/02/2019|null |
#|09/02/2019|null |
#|... |... |
#|01/02/2020|0.071957531|
#|02/02/2020|0.086542975|
#|03/02/2020|0.023767137|
#|04/02/2020|0.109725808|
#|05/02/2020|0.005774458|
#|06/02/2020|0.056242301|
#|07/02/2020|0.086208715|
#|08/02/2020|0.010676928|
This is what I am trying to achieve:
#+----------+-----------+
#|date |metric |
#+----------+-----------+
#|03/02/2018|0.071957531|
#|04/02/2018|0.086542975|
#|05/02/2018|0.023767137|
#|06/02/2018|0.109725808|
#|07/02/2018|0.005774458|
#|08/02/2018|0.056242301|
#|09/02/2018|0.086208715|
#|10/02/2018|0.010676928|
#|.... | |
#|02/02/2019|0.071957531|
#|03/02/2019|0.086542975|
#|04/02/2019|0.023767137|
#|05/02/2019|0.109725808|
#|06/02/2019|0.005774458|
#|07/02/2019|0.056242301|
#|08/02/2019|0.086208715|
#|09/02/2019|0.010676928|
#|... |... |
#|01/02/2020|0.071957531|
#|02/02/2020|0.086542975|
#|03/02/2020|0.023767137|
#|04/02/2020|0.109725808|
#|05/02/2020|0.005774458|
#|06/02/2020|0.056242301|
#|07/02/2020|0.086208715|
#|08/02/2020|0.010676928|
Worth noting I may eventually have to go back more than 2014 so any dynamism on the population would help!
I'm doing this in databricks so I can use various languages but wanted to focus on Python / Pyspark / SQL for solutions.
Any help would be appreciated.
Thanks.
CT
First create new columns month and year:
df_with_month = df.withColumn("month", f.month(f.to_timestamp("date", "dd/MM/yyyy")))
.withColumn("year", f.month(f.to_timestamp("date", "dd/MM/yyyy")))
with import pyspark.sql.functions as f
Create a new DataFrame with 2020's data:
df_2020 = df_with_month.filter(col("year") == 2020)
.withColumnRenamed("metric", "new_metric")
Join the results on the month:
df_with_metrics = df_with_month.join(df_2020, df_with_month.month == df_2020.month, "left")
.drop("metric")
.withColumnRenamed("new_metric", "metric")
You can do a self join using the condition that the date difference is a multiple of 364 days:
import pyspark.sql.functions as F
df2 = df.join(
df.toDF('date2', 'metric2'),
F.expr("""
datediff(to_date(date, 'dd/MM/yyyy'), to_date(date2, 'dd/MM/yyyy')) % 364 = 0
and
to_date(date, 'dd/MM/yyyy') <= to_date(date2, 'dd/MM/yyyy')
""")
).select(
'date',
F.coalesce('metric', 'metric2').alias('metric')
).filter('metric is not null')
df2.show(999)
+----------+-----------+
| date| metric|
+----------+-----------+
|03/02/2018|0.071957531|
|04/02/2018|0.086542975|
|05/02/2018|0.023767137|
|06/02/2018|0.109725808|
|07/02/2018|0.005774458|
|08/02/2018|0.056242301|
|09/02/2018|0.086208715|
|10/02/2018|0.010676928|
|02/02/2019|0.071957531|
|03/02/2019|0.086542975|
|04/02/2019|0.023767137|
|05/02/2019|0.109725808|
|06/02/2019|0.005774458|
|07/02/2019|0.056242301|
|08/02/2019|0.086208715|
|09/02/2019|0.010676928|
|01/02/2020|0.071957531|
|02/02/2020|0.086542975|
|03/02/2020|0.023767137|
|04/02/2020|0.109725808|
|05/02/2020|0.005774458|
|06/02/2020|0.056242301|
|07/02/2020|0.086208715|
|08/02/2020|0.010676928|
+----------+-----------+
First you can add the timestamp column:
df = df.select(F.to_timestamp("date", "dd/MM/yyyy").alias('ts'), '*')
Then you can join on equal month and day:
cond = [F.dayofmonth(F.col('left.ts')) == F.dayofmonth(F.col('right.ts')),
F.month(F.col('left.ts')) == F.month(F.col('right.ts'))]
df.select('ts', 'date').alias('left').\
join(df.filter(F.year('ts')==2020).select('ts', 'metric').alias('right'), cond)\
.orderBy(F.col('left.ts')).drop('ts').show()

Passing column to when function in pyspark [duplicate]

This question already has answers here:
How to add a constant column in a Spark DataFrame?
(3 answers)
Closed 3 years ago.
I have two pyspark dataframes:
1st dataframe: plants
+-----+--------+
|plant|station |
+-----+--------+
|Kech | st1 |
|Casa | st2 |
+-----+--------+
2nd dataframe: stations
+-------+--------+
|program|station |
+-------+--------+
|pr1 | null|
|pr2 | st1 |
+-------+--------+
What i want is to replace the null values in the second dataframe stations with all the column station in the first dataframe. Like this :
+-------+--------------+
|program|station |
+-------+--------------+
|pr1 | [st1, st2]|
|pr2 | st1 |
+-------+--------------+
I did this:
stList = plants.select(F.col('station')).rdd.map(lambda x: x[0]).collect()
stations = stations.select(
F.col('program')
F.when(stations.station.isNull(), stList).otherwise(stations.station).alias('station')
)
but it gives me an error when doesn't accept python list as a parameter
Thanks for your replies.
I've found the solution by converting the column to pandas.
stList = list(plants.select(F.col('station')).toPandas()['station'])
and then use:
F.when(stations.station.isNull(), F.array([F.lit(x) for x in station])).otherwise(stations['station']).alias('station')
it gives directly an array.
quick work around is
F.lit(str(stList))
this should work.
For better type casting use below mentioned code.
stations = stations.select(
F.col('program'),
F.when(stations.station.isNull(), func.array([func.lit(x) for x in stList]))
.otherwise(func.array(stations.station)).alias('station')
)
Firstly, you can't keep different datatypes in station column, it needs to be consistent.
+-------+--------------+
|program|station |
+-------+--------------+
|pr1 | [st1, st2]| # this is array
|pr2 | st1 | # this is string
+-------+--------------+
Secondly, this should do the trick:
from pyspark.sql import functions as F
# Create the stList as a string.
stList = ",".join(plants.select(F.col('station')).rdd.map(lambda x: x[0]).collect())
# coalesce the variables and then apply pyspark.sql.functions.split function
stations = (stations.select(
F.col('program'),
F.split(F.coalesce(stations.station, F.lit(stList)), ",").alias('station')))
stations.show()
Output:
+-------+----------+
|program| station|
+-------+----------+
| pr1|[st1, st2]|
| pr2| [st1]|
+-------+----------+

PySpark: How to covert column with Ljava.lang.Object

I created data frame in PySpark by reading data from HDFS like this:
df = spark.read.parquet('path/to/parquet')
I expect the data frame to have two column of strings:
+------------+------------------+
|my_column |my_other_column |
+------------+------------------+
|my_string_1 |my_other_string_1 |
|my_string_2 |my_other_string_2 |
|my_string_3 |my_other_string_3 |
|my_string_4 |my_other_string_4 |
|my_string_5 |my_other_string_5 |
|my_string_6 |my_other_string_6 |
|my_string_7 |my_other_string_7 |
|my_string_8 |my_other_string_8 |
+------------+------------------+
However, I get my_column column with some strings starting with [Ljava.lang.Object;, looking like this:
>> df.show(truncate=False)
+-----------------------------+------------------+
|my_column |my_other_column |
+-----------------------------+------------------+
|[Ljava.lang.Object;#7abeeeb6 |my_other_string_1 |
|[Ljava.lang.Object;#5c1bbb1c |my_other_string_2 |
|[Ljava.lang.Object;#6be335ee |my_other_string_3 |
|[Ljava.lang.Object;#153bdb33 |my_other_string_4 |
|[Ljava.lang.Object;#1a23b57f |my_other_string_5 |
|[Ljava.lang.Object;#3a101a1a |my_other_string_6 |
|[Ljava.lang.Object;#33846636 |my_other_string_7 |
|[Ljava.lang.Object;#521a0a3d |my_other_string_8 |
+-----------------------------+------------------+
>> df.printSchema()
root
|-- my_column: string (nullable = true)
|-- my_other_column: string (nullable = true)
As you can see, my_other_column column is looking as expected. Is there any way, how to convert objects in my_column column to humanly readable strings?
Jaroslav,
I tried with the following code, and have used a sample parquet file from here. I am able to get the desired output from the dataframe, can u please chk your code using the code snippet below and also sample file referred above to see if there's any other issue:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Read a Parquet file").getOrCreate()
df = spark.read.parquet('E:\\...\\..\\userdata1.parquet')
df.show(10)
df.printSchema()
Replace the path to your HDFS location.
Dataframe output for your reference:

Creating a new dataframe from a pyspark dataframe column efficiently

I wonder what is the most efficient way to extract a column in pyspark dataframe and turn them into a new dataframe? The following code runs without any problem with small datasets, but runs very slow and even causes out-of-memory error. I wonder how can I improve the efficiency of this code?
pdf_edges = sdf_grp.rdd.flatMap(lambda x: x).collect()
edgelist = reduce(lambda a, b: a + b, pdf_edges, [])
sdf_edges = spark.createDataFrame(edgelist)
In pyspark dataframe sdf_grp, The column "pairs" contains information as below
+-------------------------------------------------------------------+
|pairs |
+-------------------------------------------------------------------+
|[[39169813, 24907492], [39169813, 19650174]] |
|[[10876191, 139604770]] |
|[[6481958, 22689674]] |
|[[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]]|
|[[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]] |
+-------------------------------------------------------------------+
with a schema of
root
|-- pairs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- node1: integer (nullable = false)
| | |-- node2: integer (nullable = false)
I'd like to convert them into a new dataframe sdf_edges looks like below
+---------+---------+
| node1| node2|
+---------+---------+
| 39169813| 24907492|
| 39169813| 19650174|
| 10876191|139604770|
| 6481958| 22689674|
| 73450939|114203936|
| 73450939| 21226555|
| 73450939| 24367554|
| 66306616| 32911686|
| 66306616| 19319140|
| 66306616| 48712544|
+---------+---------+
The most efficient way to extract columns is avoiding collect(). When you call collect(), all the data is transfered to the driver and processed there. At better way to achieve what you want is using the explode() function. Have a look at the example below:
from pyspark.sql import types as T
import pyspark.sql.functions as F
schema = T.StructType([
T.StructField("pairs", T.ArrayType(
T.StructType([
T.StructField("node1", T.IntegerType()),
T.StructField("node2", T.IntegerType())
])
)
)
])
df = spark.createDataFrame(
[
([[39169813, 24907492], [39169813, 19650174]],),
([[10876191, 139604770]], ) ,
([[6481958, 22689674]] , ) ,
([[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]],),
([[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]],)
], schema)
df = df.select(F.explode('pairs').alias('exploded')).select('exploded.node1', 'exploded.node2')
df.show(truncate=False)
Output:
+--------+---------+
| node1 | node2 |
+--------+---------+
|39169813|24907492 |
|39169813|19650174 |
|10876191|139604770|
|6481958 |22689674 |
|73450939|114203936|
|73450939|21226555 |
|73450939|24367554 |
|66306616|32911686 |
|66306616|19319140 |
|66306616|48712544 |
+--------+---------+
Well, I just solve it with the below
sdf_edges = sdf_grp.select('pairs').rdd.flatMap(lambda x: x[0]).toDF()

Categories

Resources