Auto - Incrementing pyspark dataframe column values - python

I am trying to generate an additional column in a dataframe with auto incrementing values based on the global value.However all the rows are generated with the same value and the value is not incrementing.
Here is the code
def autoIncrement():
global rec
if (rec == 0) : rec = 1
else : rec = rec + 1
return int(rec)
rec=14
UDF
autoIncrementUDF = udf(autoIncrement, IntegerType())
df1 = hiveContext.sql("select id,name,location,state,datetime,zipcode from demo.target")
df1.withColumn("id2", autoIncrementUDF()).show()
Here is the result df
+---+------+--------+----------+-------------------+-------+---+
| id| name|location| state| datetime|zipcode|id2|
+---+------+--------+----------+-------------------+-------+---+
| 20|pankaj| Chennai| TamilNadu|2018-03-26 11:00:00| NULL| 15|
| 10|geetha| Newyork|New Jersey|2018-03-27 10:00:00| NULL| 15|
| 25| pawan| Chennai| TamilNadu|2018-03-27 11:25:00| NULL| 15|
| 30|Manish| Gurgoan| Gujarat|2018-03-27 11:00:00| NULL| 15|
+---+------+--------+----------+-------------------+-------+---+
But i am expecting the below result
+---+------+--------+----------+-------------------+-------+---+
| id| name|location| state| datetime|zipcode|id2|
+---+------+--------+----------+-------------------+-------+---+
| 20|pankaj| Chennai| TamilNadu|2018-03-26 11:00:00| NULL| 15|
| 10|geetha| Newyork|New Jersey|2018-03-27 10:00:00| NULL| 16|
| 25| pawan| Chennai| TamilNadu|2018-03-27 11:25:00| NULL| 17|
| 30|Manish| Gurgoan| Gujarat|2018-03-27 11:00:00| NULL| 18|
+---+------+--------+----------+-------------------+-------+---+
Any help is appreciated.

Global variables are bounded to a python process. A UDF may be executed in parallel on different workers across some cluster, and should be deterministic.
You should use monotonically_increasing_id() function from pyspark.sql.functions module.
Check the docs for more info.
You should be careful because this function is dynamic and not sticky:
How do I add an persistent column of row ids to Spark DataFrame?

Related

Dynamically Expand Arraytype() Columns in Structured Streaming PySpark

I have the following DataFrame:
root
|-- sents: array (nullable = false)
| |-- element: integer (containsNull = true)
|-- metadata: array (nullable = true)
| |-- element: float (containsNull = true)
+----------+---------------------+
|sents |metadata |
+----------+---------------------+
|[1, -1, 0]|[0.4991, 0.5378, 0.0]|
|[-1] |[0.6281] |
|[-1] |[0.463] |
+----------+---------------------+
I want to expand each array item to its own column DYNAMICALLY so that it may look as follows:
+--------+--------+--------+-----------+-----------+-----------+
|sents[0]|sents[1]|sents[2]|metadata[0]|metadata[1]|metadata[2]|
+--------+--------+--------+-----------+-----------+-----------+
| 1| -1| 0| 0.4991| 0.5378| 0.0|
| -1| null| null| 0.6281| null| null|
| -1| null| null| 0.463| null| null|
+--------+--------+--------+-----------+-----------+-----------+
but in structured Streaming there are many limitations as to doing things dynamically:
I tried the following:
numcol = df.withColumn('phrasesNum', F.size('sents')).agg(F.max('phrasesNum')).head()
df = df.select(*[F.col('sents')[i] for i in range(numcol[0])],*[F.col('metadata')[i] for i in range(numcol[0])])
Also:
df_sizes = df.select(F.size('sents').alias('sents'))
df_max = df_sizes.agg(F.max('sents'))
nb_columns = df_max.collect()[0][0]
d = c.select(*[F.map_values(c['metadata'][i]).getItem(0).alias('confidenceIntervals'+"{}".format(j)).cast(DoubleType()) for i,j in enumerate(range(F.size('sents')))],
*[c['sents'][i].alias('phraseSents'+"{}".format(j)).cast(IntegerType()) for i,j in enumerate(range(nb_columns))])
but I cannot use things like .head(), .collect(), or .take() in Structured Streaming to create the numeric variable that indicate the number of columns to dynamically create. Any ideas??
thanks to all
Only way you can do this without collecting to driver node(first,take,collect etc), is if you the know the columns you need or the max size of each array column. Here I assumed both columns had a max size of 3, with columns required 0,1,2.
Also in streaming you cant have different schema(columns) between dataframes.
cols=['0','1','2']
from pyspark.sql import functions as F
df.withColumn("struct1", F.struct(*[F.struct((F.col("sents")[int(x)]).alias('sents[{}]'.format(x))) for x in cols]))\
.withColumn("struct2", F.struct(*[F.struct((F.col("metadata")[int(x)]).alias('metadata[{}]'.format(x))) for x in cols]))\
.select(*["struct1.{}.*".format(x) for x in ['col{}'.format((int(x)+1)) for x in cols]],
*["struct2.{}.*".format(x) for x in ['col{}'.format((int(x)+1)) for x in cols]]).show()
#+--------+--------+--------+-----------+-----------+-----------+
#|sents[0]|sents[1]|sents[2]|metadata[0]|metadata[1]|metadata[2]|
#+--------+--------+--------+-----------+-----------+-----------+
#| 1| -1| 0| 0.4991| 0.5378| 0.0|
#| -1| null| null| 0.6281| null| null|
#| -1| null| null| 0.463| null| null|
#+--------+--------+--------+-----------+-----------+-----------+

How to filter out the rows that do not start with digit (CSV, PySpark). Edited: Contain only with number

CSV File
In the df a column in there has some rows which do not start with digit, i want them to delete, i tried some code below but they dont work
import re
df = sqlContext.read.csv("/FileStore/tables/mtmedical_V6-16623.csv", header='true', inferSchema="true")
df.show()
import pyspark.sql.functions as f
w=df.filter(df['_c0'].isdigit()) #error1
w=df.filter(df['_c0'].startswith(('1','2','3','4','5','6','7','8','9'))) #error2
w.show()
errors:
'Column' object is not callable #no1
py4j.Py4JException: Method startsWith([class java.util.ArrayList]) does not exist #no2
here is the table, you can see that the row below row 7 in the column '_c0' does not start with digit, how can i delete such rows?
+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------------------------------------------+--------------------+--------------------+
| _c0| description| medical_specialty| age| gender|sample_name (What has been done to patient = Treatment)| transcription| keywords|
+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------------------------------------------+--------------------+--------------------+
| 1| A 23-year-old wh...| Allergy / Immuno...| 23| female| Allergic Rhinitis |SUBJECTIVE:, Thi...|allergy / immunol...|
| 2| Consult for lapa...| Bariatrics| null| male| Laparoscopic Gas...|PAST MEDICAL HIST...|bariatrics, lapar...|
| 3| Consult for lapa...| Bariatrics| 42| male| Laparoscopic Gas...|"HISTORY OF PRESE...| at his highest h...|
| 4| 2-D M-Mode. Dopp...| Cardiovascular /...| null| null| 2-D Echocardiogr...|2-D M-MODE: , ,1....|cardiovascular / ...|
| 5| 2-D Echocardiogram| Cardiovascular /...| null| male| 2-D Echocardiogr...|1. The left vent...|cardiovascular / ...|
| 6| Morbid obesity. ...| Bariatrics| 30| male| Laparoscopic Gas...|PREOPERATIVE DIAG...|bariatrics, gastr...|
| 7| Liposuction of t...| null| null| null| null| null| null|
|", Bariatrics,31,...| 1. Deformity| right breast rec...|2. Excess soft t...| anterior abdomen...| 3. Lipodystrophy...|POSTOPERATIVE DIA...| 1. Deformity|
| 8| 2-D Echocardiogram| Cardiovascular /...| null| male| 2-D Echocardiogr...|2-D ECHOCARDIOGRA...|cardiovascular / ...|
df.filter((f.col('_c0')).isin([x for x in range(1,df.count()+1)]))

Create a new column condition-wisely

I am trying to figure out how to translate my Pandas-utilising function to PySpark.
I have a Pandas DataFrame like this:
+---+----+
|num| val|
+---+----+
| 1| 0.0|
| 2| 0.0|
| 3|48.6|
| 4|49.0|
| 5|48.7|
| 6|49.1|
| 7|74.5|
| 8|48.7|
| 9| 0.0|
| 10|49.0|
| 11| 0.0|
| 12| 0.0|
+---+----+
The code in the snippet below is fairly simple. It goes forwards till finds a non-zero value. If there is none of them it goes backwards for the same purpose
def next_non_zero(data,i,column):
for j in range(i+1,len(data[column])):
res = data[column].iloc[j]
if res !=0:
return res
for j in range(i-1,0,-1):
res = data[column].iloc[j]
if res !=0:
return res
def fix_zero(data, column):
for i, row in data.iterrows():
if (row[column] == 0):
data.at[i,column] = next_non_zero(data,i,column)
So as a result I expect to see
+---+----+
|num| val|
+---+----+
| 1|48.6|
| 2|48.6|
| 3|48.6|
| 4|49.0|
| 5|48.7|
| 6|49.1|
| 7|74.5|
| 8|48.7|
| 9|49.0|
| 10|49.0|
| 11|49.0|
| 12|49.0|
+---+----+
So I do understand that in PySpark I have to create a new column with the desired result and replace an existing column using withColumn() for example. However, I do not understand how to properly iterate through a DataFrame.
I am trying to use functions over Window:
my_window = Window.partitionBy().orderBy('num')
df = df.withColumn('new_val', F.when(df.val==0,F.lead(df.val).over(my_window)).
otherwise(F.lag(df.val).over(my_window))
Obviously, it does not provide me with the desired result as it iterates only once.
So I tried to write some udf recursion like
def fix_zero(param):
return F.when(F.lead(param).over(my_window)!=0,F.lead(param).over(my_window)).
otherwise(fix_zero(F.lead(param).over(my_window)))
spark_udf = udf(fix_zero, DoubleType())
df = df.withColumn('new_val', F.when(df.val!=0, df.val).otherwise(fix_zero('val')))
I got
RecursionError: maximum recursion depth exceeded in comparison
I suspect that this is because I pass into recursion not a row but a result of lead()
Anyway, I am totally stuck on this hurdle at this moment and would deeply appreciate any advice
There is a way with Window to go through all preceeding (or all following rows) until you reach a non-null value.
So my first step was to replace all 0 values by null
Recreating your dataframe:
values = [
(1, 0.0),
(2,0.0),
(3,48.6),
(4,49.0),
(5,48.7),
(6,49.1),
(7, 74.5),
(8,48.7),
(9,0.0),
(10,49.0),
(11,0.0),
(12,0.0)
]
df = spark.createDataFrame(values, ['num','val'])
Replacing 0s with null
from pyspark.sql.functions import when, lit, col
df= df.withColumn('val_null', when(col('val') != 0.0,col('val')))
Then define the windows, which combined with first and null, will allow us to get last non null value before row and first non null value after row
from pyspark.sql import Window
from pyspark.sql.functions import last,first,coalesce
windowForward = Window.rowsBetween(Window.unboundedPreceding, Window.currentRow)
ffilled_column = last(df['val_null'], ignorenulls=True).over(windowForward)
windowBackward = Window.rowsBetween(Window.currentRow,Window.unboundedFollowing)
bfilled_column = first(df['val_null'], ignorenulls=True).over(windowBackward)
# creating new columns in df
df =df.withColumn('ffill',ffilled_column).withColumn('bfill',bfilled_column)
# replace null with bfill if bfill is not null otherwise fill with ffill
df =df.withColumn('val_full',coalesce('bfill','ffill'))
Using this technique we arrive at your expected output in column 'val_full'
+---+----+--------+-----+-----+--------+
|num| val|val_null|ffill|bfill|val_full|
+---+----+--------+-----+-----+--------+
| 1| 0.0| null| null| 48.6| 48.6|
| 2| 0.0| null| null| 48.6| 48.6|
| 3|48.6| 48.6| 48.6| 48.6| 48.6|
| 4|49.0| 49.0| 49.0| 49.0| 49.0|
| 5|48.7| 48.7| 48.7| 48.7| 48.7|
| 6|49.1| 49.1| 49.1| 49.1| 49.1|
| 7|74.5| 74.5| 74.5| 74.5| 74.5|
| 8|48.7| 48.7| 48.7| 48.7| 48.7|
| 9| 0.0| null| 48.7| 49.0| 49.0|
| 10|49.0| 49.0| 49.0| 49.0| 49.0|
| 11| 0.0| null| 49.0| null| 49.0|
| 12| 0.0| null| 49.0| null| 49.0|
+---+----+--------+-----+-----+--------+

How to concatenate to a null column in pyspark dataframe

I have a below dataframe and I wanted to update the rows dynamically with some values
input_frame.show()
+----------+----------+---------+
|student_id|name |timestamp|
+----------+----------+---------+
| s1|testuser | t1|
| s1|sampleuser| t2|
| s2|test123 | t1|
| s2|sample123 | t2|
+----------+----------+---------+
input_frame = input_frame.withColumn('test', sf.lit(None))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
input_frame = input_frame.withColumn('test', sf.concat(sf.col('test'),sf.lit('test')))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
I want to update the 'test' column with some values and apply the filter with partial matches on the column. But concatenating to null column resulting in a null column again. How can we do this?
use concat_ws, like this:
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([["1", "2"], ["2", None], ["3", "4"], ["4", "5"], [None, "6"]]).toDF("a", "b")
# This won't work
df = df.withColumn("concat", concat(df.a, df.b))
# This won't work
df = df.withColumn("concat + cast", concat(df.a.cast('string'), df.b.cast('string')))
# Do it like this
df = df.withColumn("concat_ws", concat_ws("", df.a, df.b))
df.show()
gives:
+----+----+------+-------------+---------+
| a| b|concat|concat + cast|concat_ws|
+----+----+------+-------------+---------+
| 1| 2| 12| 12| 12|
| 2|null| null| null| 2|
| 3| 4| 34| 34| 34|
| 4| 5| 45| 45| 45|
|null| 6| null| null| 6|
+----+----+------+-------------+---------+
Note specifically that casting a NULL column to string doesn't work as you wish, and will result in the entire row being NULL if any column is null.
There's no nice way of dealing with more complicated scenarios, but note that you can use a when statement in side a concat if you're willing to
suffer the verboseness of it, like this:
df.withColumn("concat_custom", concat(
when(df.a.isNull(), lit('_')).otherwise(df.a),
when(df.b.isNull(), lit('_')).otherwise(df.b))
)
To get, eg:
+----+----+-------------+
| a| b|concat_custom|
+----+----+-------------+
| 1| 2| 12|
| 2|null| 2_|
| 3| 4| 34|
| 4| 5| 45|
|null| 6| _6|
+----+----+-------------+
You can use the coalesce function, which returns first of its arguments which is not null, and provide a literal in the second place, which will be used in case the column has a null value.
df = df.withColumn("concat", concat(coalesce(df.a, lit('')), coalesce(df.b, lit(''))))
You can fill null values with empty strings:
import pyspark.sql.functions as f
from pyspark.sql.types import *
data = spark.createDataFrame([('s1', 't1'), ('s2', 't2')], ['col1', 'col2'])
data = data.withColumn('test', f.lit(None).cast(StringType()))
display(data.na.fill('').withColumn('test2', f.concat('col1', 'col2', 'test')))
Is that what you were looking for?

Pyspark: filter function error with .isNotNull() and other 2 other conditions

I'm trying to filter my dataframe in Pyspark and I want to write my results in a parquet file, but I get an error every time because something is wrong with my isNotNull() condition. I have 3 conditions in the filter function, and if one of them is true the resulting row should be written in the parquet file.
I tried different versions with OR and | and different versions with isNotNull(), but nothing helped me.
This is one example I tried:
from pyspark.sql.functions import col
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df.where(col("col2").isNotNull()))
).write.save("new_parquet.parquet")
This is the other example I tried, but in that example it ignores the rows with attribute1 or attribute2:
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df['col2'].isNotNull())
).write.save("new_parquet.parquet")
This is the error message:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I hope you can help me, I'm new to the topic. Thank you so much!
First of, about the col1 filter, you could do it using isin like this:
df['col1'].isin(['attribute1', 'attribute2'])
And then:
df.filter((df['col1'].isin(['atribute1', 'atribute2']))|(df['col2'].isNotNull()))
AFAIK, the dataframe.column.isNotNull() should work, but I dont have sample data to test it, sorry.
See the example below:
from pyspark.sql import functions as F
df = spark.createDataFrame([(3,'a'),(5,None),(9,'a'),(1,'b'),(7,None),(3,None)], ["id", "value"])
df.show()
The original DataFrame
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 5| null|
| 9| a|
| 1| b|
| 7| null|
| 3| null|
+---+-----+
Now we do the filter:
df = df.filter( (df['id']==3)|(df['id']=='9')|(~F.isnull('value')))
df.show()
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 9| a|
| 1| b|
| 3| null|
+---+-----+
So you see
row(3, 'a') and row(3, null) are selected because of `df['id']==3'
row(9, 'a') is selected because of `df['id']==9'
row(1, 'b') is selected because of ~F.isnull('value'), but row(5, null) and row(7, null) are not selected.

Categories

Resources