I am trying to update some rows of dataframe ,below is my code.
dfs_ids1 = dfs_ids1.withColumn("arrival_dt", F.when(F.col("arrival_dt")=='1960-01-01', lit(None)) )
Basically, I want to update all the rows where arrival_dt is 1960-01-01 with null and leave rest of the rows unchanged.
You need to understand the filter and when functions.
If you want to fetch rows only without caring about others , try this.
from pyspark.sql.functions import *
dfs_ids1 = dfs_ids1.filter(col("arrival_dt='1960-01-01'"))
If you want to update remaining with custom value or other columns.
dfs_ids1=dfs_ids1.withColumn("arrival_dt",when(col("arrival_dt")=="1960-01-01",col("arrival_dt")).otherwise(lit(None)))
//Or
dfs_ids1=dfs_ids1.withColumn("arrival_dt",when(col("arrival_dt")=="1960-01-01",col("arrival_dt")))
//Sample example
//Input df
+------+-------+-----+
| name| city|state|
+------+-------+-----+
| manoj|gwalior| mp|
| kumar| delhi|delhi|
|dhakad|chennai| tn|
+------+-------+-----+
from pyspark.sql.functions import *
opOneDf=df.withColumn("name",when(col("city")=="delhi",col("city")).otherwise(lit(None)))
opOneDf.show()
//Sample output
+-----+-------+-----+
| name| city|state|
+-----+-------+-----+
| null|gwalior| mp|
|delhi| delhi|delhi|
| null|chennai| tn|
+-----+-------+-----+
Related
I have two dataframes that are essentially the same the same, but coming from two different sources. In my first dataframe I have p_user_id and date_of_birth fields that are a longType and one that is dateType and the rest of the fields are stringType. In my second dataframe everything is of stringType. I first check the row count for both dataframes based on the p_user_id(That is my unique identifier).
DF1:
+--------------+
|test1_racounts|
+--------------+
| 418895|
+--------------+
DF2:
+---------+
|d_tst_rac|
+---------+
| 418915|
+---------+
Then if there is a difference in the row count I run a check on which p_user_id values are in one dataframe and not the other.
p_user_tst_rac.subtract(rac_p_user_df).show(100, truncate=0)
Gives me this result:
+---------+
|p_user_id|
+---------+
|661520 |
|661513 |
|661505 |
|661461 |
|661501 |
|661476 |
|661478 |
|661468 |
|661479 |
|661464 |
|661467 |
|661474 |
|661484 |
|661495 |
|661499 |
|661486 |
|661502 |
|661506 |
|661517 |
+---------+
My issue comes into play when I'm trying to pull the rest of the corresponding fields for the difference. I want to pull the rest of the fields so that I can do a manual search in the DB and application to see if there is something that is overlooked. When I add the rest of columns my results get higher than 20 row counts for a difference. What is a better way to run the match and get the corresponding data:
Full code scope:
#racs in mysql
my_rac = spark.read.parquet("/Users/mysql.parquet")
my_rac.printSchema()
my_rac.createOrReplaceTempView('my_rac')
d_rac = spark.sql('''select distinct * from my_rac''')
d_rac.createOrReplaceTempView('d_rac')
spark.sql('''select count(*) as test1_racounts_ from d_rac''').show()
rac_p_user_df = spark.sql('''select
cast(p_user_id as string) as p_user_id
, record_id
, contact_last_name
, contact_first_name
from d_rac''')
#mssql_rac
sql_rac = spark.read.csv("/Users/mzn293/Downloads/kavi-20211116.csv")
#sql_rac.printSchema()
hav_rac.createOrReplaceTempView('sql_rac')
d_sql_rac = spark.sql('''select distinct
_c0 as p_user_id
, _c1 as record_id
, _c4 as contact_last_name
, _c5 as contact_first_name
from sql_rac''')
d_sql_rac.createOrReplaceTempView('d_sql_rac')
spark.sql('''select count(*) as d_aws_rac from d_sql_rac''').show()
dist_aws_rac = spark.sql('''select * from d_aws_rac''')
dist_sql_rac.subtract(rac_p_user_df).show(100, truncate=0)
With this I get more than a 20 count difference. Furthermore, I feel there is a better way to get my result. But I'm not sure what I'm missing to get the data for those 20 rows and not get 100 plus rows.
The easiest way will be to use the anti join in this case.
df_diff = df1.join(df2, df1.p_user_id == df2.p_user_id, "leftanti")
this will give you the row of all records existing in df1, but have no matching record in df2.
For each set of coordinates in a pyspark dataframe, I need to find closest set of coordinates in another dataframe
I have one pyspark dataframe with coordinate data like so (dataframe a):
+------------------+-------------------+
| latitude_deg| longitude_deg|
+------------------+-------------------+
| 40.07080078125| -74.93360137939453|
| 38.704022| -101.473911|
| 59.94919968| -151.695999146|
| 34.86479949951172| -86.77030181884766|
| 35.6087| -91.254898|
| 34.9428028| -97.8180194|
And another like so (dataframe b): (only few rows are shown for understanding)
+-----+------------------+-------------------+
|ident| latitude_deg| longitude_deg|
+-----+------------------+-------------------+
| 00A| 30.07080078125| -24.93360137939453|
| 00AA| 56.704022| -120.473911|
| 00AK| 18.94919968| -109.695999146|
| 00AL| 76.86479949951172| -67.77030181884766|
| 00AR| 10.6087| -87.254898|
| 00AS| 23.9428028| -10.8180194|
Is it possible to somehow merge the dataframes to have a result that a has the closest ident from dataframe b for each row in dataframe a:
+------------------+-------------------+-------------+
| latitude_deg| longitude_deg|closest_ident|
+------------------+-------------------+-------------+
| 40.07080078125| -74.93360137939453| 12A|
| 38.704022| -101.473911| 14BC|
| 59.94919968| -151.695999146| 278A|
| 34.86479949951172| -86.77030181884766| 56GH|
| 35.6087| -91.254898| 09HJ|
| 34.9428028| -97.8180194| 09BV|
What I have tried so far:
I have a pyspark UDF to calculate the haversine distance between 2 pairs of coordinates defined.
udf_get_distance = F.udf(get_distance)
It works like this:
df = (df.withColumn(“ABS_DISTANCE”, udf_get_distance(
df.latitude_deg_a, df.longitude_deg_a,
df.latitude_deg_b, df.longitude_deg_b,)
))
I'd appreciate any kind of help. Thanks so much
You need to do a crossJoin first. something like this
joined_df=source_df1.crossJoin(source_df2)
Then you can call your udf like you have mentioned, generate rownum based on distance and filter out the close one
from pyspark.sql.functions import row_number,Window
rwindow=Window.partitionBy("latitude_deg_a","latitude_deg_b").orderBy("ABS_DISTANCE")
udf_result_df = joined_df.withColumn(“ABS_DISTANCE”, udf_get_distance(
df.latitude_deg_a, df.longitude_deg_a,
df.latitude_deg_b, df.longitude_deg_b).withColumn("rownum",row_number().over(rwindow)).filter("rownum = 1")
Note: add return type to your udf
How can I stream into a table the following:
difference between Column A and B aggregated by column C and D.
+-------------+-------------------+--+-
| Column_A|Column_B |Column_C|Column_D|
+-------------+-------------------+--+-
|52 |67 |boy |car |
|44 |25 |girl |bike |
|98 |85 |boy |car |
|52 |41 |girl |car |
+-------------+-------------------+--+-
this is my try, but it is not working :
difference = streamingDataF.withColumn("Difference", expr("Column_A - Column_B")).drop("Column_A").drop("Column_B").groupBy("Column_C")
differenceStream = difference.writeStream\
.queryName("diff_aggr")\
.format("memory").outputMode("append")\
.start()
I am getting this error: 'GroupedData' object has no attribute 'writeStream'
Depending how do you want to aggregate grouped data - you can do e.g.
Prerequisites (in case if you didn't set them already):
from pyspark.sql import functions as F
from pyspark.sql.functions import *
For sum:
difference = streamingDataF.withColumn("Difference", expr("Column_A - Column_B")).drop("Column_A").drop("Column_B").groupBy("Column_C").agg(F.sum(F.col("Difference")).alias("Difference"))
For max:
difference = streamingDataF.withColumn("Difference", expr("Column_A - Column_B")).drop("Column_A").drop("Column_B").groupBy("Column_C").agg(F.max(F.col("Difference")).alias("Difference"))
And then:
differenceStream = difference.writeStream\
.queryName("diff_aggr")\
.format("memory").outputMode("append")\
.start()
The point is - if you do groupBy you need to also reduce, by aggregating. If you wanted to sort your values together instead try df.sort(...)
How do I expand a dataframe based on column values? I intend to go from this dataframe:
+---------+----------+----------+
|DEVICE_ID| MIN_DATE| MAX_DATE|
+---------+----------+----------+
| 1|2019-08-29|2019-08-31|
| 2|2019-08-27|2019-09-02|
+---------+----------+----------+
To one that looks like this:
+---------+----------+
|DEVICE_ID| DATE|
+---------+----------+
| 1|2019-08-29|
| 1|2019-08-30|
| 1|2019-08-31|
| 2|2019-08-27|
| 2|2019-08-28|
| 2|2019-08-29|
| 2|2019-08-30|
| 2|2019-08-31|
| 2|2019-09-01|
| 2|2019-09-02|
+---------+----------+
Any help would be much appreciated.
from datetime import timedelta, date
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType
# Create a sample data row.
df = sqlContext.sql("""
select 'dev1' as device_id,
to_date('2020-01-06') as start,
to_date('2020-01-09') as end""")
# Define a UDf to return a list of dates
#udf
def datelist(start, end):
return ",".join([str(start + datetime.timedelta(days=x)) for x in range(0, 1+(end-start).days)])
# explode the list of dates into rows
df.select("device_id",
F.explode(
F.split(datelist(df["start"], df["end"]), ","))
.alias("date")).show(10, False)
I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks
so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?