PySpark get related Row from DF List object value

PySpark get related Row from DF List object value - python

I have a dataframe which has an ID column and a related Array column which contains the IDs of its related records.
ID | NAME | RELATED_IDLIST
--------------------------
123 | mike | [345,456]
345 | alen | [789]
456 | sam | [789,999]
789 | marc | [111]
555 | dan | [333]
From the above, I need to build a relationship between all the related child IDs together to its parent ID. The resultant DF should be like
ID | NAME | RELATED_IDLIST
--------------------------
123 | mike | [345,456,789,999,111]
345 | alen | [789,111]
456 | sam | [789,999,111]
789 | marc | [111]
555 | dan | [333]
I need help figuring out the above.

By Using Self joins and Window functions you can tackle this problem.
I have divided the code into 5 steps. The algorithm is as follows :
Explode array list to create singular record (no more arrays in data)
Self join Id and Related (renamed the RELATED_IDLIST column) columns
Reduce the records of which have same a_id into one array and b_id into another array
Merge the two array list columns into one combined array and rank the resultant records based on highest size of each resulting combined array
pick the records having rank as 1
you can try the following code:
# importing necessary functions for later use
from pyspark.sql.functions import explode,col,collect_set,array_union,size
from pyspark.sql.functions import dense_rank,desc
# need set cross join to True if spark version < 3
spark.conf.set("spark.sql.crossJoin.enabled", True)
############### STEP 0 #####################################
# creating the above mentioned dataframe
id_cols = [123,345,456,789,555]
name_cols = ['mike','alen','sam','marc','dan']
related_idlist_cols = [[345,456],[789],[789,999],[111],[333]]
list_of_rows = [(each_0,each_1,each_2) for each_0, each_1, each_2 in zip(id_cols,name_cols,related_idlist_cols)]
cols_name = ['ID','NAME','RELATED_IDLIST']
# this will result in above mentioned dataframe
df = spark.createDataFrame(list_of_rows,cols_name)
############### STEP 1: Explode values #####################################
# explode function converts arraylist to atomic records
# one record having array size two will result in two records
# + -> 123, mike,129
# 123, mike , explode(['129'.'9029']) -->
# +-> 123, mike,9029
df_1 = df.select(col('id'),col('name'),explode(df.RELATED_IDLIST).alias('related'))
############### STEP 2 : Self Join with Data #####################################
# creating dataframes with different column names, for joining them later
a = df_1.withColumnRenamed('id','a_id').withColumnRenamed('name','a_name').withColumnRenamed('related','a_related')
b = df_1.withColumnRenamed('id','b_id').withColumnRenamed('name','b_name').withColumnRenamed('related','b_related')
# this is an example outer join & self join
df_2 = a.join(b, a.a_related == b.b_id, how='left').orderBy(a.a_id)
############### STEP 3 : create Array Lists #####################################
# using collect_set we can reduce values of a particular kind into one set (we are reducing 'related' records, based on 'id')
df_3 = df_2.select('a_id','a_name',collect_set('a_related').over(Window.partitionBy(df_2.a_id)).\
alias('a_related_ids'),collect_set('b_related').over(Window.partitionBy(df_2.b_id)).alias('b_related_ids'))
# merging two sets into one column and also calculating resultant the array size
df_4 = df_3.select('a_id','a_name',array_union('a_related_ids','b_related_ids').alias('combined_ids')).withColumn('size',size('combined_ids'))
# ranking the records to pick the ideal records
df_5 = df_4.select('a_id','a_name','combined_ids',dense_rank().over(Window.partitionBy('a_id').orderBy(desc('size'))).alias('rank'))
############### STEP 4 : Selecting Ideal Records #####################################
# picking records of rank 1, but this will have still ducplicates so removing them using distinct and ordering them by id
df_6 = df_5.select('a_id','a_name','combined_ids').filter(df_5.rank == 1).distinct().orderBy('a_id')
############### STEP 5 #####################################
display(df_6)

Related

How to compare differences between dataframes in pyspark

I have two dataframes that are essentially the same the same, but coming from two different sources. In my first dataframe I have p_user_id and date_of_birth fields that are a longType and one that is dateType and the rest of the fields are stringType. In my second dataframe everything is of stringType. I first check the row count for both dataframes based on the p_user_id(That is my unique identifier).
DF1:
+--------------+
|test1_racounts|
+--------------+
| 418895|
+--------------+
DF2:
+---------+
|d_tst_rac|
+---------+
| 418915|
+---------+
Then if there is a difference in the row count I run a check on which p_user_id values are in one dataframe and not the other.
p_user_tst_rac.subtract(rac_p_user_df).show(100, truncate=0)
Gives me this result:
+---------+
|p_user_id|
+---------+
|661520 |
|661513 |
|661505 |
|661461 |
|661501 |
|661476 |
|661478 |
|661468 |
|661479 |
|661464 |
|661467 |
|661474 |
|661484 |
|661495 |
|661499 |
|661486 |
|661502 |
|661506 |
|661517 |
+---------+
My issue comes into play when I'm trying to pull the rest of the corresponding fields for the difference. I want to pull the rest of the fields so that I can do a manual search in the DB and application to see if there is something that is overlooked. When I add the rest of columns my results get higher than 20 row counts for a difference. What is a better way to run the match and get the corresponding data:
Full code scope:
#racs in mysql
my_rac = spark.read.parquet("/Users/mysql.parquet")
my_rac.printSchema()
my_rac.createOrReplaceTempView('my_rac')
d_rac = spark.sql('''select distinct * from my_rac''')
d_rac.createOrReplaceTempView('d_rac')
spark.sql('''select count(*) as test1_racounts_ from d_rac''').show()
rac_p_user_df = spark.sql('''select
cast(p_user_id as string) as p_user_id
, record_id
, contact_last_name
, contact_first_name
from d_rac''')
#mssql_rac
sql_rac = spark.read.csv("/Users/mzn293/Downloads/kavi-20211116.csv")
#sql_rac.printSchema()
hav_rac.createOrReplaceTempView('sql_rac')
d_sql_rac = spark.sql('''select distinct
_c0 as p_user_id
, _c1 as record_id
, _c4 as contact_last_name
, _c5 as contact_first_name
from sql_rac''')
d_sql_rac.createOrReplaceTempView('d_sql_rac')
spark.sql('''select count(*) as d_aws_rac from d_sql_rac''').show()
dist_aws_rac = spark.sql('''select * from d_aws_rac''')
dist_sql_rac.subtract(rac_p_user_df).show(100, truncate=0)
With this I get more than a 20 count difference. Furthermore, I feel there is a better way to get my result. But I'm not sure what I'm missing to get the data for those 20 rows and not get 100 plus rows.

The easiest way will be to use the anti join in this case.
df_diff = df1.join(df2, df1.p_user_id == df2.p_user_id, "leftanti")
this will give you the row of all records existing in df1, but have no matching record in df2.

Col names not detected - AnalysisException: Cannot resolve 'Name' given input columns 'col10'

I'm trying to run a transformation function in a pyspark script:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "test_csv", transformation_ctx = "datasource0")
...
dataframe = datasource0.toDF()
...
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
#to_long(df, ["A"])
....
df = to_long(dataframe, ["Name","Type"])
My dataset looks like this:
Name |01/01(FRI)|01/02(SAT)|
ALZA CZ| 0 | 0
CLPA CZ| 1 | 5
My desired output is something like this:
Name |Type | Date. |Value |
ALZA CZ|New | 01/01(FRI) | 0
CLPA CZ|New | 01/01(FRI) | 1
ALZA CZ|Old | 01/02(SAT) | 1
CLPA CZ|Old | 01/02(SAT) | 5
However, the last code line gives me an error similar to this:
AnalysisException: Cannot resolve 'Name' given input columns 'col10'
When I check:
df.show()
I see 'col1', 'col2' etc in the first row instead of the actual labels ( ["Name","Type"] ). Should I separately remove and then add the original column titles?

It seems like that your meta data table was configured using the built-in CSV classifier. If this classifier isn't able to detect a header, it will call the columns col1, col2 etc.
Your problem lies one stage before your ETL job, so in my opinion you shouldn't remove and re-add the original column titles, but fix your data import / schema detection, by using a custom classifier.

Merging a list of dataframes with different lengths and columns in python

I have list of 100 dataframes that I am trying to merge into a single dataframe but am unable to do so. All the dataframes have different columns and are of different lengths. To give a bit of context and background, each dataframe consist of 4 sentiment scores (calculated using VaderSentiment). The dataframes have the following representation :
USER 1 DATAFRAME
created_at | positive score of user 1 tweets | negative score of user 1 tweets| neutral score of user 1 tweets | compound score of user 1 tweets |
23/2/2011 10:00 | 1.12 | 1.3 | 1.0 | 3.3 |
24/2/2011 11:00 | 1.20 | 1.1 | 0.9 | 2.5 |
USER 2 DATAFRAME
created_at | positive score of user 1 tweets | negative score of user 1 tweets| neutral score of user 1 tweets | compound score of user 1 tweets |
25/3/2011 23:00 | 0.12 | 1.1 | 0.1 | 1.1 |
26/3/2011 08:00 | 1.40 | 1.5 | 0.4 | 1.5 |
01/4/2011 19:00 | 1.80 | 0.1 | 1.9 | 3.9 |
All the dataframes have one column in common, namely created_at. What I am trying to achieve is to merge all the dataframes based on the created_at column such that I get only one created_at column and all the other columns from all the other dataframes. The result should have **400* columns of sentiment scores and along with on created_at column.
My code is as follows :
import pandas as pd
import glob
import numpy as np
import os
from functools import reduce
path = r'C:\Users\Desktop\Tweets'
allFiles = glob.glob(path + "/*.csv")
list = []
frame = pd.DataFrame()
count=0
for f in allFiles:
file = open(f, 'r')
count=count+1
_, fname = os.path.split(f)
df = pd.read_csv(f)
#print(df)
list.append(df)
frame = pd.concat(list)
print(frame)
The problem is that when I run the code as above, I get the desired arrangement of columns, but instead of getting the values i get NaN in all the values, thus essentially having a dataframe with 401 columns out of which only the created_at column contains values
Any and all help is appreciated.
Thank you
EDIT
I have tried various different solutions to different questions posted here but none of them seem to work and thus as a last resort I have started this thread
EDIT 2
I have perhaps come up with a solution to my problem. Using the code below, I can append all the columns into frames. However, this creates a duplicate of created_at column which happens to be type object. If I could merge all the dates into one column, then my troubles would be much closer to being solved.
for f in allFiles :
file = open(f, 'r')
count=count+1
_, fname = os.path.split(f)
df = pd.read_csv(f)
dates = df.iloc[:,0]
neut = df.iloc[:,1]
pos = df.iloc[:,2]
neg = df.iloc[:,3]
comp = df.iloc[:,4]
all_frames.append(dates)
all_frames.append(neut)
all_frames.append(pos)
all_frames.append(neg)
all_frames.append(comp)
frame = pd.concat(all_frames,axis=1)
Any help would be appreciated

I strongly suggest you revist your data model. Having that many columns usually signals something is wrong. Having said that, here's one way to do it. Also list is a built-in data type. Don't override it with a variable name.
I assume that other than created_at, the columns from each file are unique.
all_frames = []
for f in allFiles:
file = open(f, 'r')
count=count+1
_, fname = os.path.split(f)
df = pd.read_csv(f, parse_dates=['created_at'], index_col='created_at')
all_frames.append(df)
# This will create a dataframe of size n * 400
# n is the total number of rows between all files
frame = pd.concat(all_frames, join='outer', copy=False, sort=False)
# If you want to line up the hour across all users
frame.groupby(level=0)[frame.columns].first()

Eliminate perceived index value from data frame concatenation

I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks

so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?

Designing an expandable command-line interface with lists

These are the three lists I have:
# made up data
products = ['apple','banana','orange']
prices = ['£0.11','£0.07','£0.05']
dates = ['02/04/2017','14/09/2018','06/08/2016']
Important to know
The data in these lists will vary along with its size, although they will maintain the same data type.
The first elements of each list are linked, likewise for the second and third element etc...
Desired command line interface:
Product | Price | Date of Purchase
--------|-------|------------------
apple | £0.11 | 02/04/2017
--------|-------|------------------
banana | £0.07 | 14/09/2018
--------|-------|------------------
orange | £0.05 | 06/08/2016
I want to create a table like this. It should obviously continue if there are more elements in each list but I don't know how I would create it.
I could do
print(""" Product | Price | Date of Purchase # etc...
--------|-------|------------------
%s | %s | %s
""" % (products[0],prices[0],dates[0]))
But I think this would be hardcoding the interface, which isn't ideal because the list has an undetermined length
Any help?

If you want a version that doesn't utilize a library, here's a fairly simple function that makes use of some list comprehensions
def print_table(headers, *columns):
# Ignore any columns of data that don't have a header
columns = columns[:len(headers)]
# Start with a space to set the header off from the left edge, then join the header strings with " | "
print(" " + " | ".join(headers))
# Draw the header separator with column dividers based on header length
print("|".join(['-' * (len(header) + 2) for header in headers]))
# Iterate over all lists passed in, and combine them together in a tuple by row
for row in zip(*columns):
# Center the contents within the space available in the column based on the header width
print("|".join([
col.center((len(headers[idx]) + 2), ' ')
for idx, col in enumerate(row)
]))
This doesn't handle cell values that are longer than the column header length + 2. But that would be easy to implement with a truncation of the cell contents (an example of string truncation can be seen here).

Try pandas:
import pandas as pd
products = ['apple','banana','orange']
prices = ['£0.11','£0.07','£0.05']
dates = ['02/04/2017','14/09/2018','06/08/2016']
df = pd.DataFrame({"Product": products, "Price": prices, "Date of Purchase": dates})
print(df)
Output:
Product Price Date of Purchase
0 apple £0.11 02/04/2017
1 banana £0.07 14/09/2018
2 orange £0.05 06/08/2016

import beautifultable
from beautifultable import BeautifulTable
table = BeautifulTable()
# made up data
products = ['apple','banana','orange']
prices = ['£0.11','£0.07','£0.05']
dates = ['02/04/2017','14/09/2018','06/08/2016']
table.column_headers=['Product' ,'Price','Date of Purchase']
for i in zip(products,prices,dates):
table.append_row(list(i))
print(table)
output is :
+---------+-------+------------------+
| Product | Price | Date of Purchase |
+---------+-------+------------------+
| apple | £0.11 | 02/04/2017 |
+---------+-------+------------------+
| banana | £0.07 | 14/09/2018 |
+---------+-------+------------------+
| orange | £0.05 | 06/08/2016 |
+---------+-------+------------------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark get related Row from DF List object value - python

Related

How to compare differences between dataframes in pyspark

Col names not detected - AnalysisException: Cannot resolve 'Name' given input columns 'col10'

Merging a list of dataframes with different lengths and columns in python

Eliminate perceived index value from data frame concatenation

Designing an expandable command-line interface with lists

Categories

Resources