I'm trying to run a transformation function in a pyspark script:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "test_csv", transformation_ctx = "datasource0")
...
dataframe = datasource0.toDF()
...
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
#to_long(df, ["A"])
....
df = to_long(dataframe, ["Name","Type"])
My dataset looks like this:
Name |01/01(FRI)|01/02(SAT)|
ALZA CZ| 0 | 0
CLPA CZ| 1 | 5
My desired output is something like this:
Name |Type | Date. |Value |
ALZA CZ|New | 01/01(FRI) | 0
CLPA CZ|New | 01/01(FRI) | 1
ALZA CZ|Old | 01/02(SAT) | 1
CLPA CZ|Old | 01/02(SAT) | 5
However, the last code line gives me an error similar to this:
AnalysisException: Cannot resolve 'Name' given input columns 'col10'
When I check:
df.show()
I see 'col1', 'col2' etc in the first row instead of the actual labels ( ["Name","Type"] ). Should I separately remove and then add the original column titles?
It seems like that your meta data table was configured using the built-in CSV classifier. If this classifier isn't able to detect a header, it will call the columns col1, col2 etc.
Your problem lies one stage before your ETL job, so in my opinion you shouldn't remove and re-add the original column titles, but fix your data import / schema detection, by using a custom classifier.
Related
I want to fill the 'references' column in df_out with the 'ID' if the corresponding 'my_ID' in df_sp is contained in df_jira 'reference_ids'.
import pandas as pd
d_sp = {'ID': [1,2,3,4], 'my_ID': ["my_123", "my_234", "my_345", "my_456"], 'references':["","","2",""]}
df_sp = pd.DataFrame(data=d_sp)
d_jira = {'my_ID': ["my_124", "my_235", "my_346"], 'reference_ids': ["my_123, my_234", "", "my_345"]}
df_jira = pd.DataFrame(data=d_jira)
df_new = df_jira[~df_jira["my_ID"].isin(df_sp["my_ID"])].copy()
df_out = pd.DataFrame(columns=df_sp.columns)
needed_cols = list(set(df_sp.columns).intersection(df_new.columns))
for column in needed_cols:
df_out[column] = df_new[column]
df_out['Related elements_my'] = df_jira['reference_ids']
Desired output df_out:
| ID | my_ID | references |
|----|-------|------------|
| | my_124| 1, 2 |
| | my_235| |
| | my_346| 3 |
What I tried so far is list comprehension, but I only managed to get the reference_ids "copied" from a helper column to my 'references' column with this:
for row, entry in df_out.iterrows():
cpl_ids = [x for x in entry['Related elements_my'].split(', ') if any(vh_id == x for vh_id in df_cpl_list['my-ID'])]
df_out.at[row, 'Related elements'] = ', '.join(cpl_ids)
I can not wrap my head around on how to get the specific 'ID's on the matches of 'any()' or if this actually the way to go as I need all the matches, not something if there is any match.
Any hints are appreciated!
I work with python 3.9.4 on Windows (adding in case python 3.10 has any other solution)
Backstory: Moving data from Jira to MS SharePoint lists. (Therefore, the 'ID' does not equal the actual index in the dataframe, but is rather assigned by SharePoint upon insertion into the list. Hence, empty after running for the new entries.)
ref_df = df_sp[["ID","my_ID"]].set_index("my_ID")
df_out.references = df_out["Related elements_my"].apply(lambda x: ",".join(list(map(lambda y: "" if y == "" else str(ref_df.loc[y.strip()].ID), x.split(",")))))
df_out[["ID","my_ID","references"]]
output:
ID my_ID references
0 NaN my_124 1,2
1 NaN my_235
2 NaN my_346 3
what is map?
map is something like [func(i) for i in lst] and apply func on all variables of lst but in another manner that increase speed.
and you can read more about this: https://realpython.com/python-map-function/
but, there, our function is : lambda y: "" if y == "" else str(ref_df.loc[y.strip()].ID)
so, if y, or y.strip() there and just for remove spaces, is empty, maps to empty: "" if y == "" like my_234
otherwise, locate y in df_out and get corresponding ID, i.e maps each my_ID to ID
hope to be helpfull :)
I have two dataframes that are essentially the same the same, but coming from two different sources. In my first dataframe I have p_user_id and date_of_birth fields that are a longType and one that is dateType and the rest of the fields are stringType. In my second dataframe everything is of stringType. I first check the row count for both dataframes based on the p_user_id(That is my unique identifier).
DF1:
+--------------+
|test1_racounts|
+--------------+
| 418895|
+--------------+
DF2:
+---------+
|d_tst_rac|
+---------+
| 418915|
+---------+
Then if there is a difference in the row count I run a check on which p_user_id values are in one dataframe and not the other.
p_user_tst_rac.subtract(rac_p_user_df).show(100, truncate=0)
Gives me this result:
+---------+
|p_user_id|
+---------+
|661520 |
|661513 |
|661505 |
|661461 |
|661501 |
|661476 |
|661478 |
|661468 |
|661479 |
|661464 |
|661467 |
|661474 |
|661484 |
|661495 |
|661499 |
|661486 |
|661502 |
|661506 |
|661517 |
+---------+
My issue comes into play when I'm trying to pull the rest of the corresponding fields for the difference. I want to pull the rest of the fields so that I can do a manual search in the DB and application to see if there is something that is overlooked. When I add the rest of columns my results get higher than 20 row counts for a difference. What is a better way to run the match and get the corresponding data:
Full code scope:
#racs in mysql
my_rac = spark.read.parquet("/Users/mysql.parquet")
my_rac.printSchema()
my_rac.createOrReplaceTempView('my_rac')
d_rac = spark.sql('''select distinct * from my_rac''')
d_rac.createOrReplaceTempView('d_rac')
spark.sql('''select count(*) as test1_racounts_ from d_rac''').show()
rac_p_user_df = spark.sql('''select
cast(p_user_id as string) as p_user_id
, record_id
, contact_last_name
, contact_first_name
from d_rac''')
#mssql_rac
sql_rac = spark.read.csv("/Users/mzn293/Downloads/kavi-20211116.csv")
#sql_rac.printSchema()
hav_rac.createOrReplaceTempView('sql_rac')
d_sql_rac = spark.sql('''select distinct
_c0 as p_user_id
, _c1 as record_id
, _c4 as contact_last_name
, _c5 as contact_first_name
from sql_rac''')
d_sql_rac.createOrReplaceTempView('d_sql_rac')
spark.sql('''select count(*) as d_aws_rac from d_sql_rac''').show()
dist_aws_rac = spark.sql('''select * from d_aws_rac''')
dist_sql_rac.subtract(rac_p_user_df).show(100, truncate=0)
With this I get more than a 20 count difference. Furthermore, I feel there is a better way to get my result. But I'm not sure what I'm missing to get the data for those 20 rows and not get 100 plus rows.
The easiest way will be to use the anti join in this case.
df_diff = df1.join(df2, df1.p_user_id == df2.p_user_id, "leftanti")
this will give you the row of all records existing in df1, but have no matching record in df2.
I have a dataframe which has an ID column and a related Array column which contains the IDs of its related records.
ID | NAME | RELATED_IDLIST
--------------------------
123 | mike | [345,456]
345 | alen | [789]
456 | sam | [789,999]
789 | marc | [111]
555 | dan | [333]
From the above, I need to build a relationship between all the related child IDs together to its parent ID. The resultant DF should be like
ID | NAME | RELATED_IDLIST
--------------------------
123 | mike | [345,456,789,999,111]
345 | alen | [789,111]
456 | sam | [789,999,111]
789 | marc | [111]
555 | dan | [333]
I need help figuring out the above.
By Using Self joins and Window functions you can tackle this problem.
I have divided the code into 5 steps. The algorithm is as follows :
Explode array list to create singular record (no more arrays in data)
Self join Id and Related (renamed the RELATED_IDLIST column) columns
Reduce the records of which have same a_id into one array and b_id into another array
Merge the two array list columns into one combined array and rank the resultant records based on highest size of each resulting combined array
pick the records having rank as 1
you can try the following code:
# importing necessary functions for later use
from pyspark.sql.functions import explode,col,collect_set,array_union,size
from pyspark.sql.functions import dense_rank,desc
# need set cross join to True if spark version < 3
spark.conf.set("spark.sql.crossJoin.enabled", True)
############### STEP 0 #####################################
# creating the above mentioned dataframe
id_cols = [123,345,456,789,555]
name_cols = ['mike','alen','sam','marc','dan']
related_idlist_cols = [[345,456],[789],[789,999],[111],[333]]
list_of_rows = [(each_0,each_1,each_2) for each_0, each_1, each_2 in zip(id_cols,name_cols,related_idlist_cols)]
cols_name = ['ID','NAME','RELATED_IDLIST']
# this will result in above mentioned dataframe
df = spark.createDataFrame(list_of_rows,cols_name)
############### STEP 1: Explode values #####################################
# explode function converts arraylist to atomic records
# one record having array size two will result in two records
# + -> 123, mike,129
# 123, mike , explode(['129'.'9029']) -->
# +-> 123, mike,9029
df_1 = df.select(col('id'),col('name'),explode(df.RELATED_IDLIST).alias('related'))
############### STEP 2 : Self Join with Data #####################################
# creating dataframes with different column names, for joining them later
a = df_1.withColumnRenamed('id','a_id').withColumnRenamed('name','a_name').withColumnRenamed('related','a_related')
b = df_1.withColumnRenamed('id','b_id').withColumnRenamed('name','b_name').withColumnRenamed('related','b_related')
# this is an example outer join & self join
df_2 = a.join(b, a.a_related == b.b_id, how='left').orderBy(a.a_id)
############### STEP 3 : create Array Lists #####################################
# using collect_set we can reduce values of a particular kind into one set (we are reducing 'related' records, based on 'id')
df_3 = df_2.select('a_id','a_name',collect_set('a_related').over(Window.partitionBy(df_2.a_id)).\
alias('a_related_ids'),collect_set('b_related').over(Window.partitionBy(df_2.b_id)).alias('b_related_ids'))
# merging two sets into one column and also calculating resultant the array size
df_4 = df_3.select('a_id','a_name',array_union('a_related_ids','b_related_ids').alias('combined_ids')).withColumn('size',size('combined_ids'))
# ranking the records to pick the ideal records
df_5 = df_4.select('a_id','a_name','combined_ids',dense_rank().over(Window.partitionBy('a_id').orderBy(desc('size'))).alias('rank'))
############### STEP 4 : Selecting Ideal Records #####################################
# picking records of rank 1, but this will have still ducplicates so removing them using distinct and ordering them by id
df_6 = df_5.select('a_id','a_name','combined_ids').filter(df_5.rank == 1).distinct().orderBy('a_id')
############### STEP 5 #####################################
display(df_6)
I have a dataframe that come from SharePoint (Microsoft), and it has a lot of jsons inside the cells with the metadata. i usually dont work with json, so im struggling with it.
# df sample
+-------------+----------+
| Id | Event |
+-------------+----------+
| 105 | x |
+-------------+----------+
x = {"#odata.type":"#Microsoft.Azure.Connectors.SharePoint.SPListExpandedReference","Id":1,"Value":"Digital Training"}
How i assign just the value "Digital Training" to the cell, for example? remembering that this is ocurring for a lot of columns, and i need to solve it too. Thanks.
If the event column consists of dict-object:
df['Value'] = df.apply(lambda x: x['Event']['Value'], 1)
If the event column has string objects:
import json
df['Value'] = df.apply(lambda x: json.loads(x['Event'])['Value'], 1)
Both result in
Id Event Value
0 x {"#odata.type":"#Microsoft.Azure.Connectors.Sh... Digital Training
I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks
so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?