I'm new to AWS Glue and Pyspark, so I'm having some trouble with a transformation job. I have two DynamicFrames, one of them contains values in one of it's columns which needs to be added as a separate column in the other DF, and the values in the column need to be the value which corresponds a value from another column with the same id in the first table. Here's how it looks:
Table 1 Table2
+--+-----+-----+ +--+-----+-----+
|id|name |value| |id|col1 |col2 |
+--+-----+-----+ +--+-----+-----+
| 1|name1| 10 | | 1|str1 |val1 |
+--+-----+-----+ +--+-----+-----+
| 2|name2| 20 | | 2|str2 |val2 |
+--+-----+-----+ +--+-----+-----+
I need the new format to be:
Table2
+--+-----+-----+-----+-----+
|id|col1 |col2 |name1|name2|
+--+-----+-----+-----+-----+
| 1|str1 |val1 | 10 | | <--- add 10 only here because the id from the row in the first table must match the id from the second table
+--+-----+-----+-----+-----+
| 2|str2 |val2 | | 20 | <--- add 20 only here because the id from the row in the first table must match the id from the second table
+--+-----+-----+-----+-----+
Suppose 2 dataframes are named df1 and df2.
df3 = df1.groupBy('id').pivot('name').sum('value')
df4 = df2.join(df3, on='id', how='inner')
df4.show(truncate=False)
Related
I have two DataFrames. One contains multiple columns with sample name and rows containing values. The second DataFrame contains one column called "Sample Name" which contains a list of the names of samples that pass a quality control.
df1
| mz | Sample 001| Sample 002...
|:---- |:---------:| ---------:|
| 234 | 3434 | 34545 |
|:---- |:---------:| ---------:|
| 4542 | 5656563 | 4545 |
df2
| Sample Name | RT |
| ----------- | ---|
| Sample001 | 8 |
| Sample002 | 8 |...
df1 contains more than 2000 rows and 200 columns, df2 contains 180 columns. I want to filter df1 to remove the columns that are NOT present on the df2 column "Sample Name"
The resulting DataFrame should be a version of df1 filtered with 180 columns present on the df2 list.
Se if this works:
for col in df1.columns:
if col not in df2['Sample Name'].unique():
df1.drop(columns=[col], inplace=True)
We have a feature request where we want to pull a table as per request from the database and perform some transformation on it. But these tables may have duplicate columns [columns with same name]. I want to combine these columns into a single column
for example:
Request for input table named ages:
+---+----+------+-----+
|age| ids | ids | ids |
+---+----+------+-----+
| 25| 1 | 2 | 3 |
+---+----+------+-----+
| 26| 4 | 5 | 6 |
+---+----+------+-----+
the output table is :
+---+----+------+-----+
|age| ids |
+---+----+------+-----+
| 25| [1 , 2 , 3] |
+---+----+------+-----+
| 26| [4 , 5 , 6] |
+---+----+------+-----+
next time we might get a request for input table names:
+---+----+------+-----+
|name| company | company|
+---+----+------+-----+
| abc| a | b |
+---+----+------+-----+
| xyc| c | d |
+---+----+------+-----+
The output table should be:
+---+----+------+
|name| company |
+---+----+------+
| abc| [a,b] |
+---+----+------+
| xyc| [c,d] |
+---+----+------+
So Basically I need to find the columns with the same name and then merge the values in them.
You can convert spark dataframe into pandas dataframe, perform necessary operations and convert it back to spark dataframe.
I have added necessary comments for clarity.
Using Pandas:
import pandas as pd
from collections import Counter
pd_df = spark_df.toPandas() #converting spark dataframe to pandas dataframe
pd_df.head()
def concatDuplicateColumns(df):
duplicate_cols = [] #to store duplicate column names
for col in dict(Counter(df.columns)):
if dict(Counter(df.columns))[col] >1:
duplicate_cols.append(col)
final_dict = {}
for cols in duplicate_cols:
final_dict[cols] = [] #initialize dict
for cols in duplicate_cols:
for ind in df.index.tolist():
final_dict[cols].append(df.loc[ind, cols].tolist())
df.drop(duplicate_cols, axis=1, inplace=True)
for cols in duplicate_cols:
df[cols] = final_dict[cols]
return df
final_df = concatDuplicateColumns(pd_df)
spark_df = spark.createDataFrame(final_df)
spark_df.show()
I have a pandas dataframe that is the result of a query where 1 column creates duplicate rows. I need help identifying non-duplicate values for duplicates by name, then dynamically creating new columns with all values, then delete duplicates. Below Mike has duplicates in column "Code" and Mark in "Lang", so I'd like one row for each with new columns for the non-dupe values.
ID | Name | Code | Lang |
1 | Mike | 25 | SQL |
1 | Mike | 26 | SQL |
1 | Mike | 27 | SQL |
2 | Mark | 39 | NoSQL |
2 | Mark | 39 | SQL |
Loop through and identify which columns are not duplicates, copy non-duplicate value, write to new column in first near duplicate row, delete duplicates.
ID | Name | Code | Code2 | Code3 | Lang | Lang2 |
1 | Mike | 25 | 26 | 27 | SQL | . |
2 | Mark | 39 | . | . | NoSQL | SQL |
I'm able to get to just the duplicate rows by using the below, but have done a lot of research and am having trouble getting to my result. I'm exploring pivot and melt as an option but am stuck on the dynamic column part.
dup_rows = orig_df[orig_df.duplicated(['Name'])]
We can mark the duplicates per group with GroupBy, duplicated and cumsum
Then use pivot_table to pivot the rows to columns and finally we use pd.concat to get a single dataframe back:
columns = ['Code', 'Lang']
dfs = []
for col in columns:
df['cols'] = (
col + df.groupby(['ID', 'Name'], sort=False)
.apply(lambda x: (~x[col].duplicated()).cumsum()).astype(str).to_numpy()
)
dfs.append(df.pivot_table(index=['ID', 'Name'], columns='cols', values=col, aggfunc='first'))
dfn = pd.concat(dfs, axis=1).reset_index().rename_axis(None, axis=1)
ID Name Code1 Code2 Code3 Lang1 Lang2
0 1 Mike 25.0 26.0 27.0 SQL NaN
1 2 Mark 39.0 NaN NaN NoSQL SQL
I have a Spark DataFrame, like the one shown below.
I need an algorithm that whenever I find 'M' in the row, I need to select the next two columns after 'M' and create a new row. If there are two 'M' in a single row, then I need to create two rows 1 with the two columns from first 'M' and 2nd row with two columns from second 'M'.
Input Dataframe:
+------+---+--------------------+---+--------+--------+-----+----+-----------+----+
|rownum|_c0| _c1|_c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9|_c10|_c11|
+------+---+--------------------+---+--------+--------+-----+----+-----------+---
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Drinks|30|M|Food|20|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Food|40| M|Bar |50|M|Drinks|100
+------+---+--------------------+---+--------+--------+-----+----+-----------+----+-----+
New Output Dataframe:
+------+---+--------------------+---+--------+--------+-----+----+-------
|rownum|_c0| _c1|_c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9|
+------+---+--------------------+---+--------+--------+-----+----+-------
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Drinks|30|
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Food|20|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Food|40|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Bar |50|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Drinks |100|
+------+---+--------------------+---+--------+--------+-----+----+--+
I have a dataframe that looks like this:
# +----+------+---------+
# |col1| col2 | col3 |
# +----+------+---------+
# | id| name | val |
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
I need to create a new dataframe from it, using row[1] as the new column headers and ignoring or dropping the col1, col2, etc. row. The new table should look like this:
# +----+------+---------+
# | id | name | val |
# +----+------+---------+
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
The columns can be variable, so I can't use the names to set them explicitly in the new dataframe. This is not using pandas df's.
Assuming that there is only one row with id in col1, name in col2 and val in col3, you can use the following logic (commented for clarity and explanation)
#select the row with the header name
header = df.filter((df['col1'] == 'id') & (df['col2'] == 'name') & (df['col3'] == 'val'))
#selecting the rest of the rows except the first one
restDF = df.subtract(header)
#converting the header row into Row
headerColumn = header.first()
#looping columns for renaming
for column in restDF.columns:
restDF = restDF.withColumnRenamed(column, headerColumn[column])
restDF.show(truncate=False)
this should give you
+---+----+---+
|id |name|val|
+---+----+---+
|1 |a01 |X |
|2 |a02 |Y |
+---+----+---+
But the best option would be read it with header option set to true while reading the dataframe using sqlContext from source
Did you try this? header=True
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
df = spark.read.csv("TSCAINV_062020.csv",header=True)
Pyspark sets the column names as _c0, _c1, _c2 if the header is not set to True and it pushes the column down by one row.
Thanks to #Sai Kiran!
The header=True works for me:
df = spark.read.csv("TSCAINV_062020.csv",header=True)