I have a dataframe that looks like this:
# +----+------+---------+
# |col1| col2 | col3 |
# +----+------+---------+
# | id| name | val |
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
I need to create a new dataframe from it, using row[1] as the new column headers and ignoring or dropping the col1, col2, etc. row. The new table should look like this:
# +----+------+---------+
# | id | name | val |
# +----+------+---------+
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
The columns can be variable, so I can't use the names to set them explicitly in the new dataframe. This is not using pandas df's.
Assuming that there is only one row with id in col1, name in col2 and val in col3, you can use the following logic (commented for clarity and explanation)
#select the row with the header name
header = df.filter((df['col1'] == 'id') & (df['col2'] == 'name') & (df['col3'] == 'val'))
#selecting the rest of the rows except the first one
restDF = df.subtract(header)
#converting the header row into Row
headerColumn = header.first()
#looping columns for renaming
for column in restDF.columns:
restDF = restDF.withColumnRenamed(column, headerColumn[column])
restDF.show(truncate=False)
this should give you
+---+----+---+
|id |name|val|
+---+----+---+
|1 |a01 |X |
|2 |a02 |Y |
+---+----+---+
But the best option would be read it with header option set to true while reading the dataframe using sqlContext from source
Did you try this? header=True
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
df = spark.read.csv("TSCAINV_062020.csv",header=True)
Pyspark sets the column names as _c0, _c1, _c2 if the header is not set to True and it pushes the column down by one row.
Thanks to #Sai Kiran!
The header=True works for me:
df = spark.read.csv("TSCAINV_062020.csv",header=True)
Related
I want to remove whitespace after all data in the excel table using panda in Jupiter notebook.
foreg:
| A header | Another header |
| -------- | -------------- |
| First**whitespace** | row |
| Second | row**whitespace** |
output:
| A header | Another header |
| -------- | -------------- |
| First | row |
| Second | row |
If all columns are strings use rstrip in DataFrame.applymap:
df = df.applymap(lambda x: x.rstrip())
Or Series.str.rstrip for columns in DataFrame.apply:
df = df.apply(lambda x: x.str.rstrip())
If possible some non strings (non object) columns is possible filter columns names:
cols = df.select_dtypes(object).columns
df[cols] = df[cols].apply(lambda x: x.str.rstrip())
need to split the delimited(~) column values into new columns dynamically. Thie input s a dataframe and column name list. We are trying to solve using spark datfarame functions. Please help.
Input:
|Raw_column_name|
|1~Ram~1000~US|
|2~john~2000~UK|
|3~Marry~7000~IND|
col_names=[id,names,sal,country]
output:
id | names | sal | country
1 | Ram | 1000 | US
2 | joh n| 2000 | UK
3 | Marry | 7000 | IND
We can use split() and then use the resulting array's elements to create columns.
data_sdf. \
withColumn('raw_col_split_arr', func.split('raw_column_name', '~')). \
select(func.col('raw_col_split_arr').getItem(0).alias('id'),
func.col('raw_col_split_arr').getItem(1).alias('name'),
func.col('raw_col_split_arr').getItem(2).alias('sal'),
func.col('raw_col_split_arr').getItem(3).alias('country')
). \
show()
# +---+-----+----+-------+
# | id| name| sal|country|
# +---+-----+----+-------+
# | 1| Ram|1000| US|
# | 2| john|2000| UK|
# | 3|Marry|7000| IND|
# +---+-----+----+-------+
In case the use case is extended to be a dynamic list of columns.
col_names = ['id', 'names', 'sal', 'country']
data_sdf. \
withColumn('raw_col_split_arr', func.split('raw_column_name', '~')). \
select(*[func.col('raw_col_split_arr').getItem(i).alias(k) for i, k in enumerate(col_names)]). \
show()
# +---+-----+----+-------+
# | id|names| sal|country|
# +---+-----+----+-------+
# | 1| Ram|1000| US|
# | 2| john|2000| UK|
# | 3|Marry|7000| IND|
# +---+-----+----+-------+
Another option is from_csv() function. The only thing that needs to be defined is schema which has the added advantage that data can be parsed to correct type automatically:
df = spark.createDataFrame([('1~Ram~1000~US',), ('2~john~2000~UK',), ('3~Marry~7000~IND',)], ["Raw_column_name"])
df.show()
col_names = ['id', 'names', 'sal', 'country']
schema = ','.join([f'{name} string' for name in col_names])
# if custom type conversion is needed
# schema = "id int, names string, sal string, country string"
options = {'sep': '~'}
df2 = (df
.select(from_csv(col('Raw_column_name'), schema, options).alias('cols'))
.select(col('cols.*'))
)
df2.printSchema()
df2.show()
I'm new to AWS Glue and Pyspark, so I'm having some trouble with a transformation job. I have two DynamicFrames, one of them contains values in one of it's columns which needs to be added as a separate column in the other DF, and the values in the column need to be the value which corresponds a value from another column with the same id in the first table. Here's how it looks:
Table 1 Table2
+--+-----+-----+ +--+-----+-----+
|id|name |value| |id|col1 |col2 |
+--+-----+-----+ +--+-----+-----+
| 1|name1| 10 | | 1|str1 |val1 |
+--+-----+-----+ +--+-----+-----+
| 2|name2| 20 | | 2|str2 |val2 |
+--+-----+-----+ +--+-----+-----+
I need the new format to be:
Table2
+--+-----+-----+-----+-----+
|id|col1 |col2 |name1|name2|
+--+-----+-----+-----+-----+
| 1|str1 |val1 | 10 | | <--- add 10 only here because the id from the row in the first table must match the id from the second table
+--+-----+-----+-----+-----+
| 2|str2 |val2 | | 20 | <--- add 20 only here because the id from the row in the first table must match the id from the second table
+--+-----+-----+-----+-----+
Suppose 2 dataframes are named df1 and df2.
df3 = df1.groupBy('id').pivot('name').sum('value')
df4 = df2.join(df3, on='id', how='inner')
df4.show(truncate=False)
We have a feature request where we want to pull a table as per request from the database and perform some transformation on it. But these tables may have duplicate columns [columns with same name]. I want to combine these columns into a single column
for example:
Request for input table named ages:
+---+----+------+-----+
|age| ids | ids | ids |
+---+----+------+-----+
| 25| 1 | 2 | 3 |
+---+----+------+-----+
| 26| 4 | 5 | 6 |
+---+----+------+-----+
the output table is :
+---+----+------+-----+
|age| ids |
+---+----+------+-----+
| 25| [1 , 2 , 3] |
+---+----+------+-----+
| 26| [4 , 5 , 6] |
+---+----+------+-----+
next time we might get a request for input table names:
+---+----+------+-----+
|name| company | company|
+---+----+------+-----+
| abc| a | b |
+---+----+------+-----+
| xyc| c | d |
+---+----+------+-----+
The output table should be:
+---+----+------+
|name| company |
+---+----+------+
| abc| [a,b] |
+---+----+------+
| xyc| [c,d] |
+---+----+------+
So Basically I need to find the columns with the same name and then merge the values in them.
You can convert spark dataframe into pandas dataframe, perform necessary operations and convert it back to spark dataframe.
I have added necessary comments for clarity.
Using Pandas:
import pandas as pd
from collections import Counter
pd_df = spark_df.toPandas() #converting spark dataframe to pandas dataframe
pd_df.head()
def concatDuplicateColumns(df):
duplicate_cols = [] #to store duplicate column names
for col in dict(Counter(df.columns)):
if dict(Counter(df.columns))[col] >1:
duplicate_cols.append(col)
final_dict = {}
for cols in duplicate_cols:
final_dict[cols] = [] #initialize dict
for cols in duplicate_cols:
for ind in df.index.tolist():
final_dict[cols].append(df.loc[ind, cols].tolist())
df.drop(duplicate_cols, axis=1, inplace=True)
for cols in duplicate_cols:
df[cols] = final_dict[cols]
return df
final_df = concatDuplicateColumns(pd_df)
spark_df = spark.createDataFrame(final_df)
spark_df.show()
I have an abnormal setup where I have multiple tables on the same excel sheet. I'm trying to make each table (on the same sheet) a separate pandas dataframe. For example, on one excel sheet I might have:
+--------+--------+--------+--------+--------+--------+-----+
| Col | Col | Col | Col | Col | Col | Col |
+--------+--------+--------+--------+--------+--------+-----+
| Table1 | Table1 | | | | | |
| | | Table2 | Table2 | Table2 | Table2 | |
| | Table3 | Table3 | Table3 | | | |
+--------+--------+--------+--------+--------+--------+-----+
And what I want is tables broken out by table type (example below is one of the multiple tables in a pandas df). The table header beginning column is unique for each table,
so table1 might have the column header corner column named:
"Leads",
table2 has the column header corner column named:
"Sales",
and table3 has the column header named:
"Products".
+--------+--------+--+
| Leads | Table1 | |
+--------+--------+--+
| pd.Data| pd.Data| |
| | | |
| | | |
+--------+--------+--+
+--------+--------+--------+--------+--+
| Sales | Table2 | Table2 | Table2 | |
+--------+--------+--------+--------+--+
| pd.Data| pd.Data| pd.Data| pd.Data| |
| | | | | |
| | | | | |
+--------+--------+--------+--------+--+
+---------+---------+---------+--+
| Products| Table3 | Table3 | |
+---------+---------+---------+--+
| pd.Data | pd.Data | pd.Data | |
| | | | |
| | | | |
+---------+---------+---------+--+
Because I know that pandas will do well just assuming that the excel sheet is one big table, but with multiple tables I'm stumped on the best way to partition the data into separate df's, especially because I can't index on row or column due to variable length of the tables over time.
This is how far I got before I realized this only works for one table, not three:
import pandas as pd
import string
df = pd.read_excel("file.xlsx", usecols="B:I", index_col=3)
print(df)
letter = list(string.ascii_uppercase)
df1 = pd.read_excel("file.xlsx")
def get_start_column(df):
for i, column in enumerate(df.columns):
if df[column].first_valid_index():
return letter[i]
def get_last_column(df):
columns = df.columns
len_column = len(columns)
for i, column in enumerate(columns):
if df[column].first_valid_index():
return letter[len_column - i]
def get_first_row(df):
for index, row in df.iterrows():
if not row.isnull().values.all():
return index + 1
def usecols(df):
start = get_start_column(df)
end = get_last_column(df)
return f"{start}:{end}"
df = pd.read_excel("file.xlsx", usecols=usecols(df1), header=get_first_row(df1))
print(df)