spark dataframe not dropping columns after df.drop() operation - python

I'm using jupyter notebook and running spark 2.4.3.
game_reviews = spark.read.format("csv").option("header", "true").load("./amazon_reviews_us_Video_Games_v1_00.tsv")
#reading is fine
game_reviews_2_columns =game_reviews.drop(
'marketplace','review_id','product_parent','product_title','product_category',
'helpful_votes' ,'total_votes','vine','verified_purchase','review_headline',
'review_body','review_date')
running this code
game_reviews_2_columns.columns
still gives all columns:
['marketplace\tcustomer_id\treview_id\tproduct_id\tproduct_parent\tproduct_title\tproduct_category\tstar_rating\thelpful_votes\ttotal_votes\tvine\tverified_purchase\treview_headline\treview_body\treview_date']
What am I doing wrong?

It did not read properly the headers, it was a list of one huge string of columns and tabs.
game_reviews = spark.read.format("csv").option("header", "true").
option("delimiter","\t"). #this is the parameter
load("./amazon_reviews_us_Video_Games_v1_00.tsv")
I could specify schema and include it too, but it works fine.

Related

how to convert the CSV to parquet after renaming column name?

I am using the below code to rename one of the column name in a CSV file.
input_filepath ='/<path_to_file>/2500.csv'
df_csv = spark.read.option('header', True).csv(input_filepath)
df1 = df_csv.withColumnRenamed("xyz", "abc")
df1.printSchema()
So, the above code works fine. however, I wanted to also convert the CSV to parquet format. If I am correct, the above code will make the changes and put in the memory and not to the actual file. Please correct me if I am wrong.
If the changes are kept in memory, then how can I put the changes to parquet file?
For the file format conversion, I will be using below code snippet
df_csv.write.mode('overwrite').parquet()
But not able to figure out how to use it in this case. Please suggest
Note: I am suing Databricks notebook for all the above steps.
Hi I am not 100% sure if that will solve your issue but can you try to save your csv like this:
df.to_parquet(PATH_WHERE_TO_STORE)
Let me know if that helped.
Edit: My workflow usually goes like this
Export dataframe as csv
Check visually if everything is correct
Run this function:
import pandas as pd
def convert_csv_to_parquet(path_to_csv):
df = pd.read_csv(path_to_csv)
df.to_parquet(path_to_csv.replace(".csv", ".parq"), index=False)
Which goes into the direction of Rakesh Govindulas' comment.

Pandas read excel sheet with multiple header in row and columns and merged cells

I'm new using pandas and I'm tring to use pandas read excel to work with a file as a df. The spreadsheet looks like this:
Excel Matrix
The problem is that this file contains double headers in the colums and rows and the first header for each of them include merged cells. I tried this:
file = 'country_sector_py.xlsx'
matrix = pd.read_excel(file, sheet_name = 'matrix', header=[0, 1], index_col=[0, 1])
the error I get is "ValueError: Length of new names must be 1, got 2." I've read some related posts that says it's due to some the headers are repeated, but I haven't been able to solve it. any guide would be much appreciated.
References:
Pandas read excel sheet with multiple header when first column is empty
Error when using pandas read_excel(header=[0,1])
Not an answer, but to post more details than comments allow...
Using your code I cannot recreate.
import pandas as pd
df = pd.read_excel('matrix.xlsx', sheet_name = 'matrix', header=[0,1], index_col=[0, 1])
df
Worst I get is copying 'region 2' twice doesn't show again and also messes up the sub-columns numbering. Example:
Must be something else in your file. Share it if you can, else look around inside it, or even open and perhaps save as a different Excel version (maybe XLSM or if that then not than).
Maybe worth checking the version of Pandas with pip show pandas
>>>># pip show pandas
Name: pandas
Version: 1.3.0
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev#python.org

Pandas – save to csv replaces columns

I have problem with saving pandas DataFrame to csv. I run code on jupyter notebook and everything works fine. After runing the same code on server columns values are saved to random columns…
csvPath = r''+str(pathlib.Path().absolute())+ '/products/'+brand['file_name']+'_products.csv'
productFrame.to_csv(csvPath,index=True)
I've print DataFrame before saving – looks as it should be. After saving, I open file and values ale mixed.
How to make it always work in the proper way?
If you want to force the column order when exporting to csv, use
df[cols].to_csv()
where cols is a list of column names in the desired order.

Unable to get correct output from tsv file using pandas

I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.

How to write a dataframe in pyspark having null values to CSV

I'm using the below code to write to a CSV file.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("nullValue"," ").save("/home/user/test_table/")
when I execute it, I'm getting the following error:
java.lang.UnsupportedOperationException: CSV data source does not support null data type.
Could anyone please help?
I had the same problem (not using that command with the nullValue option) and I solved it by using the fillna method.
And I also realised that fillna was not working with _corrupt_record, so I dropped since I didn't need it.
df = df.drop('_corrupt_record')
df = df.fillna("")
df.write.option('header', 'true').format('csv').save('file_csv')

Categories

Resources