how to convert the CSV to parquet after renaming column name? - python

I am using the below code to rename one of the column name in a CSV file.
input_filepath ='/<path_to_file>/2500.csv'
df_csv = spark.read.option('header', True).csv(input_filepath)
df1 = df_csv.withColumnRenamed("xyz", "abc")
df1.printSchema()
So, the above code works fine. however, I wanted to also convert the CSV to parquet format. If I am correct, the above code will make the changes and put in the memory and not to the actual file. Please correct me if I am wrong.
If the changes are kept in memory, then how can I put the changes to parquet file?
For the file format conversion, I will be using below code snippet
df_csv.write.mode('overwrite').parquet()
But not able to figure out how to use it in this case. Please suggest
Note: I am suing Databricks notebook for all the above steps.

Hi I am not 100% sure if that will solve your issue but can you try to save your csv like this:
df.to_parquet(PATH_WHERE_TO_STORE)
Let me know if that helped.
Edit: My workflow usually goes like this
Export dataframe as csv
Check visually if everything is correct
Run this function:
import pandas as pd
def convert_csv_to_parquet(path_to_csv):
df = pd.read_csv(path_to_csv)
df.to_parquet(path_to_csv.replace(".csv", ".parq"), index=False)
Which goes into the direction of Rakesh Govindulas' comment.

Related

df.to_csv writes everything in the same column of cells

I'm trying to export the following dataframe to a csv file
with the following line of code:
df.to_csv("Result.csv", encoding = "utf-8-sig")
but the output looks really weird:
I would like it to use the different columns of the csv to input the data, but what it's doing is putting everything in the same column of cells.
Any idea what I'm doing wrong? Thanks in advance.
The CSV looks good to me. Excel uses e.g. semicolon in some locales so it doesn't import CSV files in that locale. You could use to_csv(..., sep=';') and see if it helps. In the end, it will depend on the locale settings on the computer in which the CSV is opened, so you can never be sure that your code will work "correctly" with Excel.

Converting output into .csv file

I am new to Python, and I am currently writing code to parse through an excel sheet of websites, look at websites that have been modified more than three months ago, and then pull out the names and emails of contacts at those sites. My problem now is that whenever I run the code in my terminal, it only shows me some of the output, so I'd like to export it to a .csv file or really anything that lets me see all the values but I'm not sure how.
import pandas as pd
data = pd.read_csv("filename.csv")
data.sort_values("Last change", inplace = True)
filter1 = data["Last change"]<44285
data.where(filter1, inplace = True)
print(data)
note: the 44285 came from me converting the dates in excel to integers so I didn't have to in Python, lazy I know but I'm learning
You can try converting it to a csv.
data.to_csv('data.csv')
Alternately if you want to just view more records, for example 50, you could do this:
print(data.head(50))
If you can share your parser code also, I think we can save your from the hassle of editing the excel in between the process. Or maybe something from here with a few more lines.
To solve your problem use
data.to_csv("resultfile.csv")
or if you want an excel file
data.to_excel("resultfile.xlsx")

What do I have to change that Jupyter shows columns?

I just want to import this csv file. It can read it but somehow it doesn't create columns. Does anyone know why?
This is my code:
import pandas as pd
songs_data = pd.read_csv('../datasets/spotify-top50.csv', encoding='latin-1')
songs_data.head(n=10)
Result that I see in Jupyter:
P.S.: I'm kinda new to Jupyter and programming, but after all I found it should work properly. I don't know why it doesn't do it.
To properly load a csv file you should specify some parameters. for example in you case you need to specify quotechar:
df = pd.read_csv('../datasets/spotify-top50.csv',quotechar='"',sep=',', encoding='latin-1')
df.head(10)
If you still have a problem you should have a look at your CSV file again and also pandas documentation, so that you can set parameters to match with your CSV file structure.

Issue with saving CSV file how can I fix this in python pandas?

I'm having trouble dropping columns and saving the new data frame as a CSV file.
Code:
import pandas as pd
file_path = 'Downloads/editor_events.csv'
df = pd.read_csv(file_path, index_col = False, nrows= 1000)
df.to_csv(file_path, index = False)
df.to_csv(file_path)
The code executes and doesn't give any error. I've looked in my root directory but can't see any new csv file
Check file in folder in which you are running python script. And you are saving with same name, so you can check modified time to confirm it. Also you are not dropping columns as per posted code, you are just taking 1000 rows and saving it.
First: you are saving the same file that you are reading, so you won't see any new csv files. All you are doing right now is rewriting the same file.
But since I can guess you just show it as simple example of what you want to do, I will move to second:
Make sure that your path is correct. Try to write the full path, like 'c:\Users\AwesomeUser\Downloads\editor_events.csv' instead of just 'Downloads\editor_events.csv'.

Spark writes out `saveAsTextFile` in a Row() format

I'm trying to copy these files over from S3 to Redshift, and they are all in the format of Row(column1=value, column2=value,...), which obviously causes issues. How do I get a dataframe to write out in normal csv?
I'm calling it like this:
# final_data.rdd.saveAsTextFile(
# path=r's3n://inst-analytics-staging-us-standard/spark/output',
# compressionCodecClass='org.apache.hadoop.io.compress.GzipCodec'
# )
I've also tried writing out with the spark-csv module, and it seems like it ignores any of the computations I did, and just formats the original parquet file as a csv and dumps it out.
I'm calling that like this:
df.write.format('com.databricks.spark.csv').save('results')
The spark-csv approach is a good one and should be working. It seems by looking at your code that you are calling df.write on the original DataFrame df and that's why it's ignoring your transformations. To work properly, maybe you should do:
final_data = # Do your logic on df and return a new DataFrame
final_data.write.format('com.databricks.spark.csv').save('results')

Categories

Resources