I'm trying to export the following dataframe to a csv file
with the following line of code:
df.to_csv("Result.csv", encoding = "utf-8-sig")
but the output looks really weird:
I would like it to use the different columns of the csv to input the data, but what it's doing is putting everything in the same column of cells.
Any idea what I'm doing wrong? Thanks in advance.
The CSV looks good to me. Excel uses e.g. semicolon in some locales so it doesn't import CSV files in that locale. You could use to_csv(..., sep=';') and see if it helps. In the end, it will depend on the locale settings on the computer in which the CSV is opened, so you can never be sure that your code will work "correctly" with Excel.
Related
From Python i want to export to csv format a dataframe
The dataframe contains two columns like this
So when i write this :
df['NAME'] = df['NAME'].astype(str) # or .astype('string')
df.to_csv('output.csv',index=False,sep=';')
The excel output in csv format returns this :
and reads the value "MAY8218" as a date format "may-18" while i want it to be read as "MAY8218".
I've tried many ways but none of them is working. I don't want an alternative like putting quotation marks to the left and the right of the value.
Thanks.
If you want to export the dataframe to use it in excel just export it as xlsx. It works for me and maintains the value as string in the original format.
df.to_excel('output.xlsx',index=False)
The CSV format is a text format. The file contains no hint for the type of the field. The problem is that Excel has the worst possible support for CSV files: it assumes that CSV files always use its own conventions when you try to read one. In short, one Excel implementation can only read correctly what it has written...
That means that you cannot prevent Excel to interpret the csv data the way it wants, at least when you open a csv file. Fortunately you have other options:
import the csv file instead of opening it. This time you have options to configure the way the file should be processed.
use LibreOffice calc for processing CSV files. LibreOffice is a little behind Microsoft Office on most points except for csv file handling where it has an excellent support.
I am using the below code to rename one of the column name in a CSV file.
input_filepath ='/<path_to_file>/2500.csv'
df_csv = spark.read.option('header', True).csv(input_filepath)
df1 = df_csv.withColumnRenamed("xyz", "abc")
df1.printSchema()
So, the above code works fine. however, I wanted to also convert the CSV to parquet format. If I am correct, the above code will make the changes and put in the memory and not to the actual file. Please correct me if I am wrong.
If the changes are kept in memory, then how can I put the changes to parquet file?
For the file format conversion, I will be using below code snippet
df_csv.write.mode('overwrite').parquet()
But not able to figure out how to use it in this case. Please suggest
Note: I am suing Databricks notebook for all the above steps.
Hi I am not 100% sure if that will solve your issue but can you try to save your csv like this:
df.to_parquet(PATH_WHERE_TO_STORE)
Let me know if that helped.
Edit: My workflow usually goes like this
Export dataframe as csv
Check visually if everything is correct
Run this function:
import pandas as pd
def convert_csv_to_parquet(path_to_csv):
df = pd.read_csv(path_to_csv)
df.to_parquet(path_to_csv.replace(".csv", ".parq"), index=False)
Which goes into the direction of Rakesh Govindulas' comment.
So I have a csv file with a column called reference_id. The values in reference id are 15 characters long, so something like '162473985649957'. When I open the CSV file, excel has changed the datatype to General and the numbers are something like '1.62474E+14'. To fix this in excel, I change the column type to Number and remove the decimals and it displays the correct value. I should add, it only does this in CSV file, if I output to xlsx, it works fine. PRoblem is, the file has to be csv.
Is there a way to fix this using python? I'm trying to automate a process. I have tried using the following to convert it to a string. It works in the sense that is converts the column to a string, but it still shows up incorrectly in the csv file.
df['reference_id'] = df['reference_id'].astype(str)
df.to_csv(r'Prev Day Branch Transaction Mems.csv')
Thanks
When I open the CSV file, excel has changed the data
This is an Excel problem. You can't fix how Excel decides to interpret your CSV. (You can work around some issues by using the text import format, but that's cumbersome.)
Either use XLS/XLSX files when working with Excel, or use eg. Gnumeric our something other that doesn't wantonly mangle your data.
I just want to import this csv file. It can read it but somehow it doesn't create columns. Does anyone know why?
This is my code:
import pandas as pd
songs_data = pd.read_csv('../datasets/spotify-top50.csv', encoding='latin-1')
songs_data.head(n=10)
Result that I see in Jupyter:
P.S.: I'm kinda new to Jupyter and programming, but after all I found it should work properly. I don't know why it doesn't do it.
To properly load a csv file you should specify some parameters. for example in you case you need to specify quotechar:
df = pd.read_csv('../datasets/spotify-top50.csv',quotechar='"',sep=',', encoding='latin-1')
df.head(10)
If you still have a problem you should have a look at your CSV file again and also pandas documentation, so that you can set parameters to match with your CSV file structure.
I'm opening a data file csv1.txt (in CSV format) at a Windows OS. When you open it in Windows notepad. it's like:
john|mary|joe34|25|332|21|4321|42|25
I use these codes in Jupyter Notebook to do the read_csv and to_csv in pandas:
import pandas as pd
df = pd.read_csv('csv1.txt', header=None)
df.to_csv('csv2.txt', header=False, index=False)
Then when I open csv2.txt, it looks like this:
john|mary|joe
34|25|3
32|21|43
21|42|25
Is there a way to make csv2.txt identical to csv1.txt?
You should probably provide the relevant line_terminator to to_csv e.g. assuming you're reading a unix-style CSV on Windows the input probably gets handled by universal newlines, then on writing Pandas uses os.linesep which is a "full" windows-style linebreak rather than a unix-style '\n'. Something along those lines.
That aside, providing relevant contextual environment is generally helpful to get answers, "When you open it, it's like:" when the reader has no idea what your OS is, what you mean by "open it", what the actual (raw) contents of the file is, etc...