I have a dataframe where one of the columns contains strings. But they have some tab formatting for each of them. Below is a snippet of how it looks like
formatted_line_items[1:3]
Out[393]: ['\t<string1>', '\t\t<string2>']
However when I write the dataframe using to_csv the formatting is lost. How can I write this to a csv file or excel file and still retain the formatting?
EDIT: I got to know that csv doesn't retain formatting so I used the pandas to_excel function but still no luck with the formatting.
Just found XlsxWriter has a set_indent function where we can specify the indentation.
Related
From Python i want to export to csv format a dataframe
The dataframe contains two columns like this
So when i write this :
df['NAME'] = df['NAME'].astype(str) # or .astype('string')
df.to_csv('output.csv',index=False,sep=';')
The excel output in csv format returns this :
and reads the value "MAY8218" as a date format "may-18" while i want it to be read as "MAY8218".
I've tried many ways but none of them is working. I don't want an alternative like putting quotation marks to the left and the right of the value.
Thanks.
If you want to export the dataframe to use it in excel just export it as xlsx. It works for me and maintains the value as string in the original format.
df.to_excel('output.xlsx',index=False)
The CSV format is a text format. The file contains no hint for the type of the field. The problem is that Excel has the worst possible support for CSV files: it assumes that CSV files always use its own conventions when you try to read one. In short, one Excel implementation can only read correctly what it has written...
That means that you cannot prevent Excel to interpret the csv data the way it wants, at least when you open a csv file. Fortunately you have other options:
import the csv file instead of opening it. This time you have options to configure the way the file should be processed.
use LibreOffice calc for processing CSV files. LibreOffice is a little behind Microsoft Office on most points except for csv file handling where it has an excellent support.
I've been trying to use Python and Pandas to take a csv as input, clean the dataset, and assign the output to a new csv file. One of the columns in the original csv has trademark symbols. When I export the new csv, the columns sometimes have ™ instead of just the trademark symbol, or sometimes they're turned into ™. This is how I imported the original csv and exported the new csv:
import pandas as pd
df=pd.read_csv("original_df.csv", encoding='latin1',dtype='unicode')
This is how I exported a new dataframe to csv:
df_new.to_csv('new_test_df.csv', index = False)
How do I export the string without the extra symbols (i.e. how it was in the original)?
Thanks!
Just fixed this same problem. Answer and explanation can be found here
Quick answer is to use encoding "utf-8-sig"
So I have a csv file with a column called reference_id. The values in reference id are 15 characters long, so something like '162473985649957'. When I open the CSV file, excel has changed the datatype to General and the numbers are something like '1.62474E+14'. To fix this in excel, I change the column type to Number and remove the decimals and it displays the correct value. I should add, it only does this in CSV file, if I output to xlsx, it works fine. PRoblem is, the file has to be csv.
Is there a way to fix this using python? I'm trying to automate a process. I have tried using the following to convert it to a string. It works in the sense that is converts the column to a string, but it still shows up incorrectly in the csv file.
df['reference_id'] = df['reference_id'].astype(str)
df.to_csv(r'Prev Day Branch Transaction Mems.csv')
Thanks
When I open the CSV file, excel has changed the data
This is an Excel problem. You can't fix how Excel decides to interpret your CSV. (You can work around some issues by using the text import format, but that's cumbersome.)
Either use XLS/XLSX files when working with Excel, or use eg. Gnumeric our something other that doesn't wantonly mangle your data.
I'm importing an .xlsx file with pd.read_excel(). I received this .xlsx file as an CSV file and used excel to seperate it by comma so I get the proper .xlsx file with columns etc. Six of the dataframe columns have a number as header (e.g. 5030, 5031,...). When I want to change the column name with df = df.rename(columns={...}) this does not work. Also df["5030"] does not work, it throws an error: KeyError:'5030'. This code works for columns which have regular/non-integer names.
However, when I import the raw .csv file with pd.read_csv(), all the code above does work. I can just rename column names. The df's do look exactly the same when imported with both techniques, but apparently something is different.
It is not a serious issue as I can change the column name to non-integers manually in excel, but I'm very curious about what the underlying "problem" is here and how these two function operate in a different way.
Thanks!
import pandas as pd
check = pd.read_csv('1.csv')
nocheck = check['CUSIP'].str[:-1]
nocheck = nocheck.to_frame()
nocheck['CUSIP'] = nocheck['CUSIP'].astype(str)
nocheck.to_csv('NoCheck.csv')
This works but while writing the csv, a value for an identifier like 0003418 (type = str) converts to 3418 (type = general) when the csv file is opened in Excel. How do I avoid this?
I couldn't find a dupe for this question, so I'll post my comment as a solution.
This is an Excel issue, not a python error. Excel autoformats numeric columns to remove leading 0's. You can "fix" this by forcing pandas to quote when writing:
import csv
# insert pandas code from question here
# use csv.QUOTE_ALL when writing CSV.
nocheck.to_csv('NoCheck.csv', quoting=csv.QUOTE_ALL)
Note that this will actually put quotes around each value in your CSV. It will render the way you want in Excel, but you may run into issues if you try to read the file some other way.
Another solution is to write the CSV without quoting, and change the cell format in Excel to "General" instead of "Numeric".