I have a CSV file that when I open in notepad, displays as follows:
A,B
C,
D,E,F,G,H
I see that it shows up as Unix (LF) and UTF-8 at the bottom right of the status bar. When I open the file in excel, save it (but without making any changes), and close it, it will convert it to Windows (CRLF) as expected and displays as follows in notepad:
A,B,,,
C,,,,
D,E,F,G,H
The header row is the third row (D,E,F,G,H) and my understanding is that prior to saving, Excel reads the entire CSV file, figures out that the longest row has 4 commas and uses that format throughout the entire file. The problem I'm running into is reading the original LF CSV file into Pandas .read_csv. I think I've narrowed down the solution to 2 possible options (but please correct me if i'm wrong):
Option 1: In my main python script, I start with a def function that just iterates through every csv file in a folder where I open, save, and close in order to format it into CRLF prior to working with the csv files in Pandas.
Option 2: Format the csv file upon reading it into Pandas. I feel like this is the better option, especially knowing the number of columns I have and using .read_csv(header = 3) but when I open the output file and run excel formulas, calculation times are insane, even for relatively small files. I have a feeling it's a datatype issue but I'm still new to all of this. Any clarification or resources are greatly appreciated!
Related
From Python i want to export to csv format a dataframe
The dataframe contains two columns like this
So when i write this :
df['NAME'] = df['NAME'].astype(str) # or .astype('string')
df.to_csv('output.csv',index=False,sep=';')
The excel output in csv format returns this :
and reads the value "MAY8218" as a date format "may-18" while i want it to be read as "MAY8218".
I've tried many ways but none of them is working. I don't want an alternative like putting quotation marks to the left and the right of the value.
Thanks.
If you want to export the dataframe to use it in excel just export it as xlsx. It works for me and maintains the value as string in the original format.
df.to_excel('output.xlsx',index=False)
The CSV format is a text format. The file contains no hint for the type of the field. The problem is that Excel has the worst possible support for CSV files: it assumes that CSV files always use its own conventions when you try to read one. In short, one Excel implementation can only read correctly what it has written...
That means that you cannot prevent Excel to interpret the csv data the way it wants, at least when you open a csv file. Fortunately you have other options:
import the csv file instead of opening it. This time you have options to configure the way the file should be processed.
use LibreOffice calc for processing CSV files. LibreOffice is a little behind Microsoft Office on most points except for csv file handling where it has an excellent support.
I was wondering why I get funny behaviour using a csv file that has been "changed" in excel.
I have a csv file of around 211,029 rows and pass this csv file into pandas using a Jupyter-notebook
The simplest example I can give of a change is simply clicking on the filter icon in excel saving the file, unclicking the filter icon and saving again (making no physical changes in the data).
When I pass my csv file through pandas, after a few filter operations, some rows go missing.
This is in comparison to that of doing absolutely nothing with the csv file. Leaving the csv file completely alone gives me the correct number of rows I need after filtering compared to "making changes" to the csv file.
Why is this? Is it because of the number of rows in a csv file? Are we supposed to leave csv files untouched if we are planning to filter through pandas anyways?
(As a side note I'm using Excel on a MacBook.)
Excel does not leave any file "untouched". It applies formatting to every file it opens (e.g. float values like "5.06" will be interpreted as date and changed to "05 Jun"). Depending on the expected datatype these rows might be displayed wrongly or missing in your notebook.
Better use sed or awk to manipulate csv files (or a text editor for smaller files).
I exported string of numbers from python into a csv file:
When I open it in notepad, the data looks as such which is the real data:
However if I open it in excel sheet, the data looks as such which is false:
Can somebody please let me know how do I get to see following string of letters in the csv file:
Cell A1: 15
Cell A2: 15.0
Cell A3: 15.00
Cell A4: 15.000
That is not actually done by csv file, but you are opening it in excel. So, Excel is just ignoring .000s and yeah! If you read that file using other program or python also then you will get .0 for sure.
You can look this article for, how to change that feature. If you are having hard time while saving csv file then, you may look here
Since the 1st step is related to exporting the data from python into csv, all leading and trailing zeroes can be truncated at this stage itself.
One such reference that can be used is shown here: Removing Trailing Zeros in Python
So I have a csv file with a column called reference_id. The values in reference id are 15 characters long, so something like '162473985649957'. When I open the CSV file, excel has changed the datatype to General and the numbers are something like '1.62474E+14'. To fix this in excel, I change the column type to Number and remove the decimals and it displays the correct value. I should add, it only does this in CSV file, if I output to xlsx, it works fine. PRoblem is, the file has to be csv.
Is there a way to fix this using python? I'm trying to automate a process. I have tried using the following to convert it to a string. It works in the sense that is converts the column to a string, but it still shows up incorrectly in the csv file.
df['reference_id'] = df['reference_id'].astype(str)
df.to_csv(r'Prev Day Branch Transaction Mems.csv')
Thanks
When I open the CSV file, excel has changed the data
This is an Excel problem. You can't fix how Excel decides to interpret your CSV. (You can work around some issues by using the text import format, but that's cumbersome.)
Either use XLS/XLSX files when working with Excel, or use eg. Gnumeric our something other that doesn't wantonly mangle your data.
I am asking a follow up question from here (File downloaded is different from what is on server).
I have datetime in csv file which is getting reformatted.
My CSV has data like this 1-Jan-15,1-Feb-15,1-Mar-15.
But, the reformated csv is like Jan-15, Feb-15, Mar-15.......
Is there any way to stop automatic reformatting of data?
Instead of opening the .csv file directly in Excel, open a new blank workbook in Excel and use Get Data from Text (under the Data tab of the Ribbon) to import the .csv file.
This will open the Text Import Wizard, which has 3 total screens.
On Step 1, choose Delimited.
On Step 2, choose Comma.
And on Step 3, highlight all columns with dates in them and choose Text.
Click Finish.
The General format (which is also what happens by default if you open a .csv file in Excel directly) will recognize the dates as being dates and reformat them according to your locale settings. By instructing Excel to interpret those columns as text, they will not be recognized as dates and therefore left as they are.