i need to read a excel file without changing any date , time format , float format and convert to data-frame. This is working fine if i convert the excel to CSV and read it using read_csv() .
eg:
import pandas as pd
import numpy as np
#code for reading excel
df=pd.read_excel("605.xlsx",parse_dates=False,sheet_name="Group 1",keep_default_na=False,dtype=str)
print("df_excel:")
#code for reading csv
df1=pd.read_csv("Group 1.csv",parse_dates=False,dtype=str,na_filter = False)
print("df_csv:",df1)
output:
in the above code parse_dates=False is working fine while reading CSV file, but parse_dates=False is not working in read_excel()
Expected output:
Need the exact excel data into a data-frame without changing the date , time format.
From the Pandas docs on the parse_dates parameter for read_excel():
If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. If you don`t want to parse some cells as date just change their type in Excel to “Text”.
You could try this:
df = pd.read_excel("605.xlsx",parse_dates=False,sheet_name="Group1",keep_default_na=False,dtype=str, converters={'as_at_date': str})
Explicitly converting the date column to string might help.
Related
Code:
def write_pandas_dataframe_to_excel(df):
book = openpyxl.load_workbook('~/Documents/test.xlsm', read_only=False, keep_vba=True)
sheet = book['Database']
# Delete all rows after the header so that we can replace them with the contents of our pandas dataframe
sheet.delete_rows(1,sheet.max_row)
#Write values from the pandas dataframe to the sheet
for r in dataframe_to_rows(df,index=include_index, header=True):
sheet.append(r)
for row in sheet[2:sheet.max_row]: # skip the header
cell = row[0] # column A is a Date Field.
cell.number_format = 'YYYY-mm-dd'
book.save(excel_file_path)
book.close()
Expected Result: I open up test.xlsm, and in column A, all dates should already be in the format YYYY-mm-dd
Actual Result: While the YYYY-mm-dd format gets applied without any issues when I run the python code, I then have to open up the excel file, select each cell manually and hit 'Return' in the formula window for the YYYY-mm-dd format to be applied.
Is there a way for my specified date format to be applied through the python code rather than having to manually apply it by opening up excel and selecting each cell, going to the formula bar and hitting 'Return' every time?
Thanks in advance!
I've figured out the answer. Put simply, the date was being written to excel as a string, and that was causing the issue.
In the pandas dataframe I'm containing my data I had used strptime to format the date, which converted the date type to a generic 'object' type. I removed the strptime so that it maintained the datetime object, and that way when I write to excel it writes as a pandas Timestamp object rather than a string.
the below code simply reads in an excel file, stores it as a df and writes the df back into an excel file. When I open the output file in excel, the columns (Dates, numbers) are not the same... some are text , some or numbers ect..
import pandas as pd
df = pd.read_csv("test.csv", encoding = "ISO-8859-1", dtype=object)
writer = pd.ExcelWriter('outputt.xlsx', engine='xlsxwriter')
df.to_excel(writer, index = False, sheet_name='Sheet1') #drop the index
writer.save()
Is there a way to have the column types (as defined in the initial file) be preserved or revert back to the datatypes when the file was read in?
You are reading in a csv file which is certainly not the same as an excel file. You can read a csv file with excel in Windows, but the encoding is different when the file is saved. You can certainly format cells according xlsxwriter specifications.
However, it is important to note that xlsxwriter cannot format any cells that already have a format such as the header or index, or dates or datetime objects. If you have multiple datatypes in a single column, that will also be problematic, as pandas will then default that column to object. An item of type "object" will be inferred in output, so again it will be dynamically assigned as a "best guess".
When you read your csv in you should specify the format if you want it to be maintained. Right now you are having pandas do this dynamically (Pandas will try to guess column types using the first 100 or so rows).
Change the line where you read in to include dtypes and they will be preserved in output. I am going to assume your columns have headers "ColumnA", "ColumnB", "ColumnC":
import pandas as pd
from datetime import datetime
df = pd.read_csv("test.csv", encoding = "ISO-8859-1", dtype={'ColumnA': int,
'ColumnB': float,
'ColumnC': str})
Let's use "ColumnC" as a column example of dates. I like to first read in dates as a string, then ensure the formatting I desire. So you could add this:
df['ColumnC'] = pd.to_datetime(df['ColumnC'].dt.strftime('%m/%d/%Y')
# date would look like: 06/08/2016, but you can look at other formatting for dt.strftime
This will ensure specific types in output. Further formatting can be applied such as the number of decimals in a float, including percents in output by following guides here.
My advice if you have columns with multiple data types: Don't. This is unorganized and makes use cases much more complex for downstream applications. Spend more time organizing data on the front end so you have less headache on the back end.
When I run the following code
import glob,os
import pandas as pd
dirpath = os.getcwd()
inputdirectory = dirpath
for xls_file in glob.glob(os.path.join(inputdirectory,"*.xls*")):
data_xls = pd.read_excel(xls_file, sheet_name=0, index_col=None)
csv_file = os.path.splitext(xls_file)[0]+".csv"
data_xls.to_csv(csv_file, encoding='utf-8', index=False)
It will convert all xls files in the folder into CSV as I want.
HOWEVER, on doing so, any dates such as 20/12/2018 will be converted to 20/12/2018 00:00:00 which is causing major issues with later data processing.
What is going wrong with this?
Nothing is "going wrong" per se. You simply need to provide a custom date_format to df.to_csv:
date_format : string, default None
Format string for datetime objects
In your case that would be
data_xls.to_csv(csv_file, encoding='utf-8', index=False, date_format='%d/%m/%Y')
This will fix the way the raw data is saved to the file. If you will open the file in Excel you may still see it using the full format. This is because Excel tries to assume the cell formats based on their content. You will need to right click the column and select another cell formatting, there is nothing that pandas or Python can do about that (as long as you are using to_csv and not to_excel).
if the above answers still don't work, try this?
import datetime as dt
xls_data['date']=pd.to_datetime(xls_data['date'], format="%d/%m/%y")
xls_data['date'] = xls_data['date'].dt.date
The original xls file is actually storing this fields as datetime.
When you open it with Excel - you seeing it formated the way Excel think you want to see it based on your settings / OS locale / etc.
When python reads the file, the date cells becomes python date objects.
CSV files are basically just text, it cannot holds datetime objects.
When python needs to write datetime object to a text file it gets the full text.
So you have 2 options:
Change the original file date column to text type.
or the better option:
Use python to iterate this fields and change it the text format you would like to see in the csv.
I just tried to reproduce your issue with no success:
>>>import pandas as pd
>>>xls_data = pd.read_excel('test.xls', sheet_name=0, index_cole=None)
>>>xls_data
name date
0 walla 1988-12-10
1 cool 1999-12-10
>>>xls_data.to_csv(encoding='utf-8', index=False)
'name,date\nwalla,1988-12-10\ncool,1999-12-10\n'`
P.S. Any time you deal with datetime objects you should test the result to see if anything change based on your pc locale settings.
i have a csv file. that have a column named DOB. but when i want to change the data type into date type. its gave error.
here is the code
b['DOB'] = pd.to_datetime(b['DOB'], format='%Y-%m-%d')
When you read csv in pandas, read it like below: pd.read_csv(file_name,parse_dates=True)
parse_dates=True converts data to date format if it has date.
I have an xlsx file with multiple tabs, each tab has a Date column in the format of MM/DD/YYYY
Read each tab into a pandas dataframe, applied some operations on each tab, and then write the dataframe back into two formats: xlsx and csv
In the xlsx file, the Date column (index) becomes a format that has the time attached: 1/1/2013 12:00:00 AM, while the Date column in the csv file remains unchanged: MM/DD/YYYY
How can I make sure the Date column in the xlsx file maintains the same format MM/DD/YYYY?
You are seeing the datetime in the default pandas Excel datetime format. However, you can easily set it to whatever you want:
# Set the default datetime and/or date formats.
writer = pd.ExcelWriter("pandas_datetime.xlsx",
engine='xlsxwriter',
date_format='mm/dd/yyy',
datetime_format='mm/dd/yyyy')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
See a full example in the XlsxWriter docs: Example: Pandas Excel output with date times.
You can convert the date to a string. If the date is in the index, you could do:
df.set_index(df.index.map(lambda x: x.strftime('%m/%d/%Y'))).to_excel()