How to treat date as plain text with pandas?

How to treat date as plain text with pandas? - python

I use pandas to read a .csv file, then save it as .xls file. Code as following:
import pandas as pd
df = pd.read_csv('filename.csv', encoding='GB18030')
print(df)
df.to_excel('filename.xls')
There's a column contains date like '2020/7/12', it's looks like pandas recognized it as date and output it to '2020-07-12' automatically. I don't want to format this column, or any other columns like this, I'd like to keep all data remain the same as plain text.
This convertion happens at read_csv(), because print(df) already outputs YYYY-MM-DD, before to_excel().
I tried use df.info() to check the data type of that column, the data type is object. Then I added argument dtype=pd.StringDtype() to read_csv() and it doesn't help.
The file contains Chinese characters so I set encoding to GB18030, don't know if this matters.

My experience concerning pd.read_csv indicates that:
Only columns convertible to int or float are by default
converted to respective types.
"Date-like" strings are still read as strings (the column type in
the resulting DataFrame is actually object).
If you want read_csv to convert such column to datetime type, you
should pass parse_dates parameter, specifying a list of columns to be
parsed as dates. Since you didn't do it, no source column should be
converted to datetime type.
To check this detail, after you read file, run file.info() and check
the type of the column in question.
So if respective Excel file column is of Date type, then probably
this conversion is caused by to_excel.
And one more remark concerning variable names:
What you have read using read_csv is a DataFrame, not a file.
Actual file is the source object, from which you read the content,
but here you passed only file name.
So don't use names like file to name the resulting DataFrame, as this
is misleading. It is much better to use e.g. df.
Edit following a comment as of 05:58Z
To check in full extent what you wrote in your comment, I created
the following CSV file:
DateBougth,Id,Value
2020/7/12,1031,500.15
2020/8/18,1032,700.40
2020/10/16,1033,452.17
I ran: df = pd.read_csv('Input.csv') and then print(df), getting:
DateBougth Id Value
0 2020/7/12 1031 500.15
1 2020/8/18 1032 700.40
2 2020/10/16 1033 452.17
So, at the Pandas level, no format conversion occurred in DateBougth
column. Both remaining columns, contain numeric content, so they were
silently converted to int64 and float64, but DateBought remained as object.
Then I saved this df to an Excel file, running: df.to_excel('Output.xls')
and opened it with Excel. The content is:
So neither at the Excel level any data type conversion took place.
To see the actual data type of B2 cell (the first DateBougth),
I clicked on this cell and pressed Ctrl-1, to display cell formatting.
The format is General (not Date), just as I expected.
Maybe you have some outdated version of software?
I use Python v. 3.8.2 and Pandas v. 1.0.3.
Another detail to check: Look at your code after pd.read_csv.
Maybe somewhere you put instruction like df.DateBought = pd.to_datetime(df.DateBought) (explicit type conversion)?
Or at least format conversion. Note that in my environment
there was absolutely no change in the format of DateBought column.

Problem solved. I double checked my .csv file, opened it with notepad, the data is 2020-07-12, which displays as 2020/7/12 on Office. Turns out that Office reformatted date to yyyy/m/d (based on your region). I'm developing a tool to process and import data to DB for my company, we did these work manually by copy and paste so no one noticed this issue. Thanks to #Valdi_Bo for his investigate and patience.

Related

Python save CSV without changing ID to an integer

I have a df in Python with an ID column - those IDs can be a mix of numbers and letters, or solely numbers. Eg:
ID
00028D9D1
00027B98F
000275457
When I save this df out, using pandas to_csv I see in the resulting csv file when I share with others (or open up myself), I see the IDs that contain letters are maintained as is / treated as text, but the IDs that are solely numbers are treated as integers, and automatically formatted that way. For example, I would see this in my csv file after saving:
ID
00028D9D1
00027B98F
275457
Is there any way to disable this automatic treatment of integers, leading to different formatting? The dtype of this column does say it is an object so I assumed it would save in the same format for all values.

According to RFC 4180, CSV files do not contain any type information, so it is solely the responsibility of the application to correctly interpret the contents of the file. From what I read in your question,
I have a df in Python with an ID column - those IDs can be a mix of numbers and letters, or solely numbers.
and as far as I interpret your specification, it you'll have something like this:
input.csv
ID
00028D9D1
00027B98F
000275457
script
import pandas as pd
df = pd.read_csv('input.csv')
print(df)
print(df['ID'].dtype)
df.to_csv('output.csv', index=False)
console output
ID
0 00028D9D1
1 00027B98F
2 000275457
object
output.csv
ID
00028D9D1
00027B98F
000275457
In other words, use the right tool to "open up" the CSV file you create.
As I observe on Windows, spreadsheet applications like Excel or Open/Libre office register themselves with the .csv file extension, so just opening a CSV will lead to a very generic interpretation of data: cells that can be converted into a number without errors are treated as integer cells, regardless of their column.
One application that lets you view the actual contents of a text file is Windows Notepad, for example, but as a programmer you probably know better alternatives.

CSV cannot be interpreted by numeric values

(This is a mix between code and 'user' issue, but since i suspect the issue is code, i opted to post in StackOverflow instead of SuperUser Exchange).
I generated a .csv file with pandas.DataFrame.to_csv() method. This file consists in 2 columns: one is a label (text) and another is a numeric value called accuracy (float). The delimiter used to separate columns is comma (,) and all float values are stored with dot ponctuation like this: 0.9438245862
Even saving this column as float, Excel and Google Sheets infer its type as text. And when i try to format this column as number, they ignore "0." and return a very high value instead of decimals like:
(text) 0.9438245862 => (number) 9438245862,00
I double-checked my .csv file reimporting it again with pandas.read_csv() and printing dataframe.dtypes and the column is imported as float succesfully.
I'd thank for some guidance on what am i missing.
Thanks,

By itself, the csv file should be correct. Both you and Pandas know what delimiter and floating point format are. But Excel might not agree with you, depending on your locale. A simple way to make sure is to write a tiny Excel sheet containing on first row one text value and one floating point one. You then export the file as csv and control what delimiter and floating point formats are.
AFAIK, it is much more easy to change your Python code to follow what your Excel expects that trying to explain Excel that the format of CSV files can vary...
I know that you can change the delimiter and floating point format in the current locale in a Windows system. Simply it is a global setting...

A short example of data would be most useful here. Otherwise we have no idea what you're actually writing/reading. But I'll hazard a guess based on the information you've provided.
The pandas dataframe will have column names. These column names will be text. Unless you tell Excel/Sheets to use the first row as the column name, it will have to treat the column as text. If this isn't the case, could you perhaps save the head of the dataframe to a csv, check it in a text editor, and see how Excel/Sheets imports it. Then include those five rows and two columns in your follow up.

The coding is not necessarily the issue here, but a combination of various factors. I am assuming that your computer is not using the dot character as a decimal separator, due to your language settings (for example, French, Dutch, etc). Instead your computer (and thus also Excel) is likely using a comma as a decimal separator.
If you want to open the data of your analysis / work later with Excel with little to no changes, you can either opt to change how Excel works or how you store the data to a CSV file.
Choosing the later, you can specify the decimal character for the df.to_csv method. It has the "decimal" keyword. You should then also remember that you have to change the decimal character during the importing of your data (if you want to read again the data).
Continuing with the approach of adopting your Python code, you can use the following code snippets to change how you write the dataframe to a csv
import pandas as pd
... some transformations here ...
df.to_csv('myfile.csv', decimal=',')
If you, then, want to read that output file back in with Python (using Pandas), you can use the following:
import pandas as pd
df = pd.read_csv('myfile.csv', decimal=',')

Exporting Pandas DataFrame cells directly to excel/csv (python)

I have a Pandas DataFrame that has sports records in it. All of them look like this: "1-2-0", "17-12-1", etc., for wins, losses and ties. When I export this the records come up in different date formats within Excel. Some will come up as "12-May", others as "9/5/2001", and others will come up as I want them to.
The DataFrame that I want to export is named 'x' and this is the command I'm currently using. I tried it without the date_format part and it gave the same response in Excel.
x.to_csv(r'C:\Users\B\Desktop\nba.csv', date_format = '%s')
Also tried using to_excel and I kept getting errors while trying to export. Any ideas? I was thinking I am doing the date_format part wrong, but don't know to transfer the string of text directly instead of it getting automatically switched to a string.
Thanks!

I don't think its a python issue, but Excel auto detecting dates in your data.
But, see below to convert your scores to strings.
Try this,
import pandas as pd
df = pd.DataFrame({"lakers" : ["10-0-1"],"celtics" : ["11-1-3"]})
print(df.head())
here is the dataframe with made up data.
lakers celtics
0 10-0-1 11-1-3
Convert to dataframe to string
df = df.astype(str)
and save the csv:
df.to_csv('nba.csv')
Opening in LibreOffice gives me to columns with scores (made up)
You might have a use Excel issue going on here. Inline with my comment below, you can change any column in Excel to lots of different formats. In this case I believe Excel is auto detecting date formatting, incorrectly. Select your columns of data, right click, select format and change to anything else, like 'General'.

How to run parsing logic of Pandas read_csv on custom data?

read_csv contains a lot of parsing logic to detect and convert CSV strings to numerical and datetime Pythong values. My question is, is there a way to call same conversions also on a DataFrame which contains columns with string data, but where the DataFrame is not stored in CSV file but comes from a different (unparsed) source? So only a memory DataFrame object is available.
So saving such DataFrame to a CSV file and reading it back would do such conversion, but this looks very inefficient to me.

If you have e.g. a column of string type, but containing actually a date
(e.g. yyyy-mm-dd), you can use pd.to_datetime() to convert it to Timestamp.
Assuming that the column name is SomeDate, you can call:
df.SomeDate = pd.to_datetime(df.SomeDate)
Another option is to apply any own conversion function to any your column
(search the Web for description of apply).
You didn't give any details, so I can give only such very general advice.

reading dat files with pandas by format string

reading a fixed width .dat file in pandas is not very complicated using the pd.read_csv('file.dat', sep='\s+')or the pd.read_fwf('file.dat', widths=[7, ..]) method. But in the file is also given a format string like this:
Format = (i7,1x,i7,1x,i2,1x,i2,1x,i2,1x,f5.1,1x,i4,1x,3i,1x,f4.1,1x,i1,1x,f4.1,1x,i3,1x,i4,1x,i4,1x,i3,1x,i4,2x,i1)
looking at the columns content, I assume the character indicates the datatype (i->int, f->float, x->seperator) and the number is obviously the width of the column. Is this a standard notation? Is there a more pythonic way to read data files by just passing this format string and make scripts save against format changes in the data file?
I noticed the format argument for the read_fwf() function, but it takes a list of pairs (int, int) not the type of format string that is given.
First rows of the data file:
list of pairs (int, int)

This is a pretty standard way to indicate format using the C printf convention. The format is only really important if you are trying to write the file in an identical manner. For the purpose of reading it all into pandas you don't really care. If you want control over the specific data type of each column as you read it in you use the dtype parameter. In the example below I said to make column 'a' a 64-bit floag and 'b' a 32-bit int.
my_dtypes = {‘a’: np.float64, ‘b’: np.int32}
pd.read_csv('file.dat', sep='\s+', dtype=my_dtypes)
You don't have to specify every column, just the ones that you want. It's likely that pandas figured out most of this already though by default. After your call to read_csv() try
df = pd.read_csv(....)
print(df.dtypes)
this will show you the data type of each of your columns.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.