Context
I have a pandas dataframe which I need to save to disk and re-load later. Because the file saved on disk needs to be human-readable, I'm currently saving the dataframe as a CSV. The data includes values that are integers, booleans, null/None, timestamps, and strings.
Problem
Some of the string values are phone numbers, formatted as "+12025550140", but these are being converted to integers by the round trip (dataframe -> CSV -> dataframe). I need them to stay as strings.
I've changed the CSV writing portion to use quoting=csv.QUOTE_NONNUMERIC, which preserves the format of the phone numbers into the CSV, but when they are read back into a dataframe they are converted to integers. If I tell the CSV reading portion to also use quoting=csv.QUOTE_NONNUMERIC, then they are converted to floats.
How do I enforce that quoted fields are loaded as strings? Or, is there any other way to enforce that the full process is type-safe?
Constraints and non-constraints
The file saved to disk must be easy to manually edit, preferably with a plain text editor. I have full control and ownership over the code which generates the CSV file. A different file format can be used if it is easy to apply manual edits.
Code
Writing to disk:
import csv
df = get_df() # real function has been replaced
df.to_csv(query_file_path, index=False, quoting=csv.QUOTE_NONNUMERIC)
Reading from disk:
import pandas as pd
CSV_NA_VALS = pd._libs.parsers.STR_NA_VALUES
CSV_NA_VALS.remove("")
df = pd.read_csv(query_file_path, na_values=CSV_NA_VALS)
df = df.replace([""], [None])
Versions
Python 3.9.5
pandas==1.4.0
Related
From Python i want to export to csv format a dataframe
The dataframe contains two columns like this
So when i write this :
df['NAME'] = df['NAME'].astype(str) # or .astype('string')
df.to_csv('output.csv',index=False,sep=';')
The excel output in csv format returns this :
and reads the value "MAY8218" as a date format "may-18" while i want it to be read as "MAY8218".
I've tried many ways but none of them is working. I don't want an alternative like putting quotation marks to the left and the right of the value.
Thanks.
If you want to export the dataframe to use it in excel just export it as xlsx. It works for me and maintains the value as string in the original format.
df.to_excel('output.xlsx',index=False)
The CSV format is a text format. The file contains no hint for the type of the field. The problem is that Excel has the worst possible support for CSV files: it assumes that CSV files always use its own conventions when you try to read one. In short, one Excel implementation can only read correctly what it has written...
That means that you cannot prevent Excel to interpret the csv data the way it wants, at least when you open a csv file. Fortunately you have other options:
import the csv file instead of opening it. This time you have options to configure the way the file should be processed.
use LibreOffice calc for processing CSV files. LibreOffice is a little behind Microsoft Office on most points except for csv file handling where it has an excellent support.
I am attempting to read MapInfo .dat files into .csv files using Python. So far, I have found the easiest way to do this is though xlwings and pandas.
When I do this (below code) I get a mostly correct .csv file. The only issue is that some columns are appearing as symbols/gibberish instead of their real values. I know this because I also have the correct data on hand, exported from MapInfo.
import xlwings as xw
import pandas as pd
app = xw.App(visible=False)
tracker = app.books.open('./cable.dat')
last_row = xw.Range('A1').current_region.last_cell.row
data = xw.Range("A1:AE" + str(last_row))
test_dataframe = data.options(pd.DataFrame, header=True).value
test_dataframe.columns = list(schema)
test_dataframe.to_csv('./output.csv')
When I compare to the real data, I can see that the symbols do actually map the correct number (meaning that (1 = Â?, 2=#, 3=#, etc.)
Below is the first part of the 'dictionary' as to how they map:
My question is this:
Is there an encoding that I can use to turn these series of symbols into their correct representation? The floats aren't the only column affected by this, but they are the most important to my data.
Any help is appreciated.
import pandas as pd
from simpledbf import Dbf5
dbf = Dbf5('path/filename.dat')
df = dbf.to_dataframe()
.dat files are dbase files underneath https://www.loc.gov/preservation/digital/formats/fdd/fdd000324.shtml. so just use that method.
then just output the data
df.to_csv('outpath/filename.csv')
EDIT
If I understand well you are using XLWings to load the .dat file into excel. And then read it into pandas dataframe to export it into a csv file.
Somewhere along this it seems indeed that some binary data is not/incorrectly interpreted and then written as text to you csv file.
directly read dBase file
My first suggestion would be to try to read the input file directly into Python without the use of an excel instance.
According to Wikipedia, mapinfo .dat files are actually are dBase III files. These you can parse in python using a library like dbfread.
inspect data before writing to csv
Secondly, I would inspect the 'corrupted' columns in python instead of immediately writing them to disk.
Either something is going wrong in the excel import and the data of these columns gets imported as text instead of some binary number format,
Or this data is correctly into memory as a byte array (instead of a float), and when you write it to csv, it just gets byte-wise dumped to disk instead of interpreting it as a number format and making a text representation of it
note
Small remark about your initial question regarding mapping text to numbers:
Probably it will not be possible create a straightforward map of characters to numbers:
These numbers could have any encoding and might not be stored as decimal text values like you now seem to assume
These text representations are just a decoding using some character encoding (UTF-8, UTF-16). E.g. for UTF-8 several bytes might map to one character. And the question marks or squares you see, might indicate that one or more characters could not be decoded.
In any case you will be losing information if start from the text, you must start from the binary data to decode.
the below code simply reads in an excel file, stores it as a df and writes the df back into an excel file. When I open the output file in excel, the columns (Dates, numbers) are not the same... some are text , some or numbers ect..
import pandas as pd
df = pd.read_csv("test.csv", encoding = "ISO-8859-1", dtype=object)
writer = pd.ExcelWriter('outputt.xlsx', engine='xlsxwriter')
df.to_excel(writer, index = False, sheet_name='Sheet1') #drop the index
writer.save()
Is there a way to have the column types (as defined in the initial file) be preserved or revert back to the datatypes when the file was read in?
You are reading in a csv file which is certainly not the same as an excel file. You can read a csv file with excel in Windows, but the encoding is different when the file is saved. You can certainly format cells according xlsxwriter specifications.
However, it is important to note that xlsxwriter cannot format any cells that already have a format such as the header or index, or dates or datetime objects. If you have multiple datatypes in a single column, that will also be problematic, as pandas will then default that column to object. An item of type "object" will be inferred in output, so again it will be dynamically assigned as a "best guess".
When you read your csv in you should specify the format if you want it to be maintained. Right now you are having pandas do this dynamically (Pandas will try to guess column types using the first 100 or so rows).
Change the line where you read in to include dtypes and they will be preserved in output. I am going to assume your columns have headers "ColumnA", "ColumnB", "ColumnC":
import pandas as pd
from datetime import datetime
df = pd.read_csv("test.csv", encoding = "ISO-8859-1", dtype={'ColumnA': int,
'ColumnB': float,
'ColumnC': str})
Let's use "ColumnC" as a column example of dates. I like to first read in dates as a string, then ensure the formatting I desire. So you could add this:
df['ColumnC'] = pd.to_datetime(df['ColumnC'].dt.strftime('%m/%d/%Y')
# date would look like: 06/08/2016, but you can look at other formatting for dt.strftime
This will ensure specific types in output. Further formatting can be applied such as the number of decimals in a float, including percents in output by following guides here.
My advice if you have columns with multiple data types: Don't. This is unorganized and makes use cases much more complex for downstream applications. Spend more time organizing data on the front end so you have less headache on the back end.
I have a list(fulllist) of 292 items and converted to data frame. Then tried writing it to csv in python.
import pandas as pd
my_df = pd.DataFrame(fulllist)
my_df.to_csv('Desktop/pgm/111.csv', index=False,sep=',')
But the some comma separated values fills each columns of csv. I am trying to make that values in single column.
Portion of output is shown below.
I have tried with writerows but wont work.
import csv
with open('Desktop/pgm/111.csv', "wb") as f:
writer = csv.writer(fulllist)
writer.writerows(fulllist)
Also tried with "".join at each time, when the length of list is higher than 1. It also not giving the result. How to make the proper csv so that each fields fill each columns?
My expected output csv is
Please keep in mind that .csv files are in fact plain text files and understanding of .csv by given software depends on implementation, for example some might allow newline character as part of field, when it is between " and ", while other treat every newline character as next row.
Do you have to use .csv format? If not consider other possibilities:
DSV https://en.wikipedia.org/wiki/Delimiter-separated_values is similiar to csv, but you can use for example ; instead of ,, which should help if you do not have ; in your data
openpyxl allows writing and reading of .xlsx files.
import pandas as pd
check = pd.read_csv('1.csv')
nocheck = check['CUSIP'].str[:-1]
nocheck = nocheck.to_frame()
nocheck['CUSIP'] = nocheck['CUSIP'].astype(str)
nocheck.to_csv('NoCheck.csv')
This works but while writing the csv, a value for an identifier like 0003418 (type = str) converts to 3418 (type = general) when the csv file is opened in Excel. How do I avoid this?
I couldn't find a dupe for this question, so I'll post my comment as a solution.
This is an Excel issue, not a python error. Excel autoformats numeric columns to remove leading 0's. You can "fix" this by forcing pandas to quote when writing:
import csv
# insert pandas code from question here
# use csv.QUOTE_ALL when writing CSV.
nocheck.to_csv('NoCheck.csv', quoting=csv.QUOTE_ALL)
Note that this will actually put quotes around each value in your CSV. It will render the way you want in Excel, but you may run into issues if you try to read the file some other way.
Another solution is to write the CSV without quoting, and change the cell format in Excel to "General" instead of "Numeric".