I'm trying to train a linear regression model in jupyter notebooks and loaded an csv file created via Google Sheets. All the data was saved as number in the sheet, but when i loaded the CSV into Jupyter it is turned into an string and i can't convert it back, it gives the following error: could not convert string to float: '10.801.68'
Already changed the commas by dots and tried the following code:
df.columns=['Data', 'Price', 'Volume']
df['Price'] = df['Price'].str.replace(',','.')
df['Price'] = df['Price'].astype(float)
you need to remove the . from your '10.801.68' as Yatu said: df['Price'] = df['Price'].str.replace(',',''), or try to change the thousand seperator while reading your csv file, for example:
pd.read_csv(thousands=r'.')
Your number format in your csv files (saved on windows I would assume) is quite different to python logic, I would try to adjust it directly in the reading process.
You can do as PV8 suggested within python or clean the file withing excel by replacing (ctrl+h) '.' with ''
Related
From Python i want to export to csv format a dataframe
The dataframe contains two columns like this
So when i write this :
df['NAME'] = df['NAME'].astype(str) # or .astype('string')
df.to_csv('output.csv',index=False,sep=';')
The excel output in csv format returns this :
and reads the value "MAY8218" as a date format "may-18" while i want it to be read as "MAY8218".
I've tried many ways but none of them is working. I don't want an alternative like putting quotation marks to the left and the right of the value.
Thanks.
If you want to export the dataframe to use it in excel just export it as xlsx. It works for me and maintains the value as string in the original format.
df.to_excel('output.xlsx',index=False)
The CSV format is a text format. The file contains no hint for the type of the field. The problem is that Excel has the worst possible support for CSV files: it assumes that CSV files always use its own conventions when you try to read one. In short, one Excel implementation can only read correctly what it has written...
That means that you cannot prevent Excel to interpret the csv data the way it wants, at least when you open a csv file. Fortunately you have other options:
import the csv file instead of opening it. This time you have options to configure the way the file should be processed.
use LibreOffice calc for processing CSV files. LibreOffice is a little behind Microsoft Office on most points except for csv file handling where it has an excellent support.
So I have a csv file with a column called reference_id. The values in reference id are 15 characters long, so something like '162473985649957'. When I open the CSV file, excel has changed the datatype to General and the numbers are something like '1.62474E+14'. To fix this in excel, I change the column type to Number and remove the decimals and it displays the correct value. I should add, it only does this in CSV file, if I output to xlsx, it works fine. PRoblem is, the file has to be csv.
Is there a way to fix this using python? I'm trying to automate a process. I have tried using the following to convert it to a string. It works in the sense that is converts the column to a string, but it still shows up incorrectly in the csv file.
df['reference_id'] = df['reference_id'].astype(str)
df.to_csv(r'Prev Day Branch Transaction Mems.csv')
Thanks
When I open the CSV file, excel has changed the data
This is an Excel problem. You can't fix how Excel decides to interpret your CSV. (You can work around some issues by using the text import format, but that's cumbersome.)
Either use XLS/XLSX files when working with Excel, or use eg. Gnumeric our something other that doesn't wantonly mangle your data.
I am attempting to read MapInfo .dat files into .csv files using Python. So far, I have found the easiest way to do this is though xlwings and pandas.
When I do this (below code) I get a mostly correct .csv file. The only issue is that some columns are appearing as symbols/gibberish instead of their real values. I know this because I also have the correct data on hand, exported from MapInfo.
import xlwings as xw
import pandas as pd
app = xw.App(visible=False)
tracker = app.books.open('./cable.dat')
last_row = xw.Range('A1').current_region.last_cell.row
data = xw.Range("A1:AE" + str(last_row))
test_dataframe = data.options(pd.DataFrame, header=True).value
test_dataframe.columns = list(schema)
test_dataframe.to_csv('./output.csv')
When I compare to the real data, I can see that the symbols do actually map the correct number (meaning that (1 = Â?, 2=#, 3=#, etc.)
Below is the first part of the 'dictionary' as to how they map:
My question is this:
Is there an encoding that I can use to turn these series of symbols into their correct representation? The floats aren't the only column affected by this, but they are the most important to my data.
Any help is appreciated.
import pandas as pd
from simpledbf import Dbf5
dbf = Dbf5('path/filename.dat')
df = dbf.to_dataframe()
.dat files are dbase files underneath https://www.loc.gov/preservation/digital/formats/fdd/fdd000324.shtml. so just use that method.
then just output the data
df.to_csv('outpath/filename.csv')
EDIT
If I understand well you are using XLWings to load the .dat file into excel. And then read it into pandas dataframe to export it into a csv file.
Somewhere along this it seems indeed that some binary data is not/incorrectly interpreted and then written as text to you csv file.
directly read dBase file
My first suggestion would be to try to read the input file directly into Python without the use of an excel instance.
According to Wikipedia, mapinfo .dat files are actually are dBase III files. These you can parse in python using a library like dbfread.
inspect data before writing to csv
Secondly, I would inspect the 'corrupted' columns in python instead of immediately writing them to disk.
Either something is going wrong in the excel import and the data of these columns gets imported as text instead of some binary number format,
Or this data is correctly into memory as a byte array (instead of a float), and when you write it to csv, it just gets byte-wise dumped to disk instead of interpreting it as a number format and making a text representation of it
note
Small remark about your initial question regarding mapping text to numbers:
Probably it will not be possible create a straightforward map of characters to numbers:
These numbers could have any encoding and might not be stored as decimal text values like you now seem to assume
These text representations are just a decoding using some character encoding (UTF-8, UTF-16). E.g. for UTF-8 several bytes might map to one character. And the question marks or squares you see, might indicate that one or more characters could not be decoded.
In any case you will be losing information if start from the text, you must start from the binary data to decode.
Lets say my csv file looks something like this:
acc_num,pincode
023213821,23120
002312727,03131
231238782,29389
008712372,00127
023827812,23371
when I open this file in excel , it removes the leading zeros , but here I need to keep them . This is how it looks when i open it in excel
acc_num,pincode
23213821,23120
2312727,3131
231238782,29389
8712372,127
23827812,23371
I tried converting this into a string but it still shows it without 0 (but its a string now)
I tried using the astype() function from pandas but there's no point in using it
Any help would be appreciated
Did you format the cells in the excel document as 'text', so that when you open it in excel it displays the leading zeros, then when you bring it into Python, ensure you're bring it in and storing as a string, as python3 does not allow leading zeros in ints.
I want to read this csv file in pandas as a DataFrame. Then I would like to split the resulting strings from colons.
I import using:
df_r = pd.read_csv("report.csv", sep=";|,", engine="python")
Then split using:
for c in df_r:
if df_r[c].dtype == "object":
df_r[c] = df_r[c].str.split(':')
But I get the following error:
ValueError: could not convert string to float: '"\x001\x000\x00'
Any idea what I am doing wrong?
Edit:
The error actually shows when I try to convert one of the strings to a float
print(float(df_r["Laptime"].iloc[0][2]))
I ran your code and everything works fine. You can try catching the error and print the row that has that strange behaviour and manual inspect that.
Is that the entire dump you are using? I saw that you are assigning the csv to the variable a and using df_r afterwards so I think you are doing something else in between.
If the csv file is complete be aware that the last line is empty and create a row full of NaNs. You want to read the csv with skipfooter=1.
a = pd.read_csv('report.csv', sep=";|,", engine="python", skipfooter=1)
Edit:
You can convert it to float like this:
print(float(df_r["Laptime"].iloc[0][2].replace(b'\x00',b'')))