Convert csv to csv, removing scientific notation from one column - python

I'm starting with a CSV exported from a system with 3 columns, the first column is displaying a number in scientific notation. I need to transform only that column to a number and save to another CSV. Note there are thousands of lines, converting using Excel is not an option.
I have found many articles close to this, using "float", using "round", but I haven't found anything that can handle a large file.
Example, file1.csv:
ID, Phone, Email
1.23E+15, 123-456-7890, johnsmith#test.com
Need the output to file2.csv:
ID, Phone, Email
1234680000000000, 123-456-7890, johnsmith#test.com
I know I'm way off, but this may give you an idea of what I'm trying to accomplish...
import pandas
import numpy as np
pandas.read_csv('file1.csv', dtype=np.float64)
df = df.apply(pd.to_numeric, errors='coerce')
df.round(0)
df.to_csv(float_format='file2.csv')
Here is the error I receive:
error

The text in your CSV file, "1.23E+15", means "one-point-two-three, raised to the 15th power"... that's all Python, Pandas, anything (but you) can know about that number.
I say "but you", because you seem to know that before "1.23E+15", there was the value 1234680000000000.
But, then some other program/process chopped off the "46800..." part and all it left was "1.23E+15"—something decreased the precision of the original value.
That's why #TimRoberts asked "How was this generated?" To get back 1234680000000000, you need to go to the program/process that last had that higher-precision value and try to change that program/process to not decrease the precision of the number.

Related

CSV cannot be interpreted by numeric values

(This is a mix between code and 'user' issue, but since i suspect the issue is code, i opted to post in StackOverflow instead of SuperUser Exchange).
I generated a .csv file with pandas.DataFrame.to_csv() method. This file consists in 2 columns: one is a label (text) and another is a numeric value called accuracy (float). The delimiter used to separate columns is comma (,) and all float values are stored with dot ponctuation like this: 0.9438245862
Even saving this column as float, Excel and Google Sheets infer its type as text. And when i try to format this column as number, they ignore "0." and return a very high value instead of decimals like:
(text) 0.9438245862 => (number) 9438245862,00
I double-checked my .csv file reimporting it again with pandas.read_csv() and printing dataframe.dtypes and the column is imported as float succesfully.
I'd thank for some guidance on what am i missing.
Thanks,
By itself, the csv file should be correct. Both you and Pandas know what delimiter and floating point format are. But Excel might not agree with you, depending on your locale. A simple way to make sure is to write a tiny Excel sheet containing on first row one text value and one floating point one. You then export the file as csv and control what delimiter and floating point formats are.
AFAIK, it is much more easy to change your Python code to follow what your Excel expects that trying to explain Excel that the format of CSV files can vary...
I know that you can change the delimiter and floating point format in the current locale in a Windows system. Simply it is a global setting...
A short example of data would be most useful here. Otherwise we have no idea what you're actually writing/reading. But I'll hazard a guess based on the information you've provided.
The pandas dataframe will have column names. These column names will be text. Unless you tell Excel/Sheets to use the first row as the column name, it will have to treat the column as text. If this isn't the case, could you perhaps save the head of the dataframe to a csv, check it in a text editor, and see how Excel/Sheets imports it. Then include those five rows and two columns in your follow up.
The coding is not necessarily the issue here, but a combination of various factors. I am assuming that your computer is not using the dot character as a decimal separator, due to your language settings (for example, French, Dutch, etc). Instead your computer (and thus also Excel) is likely using a comma as a decimal separator.
If you want to open the data of your analysis / work later with Excel with little to no changes, you can either opt to change how Excel works or how you store the data to a CSV file.
Choosing the later, you can specify the decimal character for the df.to_csv method. It has the "decimal" keyword. You should then also remember that you have to change the decimal character during the importing of your data (if you want to read again the data).
Continuing with the approach of adopting your Python code, you can use the following code snippets to change how you write the dataframe to a csv
import pandas as pd
... some transformations here ...
df.to_csv('myfile.csv', decimal=',')
If you, then, want to read that output file back in with Python (using Pandas), you can use the following:
import pandas as pd
df = pd.read_csv('myfile.csv', decimal=',')

How to Import a .pos file as a numpy array with mixed data types?

I am rather new to Python and I am working with .pos files. They are not that common, but I can explain their structure.
There is a header with general information and then 15 different columns containing data.The first two columns contain the GPS time (the date the first column and the time in the second - standard format YYYY/MM/DD hh:mm:ss.ms), then there are 3 columns containing coordinates or distances in meters and then other columns that are other measurements, always numbers. Here can be found an example, mind only that my GPST (gps time) is as explained above.
As a matter of fact, there are three data types in this file, that are datetime, integer, and floating numbers.
I need to import this file in Python as an array. Apparently, Python can consider .pos file as a text file, so I have tried to to use the loadtext() command, specifying the different data types (datetime64, int, float). However, it gave me an error, saying that the date format could not be recognized. Then, I tried with the command genfromtext(), both specifying the data types and with dtype=None. In the first case I got empty columns for date and time and in the latter case I got the date and time as a string.
I would like the date and the time to be recognized as such and not as a string, as I will need it later on for further analyses. Does someone have an idea on how I could import this file correctly?
Please, just try to be clear because I am a neophyte programmer!
Thank you for your help.
I answer my own question, maybe it is useful to someone.
.pos file can be open using the Pandas package as follows:
import pandas as pd
df = pd.read_table(filepath, sep='\s+', parse_dates={'Timestamp': [0, 1]})
In my data, the first two columns are date and time, which is considered as such by the argument "parse_dates={'Timestamp': [0, 1]}"

not able to read currency symbol from the cell using pandas python

I am using pandas.read_excel(file) to read the file, but instead of getting number with currency symbol its giving numbers only not with currency symbol.
help will be appreciated.
thanks]1
When the Excel file is read by Pandas it reads the underlying value of the cell which fundamentally is either a string or a number. Things like currency symbols are just applied as formatting by Excel - they affect what you see on screen but don't actually 'exist' in the cell value. For example the number 1.1256 might appear as 1.1 if you select on decimal place, or £1.13 if you select currency, it could even appear as a date, but fundamentally it is just 1.1256 and this is the value Pandas reads. Generally this is useful because with numbers you can perform arithmetic whereas if pandas imported '£1.13' you would be unable to do any arithmetic (for example add this to another amount of money) until you had removed the £ symbol and converted to a decimal and in so doing you would have lost some precision (as the number is now 1.13 not 1.1256).
If you need to view the data with the currency symbol, I'd suggest just adding it on at the last moment, for example if you print the data to screen you could use print(f'£{your_number}')

Pandas to_csv now not writing values correctly

I'm using to csv to save a datframe which looks like this:
PredictionIdx CustomerInterest
0 fe789a06f3 0.654059
1 6238f6b829 0.654269
2 b0e1883ce5 0.666289
3 85e07cdd04 0.664172
in which I've a value '0e15826235' in first column.I'm writing this dataframe to csv using pandas to_csv() . But when I open this csv in google excel or libreoffice it shows 0E in excel and 0 in libreoffice. It is giving me problem during submission in kaggle. But one point to note here is that when I'm reading the same csv using pandas read_csv it shows the above value correctly in dataframe.
As noted in the first comment, the error is resulting from your choice of editor. Many editors will use some version of scientific notation that reads an e (in specific places like the second character) as an indicator of an exponent. Excel, for instance, will read it as a "base X raised to the power Y" where X are the numbers before the e and Y are the numbers after the e. This is a brief description of Excel's scientific notation.
This does not happen in the other cell entries because there appear to be other string-like characters. Excel, Libre, and possibly Google attempt to interpret what the entry is, rather than taking it literally.
In your question you write '0e15826235' with single quotes, indicating that it might be a string, but this might be something to make sure of when writing out the values to a file -- Excel and the rest might not know this is meant to be a string literal.
In general, check for the format of the value and consider what your eventual editor might "think" it is when it opens. For Excel specifically, a single quote character at the start of the string will force Excel to read it as a string. See this answer.
For me code below works correctly with google spreadsheets:
import pandas as pd
df = pd.DataFrame({'PredictionIdx': ['fe789a06f3',
'6238f6b829',
'b0e1883ce5',
'85e07cdd04'],
'CustomerInterest': [0.654059,
0.654269,
0.666289,
0.664172]})
df.to_csv('./test.csv', index = None)
Also csv is very simple text format, it doesn't hold any information about data types.
So you could use df.to_excel() as Nihal suggested, or adjust column type settings in your favourite spreadsheets viewer.

Conforming dataframe dtypes between read_excel() and to_excel()

I am reading a dataframe from an excel file (specifically xlsx) that contains rows and columns about vendors, including zip_code and tax_id columns. When the numbers are read IN and then I cast the column astype(unicode), tax_id 123456789 becomes 123456789.0.
I don't want to cast to int and then mod / truncate (because, in the case of zip_code and theoretically tax_id too, '07443' will get converted to 7443 which isn't good). I just want to clip the '.0' and have to_excel() treat the whole column as strings (unicodes, more specifically).
Sometimes read_excel() correctly identifies a number as a string (07443 is a good example, actually). In the case of the tax_id though, it's clearly coming in as a number of some kind (even though until I astype(unicode) it, the '.0' doesn't show up.
One thing I've tried is df.astype(unicode).replace(".0",""), but this doesn't seem to be getting it done. The resulting df still shows 123456789.0.
I'm not sure how to illustrate this with code because you need an Excel file, which I can't attach. I'm open to suggestions as to how to clarify my question if necessary.
Thank you!
Hmm, one thing that appear to be working (which I suppose speaks to the awesomeness of pandas):
df['tax_id'].replace(".0$","",regex=True)

Categories

Resources