I'm processing a csv file. Source file contain value as '20190801'. Pandas detects it as int or float for different files. But before writing the output, I convert all columns to string and datatype shows all columns as object. But the output containing .0 at the end. Why is that?
e.g: 20190801.0
for col in data.columns:
data[col] = data[col].astype(str)
print(data.dtypes) <-- prints all columns datatypes as object
data.to_csv(neo_path, index=False)
I fixed like this;
I added converters parameter and making sure all those problematic columns should remain as strings in my case.
data = pd.read_csv(filepath, converters={"SiteCode":str,'Date':str,'Tank ID':str,'SIRA RECORD ID':str}
....
data.to_csv(neo_path,index=False)
In this case I get rid of, converting all column types as string as pointed in my quetsion.
for col in data.columns:
data[col] = data[col].astype(str)
: This didnt work when writing the output to csv. It converts string back again to float
Related
im reading an Excel file:
df = pd.read_excel(r'C:\test.xlsx', 'Sheet0', skiprows = 1)
The Excel file contains a column formatted General and a value like "405788", after reading this with pandas the output looks like "405788.0" so its converted as float. I need any value as String without changing the values, can someone help me out with this?
[Edit]
If i copy the values in a new Excel file and load this, the integers does not get converted to float. But i need to get the Values correct of the original file, so is there anything i can do?
Options dtype and converted changes the type as i need in str but as a floating number with .0
You can try to use the dtype attribute of the read_excel method.
df = pd.read_excel(r'C:\test.xlsx', 'Sheet0', skiprows = 1,
dtype={'Name': str, 'Value': str})
More information in the pandas docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
I'm having a problem with converting my data to fro dataframe to percentage format and keep it as a float.
I prepared a simple code thats reflects the code from my actual project:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,15,size=(10, 4)), columns=list('ABCD'))
print(df)
cols = df.columns
for col in cols:
df[col] = df[col].astype(float).map(lambda n: '{:.4%}'.format(n))
print(df)
print(df.dtypes)
In my actual project I need to choose ONLY columns that contain specific string and do some calculation on their values. At the end I have to change the formatting to percentage with 4 decimal places. Eventhough I use astype(float)my values are still str.
Consequently, when I save dataframe to excel file, values are pasted as text and not as number.
In addition, while creating a line chart from this dataframe, I get an error 'unhashable type: 'numpy.ndarray'.
Please advise on how to succefully convert data to percentage format and keep it as a float in order to get accurate paste in excel file and creating a line chart with matplotlib.
Thanks a lot!
I believe this is because you are using .format after .astype(float). The code first converts the values to float but then as format is a str function, it gets converted to string with 4 decimal places. You can try doing the following:
df[col] = df[col].map(lambda n: '{:.4%}'.format(n)).astype(float)
OR
You can try dividing your code line into two parts inside the for loop:
df[col] = df[col].map(lambda n: '{:.4%}'.format(n))
df[col] = df[col].astype(float)
Hope that works!
I'm trying to plot data read into Pandas from a xlsx file. After some minor formatting and data quality checks, I try to plot using matplotlib but get the following error:
TypeError: Empty 'DataFrame': no numeric data to plot
This is not a new issue and I have followed many of the pages on this site dealing with this very problem. The posted suggestions, unfortunately, have not worked for me.
My data set includes strings (locations of sampling sites and limited to the first column), dates (which I have converted to the correct format using pd.to_datetime), many NaN entries (that cannot be converted to zeros due to the graphical analysis we are doing), and column headings representing various analytical parameters.
As per some of the suggestions I read on this site, I have tried the following code
df = df.astype(float) which gives me the following error ValueError: could not convert string to float: 'Site 1' (Site 1 is a sampling location)
df = df.apply(pd.to_numeric, errors='ignore') which gives me the following: dtypes: float64(13), int64(1), object(65) and therefore does not appear to work as most of the data remains as an object. The date entries are the int64 and I cannot figure out why some of the data columns are float64 and some remain as objects
df = df.apply(pd.to_numeric, errors='coerce') which deletes the entire DataFrame, possibly because this operation fills the entire DataFrame with NaN?
I'm stuck and would appreciate any insight.
EDIT
I was able to solve my own question based on some of the feedback. Here is what worked for me:
df = "path"
header = [0] # keep column headings as first row of original data
skip = [1] # skip second row, which has units of measure
na_val = ['.','-.','-+0.01'] # Convert spurious decimal points that have
# no number associated with them to NaN
convert = {col: float for col in (4,...,80)} # Convert specific rows to
# float from original text
parse_col = ("A","C","E:CC") # apply to specific columns
df = pd.read_excel(df, header = header, skiprows = skip,
na_values = na_val, converters = convert, parse_columns = parse_col)
Hard to answer without a data sample, but if you are sure that the numeric columns are 100% numeric, this will probably work:
for c in df.columns:
try:
df[c] = df[c].astype(int)
except:
pass
I have a csv file with no headers. It has around 35 columns.
I am reading this file using pandas.
Currently, issue is that when it reads the file, it automatically assigns datatype to each columns.
How to avoid assigning automatic data types?
I have a column C, which I want to store as string instead of int. But pandas automatically assigns it to int
I tried 2 things.
1)
my_df = pd.DataFrame()
my_df = pd.read_csv('my_csv_file.csv',names=['A','B','C'...'Z'],converters={'C':str},engine = 'python')
Above code gives me error
ValueError: Expected 37 fields in line 1, saw 35
If I remove, converters={'C':str},engine = 'python' there is no error
2)
old_df['C'] = old_df['C'].astype(int)
Issue with this approach is that, if the value in column is '00123', it has already been converted to 123 and then it converts it to '123'. It would lose initial Zeroes , because it thinks it is integer.
use dtype option or converters in read_csv read_csv doc, works regardless of using python engine or not:
df = pd.DataFrame({'col1':['00123','00125'],'col2':[1,2],'col3':[1.0,2.0]})
df.to_csv('test.csv',index=False)
new_df = pd.read_csv('test.csv',dtype={'col1':str,'col2':np.int64,'col3':np.float64})
If you simply use dtype=str then it will read every column in as a string (object). But you can not do that with converters as it expects a dictionary. You could substitute converters for dtype in above code and get same result.
I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?
Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')
You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.
you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])
You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)