Pandas.read_sql_query reads int as float - python

I am exporting SQL database into csv using Pandas.read_sql_query + df.to_csv and have a problem that integer fields are represented as float in DataFrame.
My code:
conn = pymysql.connect(server, port)
chunk = pandas.read_sql_query('''select * from table''', conn)
df = pandas.DataFrame(chunk) # here int values are float
df.to_csv()
I am exporting a number of tables this way and the problem is that int fields are exported as float (with a dot). Also, I don't know upfront which column has which type (code supposed to be generic for all tables)
However, my target is to export everything as is - in strings
The things I've tried (without a success):
df.applymap(str)
read_sql_query(coerce_float=False)
df.fillna('')
DataFrame(dtype=object) / DataFrame(dtype=str)
Of course, I can then post-process the data to type-cast integers, but would be better to do it during the initial import
UPD: My dataset has NULL values. They should be replaced with empty strings (as the purpose is to typecast all columns to strings)

Pandas infers the datatype from a sample of the data. If an integer column has null values, pandas assigns it the float datatype because NaN is of float type and because pandas is based on numpy which has no nullable integer type. So try making sure that you dont have any nulls or that nulls are replaced with 0's if it makes sense in your dataset, 0 being an integer.
ALSO, another way to go about this would be to specify the dtypes on data import, but you have to use a special kind of integer, like Int64Dtype, as per the docs:
"If you need to represent integers with possibly missing values, use one of the nullable-integer extension dtypes provided by pandas"

So I found a solution by post-processing the data (force converting to strings) using applymap:
def convert_to_int(x):
try:
return str(int(x))
except:
return x
df = df.applymap(convert_to_int)
The step takes significant time for processing, however solves my problem

Related

Python Dataframe: Using astype to change column type fails

I have been using astype to change the type of the columns for a while. However, I run into an unexpected result today. I have a column named modularity_class, and I am trying to convert it from float to int and assign to a new column
communities_to_analyze['modularity'] = communities_to_analyze['modularity_class'].astype(int)
However, this gives me interesting result
>>print(communities_to_analyze['modularity'][0])
94
>>print(communities_to_analyze.iloc[0]['modularity'])
94.0
This looks so ridiculous. I am using pandas 1.1.1, and this has never happened to me before. I was wondering if anyone has run into the same problem before?
communities_to_analyze.iloc[0]['modularity']
Here you first access the first row of your data frame. It is converted to a pandas series. If there is a float number somewhere in the row every value is converted to float. Thus, if you then access the index 'modularity' of your pandas series it will return a float and not integer.
communities_to_analyze['modularity'][0])
Here you do it the other way round. First, select the column 'modularity'. The values remain integer because there is no float in this column.
Thus, if you then access the first value it will return an integer.
df['modularity_class'][0] will returns the content of one cell of the dataframe, in this case the string '94.0'. This is now returning a str object that doesn't have this function.You can convert it like below:
communities_to_analyze['modularity'] = int(communities_to_analyze['modularity_class'])

Converting strings to ints in a DataFrame

How to covert a DataFrame column containing strings and "-" values to floats.
I have tried with pd.to_numeric and pd.Series.astype(int) but i havenĀ“t success.
What do you recommend??
If I correctly understand, you want pandas to convert the string 7.629.352 to the float value 7629352.0 and the string (21.808.956) to the value -21808956.0.
For the first part, it is directly possible with the thousands parameter, and it is even possible to process - as a NaN:
m = read_csv(..., thousands='.', na_values='-')
The real problem is the parens for negative values.
You could use a python function to convert the values. A possible alternative would be to post process the dataframe column wise:
m = read_csv(..., thousands='.', na_values='-')
for col in m.columns:
if m[col].dtype == np.dtype('object'):
m[col] = m[col].str.replace(r'\.', '').str.replace(r'\((.*)\)', r'-\1').astype('float64')

Receiving KeyError when converting a column from float to int

created a pandas data frame using read_csv.
I then changed the column name of the 0th column from 'Unnamed' to 'Experiment_Number'.
The values in this column are floating point numbers and I've been trying to convert them to integers using:
df['Experiment_Number'] = df['Experiment_Number'].astype(int)
I get this error:
KeyError: 'Experiment_Number'
I've been trying every way since yesterday, for example also
df['Experiment_Number'] = df.astype({'Experiment_Number': int})
and many other variations.
Can someone please help, I'm new using pandas and this close to giving up on this :(
Any help will be appreciated
I had used this for renaming the column before:
df.columns.values[0] = 'Experiment_Number'
This should have worked. The fact that it didn't can only mean there were special characters/unprintable characters in your column names.
I can offer another possible suggestion, using df.rename:
df = df.rename(columns={df.columns[0] : 'Experiment_Number'})
You can convert the type during your read_csv() call then rename it afterward. As in
df = pandas.read_csv(filename,
dtype = {'Unnamed': 'float'}, # inform read_csv this field is float
converters = {'Unnamed': int}) # apply the int() function
df.rename(columns = {'Unnamed' : 'Experiment_Number'}, inplace=True)
The dtype is not strictly necessary, because the converter will override it in this case, but it is wise to get in the habit of always providing a dtype for every field of your input. It is annoying, for example, how pandas treats integers as floats by default. Also, you may later remove the converters option without worry, if you specified dtype.

Avoid converting data to int automatically while reading using pandas data frame

I have a csv file with no headers. It has around 35 columns.
I am reading this file using pandas.
Currently, issue is that when it reads the file, it automatically assigns datatype to each columns.
How to avoid assigning automatic data types?
I have a column C, which I want to store as string instead of int. But pandas automatically assigns it to int
I tried 2 things.
1)
my_df = pd.DataFrame()
my_df = pd.read_csv('my_csv_file.csv',names=['A','B','C'...'Z'],converters={'C':str},engine = 'python')
Above code gives me error
ValueError: Expected 37 fields in line 1, saw 35
If I remove, converters={'C':str},engine = 'python' there is no error
2)
old_df['C'] = old_df['C'].astype(int)
Issue with this approach is that, if the value in column is '00123', it has already been converted to 123 and then it converts it to '123'. It would lose initial Zeroes , because it thinks it is integer.
use dtype option or converters in read_csv read_csv doc, works regardless of using python engine or not:
df = pd.DataFrame({'col1':['00123','00125'],'col2':[1,2],'col3':[1.0,2.0]})
df.to_csv('test.csv',index=False)
new_df = pd.read_csv('test.csv',dtype={'col1':str,'col2':np.int64,'col3':np.float64})
If you simply use dtype=str then it will read every column in as a string (object). But you can not do that with converters as it expects a dictionary. You could substitute converters for dtype in above code and get same result.

String problem / Select all values > 8000 in pandas dataframe

I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?
Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')
You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.
you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])
You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)

Categories

Resources