created a pandas data frame using read_csv.
I then changed the column name of the 0th column from 'Unnamed' to 'Experiment_Number'.
The values in this column are floating point numbers and I've been trying to convert them to integers using:
df['Experiment_Number'] = df['Experiment_Number'].astype(int)
I get this error:
KeyError: 'Experiment_Number'
I've been trying every way since yesterday, for example also
df['Experiment_Number'] = df.astype({'Experiment_Number': int})
and many other variations.
Can someone please help, I'm new using pandas and this close to giving up on this :(
Any help will be appreciated
I had used this for renaming the column before:
df.columns.values[0] = 'Experiment_Number'
This should have worked. The fact that it didn't can only mean there were special characters/unprintable characters in your column names.
I can offer another possible suggestion, using df.rename:
df = df.rename(columns={df.columns[0] : 'Experiment_Number'})
You can convert the type during your read_csv() call then rename it afterward. As in
df = pandas.read_csv(filename,
dtype = {'Unnamed': 'float'}, # inform read_csv this field is float
converters = {'Unnamed': int}) # apply the int() function
df.rename(columns = {'Unnamed' : 'Experiment_Number'}, inplace=True)
The dtype is not strictly necessary, because the converter will override it in this case, but it is wise to get in the habit of always providing a dtype for every field of your input. It is annoying, for example, how pandas treats integers as floats by default. Also, you may later remove the converters option without worry, if you specified dtype.
Related
I am exporting SQL database into csv using Pandas.read_sql_query + df.to_csv and have a problem that integer fields are represented as float in DataFrame.
My code:
conn = pymysql.connect(server, port)
chunk = pandas.read_sql_query('''select * from table''', conn)
df = pandas.DataFrame(chunk) # here int values are float
df.to_csv()
I am exporting a number of tables this way and the problem is that int fields are exported as float (with a dot). Also, I don't know upfront which column has which type (code supposed to be generic for all tables)
However, my target is to export everything as is - in strings
The things I've tried (without a success):
df.applymap(str)
read_sql_query(coerce_float=False)
df.fillna('')
DataFrame(dtype=object) / DataFrame(dtype=str)
Of course, I can then post-process the data to type-cast integers, but would be better to do it during the initial import
UPD: My dataset has NULL values. They should be replaced with empty strings (as the purpose is to typecast all columns to strings)
Pandas infers the datatype from a sample of the data. If an integer column has null values, pandas assigns it the float datatype because NaN is of float type and because pandas is based on numpy which has no nullable integer type. So try making sure that you dont have any nulls or that nulls are replaced with 0's if it makes sense in your dataset, 0 being an integer.
ALSO, another way to go about this would be to specify the dtypes on data import, but you have to use a special kind of integer, like Int64Dtype, as per the docs:
"If you need to represent integers with possibly missing values, use one of the nullable-integer extension dtypes provided by pandas"
So I found a solution by post-processing the data (force converting to strings) using applymap:
def convert_to_int(x):
try:
return str(int(x))
except:
return x
df = df.applymap(convert_to_int)
The step takes significant time for processing, however solves my problem
I have a data frame that looks like this:
Pandas DF:
I exported it to excel to be able to see it easier. But basically I am trying to sort it by SeqNo asc and it isnt counting correctly. So instead of going 0,0,0,0,1,1,1,1,2,2,2,2 its going 0,0,0,0,0,1,1,1,1,10,10,10,10. Please help out if possible. Here is the code that I have to sort it. I have tried many other methods but it just isnt sorting correctly.
final_df = df.sort_values(by=['SeqNo'])
Based on your description I think it is treating the column values as "String" instead of "int". You can confirm this by checking the datatype of your column (Ex: use df.info() to check datatype of all the columns in dataframe)
One option to resolve this is to convert that particular column type from string to "int" before sorting and exporting to excel. You can apply pandas "to_numeric()" function before sorting and exporting to excel. Please check pandas documentation for to_numeric() (Refer to https://www.linkedin.com/pulse/change-data-type-columns-pandas-mohit-sharma/ for sample)
First of all try Command Given Below for Verifying the type of Data given to you because it's important to understand your data first:-
print(df.dtypes)
Above Command will display all the Datatypes of Given Data. Then try to find SeqNo Datatype. If your Output for SeqNo is Showing Something like:-
SeqNo object
dtype: object
Then Your Data is of String Format and you have to Convert it to Integer or Numeric Format. So, For Converting it there are two Ways:-
1. By astype(int) Method
df['SeqNo'] = df['SeqNo'].astype(int)
2. By to_numeric Method
df['SeqNo'] = pd.to_numeric(df['SeqNo'])
After this Step Try again to Verify the Datatype has been changed or not by typing print(df.dtypes) and Now it will show similar output as stated below:-
SeqNo int32
dtype: object
Now you can print Data after Sorting Operation in Ascending Format:-
final_df = df.sort_values(by = ['SeqNo'], ascending = True)
This is a rather non-standard question. For educational purposes, I'm trying to create a mixed type column in a csv file, so that I get a warning message when importing the dataset in a pandas DataFrame and later on, deal with that column to show how it's done.
The problem is that I'd type 0s in a string column in Excel, save it and close the file, but the clever pandas still imports that column as a string column, so it doesn't detect that there are in fact floats in it.
I also tried to change the format of only these 0s in pandas using astype('float'), exporting and re-importing. Still doesn't work.
Does anyone have an idea how can I create a column that pandas will read a mixed type?
Thanks in advance!
I'm trying to create a mixed type column in a csv file, so that I get
a warning message when importing the dataset in a pandas
Pandas will always infer the type of a column (Series object) and this is always going to be a single type. If every value in the column is string then pandas will load it as a column of type string.
If there are "mixed" values that can't be reasonably loaded as a strings, integers... then the inferred type will simply be dtype: object. Which also means that you will get no warning.
You can force the type when loading dataframe via dtype parameter.
pd.read_csv("test_file.csv", index_col=0, dtype=int)
Now the pandas will try to convert everything to int and if there are values that can't be converted to int, you will get an exception such as
ValueError: invalid literal for int() with base 10: 'a'
When trying to load dataset that contains string a in it. But again, this will not produce a warning, the operation will simply fail.
Here is how you can create a mixed column.
df = pd.DataFrame()
df["mix"] = ["a", "b", 1, True]
df.to_csv("test_file.csv")
df_again = pd.read_csv("test_file.csv", index_col=0)
print(df_again["mix"])
Type of the mix column is object
...
Name: mix, dtype: object
If you change the read_csv in the above code into
df_again = pd.read_csv("test_file.csv", index_col=0, dtype=int)
you will get the mentioned error.
You can create a DataFrame with mixed values.
pandas tries to guess the type for each column by chunks, if chunks have different guessed type then a warning is emitted but the type is preserved.
df = pd.DataFrame({'a': (['1'] * 100000 +
['X'] * 100000 +
['1'] * 100000),
'b': ['b'] * 300000})
df.to_csv('test.csv', index=False)
# when reading it pandas emits a DtypeWarning: Columns (0) have mixed types
df2 = pd.read_csv('test.csv')
>>> type(df2.iloc[262140, 0])
<class 'str'>
>>> type(df2.iloc[262150, 0])
<class 'int'>
It's probably not the best way to write a production-ready code but it can be useful in tests, and when debugging your code.
See the documentation: https://pandas.pydata.org/docs/reference/api/pandas.errors.DtypeWarning.html
I have a csv file with no headers. It has around 35 columns.
I am reading this file using pandas.
Currently, issue is that when it reads the file, it automatically assigns datatype to each columns.
How to avoid assigning automatic data types?
I have a column C, which I want to store as string instead of int. But pandas automatically assigns it to int
I tried 2 things.
1)
my_df = pd.DataFrame()
my_df = pd.read_csv('my_csv_file.csv',names=['A','B','C'...'Z'],converters={'C':str},engine = 'python')
Above code gives me error
ValueError: Expected 37 fields in line 1, saw 35
If I remove, converters={'C':str},engine = 'python' there is no error
2)
old_df['C'] = old_df['C'].astype(int)
Issue with this approach is that, if the value in column is '00123', it has already been converted to 123 and then it converts it to '123'. It would lose initial Zeroes , because it thinks it is integer.
use dtype option or converters in read_csv read_csv doc, works regardless of using python engine or not:
df = pd.DataFrame({'col1':['00123','00125'],'col2':[1,2],'col3':[1.0,2.0]})
df.to_csv('test.csv',index=False)
new_df = pd.read_csv('test.csv',dtype={'col1':str,'col2':np.int64,'col3':np.float64})
If you simply use dtype=str then it will read every column in as a string (object). But you can not do that with converters as it expects a dictionary. You could substitute converters for dtype in above code and get same result.
I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?
Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')
You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.
you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])
You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)