I have a dask dataframe with one problematic column that (I believe) is the source of a particular error that is thrown every time I try to do anything with the dataframe (be it head, or to_csv, or even when I try to subset using a (different) column. The error is probably owing to a data type mismatch and shows up like this:
ValueError: invalid literal for int() with base 10: 'FIPS'
So I decided to drop that column ('FIPS') using
df = df.drop('FIPS', axis=1)
Now when I do df.columns, I don't see 'FIPS' any longer which I take to mean that it has indeed been dropped. But when I try to write a different column to a file
df.column_a.to_csv('example.csv')
I keep getting the same error
ValueError: invalid literal for int() with base 10: 'FIPS'
I assume it has something to do with dask's lazy approaches as a result of which it delays the drop, but any work-around would be very helpful.
Basically, I just need to extract a single column (column_a) from df.
try to convert to a pandas dataframe after the drop
df.compute()
and only then write to csv
Related
I am reading data from postgresql DB into pandas dataframe. In one of the columns all values are integer while some are missing. Dataframe while reading is attaching trailing zeros to all the values in the column.
e.g. Original Data
SUBJID
1031456
1031457
1031458
What I am getting in the Dataframe column is this
df['SUBJID'].head()
1031456.0
1031457.0
1031458.0
I know I can remove it but there are multiple columns & I never know which column will have this problem. So while reading itself I want to ensure that everything is read as string & without those trailing zeros.
I have already tried with df = pd.read_sql('q',dtype=str). But it's not giving desired output.
Please let me know the solution.
Adding another answer because this is different than the other one.
This happens because your dataset contains empty cells, and since Int type doesn't support NA/NaN it get casted to float.
One solution would be to fill the NA/NaN with 0 then set the type as int like so
columns = ['SUBJID'] # you can list the columns you want, or you can run it on the whole dataframe if you want to.
df[columns] = df[columns].fillna(0).astype(int)
# then you can convert to string after converting to int if you need to do so
Another would be to have the sql query do the filling for you (which is a bit tedious to write if you ask me).
Note that pandas.read_sql doesn't have dtype argument anyways.
try setting the dtype of the column to int then to str.
df['SUBJID'] = df['SUBJID'].astype('int32')
df['SUBJID'] = df['SUBJID'].astype(str)
if you want to manually fix the strings, then you can do
df['SUBJID'] = df['SUBJID'].apply(lambda x: x.split(".")[0])
This should strip out the "." and everything after it, but make sure you don't use it on columns that contain a "." that you need.
I am trying to calculate one column mean from an excel.
I delete all the null value and '-' in the column called 'TFD' and form a new dataframe by selecting three columns. I want to calculated the mean from the new dataframe with groupby. But there is an error called "No numeric types to aggregate", I don't know why I have this error and how to fix it.
sheet=pd.read_excel(file)
sheet_copy=sheet
sheet_copy=sheet_copy[(~sheet_copy['TFD'].isin(['-']))&(~sheet_copy['TFD'].isnull())]
sheet_copy=sheet_copy[['Participant ID','Paragraph','TFD']]
means=sheet_copy['TFD'].groupby([sheet_copy['Participant ID'],sheet_copy['Paragraph']]).mean()
From your spreadsheet snippet above it looks as though your Participant ID and Paragraph columns have data types which are Text formats which leads me to believe that they will be strings inside of your dataframe? Which leads me to believe this is precisely where your issue lies from the exception "No numeric types to aggregate"
Following this, here are some good examples of group by with the mean clause from the pandas documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.mean.html
If you had the dataset to hand I would have tried it out myself and provided a snippet of the code used.
I get the following error when passing a pandas dataframe to scikitlearn algorithms:
invalid literal for float(): 2.,3
How do I find the row or column with the problem in order to fix or eliminate it? Is there something like df[df.isnull().any(axis=1)] for a specific value (in my case I guess 2.,3)?
If you know what column it is, you can use
df[df.your_column == 2.,3]
then you'll get all rows where the specified column has a value of 2.,3
You might have to use
df[df.your_column == '2.,3']
I have a dataframe, for which I need to convert columns to floats and ints, that has bad rows, ie., values that are in a column that should be a float or an integer are instead string values.
If I use df.bad.astype(float), I get an error, this is expected.
If I use df.bad.astype(float, errors='coerce'), or pd.to_numeric(df.bad, errors='coerce'), bad values are replaced with np.NaN, also according to spec and reasonable.
There is also errors='ignore', another option that ignores the errors and leaves the erroring values alone.
But actually, I want to not ignore the errors, but drop the rows with bad values. How can I do this?
I can ignore the errors and do some type checking, but that's not an ideal solution, and there might be something more idiomatic to do this.
Example
test = pd.DataFrame(["3", "4", "problem"], columns=["bad"])
test.bad.astype(float) ## ValueError: could not convert string to float: 'problem'
I want something like this:
pd.to_numeric(df.bad, errors='drop')
And this returns dataframe with only the 2 good rows.
Since the bad values are replaced with np.NaN would it not be simply just df.dropna() to get rid of the bad rows now?
EDIT:
Since you need to not drop the initial NaNs, maybe you could use df.fillna() prior to using pd.to_numeric
Everytime I import this one csv ('leads.csv') I get the following error:
/usr/local/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (11,12,13,14,17,19,20,21) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
I import many .csv's for this one analysis of which 'leads.csv' is only one. It's the only file with the problem. When I look at those columns in a spreadsheet application, the values are all consistent.
For example, Column 11 (which is Column K when using Excel), is a simple Boolean field and indeed, every row is populated and it's consistently populated with exactly 'FALSE' or exactly 'TRUE'. The other fields that this error message references have consistently-formatted string values with only letters and numbers. In most of these columns, there are at least some blanks.
Anyway, given all of this, I don't understand why this message keeps happening. It doesn't seem to matter much as I can use the data anyway. But here are my questions:
1) How would you go about identifying the any rows/records that are causing this error?
2) Using the low_memory=False option seems to be pretty unpopular in many of the posts I read. Do I need to declare the datatype of each field in this case? Or should I just ignore the error?