How to deal with NaN values in pandas (from csv file)? - python

I have a fairly large csv file filled with data obtained from a machine for material testing (compression test). The headers of the data are Time, Force, Stroke and they are repeated 10 times because of the sample size, so the last set of headers is Time.10, Force.10, Stroke.10.
Because of the nature of the experiment not all columns are equally long (some are approx.. 2000 rows longer than others). When I import the file into my IDE (spyder or jupyter) using pandas, all the cells within the the rows that are empty in the csv file are labeled as NaN.
The problem is... I can't do any mathematical operations within or between columns that have NaN values as they are treated as str. I have tried the most recommended solutions on pretty much all forums; .fillna(), dropna(), replace() and interpolate(). The mentioned methods work but only visually, e.g. df.fillna(0) replaces all NaN values with 0, but when I try to e.g. find the max value in the column, I still get the error that says that there are strings in my column (TypeError: '>' not supported between instances of 'float' and 'str'). The problem is caused 100% by the NaN values that are the result of the empty cells in the csv file as I have imported a csv file in which all columns where the same length (with no empty cells) and there where no problems. If anyone has any solution to this problem (doesn't need to be within pandas, just within Python) that I'm stuck on for over 2 weeks, I would be grateful.

Try read_csv() with na_filter=False.
This should at least prevent from setting "empty" source cells to NaN.
But note that:
such "empty" cells can have an empty string as the content,
the type of each column containing at least one such cell is object
(not a number),
so that (for the time being) they can't take part in any numeric
operations.
So probably, after read_csv() you should:
replace such empty strings with e.g. 0 (or whatever numeric value),
call to_numeric(...) to change type of each column from object
to whatever numeric type is appropriate in each case.

Related

Can I force Python to return only in String-format when I concatenate two series of strings?

I want to concatenate two columns in pandas containing mostly string values and some missing values. The result should be a new column which again contain string values and missings. Mostly it just worked fine with this:
df['newcolumn']=df['column1']+df['column2']
Most of the values in column1 are numbers (interpreted as strings) like 82. But some of the values in column2 are a composition of letters and numbers starting with an E like E52 or E83. When now 82 and E83 are concatenated, the result I want is 82E83. Unfortunately the results then is 8,2E+84. I guess, Python implicitly interpeted this as a number with scientific notation.
I already tried different ways of concatenating and forcing string format, but the result is always the same:
df['newcolumn']=(df['column1']+df['column2']).asytpe(str)
or
df['newcolumn']=(df['column1'].str.cat(df['column2'])).asytpe(str)
It seems Python first create a float, creating this not wanted format and then change the type to string, keeping results like 8,2E+84. Is there a solution for strictly keeping string format?
Edit: Thanks for your comments. As I tried to reproduce the problem myself with a very short dataframe, the problem also didn't occur. Finally I realized that it was only a problem with Excel automatically intepreting the cells as (wrong) numbers (in the CSV-Output). I didn't realize it before, because another dataframe coming from a CSV-File I used for merging with this dataframe on this concatenated strings was also already "destroyed" the same way by Excel. So the merging didn't work properly and I thought the concatenating in Python is the problem. I used to view the dataframe with Excel because it is really big. In the future I will be more carefully with this. My apologies for misplacing the problem!
Type conversion is not required in this case. You can simply use
df["newcolumn"] = df.apply(lambda x: f"{str(x[0])}{str(x[1])}", axis = 1)
Output:

How to ensure that a column in dataframe loaded from a csv file is formatted as an integer (without decimal characters)

I am using Python 3.7
I need to load data from two different sources (both csv) and determine which rows from the one sources are not in the second source.
I have used pandas data-frames to load the data and do a comparison between the two sources of data.
I loaded the data from the csv file and a value like 2010392 is turned to 2010392.0 in the data-frame column.
I have read quite a number of articles about formatting data-frame columns; unfortunately, most of them are about date and time conversions.
I came across an article "Format integer column of Data-frame in Python pandas" at http://www.datasciencemadesimple.com/format-integer-column-of-dataframe-in-python-pandas/ which does not solve my problem
Based on the above mentioned article I have tried the following:
pd.to_numeric(data02['IDDLECT'], downcast='integer')
Out[63]:
0 2010392.0
1 111777967.0
2 2010392.0
3 2012554.0
4 2010392.0
5 2010392.0
6 2010392.0
7 1170126.0
and as you can see, the column values still have a decimal point with a zero.
I expect the load of the dataframe from a csv file to keep the format of a number such as 2010392 to be 2010392 and not 2010392.0
Here is the code that I have tried:
import pandas as pd
data = pd.read_csv("timetable_all_2019-2_groups.csv")
data02 = data.drop_duplicates()
print(f'Len data {len(data)}')
print(data.head(20))
print(f'Len data02 {len(data02)}')
print(data02.head(20))
pd.to_numeric(data02['IDDLECT'], downcast='integer')
Here is a few lines of the content of the csv file:
The data in the one source looks like this:
IDDCYR,IDDSUBJ,IDDOT,IDDGRPTYP,IDDCLASSGROUP,IDDLECT,IDDPRIMARY
019,AAACA1B,VF,C,A1,2010392,Y
2019,AAACA1B,VF,C,A1,111777967,N
2019,AAACA3B,VF,C,A1,2010392,Y
2019,AAACA3B,VF,C,A1,2012554,N
2019,AAACB2A,VF,C,B1,2010392,Y
2019,AAACB2A,VF,P,B2,2010392,Y
2019,AAACB2A,VF,C,B1,2010392,N
2019,AAACB2A,VF,P,B2,1170126,N
2019,AAACH1A,VF,C,A1,2010392,Y
Looks like you have data which is not of integer type. Once loaded you should do something about that data and then convert the column to int.
From your error description, you have nans and/or inf values. You could impute the missing values with the mode, mean, median or a constant value. You can achieve that either with pandas or with sklearn imputer, which is dedicated to imputing missing values.
Note that if you use mean, you may end up with a float number, so make sure to get the mean as an int.
The imputation method you choose really depends on what you'll use this data for later. If you want to understand the data, filling nans with 0 may destroy aggregation functions later (e.g. if you'll want to know what the mean is, it won't be accurate).
That being said, I see you're dealing with categorical data. One option here is to use dtype='category'. If you want to later fit a model with this and you leave ids as numbers, the model can conclude weird things which are not correct (e.g. the sum of two ids equals to some third id, or that ids that are higher are more important than lower ones... things that a priori make no sense and should not be ignored and left to chance.)
Hope this helps!
data02['IDDLECT'] = data02['IDDLECT']fillna(0).astype('int')

Pandas, reading excel column values, but stop when no more values are present in that column

I want to import some values from an excel sheet with Pandas.
When I read values with Pandas, I would like to read column by column, but stop reading values when the rows of each column are empty.
Since in my excel file different columns have different number of rows, what I am getting now are arrays with some numbers, but then filled up with "nan" values until they reach the maximum number (i.e., the number rows of the excel column having the greatest number of rows)
I hope the explanation was not too confusing.
The code snippet is not a great example, it is not reproducible, but hopefully will help understanding what I am trying to do.
In the second part of the snippet (below #Removing nan) I was trying to remove the "nan" after having already imported them, but that was not working either, I was getting this error:
ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The same happened with np.isfinite
df = pandas.read_excel(file_name)
for i in range(number_of_average_points):
#Reading column values (includes nan)
force_excel_col = df[df.columns[index_force]].values[13:]
acceleration1_excel_col = df[df.columns[index_acceleration1]].values[13:]
acceleration2_excel_col = df[df.columns[index_acceleration2]].values[13:]
#Trying to remove nan
force.append(force_excel_col[np.logical_not(np.isnan(force_excel_col))])
acceleration1.append(acceleration1_excel_col[np.isfinite(acceleration1_excel_col)])
acceleration2.append(acceleration1_excel_col[np.isfinite(acceleration2_excel_col)])
This might be doable, but it is not efficient and bad practice. Having NaN data in a dataframe is a regular part of any data analysis in Pandas (and in general).
I'd encourage you rather to read in the entire excel file. Then, to get rid of all NaNs, you can either replace them (with 0s, for example), using Pandas' builtin dropna() method, or even drop all rows from your dataframe that contain NaN values. Happy to expand on this if you are interested.

Pandas DataFrame: when using read_csv, rows with blanks are converting entire column to "object" data type

I have a CSV file that looks as follows:
name,id,weight
a,12345,196.5
b,83748,
,83748,200.0
c, ,155.5
Note, there are several missing values indicated by a single space.
When I load this CSV file into a Pandas DataFrame and check the data types using dtypes, it says that every column is of type "object". Even after I convert the spaces to NaN, it still says everything is an "object".
How do I get the data types to be read in correctly, despite the spaces? Could this possibly be an issue with the Anaconda platform?
When you call read_csv(), make sure you call it with na_values=' ' (or whatever your NaN values actually are).

Python Pandas Parser Error -- DtypeWarning

Everytime I import this one csv ('leads.csv') I get the following error:
/usr/local/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (11,12,13,14,17,19,20,21) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
I import many .csv's for this one analysis of which 'leads.csv' is only one. It's the only file with the problem. When I look at those columns in a spreadsheet application, the values are all consistent.
For example, Column 11 (which is Column K when using Excel), is a simple Boolean field and indeed, every row is populated and it's consistently populated with exactly 'FALSE' or exactly 'TRUE'. The other fields that this error message references have consistently-formatted string values with only letters and numbers. In most of these columns, there are at least some blanks.
Anyway, given all of this, I don't understand why this message keeps happening. It doesn't seem to matter much as I can use the data anyway. But here are my questions:
1) How would you go about identifying the any rows/records that are causing this error?
2) Using the low_memory=False option seems to be pretty unpopular in many of the posts I read. Do I need to declare the datatype of each field in this case? Or should I just ignore the error?

Categories

Resources