I have a large data frame with 85 columns. The missing data has been coded as NaN. My goal is to get the amount of missing data in each column. So I wrote a for loop to create a list to get the amounts. But it does not work.
The followings are my codes:
headers = x.columns.values.tolist()
nans=[]
for head in headers:
nans_col = x[x.head == 'NaN'].shape[0]
nan.append(nans_col)
I tried to use the codes in the loop to generate the amount of missing value for a specific column by changing head to that column's name, then the code works and gave me the amount of missing data in that column.
So I do not know how to correct the for loop codes. Is somebody kind to help me with this? I highly appreciate your help.
For columns in pandas (python data analysis library) you can use:
In [3]: import numpy as np
In [4]: import pandas as pd
In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
In [6]: df.isnull().sum()
Out[6]:
a 1
b 2
dtype: int64
For a single column or for sereis you can count the missing values as shown below:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: s = pd.Series([1,2,3, np.nan, np.nan])
In [4]: s.isnull().sum()
Out[4]: 2
Reference
This gives you a count (by column name) of the number of values missing (printed as True followed by the count)
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
print(column)
print(missing_data[column].value_counts())
print("")
Just use Dataframe.info, and non-null count is probably what you want and more.
>>> pd.DataFrame({'a':[1,2], 'b':[None, None], 'c':[3, None]}) \
.info(verbose=True, null_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 2 non-null int64
1 b 0 non-null object
2 c 1 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 176.0+ bytes
If there are multiple dataframe
below is the function to calculate number of missing value in each column with percentage
Missing Data Analysis
def miss_data(df):
x = ['column_name','missing_data', 'missing_in_percentage']
missing_data = pd.DataFrame(columns=x)
columns = df.columns
for col in columns:
icolumn_name = col
imissing_data = df[col].isnull().sum()
imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
print(missing_data)
#function to show the nulls total values per column
colum_name = np.array(data.columns.values)
def iter_columns_name(colum_name):
for k in colum_name:
print("total nulls {}=".format(k),pd.isnull(data[k]).values.ravel().sum())
#call the function
iter_columns_name(colum_name)
#outout
total nulls start_date= 0
total nulls end_date= 0
total nulls created_on= 0
total nulls lat= 9925
.
.
.
Related
I have a dataset df with two columns ID and Value. Both are of Dtype "object". However, I would like to convert the column Value to Dtype "double" with a dot as decimal separator. The problem is that the values of this column contain noise due to the presence of too many commas (e.g. 0,1,,) - or after replacement too many dots (e.g. 0.1..). As a result, when I try to convert the Dtype to double, I get the error message: could not convert string to float: '0.2.'
Example code:
#required packages
import pandas as pd
import numpy as np
# initialize list of lists
data = [[1, '0,1'], [2, '0,2,'], [3, '0,01,,']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Value'])
#replace comma with dot as separator
df = df.replace(',', '.', regex=True)
#examine dtype per column
df.info()
#convert dtype from object to double
df = df.astype({'Value': np.double}) #this is where the error message appears
The preferred outcome is to have the values within the column Value as 0.1, 0.2 and 0.01 respectively.
How can I get rid of the redundant commas or, after replacement, dots in the values of the column Values?
One option: use string functions to convert and strip the values. For example:
#required packages
import pandas as pd
import numpy as np
# initialize list of lists
data = [[1, '0,1'], [2, '0,2,'], [3, '0,01,,']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Value'])
#replace comma with dot as separator
df['Value'] = df['Value'].str.replace(',', '.', 1).str.rstrip(',')
#examine dtype per column
df.info()
#convert dtype from object to double
df = df.astype({'Value': np.double})
print("------ df:")
print(df)
prints:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 3 non-null int64
1 Value 3 non-null object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
----- df:
ID Value
0 1 0.10
1 2 0.20
2 3 0.01
I have a DataFrame that I build row-by-row (by necessity). My issue is that at the end, all dtypes are object. This is not so if the DataFrame is created with all the data at once.
Let me explain what I mean.
import pandas as pd
from IPython.display import display
# Example data
cols = ['P/N','Date','x','y','z']
PN = ['10a1','10a2','10a3']
dates = pd.to_datetime(['2022-07-01','2022-07-03','2022-07-05'])
xd = [0,1,2]
yd = [1.1,1.2,1.3]
zd = [-0.8,0.,0.8]
# Canonical way to build DataFrame (if you have all the data ready)
dg = pd.DataFrame({'P/N':PN,'Date':dates,'x':xd,'y':yd,'z':zd})
display(dg)
dg.dtypes
Here's what I get. Note the correct dtypes:
OK, now I do the same thing row-by-row:
# Build empty DataFrame
cols = ['P/N','Date','x','y','z']
df = pd.DataFrame(columns=cols)
# Add rows in loop
for i in range(3):
new_row = {'P/N':PN[i],'Date':pd.to_datetime(dates[i]),'x':xd[i],'y':yd[i],'z':zd[i]}
# deprecated
#df = df.append(new_row,ignore_index=True)
df = pd.concat([df,pd.DataFrame([new_row])],ignore_index=True)
display(df)
df.dtypes
Note the [] around new_row, otherwise you get a stupid error. (I really don't understand the deprecation of append, BTW. It allows for much more readable code)
But now I get this: This isn't the same as above, all the dtypes are object!
The only way I found to recover my dtypes is to use infer_objects:
# Recover dtypes by using infer_objects()
dh = df.infer_objects()
dh.dtypes
And dh is now the same as dg.
Note that even if I do
df = pd.concat([df,pd.DataFrame([new_row]).infer_objects()],ignore_index=True)
above, it still does not work. I believe this is due to a bug in concat: when an empty DataFrame is concat'ed to a non-empty DataFrame, the resulting DataFrame fails to take over the dtypes of second DataFrame. We can verify this with:
pd.concat([pd.DataFrame(),df],ignore_index=True).dtypes
and all dtypes are still object.
Is there a better way to build a DataFrame row-by-row and have the correct dtypes inferred automatically?
Your initial dataframe df = pd.DataFrame(columns=cols) columns are all of type object because it has no data to infer dtypes from. So you need to set them with astype. Like #mozway commented it is recommended to use a list to add the from the loop.
I expect this to work for you:
import pandas as pd
cols = ['P/N','Date','x','y','z']
dtypes = ['object', 'datetime64[ns]', 'int64', 'float64', 'float64']
PN = ['10a1','10a2','10a3']
dates = pd.to_datetime(['2022-07-01','2022-07-03','2022-07-05'])
xd = [0,1,2]
yd = [1.1,1.2,1.3]
zd = [-0.8,0.,0.8]
df = pd.DataFrame(columns=cols).astype({c:d for c,d in zip(cols,dtypes)})
new_rows = []
for i in range(3):
new_row = [PN[i], pd.to_datetime(dates[i]), xd[i], yd[i], zd[i]]
new_rows.append(new_row)
df_new = pd.concat([df, pd.DataFrame(new_rows, columns=cols)], axis=0)
print(df_new)
print(df_new.info())
Output:
P/N Date x y z
0 10a1 2022-07-01 0 1.1 -0.8
1 10a2 2022-07-03 1 1.2 0.0
2 10a3 2022-07-05 2 1.3 0.8
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 P/N 3 non-null object
1 Date 3 non-null datetime64[ns]
2 x 3 non-null int64
3 y 3 non-null float64
4 z 3 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 144.0+ bytes
I'm trying to read a huge CSV file (almost 5GB) into a pandas dataframe.
This CSV only has 3 columns like this:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STORE_ID 404944 non-null int64
1 SIZE 404944 non-null int64
2 DISTANCE 404944 non-null object
The problem is the column DISTANCE should only have int64 numbers, but somehow it contains some "null" values in the form of \\N. These \\N are causing my code to fail. Unfortunately I have no control over building this CSV, so I have no way of correcting it before hand.
This is a sample of the CSV:
STORE_ID,SIZE,DISTANCE
900072211,1,1000
900072212,1,1000
900072213,1,\\N
900072220,5,4500
I need to have this DISTANCE column with only int64 values.
Since the CSV is huge, I first tried to read it using the following code, assigning dtypes at the start:
df = pd.read_csv("polygons.csv", dtype={"STORE_ID": int, "SIZE": int, "DISTANCE": int})
But with this I got this error:
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'
How would you go about efficiently ready this csv to a dataframe? Is there a way to assign a dtype to the DISTANCE column while reading?
Use na_values as parameter of pd.read_csv, it should solve your problem:
df = pd.read_csv(..., na_values=r'\\N')
Output:
>>> df
STORE_ID SIZE DISTANCE
0 900072211 1 1000.0
1 900072212 1 1000.0
2 900072213 1 NaN
3 900072220 5 4500.0
>>> df.dtypes
STORE_ID int64
SIZE int64
DISTANCE float64
dtype: object
Update
You can also use converters:
convert_N = lambda x: int(x) if x != r'\\N' else 0
df = pd.read_csv(..., converters={'DISTANCE': convert_N})
Output:
>>> df
STORE_ID SIZE DISTANCE
0 900072211 1 1000
1 900072212 1 1000
2 900072213 1 0
3 900072220 5 4500
>>> df.dtypes
x1 int64
x2 int64
x3 int64
dtype: object
I have a dataframe df_data and a list l_ids. Here's how df_data.head() looks like following:
And l_lids[:5] is [224960004, 60032008, 26677001, 162213003, 72405004]
I want to get rows that have l_id present in list l_ids.
So I do this: df_temp = df_data[df_data.isin(l_ids)]
However, df_temp has rows with NaN in it. In fact, text field of all rows are NaN. Here's what df_temp.head() looks like:
Cross-check:
print(79823003 in l_ids, 224960004 in l_ids)
True, True
As we can l_ids[0] is 224960004 which is present in df_temp but it's now a float and the corresponding text is NaN. Same with 79823003 and other ids.
Why is this happening? I had gotten the same error in the past also, but I got rows via some other ways and ignored the error. But now that it has happened again in an unrelated project, I feel like am doing some kind of mistake here.
Extra info
df_data.info() returns:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3577942 entries, 0 to 6953898
Data columns (total 2 columns):
text object
l_id int64
dtypes: int64(1), object(1)
df_temp.info() returns:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3577942 entries, 0 to 6953898
Data columns (total 2 columns):
text object
l_id float64
dtypes: float64(1), object(1)
Thus datatype for l_id field changed from int64 to float64.
You statement should be like this:
df_temp = df_data[df_data['l_id'].isin(l_ids)]
This will check for each row if the value of the column l_id is present in the list l_ids and return the corresponding rows for which the condition is true. Your mistake was to call isin() on the whole dataframe df_data instead of just the column df_data['l_id'].
One more way to solve the problem:
import pandas as pd
df = pd.DataFrame({
'text': ['aa', 'bb', 'cc', 'dd'],
'l_id': [1, 2, 3, 4],
})
ids = [2, 3]
df[df.apply(lambda x: x['l_id'] in ids, axis=1)]
I would like to calculate and subtract the average over a subset of columns. Here is one way to do it:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
def col_avg(df, col_ids):
'''Calculate and subtract average over *col_ids*
*df* is modified in-place.
'''
cols = [ df.columns[i] for i in col_ids ]
acc = df[cols[0]].copy()
for col in cols[1:]:
acc += df[col]
acc /= len(cols)
for col in cols:
df[col] -= acc
# Create example data
np.random.seed(42)
df = pd.DataFrame(data=np.random.random((433,80)) + np.arange(433)[:, np.newaxis],
columns=['col-%d' % x for x in range(80)])
#df = pd.DataFrame.from_csv('data.csv')
# Calculate average over columns 2, 3 and 6
df_old = df.copy()
col_avg(df, [ 1, 2, 5])
assert any(df_old.iloc[0] != df.iloc[0])
Now and I don't particularly like the two for loops, so I tried to express the same operation more concisely:
def col_avg(df, col_ids):
dfT = df.T
mean = dfT.iloc[col_ids].mean()
dfT.iloc[col_ids] -= mean
This implementation looks a lot nicer (IMO), but it has one drawback: it only works for some datasets. With the example above, it works. But e.g. when loading this csv file it fails.
The only explanation that I have is that in some cases the dfT.iloc[col_ids] expression must be internally creating a copy of the value array instead of modifying it in-place.
Is this the right explanation?
If so, what is it about the DataFrame that makes pandas decide to copy the data in one case but no the other?
Is there another way to perform this task that always works and does not require explicit iteration?
EDIT: When suggesting alternative implementations, please state why you think your implementation will always work. After all, the above code seems to work for some inputs as well.
The transpose of the DataFrame, dfT = df.T, may return a new DataFrame, not a view.
In that case, modifying dfT does nothing to df.
In your toy example,
df = pd.DataFrame(data=np.random.random((433,80)) + np.arange(433)[:, np.newaxis],
columns=['col-%d' % x for x in range(80)])
all the columns have the same dtype:
In [83]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 433 entries, 0 to 432
Data columns (total 80 columns):
col-0 433 non-null float64
col-1 433 non-null float64
col-2 433 non-null float64
...
dtypes: float64(80)
memory usage: 274.0 KB
whereas in the DataFrame built from CSV, some columns have int64 dtype:
In [55]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 492 entries, 0 to 491
Data columns (total 72 columns):
sample no 492 non-null int64
index 492 non-null int64
plasma-r 492 non-null float64
plasma-z 492 non-null float64
...
Columns of a DataFrame always have a single dtype. So when you transpose this
CSV-based df, the new DataFrame can not be formed by simply transposing a
single underlying NumPy array. The integers which were in columns by themselves
is now spread across rows. Each column of df.T must have a single dtype, so
the integers are upcasted to floats. So all the columns of df.T have dtype
float64. Data has to be copied when dtypes change.
The bottom line is: So when df has mixed types, df.T is a copy.
col_avg could be simplified to
def col_avg2(df, col_ids):
means = df.iloc[:, col_ids].mean(axis=1)
for i in col_ids:
df.iloc[:, i] -= means
Note that the expression df.iloc[:, col_ids] will return a copy since cols_ids is not a basic slice. But assignment to df.iloc[...] (or df.loc[...]) is guaranteed to modify df.
This is why assigning to df.iloc or df.loc is the recommended way to avoid the assignment-with-chained-indexing pitfall.
From my understanding of the question, this does what you are asking. I don't understand why you're transposing the dataframe. Note: I got rid of the string column names for simplicity, but you can replace those easily.
np.random.seed(42)
df = pd.DataFrame(data=np.random.random((6,8)) + np.arange(6)[:, np.newaxis])#,
#columns=['col-%d' % x for x in range(80)])
# Calculate average over columns 2, 3 and 6
df_old = df.copy()
col_ids=[1,2,5]
df[col_ids] = df[col_ids] - np.mean(df[col_ids].values)
df_old-df # to make sure average is calculated over all three columns
Out[139]:
0 1 2 3 4 5 6 7
0 0 2.950637 2.950637 0 0 2.950637 0 0
1 0 2.950637 2.950637 0 0 2.950637 0 0
2 0 2.950637 2.950637 0 0 2.950637 0 0
3 0 2.950637 2.950637 0 0 2.950637 0 0
4 0 2.950637 2.950637 0 0 2.950637 0 0
5 0 2.950637 2.950637 0 0 2.950637 0 0
OP, say you want the average computed over similar columns and subtracted (say the "Psi at ..." columns). The easiest way is
df = pd.read_csv('data.csv')
psi_cols = [c for c in df.columns if c.startswith('Psi')]
df[psi_cols] -= df[psi_cols].mean().mean()
This computes the total mean across all columns. If you want to subtract the column mean from each column, do
df[psi_cols] -= df[psi_cols].mean()