How to create a Function in python for detecting missing value? - python

Name Sex Age Ticket_No Fare
0 Braund male 22 HN07681 2500
1 NaN female 42 HN05681 6895
2 peter male NaN KKSN55 800
3 NaN male 56 HN07681 2500
4 Daisy female 22 hf55s44 NaN
5 Manson NaN 48 HN07681 8564
6 Piston male NaN HN07681 5622
7 Racline female 42 Nh55146 NaN
8 Nan male 22 HN07681 4875
9 NaN NaN NaN NaN NaN
col_Name No_of_Missing Mean Median Mode
0 Name 3 NaN NaN NaN
1 Sex 1 NaN NaN NaN
2 Age 2 36 42 22
3 Fare 2 4536 4875 2500
Mean/Median/Mode is only for numerical datatype, otherwise should be null.

Try this:
# Your original df
print(df)
# First drop any rows which are completely NaN
df = df.dropna(how = "all")
# Create a list to hold other lists.
# This will be used as the data for the new dataframe
new_data = []
# Parse through the columns
for col in df.columns:
# Create a new list, which will be one row of data in the new dataframe
# The first item containing only the columns name,
# to correspond with the new df's first column
_list = [col]
_list.append(df.dtypes[col]) # DType for that colmn is the second item/second column
missing = df[col].isna().sum() # Total the number of "NaN" in column
if missing > 30:
print("Max total number of missing exceeded")
continue # Skip this columns and continue on to next column
_list.append(missing)
# Get the mean This will error and pass if it's not possible
try: mean = df[col].mean()
except:
mean = np.nan
_list.append(mean) # Append to proper columns position
# Get the median This will error and pass if it's not possible
try: median = df[col].median()
except:
median = np.nan
_list.append(median)
# Get the mode. This will error and pass if it's not possible
try: mode = df[col].mode()[1]
except:
mode = np.nan
_list.append(mode)
new_data.append(_list)
columns = ["col_Name", "DType", "No_of_Missing", "Mean", "Median", "Mode"]
new_df = pd.DataFrame(new_data, columns = columns)
print("============================")
print(new_df)
OUTPUT:
Name Sex Age Ticket_No Fare
0 Braund male 22.0 HN07681 2500.0
1 NaN female 42.0 HN05681 6895.0
2 peter male NaN KKSN55 800.0
3 NaN male 56.0 HN07681 2500.0
4 Daisy female 22.0 hf55s44 NaN
5 Manson NaN 48.0 HN07681 8564.0
6 Piston male NaN HN07681 5622.0
7 Racline female 42.0 Nh55146 NaN
8 NaN male 22.0 HN07681 4875.0
9 NaN NaN NaN NaN NaN
============================
col_Name DType No_of_Missing Mean Median Mode
0 Name object 3 NaN NaN Daisy
1 Sex object 1 NaN NaN NaN
2 Age float64 2 36.285714 42.0 NaN
3 Ticket_No object 0 NaN NaN NaN
4 Fare float64 2 4536.571429 4875.0 NaN

Related

Transform DataFrame in Pandas

I am struggling with the following issue.
My DF is:
df = pd.DataFrame(
[
['7890-1', '12345N', 'John', 'Intermediate'],
['7890-4', '30909N', 'Greg', 'Intermediate'],
['3300-1', '88117N', 'Mark', 'Advanced'],
['2502-2', '90288N', 'Olivia', 'Elementary'],
['7890-2', '22345N', 'Joe', 'Intermediate'],
['7890-3', '72245N', 'Ana', 'Elementary']
],
columns=['Id', 'Code', 'Person', 'Level'])
print(df)
I would like to get such a result:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate
1
3300
88117N
Mark
Advanced
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2502
NaN
NaN
NaN
90288N
Olivia
Elementary
NaN
NaN
NaN
NaN
NaN
NaN
I'd start with the same approach as #Andrej Kesely but then sort by index after unstacking and map over the column names with ' '.join.
df[["Id", "No"]] = df["Id"].str.split("-", expand=True)
df_wide = df.set_index(["Id", "No"]).unstack(level=1).sort_index(axis=1,level=1)
df_wide.columns = df_wide.columns.map(' '.join)
Output
Code 1 Level 1 Person 1 Code 2 Level 2 Person 2 Code 3 \
Id
2502 NaN NaN NaN 90288N Elementary Olivia NaN
3300 88117N Advanced Mark NaN NaN NaN NaN
7890 12345N Intermediate John 22345N Intermediate Joe 72245N
Level 3 Person 3 Code 4 Level 4 Person 4
Id
2502 NaN NaN NaN NaN NaN
3300 NaN NaN NaN NaN NaN
7890 Elementary Ana 30909N Intermediate Greg
Try:
df[["Id", "Id2"]] = df["Id"].str.split("-", expand=True)
x = df.set_index(["Id", "Id2"]).unstack(level=1)
x.columns = [f"{a} {b}" for a, b in x.columns]
print(
x[sorted(x.columns, key=lambda k: int(k.split()[-1]))]
.reset_index()
.to_markdown()
)
Prints:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
2502
nan
nan
nan
90288N
Olivia
Elementary
nan
nan
nan
nan
nan
nan
1
3300
88117N
Mark
Advanced
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate

Merge multiple rows to one row in a csv file using python pandas

I have a csv file with multiple rows as stated below
Id Name Marks1 Marks2 Marks3 Marks4 Marks5
1 ABC 10 NAN NAN NAN NAN
2 BCD 15 NAN NAN NAN NAN
3 CDE 17 NAN NAN NAN NAN
1 ABC NAN 18 NAN 17 NAN
2 BCD NAN 10 NAN 15 NAN
1 ABC NAN NAN 16 NAN NAN
3 CDE NAN NAN 19 NAN NAN
I want to merge the rows having the same id and name into a single row using pandas in python. The output should be :
Id Name Marks1 Marks2 Marks3 Marks4 Marks5
1 ABC 10 18 16 17 NAN
2 BCD 15 10 NAN 15 NAN
3 CDE 17 NAN 19 NAN NAN
IIUC, DataFrame.groupby + as_index=False with GroupBy.first to eliminate NaN.
#df = df.replace('NAN',np.nan) #If necessary
df.groupby(['Id','Name'],as_index=False).first()
if you think it could have a pair Id Name with non-null values ​​in some column you could use GroupBy.apply with Series.ffill and Series.bfill + DataFrame.drop_duplicates to keep all the information.
df.groupby(['Id','Name']).apply(lambda x: x.ffill().bfill()).drop_duplicates()
Output
Id Name Marks1 Marks2 Marks3 Marks4 Marks5
0 1 ABC 10 18 16 17 NaN
1 2 BCD 15 10 NaN 15 NaN
2 3 CDE 17 NaN 19 NaN NaN
Hacky answer:
pd.groupby(“Name”).mean().reset_index()
This will only work if for each column there is only one valid value for each Name.

How to check missing values in dataframe for specific datatype?

df1:
Name marks class Avg is_stud Date
0 Tom 91.55 classA 45.0 True 10/2/2011
1 Jack 98.66 classB 65.0 False 11/2/2011
2 nick NaN classC NaN False 12/2/2011
3 juli 90.60 classA 14.0 False 13/2/2016
4 NaN 79.60 classB 58.0 True 10/2/2011
5 ramy NaN classC 22.0 False 11/2/2011
6 suzane 85.00 classA 65.0 False 12/2/2015
7 nick NaN classB 96.0 False 13/2/2012
8 Tom 69.69 classC NaN NaN NaN
9 NaN 56.20 classD NaN NaN NaN
hello all,
i want to find missing value in each column and add other column (mean ,median, mode ) to the output for only numeric (int, float) datatype else it should be null.
if a column has all unique values then mode = median.
if there is no missing value in the data frame then return empty data frame of output.
output :
col_name no.of missing mean median mode
Name 2 Nan Nan Nan
marks 3 81.61 85.0 85.0
Avg 3 52.14 58.0 65.0
is_stud 2 Nan Nan Nan
Date 2 Nan Nan Nan
thanks
As far as I see, you want to the operation on only int and float values, I would suggest checking for the type of the column first and then performing the operation.
import numpy as np
import pandas as pd
from scipy import stats
for col in df.columns:
if (df[col].dtype == 'int64') or (df[col].dtype == 'float64'):
mean = np.mean(df[col])
median = np.median(df[col])
mode = stats.mode(df[col])
print('Mean', mean, '\nMedian', median, '\Mode', mode, '\for column', col)
Once you get these values, you could print them in the way that's been shown in your question using pd.DataFrame

Python Pandas: I need to transform a pandas dataframe

the data frame looks like this. I have tried with pivot, stack, unstack. Is there any method to achieve the output
key attribute text_value numeric_value date_value
0 1 order NaN NaN 10/02/19
1 1 size NaN 43.0 NaN
2 1 weight NaN 22.0 NaN
3 1 price NaN 33.0 NaN
4 1 segment product NaN NaN
5 2 order NaN NaN 11/02/19
6 2 size NaN 34.0 NaN
7 2 weight NaN 32.0 NaN
8 2 price NaN 89.0 NaN
9 2 segment customer NaN NaN
I need the following output
key order size weight price segment
1 10/2/2019 43.0 22.0 33.0 product
2 11/2/2019 34.0 32.0 89.0 customer
Thanks in advance
I believe you dont want to change dtypes in output data, so possible solution is processing each column separately by DataFrame.dropna and DataFrame.pivot and then join together by concat:
df['date_value'] = pd.to_datetime(df['date_value'])
df1 = df.dropna(subset=['text_value']).pivot('key','attribute','text_value')
df2 = df.dropna(subset=['numeric_value']).pivot('key','attribute','numeric_value')
df3 = df.dropna(subset=['date_value']).pivot('key','attribute','date_value')
df = pd.concat([df1, df2, df3], axis=1).reindex(df['attribute'].unique(), axis=1)
print (df)
attribute order size weight price segment
key
1 2019-10-02 43.0 22.0 33.0 product
2 2019-11-02 34.0 32.0 89.0 customer
print (df.dtypes)
order datetime64[ns]
size float64
weight float64
price float64
segment object
dtype: object
Old answer - all values are casted to strings:
df['date_value'] = pd.to_datetime(df['date_value'])
df['text_value'] = df['text_value'].fillna(df['numeric_value']).fillna(df['date_value'])
df = df.pivot('key','attribute','text_value')
print (df)
attribute order price segment size weight
key
1 1569974400000000000 33 product 43 22
2 1572652800000000000 89 customer 34 32
print (df.dtypes)
order object
price object
segment object
size object
weight object
dtype: object
This is the solution, I figured out
attr_dict = {'order':'date_value', 'size':'numeric_value', 'weight':'numeric_value', 'price':'numeric_value', 'segment':'text_value'}
output_table = pd.DataFrame()
for attr in attr_dict.keys():
temp = input_table[input_table['attribute'] == attr][['key', attr_dict[attr]]]
temp.rename(columns={attr_dict[attr]:attr}, inplace=True)
output_table[attr] = list(temp.values[:,1])
output_table

Extract column value based on another column, reading multiple files

I will like to extract out the values based on another on Name,Grade,School,Class.
For example if I were to find Name and Grade, I would like to go through column 0 and find the value in the next few column, but the value is scattered(to be extracted) around the next column. Same goes for School and Class.
Refer to this: extract column value based on another column pandas dataframe
I have multiple files:
0 1 2 3 4 5 6 7 8
0 nan nan nan Student Registration nan nan
1 Name: nan nan John nan nan nan nan nan
2 Grade: nan 6 nan nan nan nan nan nan
3 nan nan nan School: C College nan Class: 1A
0 1 2 3 4 5 6 7 8 9
0 nan nan nan Student Registration nan nan nan
1 nan nan nan nan nan nan nan nan nan nan
2 Name: Mary nan nan nan nan nan nan nan nan
3 Grade: 7 nan nan nan nan nan nan nan nan
4 nan nan nan School: nan D College Class: nan 5A
This is my code: (Error)
for file in files:
df = pd.read_csv(file,header=0)
df['Name'] = df.loc[df[0].isin('Name')[1,2,3]
df['Grade'] = df.loc[df[0].isin('Grade')[1,2,3]
df['School'] = df.loc[df[3].isin('School')[4,5]
df['Class'] = df.loc[df[7].isin('Class')[8,9]
d.append(df)
df = pd.concat(d,ignore_index=True)
This is the outcome I want: (Melt Function)
Name Grade School Class ... .... ... ...
John 6 C College 1A
John 6 C College 1A
John 6 C College 1A
John 6 C College 1A
Mary 7 D College 5A
Mary 7 D College 5A
Mary 7 D College 5A
Mary 7 D College 5A
I think here is possible use:
for file in files:
df = pd.read_csv(file,header=0)
#filter out first column and reshape - removed NaNs, convert to 1 column df
df = df.iloc[1:].stack().reset_index(drop=True).to_frame('data')
#compare by :
m = df['data'].str.endswith(':', na=False)
#shift values to new column
df['val'] = df['data'].shift(-1)
#filter and transpose
df = df[m].set_index('data').T.rename_axis(None, axis=1)
d.append(df)
df = pd.concat(d,ignore_index=True)
EDIT:
You can use:
for file in files:
#if input are excel, change read_csv to read_excel
df = pd.read_excel(file, header=None)
df['Name'] = df.loc[df[0].eq('Name:'), [1,2,3]].dropna(axis=1).squeeze()
df['Grade'] = df.loc[df[0].eq('Grade:'), [1,2,3]].dropna(axis=1).squeeze()
df['School'] = df.loc[df[3].eq('School:'), [4,5]].dropna(axis=1).squeeze()
df['Class'] = df.loc[df[6].eq('Class:'), [7,8]].dropna(axis=1).squeeze()
print (df)

Categories

Resources