Extract column value based on another column, reading multiple files

Extract column value based on another column, reading multiple files - python

I will like to extract out the values based on another on Name,Grade,School,Class.
For example if I were to find Name and Grade, I would like to go through column 0 and find the value in the next few column, but the value is scattered(to be extracted) around the next column. Same goes for School and Class.
Refer to this: extract column value based on another column pandas dataframe
I have multiple files:
0 1 2 3 4 5 6 7 8
0 nan nan nan Student Registration nan nan
1 Name: nan nan John nan nan nan nan nan
2 Grade: nan 6 nan nan nan nan nan nan
3 nan nan nan School: C College nan Class: 1A
0 1 2 3 4 5 6 7 8 9
0 nan nan nan Student Registration nan nan nan
1 nan nan nan nan nan nan nan nan nan nan
2 Name: Mary nan nan nan nan nan nan nan nan
3 Grade: 7 nan nan nan nan nan nan nan nan
4 nan nan nan School: nan D College Class: nan 5A
This is my code: (Error)
for file in files:
df = pd.read_csv(file,header=0)
df['Name'] = df.loc[df[0].isin('Name')[1,2,3]
df['Grade'] = df.loc[df[0].isin('Grade')[1,2,3]
df['School'] = df.loc[df[3].isin('School')[4,5]
df['Class'] = df.loc[df[7].isin('Class')[8,9]
d.append(df)
df = pd.concat(d,ignore_index=True)
This is the outcome I want: (Melt Function)
Name Grade School Class ... .... ... ...
John 6 C College 1A
John 6 C College 1A
John 6 C College 1A
John 6 C College 1A
Mary 7 D College 5A
Mary 7 D College 5A
Mary 7 D College 5A
Mary 7 D College 5A

I think here is possible use:
for file in files:
df = pd.read_csv(file,header=0)
#filter out first column and reshape - removed NaNs, convert to 1 column df
df = df.iloc[1:].stack().reset_index(drop=True).to_frame('data')
#compare by :
m = df['data'].str.endswith(':', na=False)
#shift values to new column
df['val'] = df['data'].shift(-1)
#filter and transpose
df = df[m].set_index('data').T.rename_axis(None, axis=1)
d.append(df)
df = pd.concat(d,ignore_index=True)
EDIT:
You can use:
for file in files:
#if input are excel, change read_csv to read_excel
df = pd.read_excel(file, header=None)
df['Name'] = df.loc[df[0].eq('Name:'), [1,2,3]].dropna(axis=1).squeeze()
df['Grade'] = df.loc[df[0].eq('Grade:'), [1,2,3]].dropna(axis=1).squeeze()
df['School'] = df.loc[df[3].eq('School:'), [4,5]].dropna(axis=1).squeeze()
df['Class'] = df.loc[df[6].eq('Class:'), [7,8]].dropna(axis=1).squeeze()
print (df)

Related

Transform DataFrame in Pandas

I am struggling with the following issue.
My DF is:
df = pd.DataFrame(
[
['7890-1', '12345N', 'John', 'Intermediate'],
['7890-4', '30909N', 'Greg', 'Intermediate'],
['3300-1', '88117N', 'Mark', 'Advanced'],
['2502-2', '90288N', 'Olivia', 'Elementary'],
['7890-2', '22345N', 'Joe', 'Intermediate'],
['7890-3', '72245N', 'Ana', 'Elementary']
],
columns=['Id', 'Code', 'Person', 'Level'])
print(df)
I would like to get such a result:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate
1
3300
88117N
Mark
Advanced
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2502
NaN
NaN
NaN
90288N
Olivia
Elementary
NaN
NaN
NaN
NaN
NaN
NaN

I'd start with the same approach as #Andrej Kesely but then sort by index after unstacking and map over the column names with ' '.join.
df[["Id", "No"]] = df["Id"].str.split("-", expand=True)
df_wide = df.set_index(["Id", "No"]).unstack(level=1).sort_index(axis=1,level=1)
df_wide.columns = df_wide.columns.map(' '.join)
Output
Code 1 Level 1 Person 1 Code 2 Level 2 Person 2 Code 3 \
Id
2502 NaN NaN NaN 90288N Elementary Olivia NaN
3300 88117N Advanced Mark NaN NaN NaN NaN
7890 12345N Intermediate John 22345N Intermediate Joe 72245N
Level 3 Person 3 Code 4 Level 4 Person 4
Id
2502 NaN NaN NaN NaN NaN
3300 NaN NaN NaN NaN NaN
7890 Elementary Ana 30909N Intermediate Greg

Try:
df[["Id", "Id2"]] = df["Id"].str.split("-", expand=True)
x = df.set_index(["Id", "Id2"]).unstack(level=1)
x.columns = [f"{a} {b}" for a, b in x.columns]
print(
x[sorted(x.columns, key=lambda k: int(k.split()[-1]))]
.reset_index()
.to_markdown()
)
Prints:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
2502
nan
nan
nan
90288N
Olivia
Elementary
nan
nan
nan
nan
nan
nan
1
3300
88117N
Mark
Advanced
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate

Merge multiple rows to one row in a csv file using python pandas

I have a csv file with multiple rows as stated below
Id Name Marks1 Marks2 Marks3 Marks4 Marks5
1 ABC 10 NAN NAN NAN NAN
2 BCD 15 NAN NAN NAN NAN
3 CDE 17 NAN NAN NAN NAN
1 ABC NAN 18 NAN 17 NAN
2 BCD NAN 10 NAN 15 NAN
1 ABC NAN NAN 16 NAN NAN
3 CDE NAN NAN 19 NAN NAN
I want to merge the rows having the same id and name into a single row using pandas in python. The output should be :
Id Name Marks1 Marks2 Marks3 Marks4 Marks5
1 ABC 10 18 16 17 NAN
2 BCD 15 10 NAN 15 NAN
3 CDE 17 NAN 19 NAN NAN

IIUC, DataFrame.groupby + as_index=False with GroupBy.first to eliminate NaN.
#df = df.replace('NAN',np.nan) #If necessary
df.groupby(['Id','Name'],as_index=False).first()
if you think it could have a pair Id Name with non-null values in some column you could use GroupBy.apply with Series.ffill and Series.bfill + DataFrame.drop_duplicates to keep all the information.
df.groupby(['Id','Name']).apply(lambda x: x.ffill().bfill()).drop_duplicates()
Output
Id Name Marks1 Marks2 Marks3 Marks4 Marks5
0 1 ABC 10 18 16 17 NaN
1 2 BCD 15 10 NaN 15 NaN
2 3 CDE 17 NaN 19 NaN NaN

Hacky answer:
pd.groupby(“Name”).mean().reset_index()
This will only work if for each column there is only one valid value for each Name.

How to create a Function in python for detecting missing value?

Name Sex Age Ticket_No Fare
0 Braund male 22 HN07681 2500
1 NaN female 42 HN05681 6895
2 peter male NaN KKSN55 800
3 NaN male 56 HN07681 2500
4 Daisy female 22 hf55s44 NaN
5 Manson NaN 48 HN07681 8564
6 Piston male NaN HN07681 5622
7 Racline female 42 Nh55146 NaN
8 Nan male 22 HN07681 4875
9 NaN NaN NaN NaN NaN
col_Name No_of_Missing Mean Median Mode
0 Name 3 NaN NaN NaN
1 Sex 1 NaN NaN NaN
2 Age 2 36 42 22
3 Fare 2 4536 4875 2500
Mean/Median/Mode is only for numerical datatype, otherwise should be null.

Try this:
# Your original df
print(df)
# First drop any rows which are completely NaN
df = df.dropna(how = "all")
# Create a list to hold other lists.
# This will be used as the data for the new dataframe
new_data = []
# Parse through the columns
for col in df.columns:
# Create a new list, which will be one row of data in the new dataframe
# The first item containing only the columns name,
# to correspond with the new df's first column
_list = [col]
_list.append(df.dtypes[col]) # DType for that colmn is the second item/second column
missing = df[col].isna().sum() # Total the number of "NaN" in column
if missing > 30:
print("Max total number of missing exceeded")
continue # Skip this columns and continue on to next column
_list.append(missing)
# Get the mean This will error and pass if it's not possible
try: mean = df[col].mean()
except:
mean = np.nan
_list.append(mean) # Append to proper columns position
# Get the median This will error and pass if it's not possible
try: median = df[col].median()
except:
median = np.nan
_list.append(median)
# Get the mode. This will error and pass if it's not possible
try: mode = df[col].mode()[1]
except:
mode = np.nan
_list.append(mode)
new_data.append(_list)
columns = ["col_Name", "DType", "No_of_Missing", "Mean", "Median", "Mode"]
new_df = pd.DataFrame(new_data, columns = columns)
print("============================")
print(new_df)
OUTPUT:
Name Sex Age Ticket_No Fare
0 Braund male 22.0 HN07681 2500.0
1 NaN female 42.0 HN05681 6895.0
2 peter male NaN KKSN55 800.0
3 NaN male 56.0 HN07681 2500.0
4 Daisy female 22.0 hf55s44 NaN
5 Manson NaN 48.0 HN07681 8564.0
6 Piston male NaN HN07681 5622.0
7 Racline female 42.0 Nh55146 NaN
8 NaN male 22.0 HN07681 4875.0
9 NaN NaN NaN NaN NaN
============================
col_Name DType No_of_Missing Mean Median Mode
0 Name object 3 NaN NaN Daisy
1 Sex object 1 NaN NaN NaN
2 Age float64 2 36.285714 42.0 NaN
3 Ticket_No object 0 NaN NaN NaN
4 Fare float64 2 4536.571429 4875.0 NaN

How to fill and merge df with 10 empty rows?

how to fill df with empty rows or create a df with empty rows.
have df :
df = pd.DataFrame(columns=["naming","type"])
how to fill this df with empty rows

Specify index values:
df = pd.DataFrame(columns=["naming","type"], index=range(10))
print (df)
naming type
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
If need empty strings:
df = pd.DataFrame('',columns=["naming","type"], index=range(10))
print (df)
naming type
0
1
2
3
4
5
6
7
8
9

Python List match with appropriate index

I need to match the lists with appropriate indexes alone. There are 5
lists, one will be main list. List1/List2 will be combined together
same way List3/List4. List1/List3 index will be available in
main_list. List2 / List4 need to match with appropriate index in
main_list
main_list = ['munnar', 'ooty', 'coonoor', 'nilgri', 'wayanad', 'coorg', 'chera', 'hima']
List1 = ['ooty', 'coonoor', 'chera']
List2 = ['hill', 'hill', 'hill']
List3 = ['nilgri', 'hima', 'ooty']
List4 = ['mount', 'mount', 'mount']
df = pd.DataFrame(dict(Area=main_list))
df1 = pd.DataFrame(
list(zip(List1, List2)),
columns=('Area', 'Content')
)
df2 = pd.DataFrame(
list(zip(List3, List4)),
columns=('Area', 'cont')
)
re = pd.concat([df, df1, df2], ignore_index=True, sort=False)
Output:
Area Content cont
0 munnar NaN NaN
1 ooty NaN NaN
2 coonoor NaN NaN
3 nilgri NaN NaN
4 wayanad NaN NaN
5 coorg NaN NaN
6 chera NaN NaN
7 hima NaN NaN
8 ooty hill NaN
9 coonoor hill NaN
10 chera hill NaN
11 nilgri NaN mount
12 hima NaN mount
13 ooty NaN mount
Expected Output:
Area Content cont
0 munnar NaN NaN
1 ooty hill mount
2 coonoor hill NaN
3 nilgri NaN mount
4 wayanad NaN NaN
5 coorg NaN NaN
6 chera hill NaN
7 hima NaN mount

IIUC set_index before concat
pd.concat([df.set_index('Area'), df1.set_index('Area'), df2.set_index('Area')],1).reset_index()
Out[312]:
index Content cont
0 chera hill NaN
1 coonoor hill NaN
2 coorg NaN NaN
3 hima NaN mount
4 munnar NaN NaN
5 nilgri NaN mount
6 ooty hill mount
7 wayanad NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract column value based on another column, reading multiple files - python

Related

Transform DataFrame in Pandas

Merge multiple rows to one row in a csv file using python pandas

How to create a Function in python for detecting missing value?

How to fill and merge df with 10 empty rows?

Python List match with appropriate index

Categories

Resources