Dropping rows using multiple criteria - python

I have the following data frame:
EID
CLEAN_NAME
Start_Date
End_Date
Rank_no
E000000
DEF
3/1/1973
2/28/1978
154
E000001
GHI
6/1/1983
3/31/1988
1296
E000001
ABC
1/1/2017
80292
E000002
JKL
10/1/1980
8/31/1981
751.5
E000003
MNO
5/1/1973
11/30/1977
157
E000003
ABC
5/1/1977
11/30/1987
200
E000003
PQR
5/1/1987
11/30/1997
300
E000003
ABC
5/1/1997
1000
What I am trying to do here is I am trying to delete company ABC where rank is highest for ABC company in Rank_no column for each EID. If we find ABC record but it does not have highest rank for an EID it should not be deleted. Rest of the data should remain as it is. The expected output is as follows:
EID
CLEAN_NAME
Start_Date
End_Date
Rank_no
E000000
DEF
3/1/1973
2/28/1978
154
E000001
GHI
6/1/1983
3/31/1988
1296
E000002
JKL
10/1/1980
8/31/1981
751.5
E000003
MNO
5/1/1973
11/30/1977
157
E000003
ABC
5/1/1977
11/30/1987
200
E000003
PQR
5/1/1987
11/30/1997
300
I tried to use the following code:
result_new = result.drop(result[(result['Rank_no'] == result.Rank_no.max()) & (result['CLEAN_NAME'] == 'ABC')].index)
But it's not working. Pretty sure I am giving the conditions incorrect but not sure what exactly I am missing or writing incorrectly. I have named my data frame as result.
Any leads would be appreciated. Thanks.!

Use groupby and idxmax to find the max index for each respective EID and CLEAN_NAME combo after filtering down to only the rows that have ABC.
df.drop(df.loc[df.CLEAN_NAME == "ABC"].groupby("EID").Rank_no.idxmax())
EID CLEAN_NAME Start_Date End_Date Rank_no
0 E000000 DEF 3/1/1973 2/28/1978 154.0
1 E000001 GHI 6/1/1983 3/31/1988 1296.0
3 E000002 JKL 10/1/1980 8/31/1981 751.5
4 E000003 MNO 5/1/1973 11/30/1977 157.0
5 E000003 ABC 5/1/1977 11/30/1987 200.0
6 E000003 PQR 5/1/1987 11/30/1997 300.0

import pandas as pd
datas = [
['E000000', 'DEF', '3/1/1973', '2/28/1978', 154],
['E000001', 'GHI', '6/1/1983', '3/31/1988', 1296],
['E000001', 'ABC', '1/1/2017', '', 80292],
['E000002', 'JKL', '10/1/1980', '8/31/1981', 751.5],
['E000003', 'MNO', '5/1/1973', '11/30/1977', 157],
['E000003', 'ABC', '5/1/1977', '11/30/1987', 200],
['E000003', 'PQR', '5/1/1987', '11/30/1997', 300],
['E000003', 'ABC', '5/1/1997', '', 1000],
]
result = pd.DataFrame(datas, columns=['EID', 'CLEAN_NAME', 'Start_Date', 'End_Date', 'Rank_no'])
new_result = result.sort_values(by='Rank_no') # sort by lowest Rank_no
new_result = new_result.drop_duplicates(subset=['CLEAN_NAME'], keep='first') # drop duplicates keeping the first
new_result = new_result.sort_values(by='EID') # sort by EID
print(new_result)
Output :
EID CLEAN_NAME Start_Date End_Date Rank_no
0 E000000 DEF 3/1/1973 2/28/1978 154.0
1 E000001 GHI 6/1/1983 3/31/1988 1296.0
3 E000002 JKL 10/1/1980 8/31/1981 751.5
4 E000003 MNO 5/1/1973 11/30/1977 157.0
5 E000003 ABC 5/1/1977 11/30/1987 200.0
6 E000003 PQR 5/1/1987 11/30/1997 300.0

Related

How to retain duplicate column names and melt dataframe using pandas?

I have a dataframe like as shown below
tdf = pd.DataFrame(
{'Unnamed: 0' : ['Region','Asean','Asean','Asean','Asean','Asean','Asean'],
'Unnamed: 1' : ['Name', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR','STU'],
'2017Q1' : ['target_achieved',2345,5678,7890,1234,6789,5454],
'2017Q1' : ['target_set', 3000,6000,8000,1500,7000,5500],
'2017Q1' : ['score', 86, 55, 90, 65, 90, 87],
'2017Q2' : ['target_achieved',245,578,790,123,689,454],
'2017Q2' : ['target_set', 300,600,800,150,700,500],
'2017Q2' : ['score', 76, 45, 70, 55, 60, 77]})
As you can see that, my column names are duplicated.
Meaning, there are 3 columns (2017Q1 each and 2017Q2 each)
dataframe doesn't allow to have columns with duplicate names.
I tried the below to get my expected output
tdf.columns = tdf.iloc[0]v # but this still ignores the column with duplicate names
update
After reading the excel file, based on jezrael answer, I get the below display
I expect my output to be like as shown below
First create MultiIndex in columns and indices:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
If not possible, here is alternative from your sample data - converted columns and first row of data to MultiIndex in columns and first columns to MultiIndex in index:
tdf = pd.read_excel(file)
tdf.columns = pd.MultiIndex.from_arrays([tdf.columns, tdf.iloc[0]])
df = (tdf.iloc[1:]
.set_index(tdf.columns[:2].tolist())
.rename_axis(index=['Region','Name'], columns=['Year',None]))
print (df.index)
MultiIndex([('Asean', 'DEF'),
('Asean', 'GHI'),
('Asean', 'JKL'),
('Asean', 'MNO'),
('Asean', 'PQR'),
('Asean', 'STU')],
names=['Region', 'Name'])
print (df.columns)
MultiIndex([('2017Q1', 'target_achieved'),
('2017Q1', 'target_set'),
('2017Q1', 'score'),
('2017Q2', 'target_achieved'),
('2017Q2', 'target_set'),
('2017Q2', 'score')],
names=['Year', None])
And then reshape:
df1 = df.stack(0).reset_index()
print (df1)
Region Name Year score target_achieved target_set
0 Asean DEF 2017Q1 86 2345 3000
1 Asean DEF 2017Q2 76 245 300
2 Asean GHI 2017Q1 55 5678 6000
3 Asean GHI 2017Q2 45 578 600
4 Asean JKL 2017Q1 90 7890 8000
5 Asean JKL 2017Q2 70 790 800
6 Asean MNO 2017Q1 65 1234 1500
7 Asean MNO 2017Q2 55 123 150
8 Asean PQR 2017Q1 90 6789 7000
9 Asean PQR 2017Q2 60 689 700
10 Asean STU 2017Q1 87 5454 5500
11 Asean STU 2017Q2 77 454 500
EDIT: Solution for EDITed question is similar:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
df1 = df.rename_axis(index=['Region','Name'], columns=['Year',None]).stack(0).reset_index()

Setting the values of filtered rows of dataframe equal to the column of another dataframe

I have a dataframe df1:
df1 = pd.DataFrame([['a','Yes','abc def msg1'],
['b', 'No', 'ghi jkl msg2'],
['c','Yes','mno pqr msg3'],
['d', 'No', 'stu vwx msg4'],
['a', 'Yes', 'bcd efg msg5'],
['c','No','hij klm msg6'],
['a','No','nop qrs msg7'],
['b','No','tuv wxy msg8']],
columns=['unit_name','is_required','dummy_column'])
unit_name is_required dummy_column
a Yes abc def msg1
b No ghi jkl msg2
c Yes mno pqr msg3
d No stu vwx msg4
a Yes bcd efg msg5
c No hij klm msg6
a No nop qrs msg7
b No tuv wxy msg8
Whose rows having unit_name = 'a' and is_required='Yes' are used to derive another dataframe df2:
dummy1 dummy2 msg_column value
abc def msg1 val1
bcd efg msg5 val2
Now I want to add the value column of df2 to df1. The rows that don't have the value must contain '-'. So the expected output I want is:
unit_name is_required dummy_column value
a Yes abc def msg1 val1
b No ghi jkl msg2 -
c Yes mno pqr msg3 -
d No stu vwx msg4 -
a Yes bcd efg msg5 val2
c No hij klm msg6 -
a No nop qrs msg7 -
b No tuv wxy msg8 -
In order to do this, I tried the below line of code:
df1.loc[(df1.unit_name=='a') & (df1.is_required=='Yes'),'value'] = df2['value']
df1.fillna('-')
But I'm getting the result:
unit_name is_required dummy_column value
a Yes abc def msg1 val1
b No ghi jkl msg2 -
c Yes mno pqr msg3 -
d No stu vwx msg4 -
a Yes bcd efg msg5 -
c No hij klm msg6 -
a No nop qrs msg7 -
b No tuv wxy msg8 -
Now I understand that this is happening because while equating two columns, the index value of the LHS will be used to get the values from RHS.
How do I get the output I need? Any ideas are welcome. Thanks in advance!
Problem is there is different indices in both DataFrames.
Possible solution if number of filtered rows is same like number rows in df2:
print (((df1.unit_name=='a') & (df1.is_required=='Yes')).sum(), len(df2.index))
Then is possible use:
df1.loc[(df1.unit_name=='a') & (df1.is_required=='Yes'),'value'] = df2['value'].to_numpy()

Split Python DF Columns into 2 based off a predefined list of options

I have a dataframe that looks like the following:
print(df):
Text
John Smith abc def ghi jkl
Michael Smith abc def ghi jkl
Liz Jones abc def ghi jkl
I also have a predefined list of people who i want to find and split the above contents into two columns.
names = ('John Smith','Michael Smith','Liz Jones')
I am hoping to get the following:
print(df):
Name | Information
John Smith | abc def ghi jkl
Michael Smith | abc def ghi jkl
Liz Jones | abc def ghi jkl
i have tried:
df['Name','Information'] = df['Text'].str.split(names)
but i think the str.split needs a string and doesnt take a list of names. Is there anyway to split columns off a defined list?
Any help would be much appreciated. Thanks very much
Use Series.str.extract with joined all names by | for regex or and then for all another values:
names = ('John Smith','Michael Smith','Liz Jones')
df = df['Text'].str.extract(f'(?P<Name>{"|".join(names)})(?P<Information>.*)')
print (df)
Name Information
0 John Smith abc def ghi jkl
1 Michael Smith abc def ghi jkl
2 Liz Jones abc def ghi jkl
If want remove this column and add all another columns for original use DataFrame.pop for extract column and DataFrame.join:
df = df.join(df.pop('Text').str.extract(f'(?P<Name>{"|".join(names)})(?P<Information>.*)'))
Or:
df[['Name','Information']] = df.pop('Text').str.extract(f'(?P<letter>{"|".join(names)})(.*)')
print (df)
Name Information
0 John Smith abc def ghi jkl
1 Michael Smith abc def ghi jkl
2 Liz Jones abc def ghi jkl

Inserting a Ratio field into a Pandas Series

I get a Pandas series:
countrypat = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0).head(3)
The output looks like this:
China abc 1055
def 778
ghi 612
Malaysia def 554
abc 441
ghi 178
[...]
How to insert a new column (do I have to make this a dataframe) containing the ratio of the numeric column to the sum of the numbers for that country. Thus for China I would want a new column and the first row would contain (1055/(1055+778+612)). I have tried unstack() and to_df() but was unsure of the next steps.
I created a dataframe on my side, but excluded the .head(3) of your assigment:
countrypat = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0)
The following will give you the proportions with a simple apply to your groupby object:
countrypat.apply(lambda x: x / float(x.sum()))
The only 'problem' is that doing so returns you a series, so I would stock the intermediate results in two different series and combine them at the end:
series1 = asiaselect.groupby('Country')['Pattern'].value_counts()
series2 = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0).apply(lambda x: x / float(x.sum()))
pd.DataFrame([series1, series2]).T
China abc 1055.0 0.431493
def 778.0 0.318200
ghi 612.0 0.250307
Malaysia def 554.0 0.472293
abc 441.0 0.375959
ghi 178.0 0.151748
As to get the top three rows, you can simply add a .groupby(level=0).head(3) to each series1 and series2
series1_top = series1.groupby(level=0).head(3)
series2_top = series2.groupby(level=0).head(3)
pd.DataFrame([series1_top, series2_top]).T
I tested with a dataframe containing more than 3 rows, and it seems to work. Started with the following df:
China abc 1055
def 778
ghi 612
yyy 5
xxx 3
zzz 3
Malaysia def 554
abc 441
ghi 178
yyy 5
xxx 3
zzz 3
and ends like this:
China abc 1055.0 0.429560
def 778.0 0.316775
ghi 612.0 0.249186
Malaysia def 554.0 0.467905
abc 441.0 0.372466
ghi 178.0 0.150338

Adding new column based similar to a sql operation in python pandas

The sql operation is as below :
UPDATE table_A s SET t.stat_fips=s.stat_fips
WHERE t.stat_code=s.stat_code;
If a similar operation needs to be done on csv A comparing some value from csv B How to achieve this in Python?
Data:
Lets assume -
CSV A
col1 stat_code name
abc WY ABC
def NA DEF
ghi AZ GHI
CSV B
stat_fips stat_code
2234 WY
4344 NA
4588 AZ
Resulting CSV :
col1 stat_code name stat_fips
abc WY ABC 2234
def NA DEF 4344
ghi AZ GHI 4588
Adding the attempted code so far :
df = pd.read_csv('fin.csv',sep='\t', quotechar="'")
df = df.set_index('col1').stack(dropna=False).reset_index
df1['stat_fips'] = df1['stat_code']
print df1
(Not really sure on pandas. learning the basics yet)
Judging your example data, this looks like merge operation on your stat_code column:
import pandas as pd
df_a = pd.DataFrame([["abc", "WY", "ABC"], ["def", "NA", "DEF"]], columns= ["col1", "stat_code", "name"])
df_b = pd.DataFrame([[2234, "WY"], [4344, "NA"]], columns=["stat_fips", "stat_code"])
merged_df = pd.merge(df_a, df_b, on="stat_code", how="left")
print(merged_df)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NA DEF 4344
It seems you need map by dict d:
d = df2.set_index('stat_code')['stat_fips'].to_dict()
df1['stat_fips'] = df1['stat_code'].map(d)
print (df1)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NaN DEF 4344
2 ghi AZ GHI 4588
Or merge with left join:
df3 = pd.merge(df1, df2, on='stat_code', how='left')
print (df3)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NaN DEF 4344
2 ghi AZ GHI 4588

Categories

Resources