Adding new column based similar to a sql operation in python pandas

Adding new column based similar to a sql operation in python pandas - python

The sql operation is as below :
UPDATE table_A s SET t.stat_fips=s.stat_fips
WHERE t.stat_code=s.stat_code;
If a similar operation needs to be done on csv A comparing some value from csv B How to achieve this in Python?
Data:
Lets assume -
CSV A
col1 stat_code name
abc WY ABC
def NA DEF
ghi AZ GHI
CSV B
stat_fips stat_code
2234 WY
4344 NA
4588 AZ
Resulting CSV :
col1 stat_code name stat_fips
abc WY ABC 2234
def NA DEF 4344
ghi AZ GHI 4588
Adding the attempted code so far :
df = pd.read_csv('fin.csv',sep='\t', quotechar="'")
df = df.set_index('col1').stack(dropna=False).reset_index
df1['stat_fips'] = df1['stat_code']
print df1
(Not really sure on pandas. learning the basics yet)

Judging your example data, this looks like merge operation on your stat_code column:
import pandas as pd
df_a = pd.DataFrame([["abc", "WY", "ABC"], ["def", "NA", "DEF"]], columns= ["col1", "stat_code", "name"])
df_b = pd.DataFrame([[2234, "WY"], [4344, "NA"]], columns=["stat_fips", "stat_code"])
merged_df = pd.merge(df_a, df_b, on="stat_code", how="left")
print(merged_df)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NA DEF 4344

It seems you need map by dict d:
d = df2.set_index('stat_code')['stat_fips'].to_dict()
df1['stat_fips'] = df1['stat_code'].map(d)
print (df1)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NaN DEF 4344
2 ghi AZ GHI 4588
Or merge with left join:
df3 = pd.merge(df1, df2, on='stat_code', how='left')
print (df3)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NaN DEF 4344
2 ghi AZ GHI 4588

Related

Return a List Based on Multiple Filters from Multiple Dataframes

How do I return a list based on multiple filters from multiple dataframes? I have three dataframes:
df1
Ticker Company Name COMBOACCRUAL
0 ABC ABC Co. 0.9
1 DEF DEF Co. 0.99
2 GHI GHI Co. 0.5
df2
Ticker Company Name PMAN
0 ABC ABC Co. 0.7
1 DEF DEF Co. 0.3
2 GHI GHI Co. 0.55
df3
Ticker Company Name PFD
0 ABC ABC Co. 0.25
1 DEF DEF Co. 0.35
2 GHI GHI Co. 0.9
and I want to apply filters COMBOACCRUAL<0.95, PMAN<0.95 and PFD<0.95 on the dataframes df1 df2 and df3 respectively so I could work further on the culled data.
The expected result should look like this:
df4
Ticker Company Name COMBOACCRUAL PMAN PFD
0 ABC ABC Co. 0.9 0.7 0.25
2 GHI GHI Co. 0.5 0.55 0.9

Merge the three df together first and apply your filter?
columns = ['Ticker', 'Company', 'Name']
df4 = df1.merge(df2, on=columns).merge(df3, on=columns)
df4 = df4[(df4['COMBOACCRUAL'] < 0.95) & (df4['PMAN'] < 0.95) & (df4['PFD'] < 0.95)]

Filter columns containing values and NaN using specific characters and create seperate columns

I have a dataframe containing columns in the below format
df =
ID Folder Name Country
300 ABC 12345 CANADA
1000 NaN USA
450 AML 2233 USA
111 ABC 2234 USA
550 AML 3312 AFRICA
Output needs to be in the below format
ID Folder Name Country Folder Name - ABC Folder Name - AML
300 ABC 12345 CANADA ABC 12345 NaN
1000 NaN USA NaN NaN
450 AML 2233 USA NaN AML 2233
111 ABC 2234 USA ABC 2234 NaN
550 AML 3312 AFRICA NaN AML 3312
I tried using the below python code:-
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x.str.startswith('ABC',na = False))
Can you please help me where i am going wrong?

You should not use apply but boolean indexing:
df.loc[df['Folder Name'].str.startswith('ABC', na=False),
'Folder Name - ABC'] = df['Folder Name']
However, a better approach that would not require you to loop over all possible codes would be to extract the code, pivot_table and merge:
out = df.merge(
df.assign(col=df['Folder Name'].str.extract('(\w+)'))
.pivot_table(index='ID', columns='col',
values='Folder Name', aggfunc='first')
.add_prefix('Folder Name - '),
on='ID', how='left'
)
output:
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312

If you have a list with the substrings to be matched at the start of each string in df['Folder Name'], you could also achieve the result as follows:
lst = ['ABC','AML']
pat = f'^({".*)|(".join(lst)}.*)'
# '^(ABC.*)|(AML.*)'
df[[f'Folder Name - {x}' for x in lst]] = \
df['Folder Name'].str.extract(pat, expand=True)
print(df)
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
If you do not already have this list, you can simply create it first by doing:
lst = df['Folder Name'].dropna().str.extract('^([A-Z]{3})')[0].unique()
# this will be an array, not a list,
# but that doesn't affect the functionality here
N.B. If your list contains items that won't match, you'll end up with extra columns filled completely with NaN values. You can get rid of these at the end. E.g.:
lst = ['ABC','AML','NON']
# 'NON' won't match
pat = f'^({".*)|(".join(lst)}.*)'
df[[f'Folder Name - {x}' for x in lst]] = \
df['Folder Name'].str.extract(pat, expand=True)
df = df.dropna(axis=1, how='all')
# dropping column `Folder Name - NON` with only `NaN` values

the startswith methode return True or False so your column will contains just a boolean values instead you can try this :
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x if x.str.startswith('ABC',na = False))

does this code do the trick?
df['Folder Name - ABC'] = df['Folder Name'].where(df['Folder Name'].str.startswith('ABC'))

pandas fill in missing index from other dataframe

I wanted to know if there is a way for me to merge / re-join the missing rows simply by index.
My original way to approach is just to cleanly separate df1 into df1_cleaned and df1_untouched, and then join them back together. But I thought there's probably an easier way to re-join the two df2 since I didn't change the index. I tried outer merge with left_index and right_index but was left with the dupe columns with suffix to clean.
df1
index
colA
colB
colC
0
California
123
abc
1
New York
456
def
2
Texas
789
ghi
df2 (subset of df1 and cleaned)
index
colA
colB
colC
0
California
321
abc
2
Texas
789
ihg
end-result
index
colA
colB
colC
0
California
321
abc
1
New York
456
def
2
Texas
789
ihg

You can use combine_first or update:
df_out = df2.combine_first(df1)
or, pd.DataFrame.update (which is an inplace operation and will overwrite df1):
df1.update(df2)
Output:
colA colB colC
index
0 California 321.0 abc
1 New York 456.0 def
2 Texas 789.0 ihg

You can get difference of index, and add the missing index from df1 to df_result after reindexing df2
df_result = df2.reindex(df1.index)
missing_index = df1.index.difference(df2.index)
df_result.loc[missing_index] = df1.loc[missing_index]
print(df_result)
colA colB colC
0 California 321.0 abc
1 New York 456.0 def
2 Texas 789.0 ihg

Dropping rows using multiple criteria

I have the following data frame:
EID
CLEAN_NAME
Start_Date
End_Date
Rank_no
E000000
DEF
3/1/1973
2/28/1978
154
E000001
GHI
6/1/1983
3/31/1988
1296
E000001
ABC
1/1/2017
80292
E000002
JKL
10/1/1980
8/31/1981
751.5
E000003
MNO
5/1/1973
11/30/1977
157
E000003
ABC
5/1/1977
11/30/1987
200
E000003
PQR
5/1/1987
11/30/1997
300
E000003
ABC
5/1/1997
1000
What I am trying to do here is I am trying to delete company ABC where rank is highest for ABC company in Rank_no column for each EID. If we find ABC record but it does not have highest rank for an EID it should not be deleted. Rest of the data should remain as it is. The expected output is as follows:
EID
CLEAN_NAME
Start_Date
End_Date
Rank_no
E000000
DEF
3/1/1973
2/28/1978
154
E000001
GHI
6/1/1983
3/31/1988
1296
E000002
JKL
10/1/1980
8/31/1981
751.5
E000003
MNO
5/1/1973
11/30/1977
157
E000003
ABC
5/1/1977
11/30/1987
200
E000003
PQR
5/1/1987
11/30/1997
300
I tried to use the following code:
result_new = result.drop(result[(result['Rank_no'] == result.Rank_no.max()) & (result['CLEAN_NAME'] == 'ABC')].index)
But it's not working. Pretty sure I am giving the conditions incorrect but not sure what exactly I am missing or writing incorrectly. I have named my data frame as result.
Any leads would be appreciated. Thanks.!

Use groupby and idxmax to find the max index for each respective EID and CLEAN_NAME combo after filtering down to only the rows that have ABC.
df.drop(df.loc[df.CLEAN_NAME == "ABC"].groupby("EID").Rank_no.idxmax())
EID CLEAN_NAME Start_Date End_Date Rank_no
0 E000000 DEF 3/1/1973 2/28/1978 154.0
1 E000001 GHI 6/1/1983 3/31/1988 1296.0
3 E000002 JKL 10/1/1980 8/31/1981 751.5
4 E000003 MNO 5/1/1973 11/30/1977 157.0
5 E000003 ABC 5/1/1977 11/30/1987 200.0
6 E000003 PQR 5/1/1987 11/30/1997 300.0

import pandas as pd
datas = [
['E000000', 'DEF', '3/1/1973', '2/28/1978', 154],
['E000001', 'GHI', '6/1/1983', '3/31/1988', 1296],
['E000001', 'ABC', '1/1/2017', '', 80292],
['E000002', 'JKL', '10/1/1980', '8/31/1981', 751.5],
['E000003', 'MNO', '5/1/1973', '11/30/1977', 157],
['E000003', 'ABC', '5/1/1977', '11/30/1987', 200],
['E000003', 'PQR', '5/1/1987', '11/30/1997', 300],
['E000003', 'ABC', '5/1/1997', '', 1000],
]
result = pd.DataFrame(datas, columns=['EID', 'CLEAN_NAME', 'Start_Date', 'End_Date', 'Rank_no'])
new_result = result.sort_values(by='Rank_no') # sort by lowest Rank_no
new_result = new_result.drop_duplicates(subset=['CLEAN_NAME'], keep='first') # drop duplicates keeping the first
new_result = new_result.sort_values(by='EID') # sort by EID
print(new_result)
Output :
EID CLEAN_NAME Start_Date End_Date Rank_no
0 E000000 DEF 3/1/1973 2/28/1978 154.0
1 E000001 GHI 6/1/1983 3/31/1988 1296.0
3 E000002 JKL 10/1/1980 8/31/1981 751.5
4 E000003 MNO 5/1/1973 11/30/1977 157.0
5 E000003 ABC 5/1/1977 11/30/1987 200.0
6 E000003 PQR 5/1/1987 11/30/1997 300.0

Pivot and concat values Pandas DataFrame

I have a dataframe that looks like this:
contactId ticker
0 ABC XYZ
1 ABC ZZZ
0 BCA YYY
Creating a pivot like so:
final_df = final_df.pivot_table(index='contactId', columns='ticker', aggfunc=len, fill_value=0)
Results in the following output:
ticker XYZ ZZZ YYY
contactId
ABC 1 1 0
BCA 0 0 1
As an intermediary step (see request below), I am assuming we need to transform the pivot so that if value >0 then ticker, else (blank). ie:
ticker XYZ ZZZ YYY
contactId
ABC XYZ ZZZ
BCA YYY
Because the output I am looking for is list of space separated tickers + a text string per contactId:
contactId ticker description
ABC XYZ ZZZ The client is holding: XYZ ZZZ
BCA YYY The client is holding: YYY
For the intermediary step I tried the following (but it through a ValueError: Grouper for 'ticker' not 1-dimensional):
final_df = final_df.pivot_table(index='contactId', columns='ticker', values='ticker', fill_value="")
Can you please assist? Thank you for the help in advance!

Inspired by #sharatpc 's suggestion, after adding the below to remove duplicates:
df = df[pd.notnull(df['contactId'])]
This worked for me:
df = df.set_index('contactId').groupby('contactId')['ticker'].transform(lambda x: ' '.join(x)).reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding new column based similar to a sql operation in python pandas - python

Related

Return a List Based on Multiple Filters from Multiple Dataframes

Filter columns containing values and NaN using specific characters and create seperate columns

pandas fill in missing index from other dataframe

Dropping rows using multiple criteria

Pivot and concat values Pandas DataFrame

Categories

Resources