Applying math to columns where rows hold the same value in pandas - python

I have 2 dataframes which look like this:
df1
A B
AAA 50
BBB 100
CCC 200
df2
C D
CCC 500
AAA 10
EEE 2100
I am trying to output the dataset where column E would be B - D if A = C. Since A values are not aligned with C values I cant seem to find the appropriate method to apply calculations and compare the right numbers.
There also are values which are not shared between two datasets in this case I want to add text value 'not found' in those places so that the output would look like this:
output
A B C D E
AAA 50 AAA 10 B-D
BBB 100 Not found Not found Not found
CCC 200 CCC 500 B-D
Not found Not found EEE 2100 Not found
Thank you for your suggestions.

Use outer join with left_on and right_on parameters with DataFrame.merge and then subtract columns, for possible subtract numeric is better use missing values:
df = (df1.merge(df2, left_on='A', right_on='C', how='outer')
.fillna({'A':'Not found', 'C':'Not found'})
.assign(E = lambda x: x.B - x.D))
print (df)
A B C D E
0 AAA 50.0 AAA 10.0 40.0
1 BBB 100.0 Not found NaN NaN
2 CCC 200.0 CCC 500.0 -300.0
3 Not found NaN EEE 2100.0 NaN
Last is possible replace all missing values, only numeric columns are now mixed - strings with numbers, so next processing like some arithmetic operations is problematic:
df = (df1.merge(df2, left_on='A', right_on='C', how='outer')
.assign(E = lambda x: x.B - x.D)
.fillna('Not found'))
print (df)
A B C D E
0 AAA 50 AAA 10 40
1 BBB 100 Not found Not found Not found
2 CCC 200 CCC 500 -300
3 Not found Not found EEE 2100 Not found

Related

VLOOKUP in Python Pandas without using MERGE

I have two DataFrames with one common column as the key and I want to perform a VLOOKUP sort of operation to fetch values from the first DataFrame corresponding to the keys in second DataFrame.
DataFrame 1
key value
0 aaa 111
1 bbb 222
2 ccc 333
DataFrame 2
key value
0 bbb None
1 ccc 333
2 aaa None
3 aaa 111
Desired Output
key value
0 bbb 222
1 ccc 333
2 aaa 111
3 aaa 111
I do not want to use merge as both of my DFs might have NULL values for the key column and since pandas merge behave differently than sql join, all such rows might get joined with each other.
I tried below approach
DF2['value'] = np.where(DF2['key'].isnull(), DF1.loc[DF2['key'].equals(DF1['key'])]['value'], DF2['value'])
but have been getting KeyError: False error.
You can use:
df2['value'] = df2['value'].fillna(df2['key'].map(df1.set_index('key')['value']))
print(df2)
# Output
key value
0 bbb 222
1 ccc 333
2 aaa 111
3 aaa 111

2 Different Data Frames + Percentage Calculation + Python

there are similar questions existing, however cant find the right answer. Most of them require a common nominator which I don't have.
I want to have two outcomes from two data frames.
One is to get the percentage for each row in df2 from the total (df1). And another view of accumulated percentage.
df1
a
1875
df2
b c
aaa 125
bbb 250
ccc 500
ddd 1000
Required outcome.
b c Outcome 1 Outcome 2
aaa 125 6.67% 100.00%
bbb 250 13.33% 93.33%
ccc 500 26.67% 80.00%
ddd 1000 53.33% 53.33%
I have tried df1.eq(df2.values).mean() and couple of the merge functions. But again, don't have a common nominator.
Hope this helps. Thanks.
Use:
#get scalar from first DataFrame
a = df1.loc[0, 'a']
#divide by scalar and multiple by 100
df['Outcome 1'] = df['c'].div(a).mul(100)
#create cumulative sum in swapped order of rows
df['Outcome 2'] = df['Outcome 1'].iloc[::-1].cumsum()
print (df)
b c Outcome 1 Outcome 2
0 aaa 125 6.666667 100.000000
1 bbb 250 13.333333 93.333333
2 ccc 500 26.666667 80.000000
3 ddd 1000 53.333333 53.333333

Pandas Dataframe add columns based on existing data

I have a dataframe with 100s of columns and 1000s of rows but the basic structure is
Index 0 1 2
0 AAA NaN AAA
1 NaN BBB NaN
2 NaN NaN CCC
3 DDD DDD DDD
I would like to add two new columns one would be and id which would be equal to the first value in each row the second would be a count of the values in each row. It would look like this. To be clear all rows will always have the same value.
Index id count 0 1 2
0 AAA 2 AAA NaN AAA
1 BBB 1 NaN BBB NaN
2 CCC 1 NaN NaN CCC
3 DDD 3 DDD DDD DDD
Any help in figuring out a way to do this would be greatly appreciated. Thanks
This should work.
df['id'] = df.bfill(axis=1).iloc[:, 0].fillna('All NANs')
df['count'] = df.drop(columns=["id"]).notnull().sum(axis=1)
To maintain the order of columns:
df = df[list(df.columns[-2:]) + list(df.columns[:-2])]
Create the Dataframe
test_df = pd.DataFrame([['AAA',np.nan,'AAA'], [np.nan,'BBB',np.nan], [np.nan,np.nan, 'CCC'], ['DDD','DDD','DDD']])
Count the non-NaN elements in each row as count
test_df['count'] = test_df.notna().sum(axis=1)
Option-1: Select the first element in the row as id (regardless of NaN value)
test_df['id'] = test_df[0]
Option-2: Select the first non-NaN element as id for each row
test_df['id'] = test_df.apply(lambda x: x[x.first_valid_index()], axis=1)

Extracting existing and non existing values from 2 columns using pandas

I am new to pandas and I am trying to get a list of values that exists in both columns, values that exist in column A, values that only exist in column B.
My .csv file looks like this:
A B
AAA ZZZ
BBB BBB
CCC EEE
DDD FFF
EEE AAA
DDD
GGG HHH
JJJ
Columns have a different length and my outcome would be 3 lists or one csv that I would ouput having 3 columns, one for items existing in both columns, one for items existing in only A column and one for items existing in only B column.
IN BOTH IN COLUMN A IN COLUMN B
AAA CCC ZZZ
BBB GGG FFF
DDD JJJ HHH
EEE
(empty one)
I have tried using .isin() module but it returns true of false rather than the actual list.
existing_in_both = df_column_a.isin(df_column_b)
And I do not know how I should try to extract values that only exist in either column A or B.
Thank you for your suggestions.
My actual .csv has the following:
id clickout_id timestamp click_id click_type
1 123abc 2019-11-25 c51c56d1 1
1 123dce 2019-11-25 c51c5fs1 12
and other file is looking like this:
timestamp id gid type
2019-11-25 1 c51c56d1 2
2019-11-25 1 c51c5fs1 2
And I am trying to compare click_id from first file and gid from the second file.
When I print out using your answer I get the header names as answers rather than the values from the columns.
Use sets with intersection and difference, then for new DataFrame are used Series, because different lengths of outputs:
a = set(df.A)
b = set(df.B)
df = pd.DataFrame({'IN BOTH': pd.Series(list(a & b)),
'IN COLUMN A': pd.Series(list(a - b)),
'IN COLUMN B': pd.Series(list(b - a))})
print (df)
IN BOTH IN COLUMN A IN COLUMN B
0 DDD CCC FFF
1 BBB GGG ZZZ
2 AAA JJJ HHH
3 NaN NaN
4 EEE NaN NaN
Or use numpy.intersect1d with numpy.setdiff1d:
df = pd.DataFrame({'IN BOTH': pd.Series(np.intersect1d(df.A, df.B)),
'IN COLUMN A': pd.Series(np.setdiff1d(df.A, df.B)),
'IN COLUMN B': pd.Series(np.setdiff1d(df.B, df.A))})
print (df)
IN BOTH IN COLUMN A IN COLUMN B
0 CCC FFF
1 AAA GGG HHH
2 BBB JJJ ZZZ
3 DDD NaN NaN
4 EEE NaN NaN

Merge Pandas Dataframe on a column with structured data

Scenario: Following up from a previous question on how to read an excel file from a serve into a dataframe (How to read an excel file directly from a Server with Python), I am trying to merge the contexts of multiple dataframes (which contain data from excel worksheets).
Issue: Even after searching for similar issues here in SO, I still was not able to solve the problem.
Format of data (each sheet is read into a dataframe):
Sheet 1 (db1)
Name CUSIP Date Price
A XXX 01/01/2001 100
B AAA 02/05/2005 90
C ZZZ 03/07/2006 95
Sheet2 (db2)
Ident CUSIP Value Class
123 XXX 0.5 AA
444 AAA 1.3 AB
555 ZZZ 2,8 AC
Wanted output (fnl):
Name CUSIP Date Price Ident Value Class
A XXX 01/01/2001 100 123 0.5 AA
B AAA 02/05/2005 90 444 1.3 AB
C ZZZ 03/07/2006 95 555 2.8 AC
What I already tried: I am trying to use the merge function to match each dataframe, but I am getting the error on the "how" part.
fnl = db1
fnl = fnl.merge(db2, how='outer', on=['CUSIP'])
fnl = fnl.merge(db3, how='outer', on=['CUSIP'])
fnl = fnl.merge(bte, how='outer', on=['CUSIP'])
I also tried the concatenate, but I just get a list of dataframes, instead of a single output.
wsframes = [db1 ,db2, db3]
fnl = pd.concat(wsframes, axis=1)
Question: What is the proper way to do this operation?
It seems you need:
from functools import reduce
#many dataframes
dfs = [df1,df2]
df = reduce(lambda x, y: x.merge(y, on='CUSIP', how='outer'), dfs)
print (df)
Name CUSIP Date Price Ident Value Class
0 A XXX 01/01/2001 100 123 0.5 AA
1 B AAA 02/05/2005 90 444 1.3 AB
2 C ZZZ 03/07/2006 95 555 2,8 AC
But columns in each dataframe has to be different (no matched columns (CUSIP here)), else get _x and _y suffixes:
dfs = [df1,df1, df2]
df = reduce(lambda x, y: x.merge(y, on='CUSIP', how='outer'), dfs)
print (df)
Name_x CUSIP Date_x Price_x Name_y Date_y Price_y Ident Value \
0 A XXX 01/01/2001 100 A 01/01/2001 100 123 0.5
1 B AAA 02/05/2005 90 B 02/05/2005 90 444 1.3
2 C ZZZ 03/07/2006 95 C 03/07/2006 95 555 2,8
Class
0 AA
1 AB
2 AC

Categories

Resources