find a record across multiple python pandas dataframes - python

Let's say, I have three dataframes as follows, and I would like to find in which dataframes a particular record exists.
this is dataframe1 (df1)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | rider | 223344 | Mexico
This is dataframe2 (df2)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | keith | 993344 | Brazil
This is dataframe3 (df3)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | hopper | 444444 | Canada
So, if I run the following code, I can find all the information about acct_no 112233 for a single dataframe.
p = df1.loc[df1['acct_no']=112233]
But, I would like to know which code will help me find out that acct_no 112233 exists in df1, df2, df3

One wat to know if the element is in the column 'acct_no' of the dataframe is:
>> (df1['acct_no']==112233).any()
True
You could check all at the same time by doing:
>> all([(df['acct_no']==112233).any() for df in [df1, df2, df3]])
True

Related

Comparing two df to discover the missing rows

I have two pandas dataframes. One has 7000 lines, another one has 7003. Technicaly they both should have the same column (a column whith names of cities). So one dataframe is missing 3 cities.
I need to discover which are these missing cities in my df. I want to compare my two dataframes and discover which lines are missiing in the other one.
How could I do that? How could I do a code which give me the exact missing rows (name of the cities) in my df, in comparison to the other?
df1
+-------+--------------+
| id | cities |
+-------+--------------+
| 1 | London |
| 2 | New York |
| 3 | Rio de Jan. |
| 4 | Roma |
| 5 | Berlin |
| 6 | Paris |
| 7 | Tokio |
+-------+--------------+
df2
+-------+--------------+
| id | cities |
+-------+--------------+
| 1 | London |
| 2 | New York |
| 3 | Rio de Jan. |
| 4 | Roma |
| 5 | Berlin |
| 6 | Paris |
+-------+--------------+
One approach using set:
missing_cities = set(df1["cities"]) - set(df2["cities"])
print(missing_cities)
Output
{'Tokio'}
As an alternative, use difference:
missing_cities = set(df1["cities"]).difference(df2["cities"])
The time complexity of both approaches is O(n + m), where n and m are the length of both columns.
another method is to use concat and .duplicated(keep=False) with a boolean filter.
when using .concat you can pass in an optional arg called keys which allows you to know which dataframe is which via the index.
dfc = pd.concat([df1,df2],keys=[1,2])
dfc[~dfc.duplicated(subset='cities',keep=False)]
id cities
1 6 7 Tokio

Need help in - Merge 2 dataframes based on a column list but match the rows with NULL values in column

I have 2 dataframes which I have to join. I am trying to use merge on columns ID and STATUS. However the problem is if for a row status is NULL in df2 I want it to still match it based on just ID and bring the name.
If STATUS has a value match it else match just the ID and bring the name.
mer_col_list = ['ID','STATUS']
df_out = pd.merge(df1,df2, on=mer_col_list, how='left')
df1
|ID | STATUS | NAME |
|11 | ACTIVE | John |
|22 | DORMANT| NICK |
|33 | NOT_ACTIVE| HARRY|
df2
|ID | STATUS | BRANCH |
|11 | DORMANT| USA |
|11 | | USA |
|22 | | UK |
|33 | NOT_ACTIVE | AUS|
df_out:
|ID| NAME | BRANCH|
|11| JOHN | USA |
|22| NICK | USA |
|33|HARRY | AUS |
You can create another left join by ID only if STATUS is missing and then combine both DataFrames by DataFrame.fillna:
df_out1 = df1.merge(df2, on=['ID','STATUS'], how='left')
df_out2 = df1.merge(df2[df2['STATUS'].isna()].drop('STATUS',axis=1), on=['ID'],how='left')
df_out = df_out1.fillna(df_out2)
print (df_out)
ID STATUS NAME BRANCH
0 11 ACTIVE John USA
1 22 DORMANT NICK UK
2 33 NOT_ACTIVE HARRY AUS
You can also remove missing values first and duplicates if exist duplicated ['ID','STATUS'] for df2 and ID if remove missing valeus per STATUS:
df21 = df2.dropna(subset=['STATUS']).drop_duplicates(['ID','STATUS'])
df_out1 = df1.merge(df21, on=['ID','STATUS'], how='left')
df22 = df2[df2['STATUS'].isna()].drop('STATUS',axis=1).drop_duplicates(['ID'])
df_out2 = df1.merge(df22, on='ID',how='left')
df_out = df_out1.fillna(df_out2)
print (df_out)
Dynamic solution - if always ID is in list mer_col_list:
mer_col_list = ['ID','STATUS']
df21 = df2.dropna(subset=mer_col_list).drop_duplicates(mer_col_list)
df_out1 = df1.merge(df21, on=mer_col_list, how='left')
no_id_cols = np.setdiff1d(mer_col_list, ['ID'])
print (no_id_cols)
['STATUS']
df22=df2[df2[no_id_cols].isna().any(axis=1)].drop(no_id_cols,axis=1).drop_duplicates(['ID'])
df_out2 = df1.merge(df22, on='ID',how='left')
df_out = df_out1.fillna(df_out2)
print (df_out)
ID STATUS NAME BRANCH
0 11 ACTIVE John USA
1 22 DORMANT NICK UK
2 33 NOT_ACTIVE HARRY AUS

Group by and chose the string value of a column based on a condition using pandas

I have a dataframe consisting of people: ('id','name','occupation').
| id | name | occupation |
|:--:|:--------:|:----------:|
| 1 | John | artist |
| 1 | John | painter |
| 2 | Mary | consultant |
| 3 | Benjamin | architect |
| 3 | Benjamin | manager |
| 4 | Alice | intern |
| 4 | Alice | architect |
Task:
Some people have multiple occupations, however I need each person to have only one. For this I am trying to use the groupby pandas function.
Issue:
So far so good, however I need to apply a condition based on their occupation and here is where I got stuck.
The condition is simple:
if "architect" is in the 'occupation' of the group (person):
   keep the 'occupation' as "architect"
else:
   keep any/last/first (it doesn't matter) 'occupation'
The desired output would be:
| id | name | occupation |
|:--:|:--------:|:----------:|
| 1 | John | artist |
| 2 | Mary | consultant |
| 3 | Benjamin | architect |
| 4 | Alice | architect |
Attempt:
def one_occupation_per_person(occupation):
if "architect" in occupation:
return "architect"
else:
return ???
df.groupby(['id','name')['occupation'].apply(lambda x: one_occupation_per_person(x['occupation']),axis=1)
I hope this describes the issue clear enough. Any hints and ideas are appreciated!
Since architect will come out at the first item from a natural sort, you can simply sort on occupation and then groupby:
df.sort_values("occupation").groupby("id", as_index=False).first()
If you somehow had another occupation that sorts before architect, you can convert the column to pd.Categorical before sorting:
s = ["architect"] + df.loc[df["occupation"].ne("architect"),"occupation"].unique().tolist()
df["occupation"] = pd.Categorical(df["occupation"], ordered=True, categories=s)
print (df.sort_values("occupation").groupby("id", as_index=False).first())
Result:
id name occupation
0 1 John artist
1 2 Mary consultant
2 3 Benjamin architect
3 4 Alice architect

Is there any way to rearrange excel data without copy paste?

I have an excel file that contain country name and dates as column name.
+---------+------------+------------+------------+
| country | 20/01/2020 | 21/01/2020 | 22/01/2020 |
+--------- ------------+------------+------------+
| us | 0 | 5 | 6 |
+---------+------------+------------+------------+
| Italy | 20 | 23 | 33 |
+--------- ------------+------------+------------+
| India | 0 | 0 | 6 |
+---------+------------+------------+------------+
But i need to arrange column names country, date, and count. Is there any way to rearrange excel data without copy paste.
final excel sheet need to look like this
+---------+------------+------------+
| country | date | count |
+--------- ------------+------------+
| us | 20/01/2020 | 0 |
+---------+------------+------------+
| us | 21/01/2020 | 5 |
+---------+------------+------------+
| us | 22/01/2020 | 6 |
+---------+------------+------------+
| Italy | 20/01/2020 | 20 |
+--------- ------------+------------+
| Italy | 21/01/2020 | 23 |
+--------- ------------+------------+
| Italy | 22/01/2020 | 33 |
+--------- ------------+------------+
| India | 20/01/2020 | 0 |
+---------+------------+------------+
Unpivot using Power Query:
Data --> Get & Transform --> From Table/Range
Select the country column
Unpivot Other columns
Rename the resulting Attribute and Value columns to date and count
Because the Dates which are in the header are turned into Text, you may need to change the date column type to date, or, as I did, to date using locale
M-Code
Source = Excel.CurrentWorkbook(){[Name="Table2"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"country", type text}, {"20/01/2020", Int64.Type}, {"21/01/2020", Int64.Type}, {"22/01/2020", Int64.Type}}),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Changed Type", {"country"}, "date", "count"),
#"Changed Type with Locale" = Table.TransformColumnTypes(#"Unpivoted Other Columns", {{"date", type date}}, "en-150")
in
#"Changed Type with Locale"
Power Pivot is the best way, but if you want to use formulas:
In F1 enter:
=INDEX($A$2:$A$4,ROUNDUP(ROWS($1:1)/3,0))
and copy downward. In G1 enter:
=INDEX($B$1:$D$1,MOD(ROWS($1:1)-1,3)+1)
and copy downward. H1 enter:
=INDEX($B$2:$D$4,ROUNDUP(ROWS($1:1)/3,0),MOD(ROWS($1:1)-1,3)+1)
and copy downward
The 3 in these formulas is because we have 3 dates in the original table.

groupby: how to show max(field1) and the value of field2 corresponding to max(field1)?

Let's say I have a table with 3 fields: client, city, sales, with sales being a float.
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | NY | 0 |
| a | LA | 1 |
| a | London | 2 |
| b | NY | 3 |
| b | LA | 4 |
| b | London | 5 |
+--------+--------+-------+
For each client, I would like to show what is the city with the greatest sales, and what those sales are, ie I want this output:
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | London | 2 |
| b | London | 5 |
+--------+--------+-------+
Any suggestions?
This table can be generated with:
df=pd.DataFrame()
df['client']= np.repeat( ['a','b'],3 )
df['city'] = np.tile( ['NY','LA','London'],2)
df['sales']= np.arange(0,6)
This is wrong because it calculates the 'maximum' of the city, and shows NY because it considers that N > L
max_by_id = df.groupby('client').max()
I can first create a dataframe with the highest sales, and then merge it with the initial dataframe to retrieve the city; it works, but I was wondering if there is a faster / more elegant way?
out = pd.merge( df, max_by_id, how='inner' ,on=['client','sales'] )
I remember doing something similar with cross apply statements in SQL but wouldn't know how to run a Pandas equivalent.
You need to sort by sales and then groupby client and pick first
df.sort_values(['sales'], ascending=False).groupby('client').first().reset_index()
OR
As #user3483203:
df.loc[df.groupby('client')['sales'].idxmax()]
Output:
client city sales
0 a London 2
1 b London 5

Categories

Resources