How can we handle duplicated rows after merging two dataframes? - python

I have two dataframes that look like this.
import pandas as pd
# initialize list of lists
expenses = [[2020, 108, 76848.94], [2020, 108, 42488.51]]
# Create the pandas DataFrame
expenses = pd.DataFrame(expenses, columns = ['Year', 'ID', 'Expense'])
expenses['ID'] = expenses['ID'].astype(str)
# initialize list of lists
revenues = [[2020, 108, 'NRG CENTER', 380121.25], [2020, 108, 'NRG STADIUM Area1', 500655.83], [2020, 108, 'NRG STADIUM Area2', 500655.83], [2020, 108, 'NRG STADIUM Area3', 338153.03], [2020, 108, 'NRG CENTER BB', 70846.37]]
# Create the pandas DataFrame
revenues = pd.DataFrame(revenues, columns = ['Year', 'ID', 'Name', 'Revenue'])
revenues['ID'] = revenues['ID'].astype(str)
Now, merge.
df_final = pd.merge(expenses,
revenues,
left_on=['ID','Year'],
right_on=['ID','Year'],
how='inner')
df_final
When I merge the two together, I get this.
I tried to handle this a couple different ways. I tried this.
expenses_nodups = expenses.drop_duplicates()
df_final = pd.merge(revenues , expenses_nodups , on=['ID'])
df_final
That gives me dupes. So, I tried this.
df_final.drop_duplicates(keep=False,inplace=True)
df_final
That also gives me dupes.
I have a 'Name' field only in the revenue table, but nothing like this in the expense table. The expense table just has Year, ID, and Expense.
Why do we get these dupes?
How can I handle this dupe issue?

Related

About Pandas Dataframe

I have a question related to Pandas.
In df1 I have a data frame with the id of each seller and their respective names.
In df2 I have the id of the salesmen and their respective sales.
I would like to have in the df2, two new columns with the first name and last names of the salesmen.
PS. in df2 one of the sales is shared between two vendors.
import pandas as pd
vendors = {'first_name': ['Montgomery', 'Dagmar', 'Reeba', 'Shalom', 'Broddy', 'Aurelia'],
'last_name': ['Humes', 'Elstow', 'Wattisham', 'Alen', 'Keningham', 'Brechin'],
'id_vendor': [127, 241, 329, 333, 212, 233]}
sales = {'id_vendor': [['127'], ['241'], ['329, 333'], ['212'], ['233']],
'sales': [1233, 25000, 8555, 4333, 3222]}
df1 = pd.DataFrame(vendors)
df2 = pd.DataFrame(sales)
I attach the code. any suggestions?`
Thank you in advance.
You can merge df1 with df2 with the exploded id_vendors column and use DataFrame.GroupBy.agg when grouping by sales to obtain the columns as you want:
transform_names = lambda x: ', '.join(list(x))
res = (df1.merge(df2.explode('id_vendor')).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)
first_name last_name id_vendor
sales
1233 Montgomery Humes [127]
3222 Aurelia Brechin [233]
4333 Broddy Keningham [212]
8555 Reeba, Shalom Wattisham, Alen [329, 333]
25000 Dagmar Elstow [241]
Note:
In your example, id_vendors in df2 is populated by lists of strings, but since id_vendor in df1 is of integer type, I assume that it was a typo. If id_vendors is indeed containing lists of strings, you need to also convert the strings to integers:
transform_names = lambda x: ', '.join(list(x))
# Notice the .astype(int) call.
res = (df1.merge(df2.explode('id_vendor').astype(int)).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)

Compare df's including detailed insight in data

I'm having a python project:
df_testR with columns={'Name', 'City','Licence', 'Amount'}
df_testF with columns={'Name', 'City','Licence', 'Amount'}
I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.
I referred to: Diff between two dataframes in pandas
But I receive a table with TRUE and FALSE only:
Name
City
Licence
Amount
True
True
True
False
But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:
Name
City
Licence
Amount_R
Amount_F
Paul
NY
YES
200
500.
Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.
Could someone help?
import copy
import pandas as pd
data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)
First find the missing rows and print them:
matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
print('Some rows are missing from df2:')
print(df1.iloc[missing_rows, :])
Then drop these rows and merge:
df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)
Now remove the rows that have the same amount:
df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]
I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

Join two dataframes by the closest datetime

I have two dataframes df_A and df_B where each has date, time and a value. An example below:
import pandas as pd
df_A = pd.DataFrame({
'date_A': ["2021-02-01", "2021-02-01", "2021-02-02"],
'time_A': ["22:00:00", "23:00:00", "00:00:00"],
'val_A': [100, 200, 300]})
df_B = pd.DataFrame({
'date_B': ["2021-02-01", "2021-02-01", "2021-02-01", "2021-02-01", "2021-02-02"],
'time_B': ["22:01:12", "22:59:34", "23:00:17", "23:59:57", "00:00:11"],
'val_B': [104, 203, 195, 296, 294]})
I need to join this dataframes but date and time never match. So I want a left join by the closest datetime from df_B to df_A. So the output should be:
df_out = pd.DataFrame({
'date_A': ["2021-02-01", "2021-02-01", "2021-02-02"],
'time_A': ["22:00:00", "23:00:00", "00:00:00"],
'val_A': [100, 200, 300],
'date_B': ["2021-02-01", "2021-02-01", "2021-02-01"],
'time_B': ["22:01:12", "23:00:17", "23:59:57"],
'val_B': [104, 195, 296]})
df_out
Pandas has a handy merge_asof() function for these types of problems (https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html)
It requires a single key to merge on, so you can create a single date-time column in each dataframe and perform the merge:
df_A['date_time'] = pd.to_datetime(df_A.date_A + " " + df_A.time_A)
df_B['date_time'] = pd.to_datetime(df_B.date_B + " " + df_B.time_B)
# Sort the two dataframes by the new key, as required by merge_asof function
df_A.sort_values(by="date_time", inplace=True, ignore_index=True)
df_B.sort_values(by="date_time", inplace=True, ignore_index=True)
result_df = pd.merge_asof(df_A, df_B, on="date_time", direction="nearest")
Note the direction argument's value is "nearest" as you requested. There are other values you can choose, like "backward" and "forward".

Check whether all unique value of column B are mapped with all unique value of Column A

I need little help, I know it's very easy I tried but didn't reach the goal.
# Import pandas library
import pandas as pd
data1 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600]]
df1 = pd.DataFrame(data1, columns = ['Country', 'Bottle_Weight'])
data2 = [['India', 350], ['India', 600],['India', 200], ['Bangladesh', 350],['Bangladesh', 600]]
df2 = pd.DataFrame(data2, columns = ['Country', 'Bottle_Weight'])
data3 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600],['Bangladesh', 200]]
df3 = pd.DataFrame(data3, columns = ['Country', 'Bottle_Weight'])
So basically I want to create a function, which will check the mapping by comparing all other unique countries(Bottle weights) with the first country.
According to the 1st Dataframe, It should return text as - All unique value of 'Bottle Weights' are mapped with all unique countries
According to the 2nd Dataframe, It should return text as - 'Country_name' not mapped 'Column name' 'value'
In this case, 'Bangladesh' not mapped with 'Bottle_Weight' 200
According to the 3rd Dataframe, It should return text as - All unique value of Bottle Weights are mapped with all unique countries (and in a new line) 'Country_name' mapped with new value '200'
It is not a particularly efficient algorithm, but I think this should get you the results you are looking for.
def check_weights(df):
success = True
countries = df['Country'].unique()
first_weights = df.loc[df['Country']==countries[0]]['Bottle_Weight'].unique()
for country in countries[1:]:
weights = df.loc[df['Country']==country]['Bottle_Weight'].unique()
for weight in first_weights:
if not np.any(weights[:] == weight):
success = False
print(f"{country} does not have bottle weight {weight}")
if success:
print("All bottle weights are shared with another country")

How to filter stocks in a large dataset - python

I have a large dataset of over 20,000 stocks from 1964-2018. (It's CRSP data I got from my university). I now want to apply the following filter technique according to Amihud (2002):
1. include all stocks that have a price greater than $5 at end of year t-1
2. include all stocks that have data for at least 200 days at end of year t-1
3. the stocks have information about market capitalization at end of year t-1
I'm quite stuck on this since I've never worked with such a large dataset. Any suggestions where I can find ideas on how to solve this problem? Many thanks.
I already tried to filter on a monthly basis. I created new dataframe that included those stocks whose prices where above $5 in december. Now I got stuck. The graph shows the number of stocks over time before and after applying the first filter. dataframe with filter
#of stocks over time
df['month'] = pd.DatetimeIndex(df.index).month
df2= df[(df.month == 12) & (df.prc >= 5)]
EDIT:
I created a sample dataframe that looks like my original dataframe
import pandas as pd
import numpy as np
df1 = pd.DataFrame( { 'date': ['2010-05-12', '2010-05-13', '2010-05-13',
'2011-11-13', '2011-11-14', '2011-03-30', '2011-12-01',
'2011-12-02', '2011-12-01', '2011-12-02'],
"stock" : ["stock_1", "stock_1", "stock_2", "stock_3",
"stock_3", "stock_3", 'stock_1', 'stock_1', 'stock_2',
'stock_2'] ,
"price" : [100, 102, 300, 51, 49, 45, 101, 104, 301, 299],
'volume':[1000, 1020, np.nan, 510, 490, 450, 1010, 1040,
np.nan, 2990],
'return':[0.01, 0.03, 0.02, np.nan, 0.02, -0.04, -0.08,
-0.01, np.nan, -0.01] } )
df1 = df1.set_index(pd.DatetimeIndex(df1['date']))
pivot_df = df1.pivot_table(index=[df1.index, 'stock'], values=['price',
'vol', 'ret'])
The resulting dataframe is basically panel data. I want to to check whether each stock has return and volume data (not NaN) each day. Then I want to remove all stocks that have return and volume data for less than 200 days in a given year. Since the original dataframe contains nearly 20,000 stocks from 1964 - 2018 I want to do this in an efficient way.

Categories

Resources