Merging 2 data frames without changing associated values

Merging 2 data frames without changing associated values - python

I currently have 2 datasets
1 = Drugs prescribed per hospital
2 = Crimes committed
I have been able to assign the located hospital ID to the various crimes so therefore I can identify which hospital is closer.
What I really would like to do is to assign the amount of drugs prescribed using the count_values method to the hospital ID in the Crime data so that I can then plot a scatter matrix of where the crimes took place and the total quantity of drugs prescribed from the closest hospital.
I have tried using the following
df = Crimes.merge(hosp[['hosp no', 'Total Quantity']],
left_on='hosp_no', right_on='hosp no').drop('hosp no', 1)
df
However when I use the above code the associated Hosp ID to the crime changes and I don't want it too!!
I am new to jupyter notebook so I would be most grateful for any help!!
Thank you in advance
Crimes df
ID Type Hosp No
0 Anti-Social 222
Hosp df
Hosp no Total Quantity Drug name
222 1000 Paracetamol
So basically Hosp 222 has prescribed 1000 Paracetamol drugs how can I assign the number 1000 to the Crime df where Hosp No = 222 to look like this:
Crimes df
ID Type Hosp No Total Quantity
0 Anti-Social 222 1000

If the columns you are merging on share the same name, you don't need on parameter. Since you need column added to crime, we can use parameter how = left
Crimes = Crimes.merge(Hosp[['Hosp No', 'Total Quantity']], how = 'left')
ID Type Hosp No Total Quantity
0 0 Anti-Social 222 1000
Let me know if this is the desired output or you need anything else

Related

How do I write my own complex fuzzy match with pandas?

I have two dataframes: one with an account number, a purchase ID, a total cost, and a date
and another with account number, money paid, and date:
To make it clear there are two accounts, 11111 and 33333, but there are some typos in the dataframes.
AccountNumber Purchase ID Total Cost Date
11111 100 10 1/1/2020
333333 100 10 1/1/2020
33333 200 20 2/2/2020
11111 300 30 4/2/2020
AccountNumber Date Money Paid:
11111 1/1/2020 5
111111 1/2/2020 2
33333 1/2/2020 1
33333 2/3/2020 15
1111 4/2/2020 30
Each Purchase ID is an identifier for a single purchase, however multiple accounts may be involved within the purchase, such as account 11111 and 33333. Moreover, an account may be used for two different purchases such as account 11111 with Purchase ID 100 and 300. In the second dataframe, payments can be made in increments, so I need to use the date to make sure that the payment is associated with the correct Purchase ID. Moreover, there may be some slight errors in the account numbers so I need to use a fuzzy match. In the end, I want to get a dataframe that is grouped by Purchase ID and compares how much the accounts paid vs. the cost of the item:
Purchase ID Date Total Cost Amount Paid $Owed
100 1/1/2020 10 8 2
200 2/2/2020 20 15 5
300 4/2/2020 30 30 0
As you can see, this is a fairly complicated question. I first tried just joining the two dataframes based on AccountNumber but I ran into issues due to the slight differences as well as the problem of matching the Accountnumber transaction to the correct Purchase ID with the date, because one error with merging is that you might accidentally sum up money paid for the wrong Purchase since accounts can be involved with different purchases.
I'm thinking about iterating through the rows and using if statements/regex but I feel like that would take too long.
What's the simplest and efficient solution to this problem? I'm a beginner at pandas and python.

The library pandas-dedupe can help you to link the two dataframe by using a combination of active learning and clustering. have a look at the repo.
Here is the sample code (and step by step explanation):
import pandas as pd
import pandas_dedupe
#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')
#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])
# At this point pandas_dedupe will ask you to label a sample of records according
# to whether they are distinct or the same observation.
# After that, pandas-dedupe uses its knowledge to cluster together similar records.
#send output to csv
df_final.to_csv('linkage_output.csv')

Pandas groupby results - move some grouped column values into rows of new data frame

I have seen a number of similar questions but cannot find a straightforward solution to my issue.
I am working with a pandas dataframe containing contact information for constituent donors to a nonprofit. The data has Households and Individuals. Most Households have member Individuals, but not all Individuals are associated with a Household. There is no data that links the Household to the container Individuals, so I am attempting to match them up based on other data - Home Street Address, Phone Number, Email, etc.
A simplified version of the dataframe looks something like this:
Constituent Id Type Home Street
1234567 Household 123 Main St.
2345678 Individual 123 Main St.
3456789 Individual 123 Main St.
4567890 Individual 433 Elm Rd.
0123456 Household 433 Elm Rd.
1357924 Individual 500 Stack Ln.
1344444 Individual 500 Stack Ln.
I am using groupby in order to group the constituents. In this case, by Home Street. I'm trying to ensure that I only get groupings with more than one record (to exclude Individuals unassociated with a Household). I am using something like:
df1 = df[df.groupby('Home Street').filter(lambda x: len(x)>1)
What I would like to do is somehow export the grouped dataframe to a new dataframe that includes the Household Constituent Id first, then any Individual Constituent Ids. And in the case that there is no Household in the grouping, place the Individual Constituents in the appropriate locations. The output for my data set above would look like:
Household Individual Individual
1234567 2345678 3456789
0123456 4567890
1357924 1344444
I have toyed with iterating through the groupby object, but I feel like I'm missing some easy way to accomplish my task.

This should do it
df['Type'] = df['Type'] + '_' + (df.groupby(['Home Street','Type']).cumcount().astype(str))
df.pivot_table(index='Home Street', columns='Type', values='Constituent Id', aggfunc=lambda x: ' '.join(x)).reset_index(drop=True)
Output
Type Household_0 Individual_0 Individual_1
0 1234567 2345678 3456789
1 0123456 4567890 NaN
2 NaN 1357924 1344444

IIUC, we can use groupby agg(list) and some re-shaping using .join & explode
s = df.groupby(["Home Street", "Type"]).agg(list).unstack(1).reset_index(
drop=True
).droplevel(level=0, axis=1).explode("Household")
df1 = s.join(pd.DataFrame(s["Individual"].tolist()).add_prefix("Indvidual_")).drop(
"Individual", axis=1
)
print(df1.fillna(' '))
Household Indvidual_0 Indvidual_1
0 1234567 2345678 3456789
1 0123456 4567890
2 1357924 1344444
or we can ditch the join and cast Household to your index.
df1 = pd.DataFrame(s["Individual"].tolist(), index=s["Household"])\
.add_prefix("Individual_")
print(df1)
Individual_0 Individual_1
Household
1234567 2345678 3456789
0123456 4567890 None
NaN 1357924 1344444

How to aggregate rows in a pandas dataframe

I have a dataframe shown in the image 1. It is a sample of pubs in London,UK (3337 pubs/rows). And the geometry is at an LSOA level. In some LSOAs, there is more than 1 pub. I want my dataframe to summarise the number of pubs in every LSOA. I already have the information by using
psdf['lsoa11nm'].value_counts()
prints out:
City of London 001F 103
City of London 001G 40
Westminster 013B 36
Westminster 018A 36
Westminster 013E 30
...
Lambeth 005A 1
Croydon 043C 1
Hackney 002E 1
Merton 022D 1
Bexley 008B 1
Name: lsoa11nm, Length: 1630, dtype: int64
I cant use this as a new dataframe because it is a key and one column as opposed two columns where one would be lsoa11nm and the other pub count.
Does anyone know how to groupby the dataframe so that there will be only one row for every lsoa, that says how many pubs are in it?

How to run a groupby based on result of other/previous groupby?

Let's assume you are selling a product globally and you want to set up a sales office somewhere in a major city. Your decision will be based purely on sales numbers.
This will be your (simplified) sales data:
df={
'Product':'Chair',
'Country': ['USA','USA', 'China','China','China','China','India',
'India','India','India','India','India', 'India'],
'Region': ['USA_West','USA_East', 'China_West','China_East','China_South','China_South', 'India_North','India_North', 'India_North','India_West','India_West','India_East','India_South'],
'City': ['A','B', 'C','D','E', 'F', 'G','H','I', 'J','K', 'L', 'M'],
'Sales':[1000,1000, 1200,200,200, 200,500 ,350,350,100,700,50,50]
}
dff=pd.DataFrame.from_dict(df)
dff
Based on the data you should go for City "G".
The logic should go like this:
1) Find country with Max(sales)
2) in that country, find region with Max(sales)
3) in that region, find city with Max(sales)
I tried: groupby('Product', 'City').apply(lambda x: x.nlargest(1)), but this doesn't work, because it would propose city "C". This is the city with highest sales globally, but China is not the Country with highest sales.
I probably have to go through several loops of groupby. Based on the result, filter the original dataframe and do a groupby again on the next level.
To add to the complexity, you sell other products too (not just 'Chairs', but also other furniture). You would have to store the results of each iteration (like country with Max(sales) per product) somewhere and then use it in the next iteration of the groupby.
Do you have any ideas, how I could implement this in pandas/python?

Idea is aggregate sum per each level with Series.idxmax for top1 value, what is used for filtering for next level by boolean indexing:
max_country = dff.groupby('Country')['Sales'].sum().idxmax()
max_region = dff[dff['Country'] == max_country].groupby('Region')['Sales'].sum().idxmax()
max_city = dff[dff['Region'] == max_region].groupby('City')['Sales'].sum().idxmax()
print (max_city)
G

One way is to add groupwise totals, then sort your dataframe. This goes beyond your requirement by ordering all your data using your preference logic:
df = pd.DataFrame.from_dict(df)
factors = ['Country', 'Region', 'City']
for factor in factors:
df[f'{factor}_Total'] = df.groupby(factor)['Sales'].transform('sum')
res = df.sort_values([f'{x}_Total' for x in factors], ascending=False)
print(res.head(5))
City Country Product Region Sales Country_Total Region_Total \
6 G India Chair India_North 500 2100 1200
7 H India Chair India_North 350 2100 1200
8 I India Chair India_North 350 2100 1200
10 K India Chair India_West 700 2100 800
9 J India Chair India_West 100 2100 800
City_Total
6 500
7 350
8 350
10 700
9 100
So for the most desirable you can use res.iloc[0], for the second res.iloc[1], etc.

Creating a dictionary of categoricals in SQL and aggregating them in Python

I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?

One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.