Merging DataFrames on specific columns

Merging DataFrames on specific columns - python

I have a frame moviegoers that includes zip codes but not cities.
I then redefined moviegoers to be zipcodes and changed the data type of zip codes to be a data frame instead of a series.
zipcodes = pd.read_csv('NYC1-moviegoers.csv',dtype={'zip_code': object})
I know the dataset URL I need is this: https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv.
I defined a dataframe, zip_codes, to call the data from that dataset and change the dataset type from series to dataframe so its in the same format as the zipcodes dataframe.
I want to merge the dataframes so I can have the movie goer data. But, instead of zipcodes, I want to have the state abbreviation. This is where I am having issues.
The end goal is to count the number of movie goers per state. Example ideal output:
CA 116
MN 78
NY 60
TX 51
IL 50
Any ideas would be greatly appreciated.

I think need map by Series and then use value_counts for count:
print (zipcodes)
zip_code
0 85711
1 94043
2 32067
3 43537
4 15213
s = zip_codes.set_index('Zipcode')['State']
df = zipcodes['zip_code'].map(s).value_counts().rename_axis('state').reset_index(name='count')
print (df.head())
state count
0 OH 1
1 CA 1
2 FL 1
3 AZ 1
4 PA 1

Simply merge both datasets on Zipcode columns then run groupby for state counts.
# READ DATA FILES WITH RENAMING OF ZIP COLUMN IN FIRST
url = "https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv"
moviegoers = pd.read_csv('NYC1-moviegoers.csv', dtype={'zip_code': object}).rename(columns={'zip_code': 'Zipcode'})
zipcodes = pd.read_csv(url, dtype={'Zipcode': object})
# MERGE ON COMMON FIELD
merged_df = pd.merge(moviegoers, zipcodes, on='Zipcode')
# AGGREGATE BY INDICATOR (STATE)
merged_df.groupby('State').size()
# ALTERNATIVE GROUP BY COUNT
merged_df.groupby('State')['Zipcode'].agg('count')

Related

Python: Excel to data frame : removing top rows and columns that doesnt contain 'right' data

I got a rather very basic excel to Pandas issue which I am unable to get around. Any help in this regard will be highly appreciated.
Source file
I got some data in an excel like below(apologies for pasting a picture and not Table):
Columns A,B,C are not required. I need the highlighted data to be read/moved into a pandas dataframe.

df = pd.read_excel('Miscel.xlsx',sheet_name='Sheet2',skiprows=8, usecols=[3,4,5,6])
df
Date Customers Location Sales
0 2021-10-05 A NSW 12
1 2021-10-03 B NSW 10
2 2021-10-01 C NSW 33
If your data is small, you can also read in and then drop the Nan.
df = pd.read_excel('Miscel.xlsx',sheet_name='Sheet2',skiprows=8).dropna(how='all',axis=1)

Pandas - pivoting multiple columns into fewer columns with some level of detail kept

Say I have the following code that generates a dataframe:
df = pd.DataFrame({"customer_code": ['1234','3411','9303'],
"main_purchases": [3,10,5],
"main_revenue": [103.5,401.5,99.0],
"secondary_purchases": [1,2,4],
"secondary_revenue": [43.1,77.5,104.6]
})
df.head()
There's the customer_code column that's the unique ID for each client.
And then there are 2 columns to indicate the purchases that took place and revenue generated from main branches by those clients.
And another 2 columns to indicate the purchases/revenue from secondary branches by those clients.
I want to get the data into a format like this, where a pivot is done where there's a new column to differentiate between main vs secondary, but the revenue numbers and purchase columns are not mixed up:
The obvious solution is just to split this into 2 dataframes, and then simply do a concatenate, but I'm wondering whether there's a built-in way to do this in a line or two - this strikes me as the kind of thing someone might have thought to bake in a solution for.

With a little column renaming to get the "revenue" and "purchases" in the column names first using a regular expression and str.replace we can use pd.wide_to_long to convert these now stubnames from columns to rows:
# Reorder column names so stubnames are first
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
# Convert wide_to_long
df = (
pd.wide_to_long(
df,
i='customer_code',
stubnames=['purchases', 'revenue'],
j='type',
sep='_',
suffix='.*'
)
.sort_index() # Optional sort to match expected output
.reset_index() # retrieve customer_code from the index
)
df:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
1
1234
secondary
1
43.1
2
3411
main
10
401.5
3
3411
secondary
2
77.5
4
9303
main
5
99
5
9303
secondary
4
104.6
What does reordering the column headers do?
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
Produces:
Index(['customer_code', 'purchases_main', 'revenue_main',
'purchases_secondary', 'revenue_secondary'],
dtype='object')
The "type" column is now the suffix of the column header which allows wide_to_long to process the table as expected.

You can abstract the reshaping process with pivot_longer from pyjanitor; they are just a bunch of wrapper functions in Pandas:
#pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = 'customer_code',
names_to=('type', '.value'),
names_sep='_',
sort_by_appearance=True)
customer_code type purchases revenue
0 1234 main 3 103.5
1 1234 secondary 1 43.1
2 3411 main 10 401.5
3 3411 secondary 2 77.5
4 9303 main 5 99.0
5 9303 secondary 4 104.6
The .value in names_to signifies to the function that you want that part of the column to remain as a header; the other part goes under the type column. The split is determined in this case by names_sep (there is a names_pattern option, that allows regular expression split); if you do not care about the order of appearance, you can set sort_by_appearance as False.

You can also use melt() and concat() function to solve this problem.
import pandas as pd
df1 = df.melt(
id_vars='customer_code',
value_vars=['main_purchases', 'secondary_purchases'],
var_name='type',
value_name='purchases',
ignore_index=True)
df2 = df.melt(
id_vars='customer_code',
value_vars=['main_revenue', 'secondary_revenue'],
var_name='type',
value_name='revenue',
ignore_index=True)
Then we use concat() with the parameter axis=1 to join side by side and use sort_values(by='customer_code') to sort data by customer.
result= pd.concat([df1,df2['revenue']],
axis=1,
ignore_index=False).sort_values(by='customer_code')
Using replace() with regex to align type names:
result.type.replace(r'_.*$','', regex=True, inplace=True)
The above code will output the below dataframe:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
3
1234
secondary
1
43.1
1
3411
main
10
401.5
4
3411
secondary
2
77.5
2
9303
main
5
99
5
9303
secondary
4
104.6

Read in a csv with a different separator based on the each column?

Column id and runtime are comma-separated. However, column genres is separated by Pipe(|).
df = pd.read_csv(path, sep=',') results in the table below. However, I can't conduct any queries on column genres, for instance finding the most popular genre by year? Is it possible to separate pipe into separate rows?
df.head()
id runtime genres Year
0 135397 124 Action|Adventure|Science Fiction|Thriller 2000
1 76341 120 Action|Adventure|Science Fiction|Thriller 2002
2 262500 119 Adventure|Science Fiction|Thriller 2001
3 140607 136 Action|Adventure|Science Fiction|Fantasy 2000
4 168259 137 Action|Crime|Thriller 1999

You're better reading the file as is, then split the genres into new rows with pandas explode:
df = df.assign(genres = df.genres.str.split('|')).explode('genres')
so that you can easily manipulate your data.
For example, to get the most frequent (i.e. mode) genres per year:
df.groupby('Year').genres.apply(lambda x: x.mode()).droplevel(1)
To identify the counts:
def get_all_max(grp):
counts = grp.value_counts()
return counts[counts==counts.max()]
df.groupby('Year').genres.apply(get_all_max)\
.rename_axis(index={None:'Genre'}).to_frame(name='Count')

Pandas: Filter 14000 rows from a data frame consisting 300000 rows

I have a data frame df1 consisting of 14000 person id. I have another data frame df2 consisting of 300000 data of ids and other attributes. I need to match the 14000 id's of df1 to the 300000 id's of df2 and extract the whole row of those 14000 ids.
df1 personUuid
0 99afae32-1486-47db-825e-6695f742eb86
1 bb22ca94-1f4b-435c-98ff-bd6f02a6b42b
2 ecfdc560-cc97-4525-8d1e-e3536793ef6e
3 8fbe1e4f-ae1e-4949-afd9-b120f6ae3762
4 d83dc0c4-26e6-4126-926d-7b84913bca13
... ...
14367 23592455-47a2-47ef-9d21-a283ae50988d
14368 1adecd7e-a0c2-4c35-bef1-75569f3b57fe
14369 e96f6eb4-d823-47b4-bd03-755e8f685e8f
14370 c87156e2-9610-40f4-a75a-17435d9fa91f
14371 70f08fd1-c595-4d01-886d-ed586a77c1d1
personUuid firstName middleName lastName emails urls locations currentTitles currentCompanies education ... count_currentTitles fullName li_clean gh_clean tw_clean fb_clean email_clean email_clean1 email_clean2 email_clean3
0 ab92fa98-2427-461d-87ac-31a440b6e1ae
1 658c57b9-457a-4e97-8b1c-10ab45655518
2 7da5a858-3c20-46c0-b728-23e64352094d
3 9c14f2b6-a81a-49af-85d4-d4cf76001f07
Similarly, I have the second data frame with 300K person ids and attributes like fullname, emails, location, etc.
need to match those 14K ids to 300 K and display all the attributes of the 14K only.

You need to do a merge with an inner join as given below:
df1['personUuid'] = df1['personUuid'].str.strip()
df2['personUuid'] = df2['personUuid'].str.strip()
df = pd.merge(left=df1, right=df2, how='inner', on=['personUuid'])

Use pandas to get county name using fips codes

I have fips codes here: http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt
And a dataset that looks like this:
fips_state fips_county value
1 1 10
1 3 34
1 5 37
1 7 88
1 9 93
How can I get the county name of each row using the data from the link above with pandas?

Simply load both data sets into DataFrames, then set the appropriate index:
df1.set_index(['fips_state', 'fips_county'], inplace=True)
This gives you a MultiIndex by state+county. Once you've done this for both datasets, you can trivially map them, for example:
df1['county_name'] = df2.county_name

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging DataFrames on specific columns - python

Related

Python: Excel to data frame : removing top rows and columns that doesnt contain 'right' data

Pandas - pivoting multiple columns into fewer columns with some level of detail kept

Read in a csv with a different separator based on the each column?

Pandas: Filter 14000 rows from a data frame consisting 300000 rows

Use pandas to get county name using fips codes

Categories

Resources