pandas join gives NaN values

pandas join gives NaN values - python

I want to join 2 DataFrames
Zipcode Database (first 10 entries)
0 zip_code City State County Population
0 0 90001 Los Angeles California Los Angeles 54481
1 1 90002 Los Angeles California Los Angeles 44584
2 2 90003 Los Angeles California Los Angeles 58187
3 3 90004 Los Angeles California Los Angeles 67850
4 4 90005 Los Angeles California Los Angeles 43014
5 5 90006 Los Angeles California Los Angeles 62765
6 6 90007 Los Angeles California Los Angeles 45021
7 7 90008 Los Angeles California Los Angeles 30840
8 8 90009 Los Angeles California Los Angeles -
9 9 90010 Los Angeles California Los Angeles 1943
And data (first 10 entries)
buyer zip_code
0 SWEENEY,THOMAS R & MICHELLE H NaN
1 DOUGHERTY,HERBERT III & JENNIFER M NaN
2 WEST COAST RLTY SVCS INC NaN
3 LOVE,JULIE M NaN
4 SAHAR,DAVID NaN
5 SILBERSTERN,BRADLEY E TRUST 91199
6 LEE,SUSAN & JIMMY C 92025
7 FRAZZANO REAL ESTATE I NC NaN
8 RUV INVESTMENTS LLC 91730
9 KAOS KAPITAL LLC NaN
So the final table should have [buyer, zip_code, City, County]. I'm joining with respect to Zip code.
data_2 = data.join(zipcode_database[['City', 'County', 'zip_code']].set_index('zip_code'), on='zip_code')
But the city and county columns are NaN even for the tuples in data where zipcode is actually present.
buyer zip_code City County
10 LANDON AVE TRUST 37736 NaN NaN NaN
11 UMAR,AHMAD NaN NaN NaN
12 3 JPS INC 90717 NaN NaN
13 T & L HOLDINGS INC 95610 NaN NaN
14 CAHP HOLDINGS LLC 90808 NaN NaN
15 REBUILDING TOGETHER LONG BEACH 92344 NaN NaN
16 COLFIN AI-CA 4 LLC NaN NaN NaN
17 GUTIERREZ,HUGO 91381 NaN NaN
18 VALBRIDGE CAP GOLDEN GATE FUND NaN NaN NaN
19 SOLARES,OSCAR 92570 NaN NaN
Why is this the case? The zipcode database has all zipcodes from 90001 - 999950.
My first thought is the datatype of "zip_code" in both are different:
print(zipcode_database['zip_code'].dtype)
print(data['zip_code'].dtype)
Output:
int64
object
Thought of typecasting with astype, but this does not work with NaN values. Any thoughts?

You can cast NaN values to float types, but not int. In your case I would cast the zip_code field in both DataFrames to a float and then join.
zipcode_database.zip_code = zipcode_database.zip_code.astype(float)
data.zip_code = data.zip_code.astype(float)
data_2 = data.join(zipcode_database[['City', 'County', 'zip_code']].set_index('zip_code'), on='zip_code')
I can't reproduce anything meaningful from your example data (no matching zip codes), but that should fix the issue.

Related

Python version of dplyr R code commands for calculations

I am trying to create a separate pandas DataFrame in python using pandas'.groupby function. I am working with basketball data and want to create a column that displays if the home and away teams are on the tail end of a back-to-back.
The 0 in the yesterday_home_team and yesterday_away_team columns indicates that the away team did not play the previous night.
Given that there are multiple games each night, the .groupby function should be used.
Input Data:
date home_team away_team
9/22/22 LAL DET
9/23/22 LAC LAL
Desired output:
date home_team away_team yesterday_home_team yesterday_away_team
9/21/22 LAL MIN 0 MIN
9/22/22 LAL DET DET 0
9/23/22 LAC LAL LAL LAC
Appreciate your assistance.

Your output example doesn't make sense to me. Do you need the team names in the 'yesterday_home_team' and 'yesterday_away_team'? Is it sufficient to simply just have a 1 if the home team is on the back to back, and 0 if the home team is not (and then also same logic for away team)? It's also tough when you don't provide a good sample dataset.
Anyways, here's my solution that just indicates a 1 or 0 if the given team is on the back end of the back to back:
import pandas as pd
import numpy as np
months = ['October', 'November', 'December', 'January', 'February', 'March', 'April', 'May', 'June']
dfs = []
for month in months:
month = month.lower()
url = f'https://www.basketball-reference.com/leagues/NBA_2022_games-{month}.html'
df = pd.read_html(url)[0]
df['Date'] = pd.to_datetime(df['Date'])
dfs.append(df)
df = pd.concat(dfs)
df = df.rename(columns={'Visitor/Neutral':'away_team', 'Home/Neutral':'home_team'})
df_melt = pd.melt(df, id_vars=['Date'],
value_vars=['away_team', 'home_team'],
var_name = 'Home_Away',
value_name = 'Team')
df_melt = df_melt.sort_values('Date').reset_index(drop=True)
df_melt['days_between'] = df_melt.groupby('Team')['Date'].diff().dt.days
df_melt['yesterday'] = np.where(df_melt['days_between'] == 1, 1, 0)
df_melt = df_melt.drop(['days_between', 'Home_Away'], axis=1)
df = df.merge(df_melt.rename(columns={'Team':'home_team', 'yesterday':'yesterday_home_team'}), how='left', left_on=['Date', 'home_team'], right_on=['Date', 'home_team'])
df = df.merge(df_melt.rename(columns={'Team':'away_team', 'yesterday':'yesterday_away_team'}), how='left', left_on=['Date', 'away_team'], right_on=['Date', 'away_team'])
df = df[['Date', 'home_team', 'away_team', 'yesterday_home_team', 'yesterday_away_team']]
Output:
print(df.head(30).to_string())
Date home_team away_team yesterday_home_team yesterday_away_team
0 2021-10-19 Milwaukee Bucks Brooklyn Nets 0 0
1 2021-10-19 Los Angeles Lakers Golden State Warriors 0 0
2 2021-10-20 Charlotte Hornets Indiana Pacers 0 0
3 2021-10-20 Detroit Pistons Chicago Bulls 0 0
4 2021-10-20 New York Knicks Boston Celtics 0 0
5 2021-10-20 Toronto Raptors Washington Wizards 0 0
6 2021-10-20 Memphis Grizzlies Cleveland Cavaliers 0 0
7 2021-10-20 Minnesota Timberwolves Houston Rockets 0 0
8 2021-10-20 New Orleans Pelicans Philadelphia 76ers 0 0
9 2021-10-20 San Antonio Spurs Orlando Magic 0 0
10 2021-10-20 Utah Jazz Oklahoma City Thunder 0 0
11 2021-10-20 Portland Trail Blazers Sacramento Kings 0 0
12 2021-10-20 Phoenix Suns Denver Nuggets 0 0
13 2021-10-21 Atlanta Hawks Dallas Mavericks 0 0
14 2021-10-21 Miami Heat Milwaukee Bucks 0 0
15 2021-10-21 Golden State Warriors Los Angeles Clippers 0 0
16 2021-10-22 Orlando Magic New York Knicks 0 0
17 2021-10-22 Washington Wizards Indiana Pacers 0 0
18 2021-10-22 Cleveland Cavaliers Charlotte Hornets 0 0
19 2021-10-22 Boston Celtics Toronto Raptors 0 0
20 2021-10-22 Philadelphia 76ers Brooklyn Nets 0 0
21 2021-10-22 Houston Rockets Oklahoma City Thunder 0 0
22 2021-10-22 Chicago Bulls New Orleans Pelicans 0 0
23 2021-10-22 Denver Nuggets San Antonio Spurs 0 0
24 2021-10-22 Los Angeles Lakers Phoenix Suns 0 0
25 2021-10-22 Sacramento Kings Utah Jazz 0 0
26 2021-10-23 Cleveland Cavaliers Atlanta Hawks 1 0
27 2021-10-23 Indiana Pacers Miami Heat 1 0
28 2021-10-23 Toronto Raptors Dallas Mavericks 1 0
29 2021-10-23 Chicago Bulls Detroit Pistons 1 0

How to transform combinations of values in columns into individual columns?

I have a dataset (df), that looks like this:
Date
ID
County Name
State
State Name
Product Name
Type of Transaction
QTY
202105
10001
Los Angeles
CA
California
Shoes
Entry
630
202012
10002
Houston
TX
Texas
Keyboard
Exit
5493
202001
11684
Chicago
IL
Illionis
Phone
Disposal
220
202107
12005
New York
NY
New York
Phone
Entry
302
...
...
...
...
...
...
...
...
202111
14990
Orlando
FL
Florida
Shoes
Exit
201
For every county, there are multiple entries for different Products, types of transactions, and at different dates, but not all counties have the same number of entries and they don't follow the same dates.
I want to recreate this dataset, such that:
1 - All counties have the same start and end dates, and for those dates where the county does not record entries, I want this entry to be recorded as NaN.
2 - The product names and their types are their own columns.
Essentially, this is how the dataset needs to look:
Date
ID
County Name
State
State Name
Shoes, Entry
Shoes, Exit
Shoes, Disposal
Phones, Entry
Phones, Exit
Phones, Disposal
Keyboard, Entry
Keyboard, Exit
Keyboard, Disposal
202105
10001
Los Angeles
CA
California
594
694
5660
33299
1110
5659
4559
3223
56889
202012
10002
Houston
TX
Texas
3420
4439
549
2110
5669
2245
39294
3345
556
202001
11684
Chicago
IL
Illionis
55432
4439
329
21190
4320
455
34059
44556
5677
202107
12005
New York
NY
New York
34556
2204
4329
11193
22345
43221
1544
3467
22450
...
...
...
...
...
...
...
...
...
...
...
...
...
...
202111
14990
Orlando
FL
Florida
54543
23059
3290
21394
34335
59660
NaN
NaN
NaN
Under the example, you can see how Florida does not record certain transactions. I would like to add the NaN such that the dataframe looks like this. I appreciate all the help!

This is essentially a pivot, with flattening of the MultiIndex:
(df
.pivot(index=['Date', 'ID', 'County Name', 'State', 'State Name'],
columns=['Product Name', 'Type of Transaction'],
values='QTY')
.pipe(lambda d: d.set_axis(map(','.join, d. columns), axis=1))
.reset_index()
)
Output:
Date ID County Name State State Name Shoes,Entry Keyboard,Exit \
0 202001 11684 Chicago IL Illionis NaN NaN
1 202012 10002 Houston TX Texas NaN 5493.0
2 202105 10001 Los Angeles CA California 630.0 NaN
3 202107 12005 New York NY New York NaN NaN
Phone,Disposal Phone,Entry
0 220.0 NaN
1 NaN NaN
2 NaN NaN
3 NaN 302.0

Python Pivot: Can I get the count of columns per row(id/index) and store it in a new columns?

hope you can help me this.
The df looks like this.
region AMER
country Brazil Canada Columbia Mexico United States
metro Rio de Janeiro Sao Paulo Toronto Bogota Mexico City Monterrey Atlanta Boston Chicago Culpeper Dallas Denver Houston Los Angeles Miami New York Philadelphia Seattle Silicon Valley Washington D.C.
ID
321321 2 1 1 13 15 29 1 2 1 11 6 15 3 2 14 3
23213 3
231 2 2 3 1 5 6 3 3 4 3 3 4
23213 4 1 1 1 4 1 2 27 1
21321 4 2 2 1 14 3 2 4 2
12321 1 2 1 1 1 1 10
123213 2 45 5 1
12321 1
123 1 3 2
I want to get the count of columns that have data per of metro and country per region of all the rows(id/index) and store that count into a new column.
Regards,
RJ

You may want to try
df['new']df.sum(level=0, axis=1)

Fillna with most frequent if most frequent occurs else fillna with most frequent value of the entire column

I have a panda dataframe
City State
0 Cambridge MA
1 NaN DC
2 Boston MA
3 Washignton DC
4 NaN MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 NaN FL
11 Washington DC
I want to fill NaNs based on most frequent state if the state appears before so I group by state and apply the following code:
df['City'] = df.groupby('State').transform(lambda x:x.fillna(x.value_counts().idxmax()))
The above code works for if all states have occurred before the output will be
City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
However I want to add a condtion so that if a state never occur its city will be the most frequent in the entire City column ie if the dataframe is
City State
0 Cambridge MA
1 NaN DC
2 Boston MA
3 Washignton DC
4 NaN MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 NaN FL
11 Washington DC
12 NaN NY
NY has never occurred before I want output to be
City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
12 Cambridge NY
The code above gives a ValueError: ('attempt to get argmax of an empty sequence') because "NY" has never occurred before.

IIUC:
def f(x):
if x.count()<=0:
return np.nan
return x.value_counts().index[0]
df['City'] = df.groupby('State')['City'].transform(f)
df['City'] = df['City'].fillna(df['City'].value_counts().idxmax())
Output:
City State
0 Cambridge MA
1 Washignton DC
2 Cambridge MA
3 Washignton DC
4 Cambridge MA
5 Miami FL
6 Cambridge MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washignton DC
12 Cambridge NY

You can solve this by the following code
mode = df['City'].mode()[0]
df['City'] = df.groupby('State')['City'].apply(lambda x: x.fillna(x.value_counts().idxmax() if x.value_counts().max() >=1 else mode , inplace = False))
df['City']= df['City'].fillna(df['City'].value_counts().idxmax())
output:
City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
12 Cambridge NY

Splitting a Pandas DataFrame column into two columns

I'm working on a simple web scrape, DataFrame project. I have a simple 8x1 DataFrame, and I'm trying to split it into an 8x2 DataFrame. So far this is what my DataFrame looks like:
dframe = DataFrame(data, columns=['Active NPGL Teams'], index=[1, 2, 3, 4, 5, 6, 7, 8])
Active NPGL Teams
1 Baltimore Anthem (2015–present)
2 Boston Iron (2014–present)
3 DC Brawlers (2014–present)
4 Los Angeles Reign (2014–present)
5 Miami Surge (2014–present)
6 New York Rhinos (2014–present)
7 Phoenix Rise (2014–present)
8 San Francisco Fire (2014–present)
I would like to add a column, "Years Active" and split the "(2014-present)", "(2015-present)" into the "Years Active" column. How do I split my data?

You can use
dframe['Active NPGL Teams'].str.split(r' (?=\()', expand=True)
0 1
1 Baltimore Anthem (2015–present)
2 Boston Iron (2014–present)
3 DC Brawlers (2014–present)
4 Los Angeles Reign (2014–present)
5 Miami Surge (2014–present)
6 New York Rhinos (2014–present)
7 Phoenix Rise (2014–present)
8 San Francisco Fire (2014–present)
The key is the regex r' (?=\()' which matches a space only if it is followed by an open parenthesis (lookahead assertion).
Another approach (which is about 5% slower but more flexible) is to user Series.str.extract.
dframe['Active NPGL Teams'].str.extract(r'^(?P<Team>.+) (?P<YearsActive>\(.+\))$',
expand=True)
Team YearsActive
1 Baltimore Anthem (2015–present)
2 Boston Iron (2014–present)
3 DC Brawlers (2014–present)
4 Los Angeles Reign (2014–present)
5 Miami Surge (2014–present)
6 New York Rhinos (2014–present)
7 Phoenix Rise (2014–present)
8 San Francisco Fire (2014–present)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.