Best way to replicate SQL "update case when..." with Pandas? - python

I have this sample data set
City
LAL
NYK
Dallas
Detroit
SF
Chicago
Denver
Phoenix
Toronto
And what I want to do is update certain values with specific values, and the rest of it I would leave as it is.
So, with SQL I would do something like this:
update table1
set city = case
when city='LAL' then 'Los Angeles'
when city='NYK' then 'New York'
Else city
end
What would be the best way to do this in Pandas?

Use replace on the City column:
df['City'] = df['City'].replace({"LAL": "Los Angeles", "NYK": "New York"})
output:
City
0 Los Angeles
1 New York
2 Dallas
3 Detroit
4 SF
5 Chicago
6 Denver
7 Phoenix
8 Toronto

You can directly replace the values like this:
replacement_dict = {"LAL": "Los Angeles", "NYK": "New York"}
for key, value in replacement_dict.items():
df['City'][df['City'] == key] = value

You can replace it using replace(). One option ist to define a dict.
Example
df = pd.DataFrame({'City':["LAL","NYK","Dallas","Detroit","SF","Chicago","Denver","Phoenix","Toronto"]})
df.replace({"LAL": "Los Angeles", "NYK": "New York"})

Related

Python: no change in a pandas dataframe column when using apply function [duplicate]

This question already has answers here:
Pandas df.apply does not modify DataFrame
(2 answers)
Closed 1 year ago.
As a reproducible example, I created the following dataframe:
dictionary = {'Metropolitan area': ['New York City','New York City','Los Angeles', 'Los Angeles'],
'Population (2016 est.)[8]': [20153634, 20153634, 13310447, 13310447],
'NBA':['Knicks',' ',' ', 'Clippers']}
df = pd.DataFrame(dictionary)
to substitute any space present in df['NBA'] by 'None' I created the following function:
def transform(x):
if len(x)<2:
return None
else:
return x
which I apply over df['NBA'] using .apply method:
df['NBA'].apply(transform)
After doing this, I get the following output, which seems to have been succesful:
> 0 Knicks
1 Missing Value
2 Missing Value
3 Clippers
Name: NBA, dtype: object
But, here the problem, when I call for df, df['NBA'] is not transformed, and I get that column as it was from the beginning, and the spaces are still present and not replaced by None:
Metropolitan area Population (2016 est.)[8] NBA
0 New York City 20153634 Knicks
1 New York City 20153634
2 Los Angeles 13310447
3 Los Angeles 13310447 Clippers
What am I doing wrong? am I misunderstunding the .apply method?
The command df['NBA'].apply(transform) on its own will do the operation but not save it to the original DataFrame in the memory.
so you just have to save the new column:
df['NBA'] = df['NBA'].apply(transform)
and the resulting DataFrame should be:
Metropolitan area Population (2016 est.)[8] NBA
0 New York City 20153634 Knicks
1 New York City 20153634 None
2 Los Angeles 13310447 None
3 Los Angeles 13310447 Clippers
Assign the results of apply back to the column.
df['NBA'] = df['NBA'].apply(transform)

How to choose random item from a dictionary to df and exclude one item?

I have a dictionary and a dataframe, for example:
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
print(df)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
Output:
Name
0 Tom
1 Joseph
2 Krish
3 John
I've created a column called CITY by using the following:
df["CITY"]=np.random.choice(list(city), len(df))
df
Name CITY
0 Tom New York
1 Joseph LA
2 Krish Miami
3 John New Yor
Now, I would like to generate a new column - CITY2 with a random item from city dictionary, but I would like CITY will be a different item than CITY2, so basically when I'm generating CITY2 I need to exclude CITY item.
It's worth mentioning that my real df is quite large so I need it to be effective as possible.
Thanks in advance.
continue with approach you have used
have used pd.Series() as a convenience to remove value that has already been used
wrapped in apply() to get value of each row
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
df["CITY"]=np.random.choice(list(city), len(df))
df["CITY2"] = df["CITY"].apply(lambda x: np.random.choice(pd.Series(city).drop(x).index))
Name
CITY
CITY2
0
Tom
Miami
New York
1
Joseph
LA
Miami
2
Krish
New York
Miami
3
John
New York
LA
You could also first group by "CITY", remove the current city per group from the city dict and then create the new random list of cities.
Maybe this is faster because you don't have to drop one city per row, but per group.
city2 = pd.Series()
for key,group in df.groupby('CITY'):
cities_subset = np.delete(np.array(list(city)),list(city).index(key))
city2 = city2.append(pd.Series(np.random.choice(cities_subset, len(group)),index=group.index))
df["CITY2"] = city2
This gives for example:
Name CITY CITY2
0 Tom New York LA
1 Joseph New York Miami
2 Krish LA New York
3 John New York LA

Merge is not working on two dataframes of multi level index

First DataFrame : housing, This data Frame contains MultiIndex (State, RegionName) and some relevant values in other 3 columns.
State RegionName 2008q3 2009q2 Ratio
New York New York 499766.666667 465833.333333 1.072844
California Los Angeles 469500.000000 413900.000000 1.134332
Illinois Chicago 232000.000000 219700.000000 1.055985
Pennsylvania Philadelphia 116933.333333 116166.666667 1.006600
Arizona Phoenix 193766.666667 168233.333333 1.151773
Second DataFrame : list_of_university_towns, Contains the names of States and Some regions and has default numeric index
State RegionName
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
4 Arizona Phoenix
5 Illinois Chicago
Now the inner join of the two dataframes :
uniHousingData = pd.merge(list_of_university_towns,housing,how="inner",on=["State","RegionName"])
This gives no values in the resultant uniHousingData dataframe, while it should have the bottom two values (index#4 and 5 from list_of_university_towns)
What am I doing wrong?
I found the issue. There was space at the end of the string in the RegionName column of the second dataframe. used Strip() method to remove the space and it worked like a charm.

Pandas - Create a new column (Branch name) based on another column (City name)

I have the following Python Pandas Dataframe (8 rows):
City Name
New York
Long Beach
Jamestown
Chicago
Forrest Park
Berwyn
Las Vegas
Miami
I would like to add a new Column (Branch Name) based on City Name as below:
City Name Branch Name
New York New York
Long Beach New York
Jamestown New York
Chicago Chicago
Forrest Park Chicago
Berwyn Chicago
Las Vegas Las Vegas
Miami Miami
How do I do that?
You can use .map(). City names not in the dictionnary will be kept.
df["Branch Name"] = df["City Name"].map({"Long Beach":"New York",
"Jamestown":"New York",
"Forrest Park":"Chicago",
"Berwyn":"Chicago",}, na_action='ignore')
df["Branch Name"] = df["Branch Name"].fillna(df["City Name"])

converting list like column values into multiple rows using Pandas DataFrame

CSV file: (sample1.csv)
Location_City, Location_State, Name, hobbies
Los Angeles, CA, John, "['Music', 'Running']"
Texas, TX, Jack, "['Swimming', 'Trekking']"
I want to convert hobbies column of CSV into following output
Location_City, Location_State, Name, hobbies
Los Angeles, CA, John, Music
Los Angeles, CA, John, Running
Texas, TX, Jack, Swimming
Texas, TX, Jack, Trekking
I have read csv into dataframe but I don't know how to convert it?
data = pd.read_csv("sample1.csv")
df=pd.DataFrame(data)
df
You can use findall or extractall for get lists from hobbies colum, then flatten with chain.from_iterable and repeat another columns:
a = df['hobbies'].str.findall("'(.*?)'").astype(np.object)
lens = a.str.len()
from itertools import chain
df1 = pd.DataFrame({
'Location_City' : df['Location_City'].values.repeat(lens),
'Location_State' : df['Location_State'].values.repeat(lens),
'Name' : df['Name'].values.repeat(lens),
'hobbies' : list(chain.from_iterable(a.tolist())),
})
Or create Series, remove first level and join to original DataFrame:
df1 = (df.join(df.pop('hobbies').str.extractall("'(.*?)'")[0]
.reset_index(level=1, drop=True)
.rename('hobbies'))
.reset_index(drop=True))
print (df1)
Location_City Location_State Name hobbies
0 Los Angeles CA John Music
1 Los Angeles CA John Running
2 Texas TX Jack Swimming
3 Texas TX Jack Trekking
We can solve this using pandas.DataFrame.explode function which was introduced in version 0.25.0 if you have same or higher version, you can use below code.
explode function reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html
import pandas as pd
import ast
data = {
'Location_City': ['Los Angeles','Texas'],
'Location_State': ['CA','TX'],
'Name': ['John','Jack'],
'hobbies': ["['Music', 'Running']", "['Swimming', 'Trekking']"]
}
df = pd.DataFrame(data)
# Converting a string representation of a list into an actual list object
list_eval = lambda x: ast.literal_eval(x)
df['hobbies'] = df['hobbies'].apply(list_eval)
# Exploding the list
df = df.explode('hobbies')
print(df)
Location_City Location_State Name hobbies
0 Los Angeles CA John Music
0 Los Angeles CA John Running
1 Texas TX Jack Swimming
1 Texas TX Jack Trekking

Categories

Resources