Python: Replace multiple old values to new value Pandas - python

I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code to make a function for replace values:
df = {'Name':['al', 'el', 'naila', 'dori','jlo'],
'living':['Alvando','Georgia GG','Newyork NY','Indiana IN','Florida FL'],
'sample2':['malang','kaltim','ambon','jepara','sragen'],
'output':['KOTA','KAB','WILAYAH','KAB','DAERAH']
}
df = pd.DataFrame(df)
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0)
df = df.replace('KAB', 1)
But I am actually expecting this output with the simple code that doesn't repeat replace
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
I've tried using np.where but it doesn't give the desired result. all results display 0, but the original value is 1
df['output'] = pd.DataFrame({'output':np.where(df == "KAB", 1, 0).reshape(-1, )})

This code should work for you:
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0).replace('KAB', 1)
Output:
>>> df
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

Update Pandas DataFrame for each string in a list

I am analyzing a dataset containing NFL game results over the past 20 years and am trying to create a column denoting for each team whether or not the game was a home game or away game (home game = 1, away game = 0).
The code I have so far is:
home_list = list(df.home_team.unique())
def home_or_away(team_name, dataf):
dataf['home_or_away'] = np.where(dataf['home_team'] == team_name, 1, 0)
return dataf
for i in home_list:
home_update_all = home_or_away(i, df)
df.update(home_update_all)
This doesn't seem to yield the correct results as each team is just overwritten when iterating over them. Any ideas on how to solve this?
Thanks!
Not really sure what your expected output is. Do you mean you want one column per team? You currently keep creating columns but with the same name so always only the one in the last iteration will be kept, the rest overwritten. Or do you want multiple DataFrames?
If you want multiple columns, one per team:
import pandas as pd
df = pd.DataFrame({'game': [1, 2, 3, 4], 'home_team': ['a', 'b', 'c', 'a']})
> game home_team
0 1 a
1 2 b
2 3 c
3 4 a
First collect unique teams as you did:
home_list = list(df.home_team.unique())
Create a column for each team:
for team in home_list:
df[f'home_or_away_{team}'] = [int(ht==team) for ht in df['home_team']]
Which results in:
> game home_team home_or_away_a home_or_away_b home_or_away_c
0 1 a 1 0 0
1 2 b 0 1 0
2 3 c 0 0 1
3 4 a 1 0 0
You're over complicating it. Don't need to iterate with numpy .where(). Just use the np.where() on the 2 columns (not with a separate function).
Basically says "where home_team equals team_name, put a 1, else put 0"
import pandas as pd
import numpy as np
df = pd.DataFrame([['Chicago Bears','Chicago Bears', 'Green Bay Packers'],
['Chicago Bears','Green Bay Packers', 'Chicago Bears'],
['Detriot Lions','Detriot Lions', 'Los Angeles Chargers'],
['New England Patriots','New York Jets', 'New England Patriots'],
['Houston Texans','Los Angeles Rams', 'Houston Texans']],
columns = ['team_name','home_team','away_team'])
df['home_or_away'] = np.where(df['home_team'] == df['team_name'], 1, 0)
Output:
print(df)
team_name home_team away_team home_or_away
0 Chicago Bears Chicago Bears Green Bay Packers 1
1 Chicago Bears Green Bay Packers Chicago Bears 0
2 Detriot Lions Detriot Lions Los Angeles Chargers 1
3 New England Patriots New York Jets New England Patriots 0
4 Houston Texans Los Angeles Rams Houston Texans 0

Python DataFrame pivot_table not returning column headers

I have the following df that I am trying to pivot by '
data = {'Country': ['India','India', 'India','India','India',
'USA','USA'],
'Personality': ['Sachin Tendulkar','Sachin Tendulkar','Sania Mirza','Sachin Tendulkar', 'Sania
Mirza', 'Serena Williams','Venus Willians'] }
#create a dataframe from the data
df = pd.DataFrame(data, columns = ['Country','Personality'])
My issue is with the following line of code:
df.pivot_table(index=['Country'],
columns=['Personality'],values='Personality', aggfunc='count', fill_value=0)
I expect the op to look like the following:
Sachin Tendulkar Sania Mirza Serena Williams Venus Williams
Country
India 3 2 0 0
USA 0 0 1 1
However, all I see is the Index column after running the above code.
If you put len in for aggfunc, it works:
df.pivot_table(index='Country', columns='Personality', values='Personality', aggfunc=len, fill_value=0)
Output:
Personality Sachin Tendulkar Sania Mirza Serena Williams Venus Willians
Country
India 3 2 0 0
USA 0 0 1 1

A query using pandas

Hello
i need to create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014 , using pandas This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index)
you'll find a description of my data in the picture :
Consider the following demo:
In [19]: df
Out[19]:
REGION STNAME CTYNAME POPESTIMATE2014 POPESTIMATE2015
0 0 Washington Washington 10 12
1 1 Washington Washington County 11 13
2 2 Alabama Alabama County 13 15
3 4 Alaska Alaska 14 12
4 3 Montana Montana 10 11
5 2 Washington Washington 15 19
In [20]: qry = "REGION in [1,2] and POPESTIMATE2015 > POPESTIMATE2014 and CTYNAME.str.contains('^Washington')"
In [21]: df.query(qry, engine='python')[['STNAME', 'CTYNAME']]
Out[21]:
STNAME CTYNAME
1 Washington Washington County
5 Washington Washington
Use boolean indexing with mask created by isin and startswith:
mask = df['REGION'].isin([1,2]) &
df['COUNTY'].str.startswith('Washington') &
(df['POPESTIMATE2015'] > df['POPESTIMATE2014'])
df = df.loc[mask, ['STNAME', 'CTYNAME']]

Setting values when iterating through a DataFrame

I have a dictionary of states (example IA:Idaho). I have loaded the dictionary into a DataFrame bystate_df.
then I am importing a CSV with states deaths that I want to add them to the bystate_df as I read the lines:
byState_df = pd.DataFrame(states.items())
byState_df['Deaths'] = 0
df['Deaths'] = df['Deaths'].convert_objects(convert_numeric=True)
print byState_df
for index, row in df.iterrows():
if row['Area'] in states:
byState_df[(byState_df[0] == row['Area'])]['Deaths'] = row['Deaths']
print byState_df
but the byState_df is still 0 afterwords:
0 1 Deaths
0 WA Washington 0
1 WI Wisconsin 0
2 WV West Virginia 0
3 FL Florida 0
4 WY Wyoming 0
5 NH New Hampshire 0
6 NJ New Jersey 0
7 NM New Mexico 0
8 NA National 0
I test row['Deaths'] while it iterates and it's producing the correct values, it just seem to be setting the byState_df value incorrectly.
Can you try the following code where I use .loc instead of [][].
byState_df = pd.DataFrame(states.items())
byState_df['Deaths'] = 0
df['Deaths'] = df['Deaths'].convert_objects(convert_numeric=True)
print byState_df
for index, row in df.iterrows():
if row['Area'] in states:
byState_df.loc[byState_df[0] == row['Area'], 'Deaths'] = row['Deaths']
print byState_df

Categories

Resources