Replacing some values of a column conditional on another column in pandas - python

I have a data-frame df:
ID county state sales year
a king oh 10 2010
b 0 al 20 2011
c kent oh 10 2010
d 0 wa 30 2012
I want to replace the zero value of the county with county name conditional on ID, such that, if ID equals 'b', county will be 'anchorage' and for 'd' id county will be 'whitman'.
ID county state sales year
a king oh 10 2010
b anchorage al 20 2011
c kent oh 10 2010
d whitman wa 30 2012
I applied the following code:
conditions = [(df['id'] == 'b'),(df['id'] == 'd')]
values = ['anchorage', 'whitman']
df['county'] = np.select(conditions, values)
The above code replaces the zero value with new value, but at the same time it turns existing nonzero county value to zero.

Try setting the default value of np.select to the 'county' column:
conditions = [(df['id'] == 'b'), (df['id'] == 'd')]
values = ['anchorage', 'whitman']
df['county'] = np.select(conditions, values, default=df['county'])
df:
id county state sales year
0 a king oh 10 2010
1 b anchorage al 20 2011
2 c kent oh 10 2010
3 d whitman wa 30 2012

first create criteria and then simply use loc
condition = (df['country'] == 0)
for i in range(len(df['ID'])):
if condition[i] == True:
df.loc[df['ID'] == 'b', 'country'] = 'anchorage'
df.loc[df['ID'] == 'd', 'country'] = 'whitman'

Related

Getting maximum counts of a column in grouped dataframe

My dataframe df is:
Election Year Votes Party Region
0 2000 50 A a
1 2000 100 B a
2 2000 26 A b
3 2000 180 B b
4 2000 300 A c
5 2000 46 C c
6 2005 149 A a
7 2005 46 B a
8 2005 312 A b
9 2005 23 B b
10 2005 16 A c
11 2005 35 C c
I want to get the Party winning maximum region every year. So the desired output is:
Election Year Party
2000 B
2005 A
I tried this code to get the the above output, but it is giving error:
winner = df.groupby(['Election Year'])['Votes'].max().reset_index()
winner = winner.groupby('Election Year').first().reset_index()
winner = winner[['Election Year', 'Party']].to_string(index=False)
winner
how can I get the desired output?
Here is one approach with nested groupby. We first count per-party votes in each year-region pair, then use mode to find the party winning most regions. The mode need not be unique (if two or more parties win the same number of regions).
df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().idxmax())\
.unstack().mode(1).rename(columns={0: "Party"})
Party
Year
2000 B
2005 A
To address the comment, you can replace idxmax above with nlargest and diff to find regions where win margin is below a given number.
margin = df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().nlargest(2).diff()) > -125
print(margin[margin].reset_index()[["Year", "Region"]])
# Year Region
# 0 2000 a
# 1 2005 a
# 2 2005 c
You can use GroupBy.idxmax() to get the index of max Votes for each group of Election Year, then use .loc to locate the rows followed by selection of required columns, as followed:
df.loc[df.groupby('Election Year')['Votes'].idxmax()][['Election Year', 'Party']]
Result:
Election Year Party
4 2000 A
8 2005 A
Edit
If we are to get the Party winning most Region, we can use the following codes (without using the slow .apply() with lambda function):
(df.loc[
df.groupby(['Election Year', 'Region'])['Votes'].idxmax()]
[['Election Year', 'Party', 'Region']]
.pivot(index='Election Year', columns='Region')
.mode(axis=1)
).rename({0: 'Party'}, axis=1).reset_index()
Result:
Election Year Party
0 2000 B
1 2005 A
Try this
winner = df.groupby(['Election Year','Party'])['Votes'].max().reset_index()
winner.drop('Votes', axis = 1, inplace = True)
winner
Another method: (closed to #hilberts_drinking_problem in fact)
>>> df.groupby(["Election Year", "Region"]) \
.apply(lambda x: x.loc[x["Votes"].idxmax(), "Party"]) \
.unstack().mode(axis="columns") \
.rename(columns={0: "Party"}).reset_index()
Election Year Party
0 2000 B
1 2005 A
I believe the one liner df.groupby(["Election Year"]).max().reset_index()['Election Year', 'Party'] solves your problem

Replace the values in a column based on frequency

I have a dataframe (3.7 million rows) with a column with different country names
id Country
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 INDIA
6 USA
7 USA
8 ITALY
9 USA
10 RUSSIA
I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column
My alternate solution is to replace the names with there frequency using
df.column_name = df.column_name.map(df.column_name.value_counts())
Use:
df.loc[df.groupby('Country')['id']
.transform('size')
.div(len(df))
.lt(0.15),
'Country'] = 'Miscellanous'
Or
df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
.lt(0.15)),
'Country'] = 'Miscellanous'
If you want to put all country whose frequency is less than a threshold into the "Misc" category:
threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()
df['Country'].map(mappings)
Here is another option
s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')
You can use dictionary and map for this:
d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x if d[x] > 0.15 else 'Miscellanous' )
Output:
id
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 Miscellanous
6 USA
7 USA
8 Miscellanous
9 USA
10 RUSSIA
Name: Country, dtype: object

Listing unique value counts per groups in pandas dataframe

I am new to pandas and python.
I am trying to group items by one column and list the information from the data frame per group.
My dataframe:
B C D E F
1 Honda USA 2000 Washington New
2 Honda USA 2001 Salt Lake Used
3 Ford Canada 2005 Washington New
4 Toyota USA 2010 Ney York Used
5 Honda USA 2001 Salt Lake Used
6 Honda Canada 2011 Salt Lake Crashed
7 Ford Italy 2014 Rome New
I am trying to group my dataframe by column B and list how many C, D, E, F column values are in group B. For example we see that in column B there are 4 Honda which I am grouping it together. Then I want to list the following information - USA(3), Canada(1), 2000(1),2001(2), 2011(1), Washington(1), Salt Lake(3), New(1), Used(2), Crashed(1) and do the same per every group ( car make ) in column B:
Car Country Year City Condition
1 Honda(4) USA(3) 2000(1) Washington(1) New(1)
Canada(1) 2001(2) Salt Lake(3) Used(2)
2011(1) Crashed(1)
2 Ford(2) Canada(1) 2005(5) Washington(1) New(2)
Italy(1) 2014(1) Rome(1)
...
What I've tried so far:
df.groupby(['B'])
Which gives me back <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d559080>
At this point, I am not sure how I should code moving on forward getting the desired results after grouping the column B.
Thank you for your suggestions.
You need lambda function with custom function for processing each column separately with Series.value_counts and then join values of index to values of counts of Series together:
def f(x):
x = x.value_counts()
y = x.index.astype(str) + '(' + x.astype(str) + ')'
return y.reset_index(drop=True)
df1 = df.groupby(['B']).apply(lambda x: x.apply(f)).reset_index(drop=True)
print (df1)
B C D E F
0 Ford(2) Italy(1) 2014(1) Washington(1) New(2)
1 NaN Canada(1) 2005(1) Rome(1) NaN
2 Honda(4) USA(3) 2001(2) Salt Lake(3) Used(2)
3 NaN Canada(1) 2011(1) Washington(1) Crashed(1)
4 NaN NaN 2000(1) NaN New(1)
5 Toyota(1) USA(1) 2010(1) Ney York(1) Used(1)

Populating new column in dataframe from existing two with different lengths

How would I go about adding a new column to an existing dataframe by comparing it to another that is shorter in length and has a different index.
For example, if I have:
df1 = country code year
0 Armenia a 2016
1 Brazil b 2017
2 Turkey c 2016
3 Armenia d 2017
df2 = geoCountry 2016_gdp 2017_gdp
0 Armenia 10.499 10.74
1 Brazil 1,798.62 2,140.94
2 Turkey 857.429 793.698
and I want to end up with:
df1 = country code year gdp
0 Armenia a 2016 10.499
1 Brazil b 2017 2,140.94
2 Turkey c 2016 857.429
3 Armenia d 2017 10.74
How would I go about this? I attempted to use answers outlined here and here to no avail. I also did the following which takes too long on a 90000 row dataframe
for index, row in df1.iterrows():
if row['country'] in list(df2.geoCountry):
if row['year'] == 2016:
df1['gdp'].append(df2[df2.geoCountry == str(row['country'])]['2016'])
else:
df1['gdp'].append(df2[df2.geoCountry == str(row['country'])]['2017'])
I guess this is what you're looking for:
df2 = df2.melt(id_vars = 'geoCountry', value_vars = ['2016_gdp', '2017_gdp'], var_name = ['year'])
df1['year'] = df1['year'].astype('int')
df2['year'] = df2['year'].str.slice(0,4).astype('int')
df1.merge(df2, left_on = ['country','year'], right_on = ['geoCountry','year'])[['country', 'code', 'year', 'value']]
Output:
country code year value
0 Armenia a 2016 10.499
1 Brazil b 2017 2,140.94
2 Turkey c 2016 857.429
3 Armenia d 2017 10.74
You mainly need the melt function:
df2.columns = df2.columns.str.split("_").str.get(0)
df2 = df2.rename(index=str, columns={"geoCountry": "country"})
df3 = pd.melt(df2, id_vars=['geoCountry'], value_vars=['2016','2017'],
var_name='year', value_name='gdp')
After this you simply merge the df1 with the above df3
result = pd.merge(df1, df3, on=['country','year'])
Output:
pd.merge(df1, df3, on=['country','year'])
Out[36]:
country code year gdp
0 Armenia a 2016 10.499
1 Brazil b 2017 2140.940
2 Turkey c 2016 857.429
3 Armenia d 2017 10.740

A query using pandas

Hello
i need to create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014 , using pandas This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index)
you'll find a description of my data in the picture :
Consider the following demo:
In [19]: df
Out[19]:
REGION STNAME CTYNAME POPESTIMATE2014 POPESTIMATE2015
0 0 Washington Washington 10 12
1 1 Washington Washington County 11 13
2 2 Alabama Alabama County 13 15
3 4 Alaska Alaska 14 12
4 3 Montana Montana 10 11
5 2 Washington Washington 15 19
In [20]: qry = "REGION in [1,2] and POPESTIMATE2015 > POPESTIMATE2014 and CTYNAME.str.contains('^Washington')"
In [21]: df.query(qry, engine='python')[['STNAME', 'CTYNAME']]
Out[21]:
STNAME CTYNAME
1 Washington Washington County
5 Washington Washington
Use boolean indexing with mask created by isin and startswith:
mask = df['REGION'].isin([1,2]) &
df['COUNTY'].str.startswith('Washington') &
(df['POPESTIMATE2015'] > df['POPESTIMATE2014'])
df = df.loc[mask, ['STNAME', 'CTYNAME']]

Categories

Resources