Conditionally filling blank values in Pandas dataframes - python

I have a datafarme which looks like as follows (there are more columns having been dropped off):
memberID shipping_country
264991
264991 Canada
100 USA
5000
5000 UK
I'm trying to fill the blank cells with existing value of shipping country for each user:
memberID shipping_country
264991 Canada
264991 Canada
100 USA
5000 UK
5000 UK
However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?

You can use GroupBy + ffill / bfill:
def filler(x):
return x.ffill().bfill()
res = df.groupby('memberID')['shipping_country'].apply(filler)
A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.
This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.

For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 54
This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')
Yields:
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 54
If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')

You can use chained groupbys, one with forward fill and one with backfill:
# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
This method will also allow a group made up of all NaN to remain NaN:
>>> df
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 1
6 1
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 1 NaN
6 1 NaN

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

Replace the values in a column based on frequency

I have a dataframe (3.7 million rows) with a column with different country names
id Country
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 INDIA
6 USA
7 USA
8 ITALY
9 USA
10 RUSSIA
I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column
My alternate solution is to replace the names with there frequency using
df.column_name = df.column_name.map(df.column_name.value_counts())
Use:
df.loc[df.groupby('Country')['id']
.transform('size')
.div(len(df))
.lt(0.15),
'Country'] = 'Miscellanous'
Or
df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
.lt(0.15)),
'Country'] = 'Miscellanous'
If you want to put all country whose frequency is less than a threshold into the "Misc" category:
threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()
df['Country'].map(mappings)
Here is another option
s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')
You can use dictionary and map for this:
d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x if d[x] > 0.15 else 'Miscellanous' )
Output:
id
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 Miscellanous
6 USA
7 USA
8 Miscellanous
9 USA
10 RUSSIA
Name: Country, dtype: object

Fill missing values in a data frame based without losing values already in the data frame

I have data frame with missing values:
import pandas as pd
data = {'Brand':['residential','unclassified','tertiary','residential','unclassified','primary','residential'],
'Price': [22000,25000,27000,"NA","NA",10000,"NA"]
}
df = pd.DataFrame(data, columns = ['Brand', 'Price'])
print (df)
Resulting in this data frame:
Brand Price
0 residential 22000
1 unclassified 25000
2 tertiary 27000
3 residential NA
4 unclassified NA
5 primary 10000
6 residential NA
I would like to fill in the missing values for residential and unclassified in the prices column with fixed values (residential =1000, unclassified=2000), however I dont want to lose any values that are already present in the prices column for residential or unclassified, so the out put should look like this:
Brand Price
0 residential 22000
1 unclassified 25000
2 tertiary 27000
3 residential 1000
4 unclassified 2000
5 primary 10000
6 residential 1000
Whats the easiest way to get this done
We can do map with fillna , PS: you need to make sure in your df, NA is NaN
df.Price.fillna(df.Brand.map({'residential':1000,'unclassified':2000}),inplace=True)
df
Brand Price
0 residential 22000.0
1 unclassified 25000.0
2 tertiary 27000.0
3 residential 1000.0
4 unclassified 2000.0
5 primary 10000.0
6 residential 1000.0

Listing unique value counts per groups in pandas dataframe

I am new to pandas and python.
I am trying to group items by one column and list the information from the data frame per group.
My dataframe:
B C D E F
1 Honda USA 2000 Washington New
2 Honda USA 2001 Salt Lake Used
3 Ford Canada 2005 Washington New
4 Toyota USA 2010 Ney York Used
5 Honda USA 2001 Salt Lake Used
6 Honda Canada 2011 Salt Lake Crashed
7 Ford Italy 2014 Rome New
I am trying to group my dataframe by column B and list how many C, D, E, F column values are in group B. For example we see that in column B there are 4 Honda which I am grouping it together. Then I want to list the following information - USA(3), Canada(1), 2000(1),2001(2), 2011(1), Washington(1), Salt Lake(3), New(1), Used(2), Crashed(1) and do the same per every group ( car make ) in column B:
Car Country Year City Condition
1 Honda(4) USA(3) 2000(1) Washington(1) New(1)
Canada(1) 2001(2) Salt Lake(3) Used(2)
2011(1) Crashed(1)
2 Ford(2) Canada(1) 2005(5) Washington(1) New(2)
Italy(1) 2014(1) Rome(1)
...
What I've tried so far:
df.groupby(['B'])
Which gives me back <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d559080>
At this point, I am not sure how I should code moving on forward getting the desired results after grouping the column B.
Thank you for your suggestions.
You need lambda function with custom function for processing each column separately with Series.value_counts and then join values of index to values of counts of Series together:
def f(x):
x = x.value_counts()
y = x.index.astype(str) + '(' + x.astype(str) + ')'
return y.reset_index(drop=True)
df1 = df.groupby(['B']).apply(lambda x: x.apply(f)).reset_index(drop=True)
print (df1)
B C D E F
0 Ford(2) Italy(1) 2014(1) Washington(1) New(2)
1 NaN Canada(1) 2005(1) Rome(1) NaN
2 Honda(4) USA(3) 2001(2) Salt Lake(3) Used(2)
3 NaN Canada(1) 2011(1) Washington(1) Crashed(1)
4 NaN NaN 2000(1) NaN New(1)
5 Toyota(1) USA(1) 2010(1) Ney York(1) Used(1)

How to output the top 5 of a specific column along with associated columns using python?

I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000

Categories

Resources