Replace the values in a column based on frequency - python

I have a dataframe (3.7 million rows) with a column with different country names
id Country
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 INDIA
6 USA
7 USA
8 ITALY
9 USA
10 RUSSIA
I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column
My alternate solution is to replace the names with there frequency using
df.column_name = df.column_name.map(df.column_name.value_counts())

Use:
df.loc[df.groupby('Country')['id']
.transform('size')
.div(len(df))
.lt(0.15),
'Country'] = 'Miscellanous'
Or
df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
.lt(0.15)),
'Country'] = 'Miscellanous'

If you want to put all country whose frequency is less than a threshold into the "Misc" category:
threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()
df['Country'].map(mappings)

Here is another option
s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')

You can use dictionary and map for this:
d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x if d[x] > 0.15 else 'Miscellanous' )
Output:
id
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 Miscellanous
6 USA
7 USA
8 Miscellanous
9 USA
10 RUSSIA
Name: Country, dtype: object

Related

How to keep the values with most frequent prefix in a groupby pandas dataframe?

Let's say I have this dataframe :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 Spain m2_location
4 USA m1_name
5 USA m2_name
6 USA m3_size
7 USA m3_location
I want to group on the "Country" columns and to keep the records with the most frequent records in the groupby object.
The expected result would be :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
I already tried extracting the prefix, then getting the mode of the prefix on the dataframe and merging rows with this mode, but I feel that a more direct and more efficient solution exists.
Here is the working sample code below for reproducible results :
df = pd.DataFrame({
"Country": ["Spain","Spain","Spain","Spain","USA","USA","USA","USA"],
"City": ["m1_name","m1_location","m1_size","m2_location","m1_name","m2_name","m3_size","m3_location"]
})
df['prefix'] = df['City'].str[1]
modes = df.groupby('Country')['prefix'].agg(pd.Series.mode).rename("modes")
df = df.merge(modes, how="right", left_on=['Country','prefix'], right_on=['Country',"modes"])
df = df.drop(['modes','prefix'], axis = 1)
print(df)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location
You can try groupby and apply to filter group rows
out = (df.assign(prefix=df['City'].str.split('_').str[0])
.groupby('Country')
.apply(lambda g: g[g['prefix'].isin(g['prefix'].mode())])
.reset_index(drop=True)
.drop('prefix',axis=1))
print(out)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location
Use:
In [575]: df['Prefix_count'] = df.groupby(['Country', df.City.str.split('_').str[0]])['City'].transform('size')
In [589]: idx = df.groupby('Country')['Prefix_count'].transform(max) == df['Prefix_count']
In [593]: df[idx].drop('Prefix_count', 1)
Out[593]:
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
An interesting fact about the solutions proposed below is that Mayank's one is way faster. I ran it on 1000 rows on my data and got :
Mayank's solution : 0.020 seconds
Ynjxsjmh's solution : 0.402 seconds
My (OP) solution : 0.122 seconds

Splitting a csv into multiple csv of maximum 2000 rows while respecting grouping condition using Python

This is my very first question...
I'm trying to split a big csv of maximum 2000 rows. If it was just splitting it would be too easy. In this case, I can't just split by dividing the csv. Indeed, some rows need to be grouped together. Every file can't be bigger (it can be smaller) than 2000 but mostly, rows that needs to be together should be in the same file. Rows that needs to be together share the same combination for two columns --> That's how I know they need to be together.
Example with 10 records and split csv's of maximum 5 rows :
Country
Category
Product
Spain
A
1
Spain
A
2
Spain
A
3
Spain
B
4
Spain
B
5
Spain
B
6
Spain
B
7
Italy
B
8
Germany
A
9
Germany
A
10
Here all the rows having the same combination of Country and Category need to be together. If the maximum size of the split file is 5, we get the following:
Country
Category
Product
Spain
A
1
Spain
A
2
Spain
A
3
Country
Category
Product
Spain
B
4
Spain
B
5
Spain
B
6
Spain
B
7
Italy
B
8
Country
Category
Product
Germany
A
9
Germany
A
10
Any idea how I could solve this?
Thanks!!
You can find group sizes, then calculate which group should be the last in each file (the point at which it overflows the given maximum number of rows), then convert it to file numbers and group by file numbers to save individual CSVs.
In code this would look like the following (please see comments for explanation):
# set max records per file
N = 5
# find counts per group
z = df.groupby(['Country', 'Category'], sort=False).size().reset_index()
# find the last group in each file and set `x` = 1
i = 0
for j in range(len(z)):
if z.loc[i:j, 0].sum() > N:
z.loc[j, 'x'] = 1
i = j
# calculate the file number `f` as the cumsum of `x`
z['f'] = z['x'].fillna(0).cumsum().astype(int) + 1
# merge df and z to get file number for each record
# then groupby and save to separate CSV files
for f, df_g in df.merge(z[['Country', 'Category', 'f']]).groupby('f'):
df_g.drop(columns='f').to_csv(f'{f:03}.csv', index=False)
This would save your sample DataFrame into 3 files:
001.csv
Country Category Product
0 Spain A 1
1 Spain A 2
2 Spain A 3
002.csv
Country Category Product
0 Spain B 4
1 Spain B 5
2 Spain B 6
3 Spain B 7
4 Italy B 8
003.csv
Country Category Product
0 Germany A 9
1 Germany A 10
Example:
Country Category Product
0 Spain a 1
1 Belgium b 2
2 Spain a 2
3 Cuba c 3
4 Belgium c 4
5 Cuba a 5
new_df = df[df['Country']=='Spain']
Country Category Product
0 Spain a 1
2 Spain a 2
Then convert the subset to a new CSV subset file. (do the same for 'category' and 'product')
new_df.to_csv(file)
This is not an optimal solution but it is an easy and tractable solution that is likely good enough.
import csv
from itertools import groupby
with open(fn) as f:
reader=csv.reader(f)
header=next(reader)
data=[row for row in reader]
def key_func(li):
return (li[1],li[0],li[2])
data_dic={}
# If you only want a single country in a file, change the following line to
# for k,v in groupby(sorted(data, key=key_func), key=lambda li: (li[0], li[1])):
for k,v in groupby(sorted(data, key=key_func), key=lambda li: li[1]):
data_dic[k]=list(v)
chnk=5 # change this for the max lines per file
cnt=1
for k,v in data_dic.items():
for chunk in (v[i:i+chnk] for i in range(0, len(v), chnk)):
print(f'\n=== file {cnt}:')
cnt+=1
print('\n'.join([','.join(e) for e in [header]+chunk]))
With your example, prints:
=== file 1:
Country,Category,Product
Germany,A,10
Germany,A,9
Spain,A,1
Spain,A,2
Spain,A,3
=== file 2:
Country,Category,Product
Italy,B,8
Spain,B,4
Spain,B,5
Spain,B,6
Spain,B,7
With this input:
Country,Category,Product
Spain,A,1
Spain,A,2
Spain,A,3
Spain,A,4
Spain,B,4
Spain,B,5
Spain,B,6
Spain,B,7
Spain,B,8
Italy,B,8
Germany,A,9
Germany,A,10
Germany,A,11
Cuba,C,22
Prints:
=== file 1:
Country,Category,Product
Germany,A,10
Germany,A,11
Germany,A,9
Spain,A,1
Spain,A,2
=== file 2:
Country,Category,Product
Spain,A,3
Spain,A,4
=== file 3:
Country,Category,Product
Italy,B,8
Spain,B,4
Spain,B,5
Spain,B,6
Spain,B,7
=== file 4:
Country,Category,Product
Spain,B,8
=== file 5:
Country,Category,Product
Cuba,C,22

Extract country name from text in column to create another column

I have tried different combinations to extract the country names from a column and create a new column with solely the countries. I can do it for selected rows i.e. df.address[9998] but not for the whole column.
import pycountry
Cntr = []
for country in pycountry.countries:
for country.name in df.address:
Cntr.append(country.name)
Any ideas what is going wrong here?
edit:
address is an object in the df and
df.address[:10] looks like this
Address
0 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 ...
9 Kristiansand, Norway
Name: address, Length: 10, dtype: object
Based on Petar's response when I run individual queries I get the country correctly, but when I try to create a column with all the countries (or ranges like df.address[:5] I get an empty Cntr)
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df['address'][1]:
Cntr.append(country.name)
Cntr
Returns
[Italy]
and df.address[2] returns [ ]
etc.
I have also run
df['address'] = df['address'].astype('str')
to make sure that there are no floats or int in the column.
Sample dataframe
df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})
df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)
address city country
0 Turin, Italy Turin Italy
1 NaN NaN NaN
2 Zurich, Switzerland Zurich Switzerland
3 NaN NaN NaN
4 Glyfada, greece Glyfada greece
You were really close. We cannot loop like this for country.name in df.address. Instead:
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df.address:
Cntr.append(country.name)
If this does not work, please supply more information because I am unsure what df.address looks like.
You can use the function clean_country() from the library DataPrep. Install it with pip install dataprep.
from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
address address_clean
0 Turin, Italy Italy
1 NaN NaN
2 Zurich, Switzerland Switzerland
3 NaN NaN
4 Glyfada, Greece Greece

Conditionally filling blank values in Pandas dataframes

I have a datafarme which looks like as follows (there are more columns having been dropped off):
memberID shipping_country
264991
264991 Canada
100 USA
5000
5000 UK
I'm trying to fill the blank cells with existing value of shipping country for each user:
memberID shipping_country
264991 Canada
264991 Canada
100 USA
5000 UK
5000 UK
However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?
You can use GroupBy + ffill / bfill:
def filler(x):
return x.ffill().bfill()
res = df.groupby('memberID')['shipping_country'].apply(filler)
A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.
This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.
For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 54
This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')
Yields:
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 54
If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')
You can use chained groupbys, one with forward fill and one with backfill:
# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
This method will also allow a group made up of all NaN to remain NaN:
>>> df
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 1
6 1
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 1 NaN
6 1 NaN

Pandas merge fail to extract common Index values

I'm trying to merge 2 DataFrames of different sizes, both are indexed by 'Country'. The first dataframe 'GDP_EN' contains every country in the world, and the second dataframe 'ScimEn' contains 15 countries.
When I try to merge these DataFrames,instead of merging the columns based on index countries of ScimEn, I got back 'Country_x' and 'Country_y'. 'Country_x' came from GDP_EN, which are the first 15 countries in alphabetical order. 'Country_y' are the 15 countries from ScimEn. I'm wondering why didn't they merge?
I used:
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
I think both DataFrames are not indexes by Country, but Country is column add parameter on='Country':
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]})
print (GDP_EN)
Country a
0 USA 4
1 France 8
2 Slovakia 6
3 Russia 9
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]})
print (ScimEn)
Country b
0 France 80
1 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
Country_x a Country_y b
0 USA 4 France 80
1 France 8 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,on='Country',how='right')
print (DF)
Country a b
0 France 8 80
1 Slovakia 6 70
If Country are indexes it works perfectly:
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]}).set_index('Country')
print (GDP_EN)
a
Country
USA 4
France 8
Slovakia 6
Russia 9
print (GDP_EN.index)
Index(['USA', 'France', 'Slovakia', 'Russia'], dtype='object', name='Country')
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]}).set_index('Country')
print (ScimEn)
b
Country
France 80
Slovakia 70
print (ScimEn.index)
Index(['France', 'Slovakia'], dtype='object', name='Country')
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
a b
Country
France 8 80
Slovakia 6 70

Categories

Resources