Inserting a Ratio field into a Pandas Series - python

I get a Pandas series:
countrypat = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0).head(3)
The output looks like this:
China abc 1055
def 778
ghi 612
Malaysia def 554
abc 441
ghi 178
[...]
How to insert a new column (do I have to make this a dataframe) containing the ratio of the numeric column to the sum of the numbers for that country. Thus for China I would want a new column and the first row would contain (1055/(1055+778+612)). I have tried unstack() and to_df() but was unsure of the next steps.

I created a dataframe on my side, but excluded the .head(3) of your assigment:
countrypat = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0)
The following will give you the proportions with a simple apply to your groupby object:
countrypat.apply(lambda x: x / float(x.sum()))
The only 'problem' is that doing so returns you a series, so I would stock the intermediate results in two different series and combine them at the end:
series1 = asiaselect.groupby('Country')['Pattern'].value_counts()
series2 = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0).apply(lambda x: x / float(x.sum()))
pd.DataFrame([series1, series2]).T
China abc 1055.0 0.431493
def 778.0 0.318200
ghi 612.0 0.250307
Malaysia def 554.0 0.472293
abc 441.0 0.375959
ghi 178.0 0.151748
As to get the top three rows, you can simply add a .groupby(level=0).head(3) to each series1 and series2
series1_top = series1.groupby(level=0).head(3)
series2_top = series2.groupby(level=0).head(3)
pd.DataFrame([series1_top, series2_top]).T
I tested with a dataframe containing more than 3 rows, and it seems to work. Started with the following df:
China abc 1055
def 778
ghi 612
yyy 5
xxx 3
zzz 3
Malaysia def 554
abc 441
ghi 178
yyy 5
xxx 3
zzz 3
and ends like this:
China abc 1055.0 0.429560
def 778.0 0.316775
ghi 612.0 0.249186
Malaysia def 554.0 0.467905
abc 441.0 0.372466
ghi 178.0 0.150338

Related

Filter columns containing values and NaN using specific characters and create seperate columns

I have a dataframe containing columns in the below format
df =
ID Folder Name Country
300 ABC 12345 CANADA
1000 NaN USA
450 AML 2233 USA
111 ABC 2234 USA
550 AML 3312 AFRICA
Output needs to be in the below format
ID Folder Name Country Folder Name - ABC Folder Name - AML
300 ABC 12345 CANADA ABC 12345 NaN
1000 NaN USA NaN NaN
450 AML 2233 USA NaN AML 2233
111 ABC 2234 USA ABC 2234 NaN
550 AML 3312 AFRICA NaN AML 3312
I tried using the below python code:-
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x.str.startswith('ABC',na = False))
Can you please help me where i am going wrong?
You should not use apply but boolean indexing:
df.loc[df['Folder Name'].str.startswith('ABC', na=False),
'Folder Name - ABC'] = df['Folder Name']
However, a better approach that would not require you to loop over all possible codes would be to extract the code, pivot_table and merge:
out = df.merge(
df.assign(col=df['Folder Name'].str.extract('(\w+)'))
.pivot_table(index='ID', columns='col',
values='Folder Name', aggfunc='first')
.add_prefix('Folder Name - '),
on='ID', how='left'
)
output:
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
If you have a list with the substrings to be matched at the start of each string in df['Folder Name'], you could also achieve the result as follows:
lst = ['ABC','AML']
pat = f'^({".*)|(".join(lst)}.*)'
# '^(ABC.*)|(AML.*)'
df[[f'Folder Name - {x}' for x in lst]] = \
df['Folder Name'].str.extract(pat, expand=True)
print(df)
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
If you do not already have this list, you can simply create it first by doing:
lst = df['Folder Name'].dropna().str.extract('^([A-Z]{3})')[0].unique()
# this will be an array, not a list,
# but that doesn't affect the functionality here
N.B. If your list contains items that won't match, you'll end up with extra columns filled completely with NaN values. You can get rid of these at the end. E.g.:
lst = ['ABC','AML','NON']
# 'NON' won't match
pat = f'^({".*)|(".join(lst)}.*)'
df[[f'Folder Name - {x}' for x in lst]] = \
df['Folder Name'].str.extract(pat, expand=True)
df = df.dropna(axis=1, how='all')
# dropping column `Folder Name - NON` with only `NaN` values
the startswith methode return True or False so your column will contains just a boolean values instead you can try this :
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x if x.str.startswith('ABC',na = False))
does this code do the trick?
df['Folder Name - ABC'] = df['Folder Name'].where(df['Folder Name'].str.startswith('ABC'))

Pandas - comparing certain columns of two dataframes and updating rows of one if a condition is met

I have two dataframes that share some of the same columns, but one has more columns than the other. I would like to compare certain column values of the dataframes and update a column value for each row of one of the dataframes where both dataframes have the same value for certain columns. Ex:
df1: df2:
State Organization Date Tag Fine State Organization Date Fine
MD ABC 01/10/2021 901 0 MD ABC 01/10/2021 1000
MD ABC 01/10/2021 801 0 MD ABC 01/15/2021 6000
NJ DEF 02/10/2021 701 0 NJ DEF 02/10/2021 900
NJ DEF 02/10/2021 601 0
NJ DEF 02/10/2021 701 0
So in my particular case, if both dataframes share a row where state, organization, and date are the same, I would like to use df2's corresponding row's fine value to update df1's corresponding row's fine value. So:
df1:
State Organization Date Tag Fine
MD ABC 01/10/2021 901 1000
MD ABC 01/15/2021 801 6000
NJ DEF 02/10/2021 701 900
NJ DEF 02/10/2021 601 900
NJ DEF 02/10/2021 701 900
As you can see, the data frames do not have an equal number of columns, so I'm not sure if there's an easy way to do this without using iterrows. Any suggestions?
try this:
idx = ['State', 'Organization', 'Date']
res = df1.set_index(idx).copy()
print(res)
>>>
Tag Fine
State Organization Date
MD ABC 01/10/2021 901 0
01/10/2021 801 0
NJ DEF 02/10/2021 701 0
02/10/2021 601 0
02/10/2021 701 0
df2 = df2.set_index(idx)
print(df2)
>>>
Fine
State Organization Date
MD ABC 01/10/2021 1000
01/15/2021 6000
NJ DEF 02/10/2021 900
res.update(df2)
print(res)
>>>
Tag Fine
State Organization Date
MD ABC 01/10/2021 901 1000.0
01/10/2021 801 1000.0
NJ DEF 02/10/2021 701 900.0
02/10/2021 601 900.0
02/10/2021 701 900.0
pd.__version__
>>>
'1.4.1'

Dataframe replace with another row, based on condition

I have a dataframe like the following:
ean product_resource_id shop
----------------------------------------------------
123 abc xxl
245 bed xxl
456 dce xxl
123 0 conr
245 0 horec
I want to replace 0 "product_resource_id"s with an id where "ean"s are same.
I want to get a result like:
ean product_resource_id shop
----------------------------------------------------
123 abc xxl
245 bed xxl
456 dce xxl
123 abc conr
245 bed horec
Any help would be really helpful. Thanks in advance!
Idea is filter rows with 0 values in product_resource_id, remove duplicates by ean column if exist and create Series by DataFrame.set_index for mapping, if no match values are replace by original by values by Series.fillna, because non match values return NaNs:
#mask = df['product_resource_id'].ne('0')
#if 0 is integer
mask = df['product_resource_id'].ne(0)
s = df[mask].drop_duplicates('ean').set_index('ean')['product_resource_id']
df['product_resource_id'] = df['ean'].map(s).fillna(df['product_resource_id'])
print (df)
ean product_resource_id shop
0 123 abc xxl
1 245 bed xxl
2 456 dce xxl
3 123 abc conr
4 245 bed horec

Conditionally filling blank values in Pandas dataframes

I have a datafarme which looks like as follows (there are more columns having been dropped off):
memberID shipping_country
264991
264991 Canada
100 USA
5000
5000 UK
I'm trying to fill the blank cells with existing value of shipping country for each user:
memberID shipping_country
264991 Canada
264991 Canada
100 USA
5000 UK
5000 UK
However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?
You can use GroupBy + ffill / bfill:
def filler(x):
return x.ffill().bfill()
res = df.groupby('memberID')['shipping_country'].apply(filler)
A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.
This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.
For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 54
This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')
Yields:
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 54
If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')
You can use chained groupbys, one with forward fill and one with backfill:
# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
This method will also allow a group made up of all NaN to remain NaN:
>>> df
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 1
6 1
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 1 NaN
6 1 NaN

Categories columns in second row to the first row in DataFrame Pandas?

I had this database:
Unnamed=0 2001 2002 2003
General 456 567 543
Cleaning 234 234 344
After transpose data, I got the variables in the second row in Jupyter Notebook:
df = df.T.rename_axis('Date').reset_index()
df
Date 1 2
1 General Cleaning
2 2001 456 234
3 2002 567 234
4 2003 543 344
How do I place them in the first row in the DataFrame so I can group and manipulate the values?
Date General Cleaning
1 2001 456 234
2 2002 567 234
3 2003 543 344
You were close with the attempt you showed above. Instead, reset the index to move the dates from the index to the first column, and then rename that date column from index to Date:
df = df.T.reset_index().rename(columns={'index':'Date'})
df
Output:
Date General Cleaning
0 2001 456 234
1 2002 567 234
2 2003 543 344
You can simply drop row 1 and rename columns.
df.drop(1, axis=0, inplace=True)
df.columns= ['Date', 'General', 'Cleaning']
Assuming this df:
df = pd.DataFrame(data=[[2001,2002,2003],[456,567,543],[234,234,344]],index=[0,'General','Cleaning'])
Just do it:
df = df.T.copy().rename(columns={0:'Date'})

Categories

Resources