Aggregating and group by in Pandas considering some conditions

Aggregating and group by in Pandas considering some conditions - python

I have an excel file which simplified has the following structure and which I read as a dataframe:
df = pd.DataFrame({'ISIN':['US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US00206R1023'],
'Name':['ALPHABET INC.CL.A DL-,001', 'Alphabet Inc Class A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'Alphabet Inc. Class C', 'Alphabet Inc. Class A', 'AT&T Inc'],
'Country':['United States', 'United States', 'United States', '', 'United States', 'United States', 'United States', 'United States', 'United States'],
'Category':[ '', 'big', 'big', '', 'big', 'test', 'test', 'test', 'average'],
'Category2':['important', '', 'important', '', '', '', '', '', 'irrelevant'],
'Value':[1000, 750, 60, 50, 160, 9, 10, 10, 1]})
I would love to group by ISIN and add up the values and calculate the sum like
df1 = df.groupby('ISIN').sum(['Value'])
The problem with this approach is, I dont get the other fields 'Name', 'Country', 'Category', 'Category2'.
My objective is to get as a result the following data aggregated dataframe:
df1 = pd.DataFrame({'ISIN':['US02079K3059', 'US00206R1023'],
'Name':['ALPHABET A', 'AT&T Inc'],
'Country':['United States', 'United States'],
'Category':['big', 'average'],
'Category2':['important', 'irrelevant'],
'Value':[2049, 1]})
If you compare df to df1, you will recognize some criteria/conditions I applied:
for every 'ISIN' most commonly appearing field value should be used, e.g. 'United States' in column 'Country'
If field values are equally most common, the first appearing of the most common should be used, e.g. 'big' and 'test' in column 'Category'
Exception: empty values don't count, e.g. Category2, even though '' is the most common value, 'important' is used as final value.
How can I achieve this goal? Anyone who can help me out?

try convert '' to NaN then drop 'Value' column then groupby 'ISIN' and calculate mode then map the values of sum of 'Value' column grouped by 'ISIN' to 'ISIN' column so to create 'Value' column in your Final result:
Basically the idea is to converting empty string '' to NaN so that it doesn't count in the mode and we are defining a function to handle such cases when mode of particular column groupedby 'ISIN' is NaN because of dropna=True in mode() method
def f(x):
try:
return x.mode().iat[0]
except IndexError:
return float('NaN')
Finally:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(f))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
OR
Via passing dropna=False in mode() method and anonymous function:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(lambda x:x.mode(dropna=False).iat[0]))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
Now If you print out you will get your desired output

Related

Whats wrong? Pandas

SyntaxError: invalid syntax. when executing, it does not work, create groups by continent, writes that = invalid, what should be put?
def country_kl(country):
if country = ['United States', 'Mexico', 'Canada', 'Bahamas', 'Chile', 'Brazil', 'Colombia','British Virgin Islands'
,'Peru','Uruguay','Turks and Caicos Islands','Cambodia','Bermuda','Argentina']:
return '1'
elif country = ['France', 'Spain', 'Germany', 'Switzerland', 'Belgium', 'United Kingdom', 'Austria', 'Italy', 'Swaziland'
,'Russia' , 'Sweden','Czechia','Monaco','Denmark','Poland','Norway','Netherlands','Portugal','Turkey','Finland',
'Ukraine','Andorra','Hungary','Greece','Romania','Slovakia','Liechtenstein','Guernsey','Ireland']:
return '2'
elif country = ['India','China', 'Singapore', 'Hong Kong', 'Australia', 'Japan']:
return '3'
elif country = ['United Arab Emirates',
'Thailand','Malaysia','New Zealand','South Korea','Philippines','Taiwan','Israel','Vietnam','Cayman Islands',
'Kazakhstan' ,'Georgia','Bahrain','Nepal','Qatar','Oman','Lebanon']:
return '3'
else :
return '4'

One more error in your code is that you used a single "=", what
actually means substitution.
To compare two values use "==" (double "=").
But of course, to check whether a value of a variable is contained
in a list you have to use in operator, just as Ilya suggested in his comment.
Another, more readable and elegant solution is:
Create a dictionary, where the key is country name and the
value is your expected result for this country. Something like:
countries = {'United States': '1', 'Mexico': '1', 'France': '2', 'Spain': '2',
'India': '3', 'China': '3', 'Singapore': '3'}
(include other countries too).
Look up this dictionary, with default value of '4', which you
used in your code:
result = countries.get(country, default='4')
And by the way: Your question and code have nothing to do with Pandas.
You use ordinary, pythonic list and (as I suppose) a string variable.
But since you marked your question also with Pandas tag,
I came up also with a pandasonic solution:
Create a Series from the above dictionary:
ctr = pd.Series(countries.values(), index=countries.keys())
Lookup this Series, also with a default value:
result = ctr.get(country, default='4')

Using regex in python for a dynamic string

I have a pandas columns with strings which dont have the same pattern, something like this:
{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'}
{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'}
{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}
How do I only keep the name of the country for every row? I would only like to keep "France", "United States of America", "France".
I tried building the regex pattern: something like this
r"^\W+[a-z]+_[0-9]\W+"
But this turns out to be very specific, and if there is a slight change in the string the pattern wont work. How do we resolve this?

As you have dictionaries in the column, you can get the values of the name keys:
import pandas as pd
df = pd.DataFrame({'col':[{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'},
{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'},
{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}]})
df['col'] = df['col'].apply(lambda x: x["name"])
Output of df['col']:
0 France
1 United States of America
2 France
Name: col, dtype: object
If the column contains stringified dictionaries, you can use ast.literal_eval before accessing the name key value:
import pandas as pd
import ast
df = pd.DataFrame({'col':["{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'}",
"{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'}",
"{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}"]})
df['col'] = df['col'].apply(lambda x: ast.literal_eval(x)["name"])
And in case your column is totally messed up, yes, you can resort to regex:
df['col'] = df['col'].str.extract(r"""['"]name['"]\s*:\s*['"]([^"']+)""")
# or to support escaped " and ':
df['col'] = df['col'].str.extract(r"""['"]name['"]\s*:\s*['"]([^"'\\]+(?:\\.[^'"\\]*)*)""")>>> df['col']
0
0 France
1 United States of America
2 France
See the regex demo.

Matching part of a string with a value in two pandas dataframes

Given the following df with street names:
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
And df2 which contains that match streets and their following county:
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
How can I create a column that tells me the state where each street of DF is, through a pairing of df(street) df2(street2). The matching does not have to be perfect, it must match at least one word?
The following dataframe is an example of what I want to obtain:
desiredoutput = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown'], 'state': ["Utuado", "NA", "NA", "Bayamon"]})

Maybe a Naive approach, but works well.
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
output = {'street1':[],'county':[]}
streets1 = df['street1']
streets2 = df2['street2']
county = df2['county']
for street in streets1:
for index,street2 in enumerate(streets2):
if street2 in street:
output['street1'].append(street)
output['county'].append(county[index])
count = 1
if count == 0:
output['street1'].append(street)
output['county'].append('NA')
count = 0
print(output)

Select Rows where one cell contains 'some' word and store in a variable

I'm trying to analyze StackOverflow Survey Data which can be found here.
I want to get all rows which contains in "United States" in "Country" column and then store the values in a variable.
For example =
my_data = `{'Age1stCode': 12, 13, 15, 16, 18
'Country': 'India', 'China', 'United States', 'England', 'United States'
'x': 'a', 'b', 'c', 'd', 'e'
}`
what I want =
my_result_data = `{'Age1stCode': 15, 18
'Country': 'United States', 'United States'
'x': 'c', 'e'
}`

If you need to filter or search for specific string in specific column in dataframe
you need just to use
df_query = df.loc[ df.country.str.contains('United States')]
you can set case = True to make the search case sensitive
check doc for more infos : pandas doc

Pandas translating column of a dataframe with a lookup dataframe

I have a dataframe that looks like:
df = pd.DataFrame({'ISIN': ['A1kT23', '4523', 'B333', '49O33'], 'Name': ['Example A', 'Name Xy', 'Example B', 'Test123'], 'Sector': ['Energy', 'Industrials', 'Utilities', 'Real Estate'], 'Country': ['UK', 'USA', 'Germany', 'China']})
I would like to translate the column Sector into German by using the dataframe Sector_EN_DE
Sector_EN_DE = pd.DataFrame({'Sector_EN': ['Energy', 'Industrials', 'Utilities', 'Real Estate', 'Materials'], 'Sector_DE': ['Energie', 'Industrie', 'Versorger', 'Immobilien', 'Materialien']})
so that I get as result the dataframe
df = pd.DataFrame({'ISIN': ['A1kT23', '4523', 'B333', '49O33'], 'Name': ['Example A', 'Name Xy', 'Example B', 'Test123'], 'Sector': ['Energie', 'Industrie', 'Versorger', 'Immobilien'], 'Country': ['UK', 'USA', 'Germany', 'China']})
What would be the apropriate code line?

Another way via map():
df['Sector']=df['Sector'].map(dict(Sector_EN_DE[['Sector_EN', 'Sector_DE']].values))
OR
via replace():
df['Sector']=df['Sector'].replace(dict(Sector_EN_DE[['Sector_EN', 'Sector_DE']].values))

This line will do the merge and DataFrame cleanup:
df.merge(Sector_EN_DE, left_on='Sector', right_on='Sector_EN').drop(['Sector', 'Sector_EN'], axis=1).rename(columns={'Sector_DE': 'Sector'})
Explanation:
The merge function do the join between both DataFrames.
The drop function drops the English version of Sector, with axis=1 because you're dropping columns (you can also use that function to drop rows).
The rename function renames the Sector_DE column.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregating and group by in Pandas considering some conditions - python

Related

Whats wrong? Pandas

Using regex in python for a dynamic string

Matching part of a string with a value in two pandas dataframes

Select Rows where one cell contains 'some' word and store in a variable

Pandas translating column of a dataframe with a lookup dataframe

Categories

Resources