How to group this dataframe by 'Continent' column? - python

A 'Continent' column is added to an existing data frame using a dictionary to match with the country names in data frame.
I am trying to group the data frame by the 'Continent' column.
I have tried the following:
def answer_eleven():
Top15 = answer_one()
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
ContinentDict= pd.Series(ContinentDict)
Top15= Top15.assign(Continent= ContinentDict)
Top15= Top15.groupby('Continent')
return Top15
answer_eleven()
However, the output i get is:
pandas.core.groupby.groupby.DataFrameGroupBy object at 0x0000021C9C3BC6D8

a way to display a groupby object is
data = answer_eleven()
for key, item in data:
print(data.get_group(key), "\n")

Related

Create multiple dataframes with a loop in Python

So I got this part of code that I want to make shorter:
df_1 = investpy.stocks.get_stock_recent_data('Eco','Colombia',False)
df_2 = investpy.stocks.get_stock_recent_data('JPM','United States',False)
df_3 = investpy.stocks.get_stock_recent_data('TSM','United States',False)
df_5 = investpy.stocks.get_stock_recent_data('CSCO','United States',False)
df_8 = investpy.stocks.get_stock_recent_data('NVDA','United States',False)
df_9 = investpy.stocks.get_stock_recent_data('BLK','United States',False)
As I use the same code and only a few things change from one line to another I think I migth solve this using a function. I create this one:
def _get_asset_data(ticker, country, state):
investpy.stocks.get_stock_recent_data(ticker, country, state)
So I tried this:
_get_asset_data('TSLA', 'United States', False)
print(_get_asset_data)
<function _get_asset_data at 0x7f323c912560>
However, I do not know how to make each set of data that I receive as a result of this function to be stored in a data frame for each company.I tried a for loop but got nowhere.
Any ideas? ¡Thank you in advance for your attention and cooperation!
Here is one approach based on the code given. You should refrain from using it in practice, as it contains redundant code, which makes it hard to maintain. You'll find a more flexible approach below.
Based on your solution
import investpy
import pandas as pd
def _get_asset_data(ticker, country, state=False):
return investpy.stocks.get_stock_recent_data(ticker, country, state)
df_1 = _get_asset_data('Eco','Colombia')
df_2 = _get_asset_data('JPM','United States')
df_3 = _get_asset_data('TSM','United States')
df_5 = _get_asset_data('CSCO','United States')
df_8 = _get_asset_data('NVDA','United States')
df_9 = _get_asset_data('BLK','United States')
final = pd.concat([df_1, df_2, df_3, df_5, df_8, df_9], axis=1)
final
More versatile solution:
import investpy
import pandas as pd
def _get_asset_data(ticker, country, state=False):
return investpy.stocks.get_stock_recent_data(ticker, country, state)
stocks = [
('Eco', 'Colombia'),
('JPM', 'United States'),
('TSM', 'United States'),
('CSCO', 'United States'),
('NVDA', 'United States'),
('BLK', 'United States'),
]
results = []
for stock in stocks:
result = _get_asset_data(*stock)
results.append(result)
final = pd.concat(results, axis=1)
final

Aggregating and group by in Pandas considering some conditions

I have an excel file which simplified has the following structure and which I read as a dataframe:
df = pd.DataFrame({'ISIN':['US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US00206R1023'],
'Name':['ALPHABET INC.CL.A DL-,001', 'Alphabet Inc Class A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'Alphabet Inc. Class C', 'Alphabet Inc. Class A', 'AT&T Inc'],
'Country':['United States', 'United States', 'United States', '', 'United States', 'United States', 'United States', 'United States', 'United States'],
'Category':[ '', 'big', 'big', '', 'big', 'test', 'test', 'test', 'average'],
'Category2':['important', '', 'important', '', '', '', '', '', 'irrelevant'],
'Value':[1000, 750, 60, 50, 160, 9, 10, 10, 1]})
I would love to group by ISIN and add up the values and calculate the sum like
df1 = df.groupby('ISIN').sum(['Value'])
The problem with this approach is, I dont get the other fields 'Name', 'Country', 'Category', 'Category2'.
My objective is to get as a result the following data aggregated dataframe:
df1 = pd.DataFrame({'ISIN':['US02079K3059', 'US00206R1023'],
'Name':['ALPHABET A', 'AT&T Inc'],
'Country':['United States', 'United States'],
'Category':['big', 'average'],
'Category2':['important', 'irrelevant'],
'Value':[2049, 1]})
If you compare df to df1, you will recognize some criteria/conditions I applied:
for every 'ISIN' most commonly appearing field value should be used, e.g. 'United States' in column 'Country'
If field values are equally most common, the first appearing of the most common should be used, e.g. 'big' and 'test' in column 'Category'
Exception: empty values don't count, e.g. Category2, even though '' is the most common value, 'important' is used as final value.
How can I achieve this goal? Anyone who can help me out?
try convert '' to NaN then drop 'Value' column then groupby 'ISIN' and calculate mode then map the values of sum of 'Value' column grouped by 'ISIN' to 'ISIN' column so to create 'Value' column in your Final result:
Basically the idea is to converting empty string '' to NaN so that it doesn't count in the mode and we are defining a function to handle such cases when mode of particular column groupedby 'ISIN' is NaN because of dropna=True in mode() method
def f(x):
try:
return x.mode().iat[0]
except IndexError:
return float('NaN')
Finally:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(f))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
OR
Via passing dropna=False in mode() method and anonymous function:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(lambda x:x.mode(dropna=False).iat[0]))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
Now If you print out you will get your desired output

Create a new column value based on another column content from a list

I am trying to create a new variable called "region" based on the names of countries in Africa in another variable. I have a list of all the African countries (two shown here as an example) but I have am having encountering errors.
def africa(x):
if africalist in x:
return 'African region'
else:
return 'Not African region'
df['region'] = ''
df.region= df.countries.apply(africa)
I'm getting :
'in ' requires string as left operand, not list
I recommend you see When should I want to use apply.
You could use:
df['region'] = df['countries'].isin(africalist).map({True:'African region',
False:'Not African region'})
or
df['region'] = np.where(df['countries'].isin(africalist),
'African region',
'Not African region')
Your condition is wrong.
The correct manner is
if element in list_of_elements:
So, changing the function africa results in:
def africa(x):
if x in africalist:
return 'African region'
else:
return 'Not African region'

How to turn a list of a list of dictionaries into a dataframe via loop

I have a list of a list of dictionaries. I managed to access each list-element within the outer list and convert the dictionary via pandas into a data-frame. I then save the DF and later concat it. That's a perfect result. But I need a loop to do that for big data.
Here is my MWE which works fine in principle.
import pandas as pd
mwe = [
[{"name": "Norway", "population": 5223256, "area": 323802.0, "gini": 25.8}],
[{"name": "Switzerland", "population": 8341600, "area": 41284.0, "gini": 33.7}],
[{"name": "Australia", "population": 24117360, "area": 7692024.0, "gini": 30.5}],
]
df0 = pd.DataFrame.from_dict(mwe[0])
df1 = pd.DataFrame.from_dict(mwe[1])
df2 = pd.DataFrame.from_dict(mwe[2])
frames = [df0, df1, df2]
result = pd.concat(frames)
It creates a nice table.
Here is what I tried to create a list of data frames:
for i in range(len(mwe)):
frame = pd.DataFrame()
frame = pd.DataFrame.from_dict(mwe[i])
frames = []
frames.append(frame)
Addendum: Thanks for all the answers. They are working on my MWE. Which made me notice that there are some strange entries in my dataset. No solution works for my dataset, since I have an inner-list element which contains two dictionaries (due to non unique data retrieval):
....
[{'name': 'United States Minor Outlying Islands', 'population': 300},
{'name': 'United States of America',
'population': 323947000,
'area': 9629091.0,
'gini': 48.0}],
...
How can I drop the entry for "United States Minor Outlying Islands"?
You could get each dict out of the containing list and just have a list of dict:
import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
# use x.pop() so that you aren't carrying around copies of the data
# for a "big data" application
df = pd.DataFrame([x.pop() for x in mwe])
df.head()
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
By bringing the list comprehension into the dataframe declaration, that list is temporary, and you don't have to worry about the cleanup. pop will also consume the dictionaries out of mwe, minimizing the amount of copies you are carrying around in memory
As a note, when doing this, mwe will then look like:
mwe
[[], [], []]
Because the contents of the sub-lists have been popped out
EDIT: New Question Content
If your data contains duplicates, or at least entries you don't want, and the undesired entries don't have matching columns to the rest of the dataset (which appears to be the case), it becomes a bit trickier to avoid copying data as above:
mwe.append([{'name': 'United States Minor Outlying Islands', 'population': 300}, {'name': 'United States of America', 'population': 323947000, 'area': 9629091.0, 'gini': 48.0}])
key_check = {}.fromkeys(["name", "population", "area", "gini"])
# the easy way but copies data
df = pd.DataFrame([item for item in data
for data in mwe
if item.keys()==key_check.keys()])
Since you'll still have the data hanging around in mwe. It might be better to use a generator
def get_filtered_data(mwe):
for data in mwe:
while data: # when data is empty, the while loop will end
item = data.pop() # still consumes data out of mwe
if item.keys() == key_check.keys():
yield item # will minimize data copying through lazy evaluation
df = pd.DataFrame([x for x in get_filtered_data(mwe)])
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
3 9629091.0 48.0 United States of America 323947000
Again, this is under the assumption that non-desired entries have invalid columns, which appears to be the case here, specifically. Otherwise, this will at least flatten out the data structure so you can filter it with pandas later
Create and empty DataFrame and loop over the list using df.append on each loop:
>>> import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
>>> df = pd.DataFrame()
>>> for country in mwe:
... df = df.append(country)
...
>>> df
area gini name population
0 323802.0 25.8 Norway 5223256
0 41284.0 33.7 Switzerland 8341600
0 7692024.0 30.5 Australia 24117360
Try this :
df = pd.DataFrame(columns = ['name', 'population', 'area', 'gini'])
for i in range(len(mwe)):
df.loc[i] = list(mwe[i][0].values())
Output :
name pop area gini
0 Norway 5223256 323802.0 25.8
1 Switzerland 8341600 41284.0 33.7
2 Australia 24117360 7692024.0 30.5

Populate new column based on conditions in another column

I'm playing about with Python and pandas.
I have created a dataframe, I have a column (axis 1) called 'County' but I need to create a column called 'Region' and populate it like this (atleast I think):
If County column == 'Suffolk' or 'Norfolk' or 'Essex' then in Region column insert 'East Anglia'
If County column == 'Kent' or 'East Sussex' or 'West Sussex' then in Region Column insert 'South East'
If County column == 'Dorset' or 'Devon' or 'Cornwall' then in Region Column insert 'South West'
and so on...
So far I have this:
myDataFrame['Region'] = np.where(myDataFrame['County']=='Suffolk', 'East Anglia', '')
But I suspect this won't work for any other counties
As I'm sure is obvious I am a beginner. I have tried googling and reading but only could find out about numpy where, which got me this far.
You'll definitely need df.isin and loc based indexing:
df['Region'] = np.nan
df.loc[df.County.isin(['Suffolk','Norfolk', 'Essex']), 'Region'] = 'East Anglia'
df.loc[df.County.isin(['Kent', 'East Sussex', 'West Sussex']), 'Region'] = 'South East'
df.loc[df.County.isin(['Dorset', 'Devon', 'Cornwall']), 'Region'] = 'South West'
You could also create a mapping of sorts and use df.map or df.replace:
mapping = { 'Suffolk' : 'East Anglia', 'Norfolk': 'East Anglia', ... 'Kent' :'South East', ..., ... }
df['Region'] = df.County.map(mapping)
I would prefer a map here because it would convert non-matches to NaN, which would be the ideal thing.

Categories

Resources