pandas operations inside a for-loop - python

Here is a sample of my data
threats =
binomial_name
continent threat_type
Africa Agriculture & Aquaculture 143
Biological Resource Use 102
Climate Change 3
Commercial Development 36
Energy Production & Mining 30
... ... ...
South America Human Intrusions 1
Invasive Species 3
Natural System Modifications 1
Transportation Corridor 2
Unknown 38
I want to use a for loop and obtain and append together the top 5 values of each continent into a data frame.
Here is my code -
continents = threats.continent.unique()
for i in continents:
continen = (threats
.query('continent == i')
.groupby(['continent','threat_type'])
.sort_values(by=('binomial_name'), ascending=False).
.head())
top5 = appended_data.append(continen)
I am however getting the error - KeyError: 'i'
Where am I going wrong?

So, the canonical way to do this:
df.groupby('continent', as_index=False).apply(
lambda grp: grp.nlargest(5, 'binomial_value'))
If you want to do this in a loop, replace this part:
for i in continents:
continen = threats[threats['continent'] == i].nlargest(2, 'binomial_name')
appended_data.append(continen)

Related

Multi-criteria pandas dataframe exceptions reporting

Given the following pandas df -
Holding Account
Entity ID
Holding Account Number
% Ownership
Entity ID %
Account # %
Ownership Audit Note
11 West Summit Drive LLC (80008660955)
3423435
54353453454
100
100
100
NaN
110 Goodwill LLC (91928475)
7653453
65464565
50
50
50
Partial Ownership [50.00%]
1110 Webbers St LLC (14219739)
1235734
12343535
100
100
100
NaN
120 Goodwill LLC (30271633)
9572953
96839592
55
55
55
Inactive Client [10.00%]
Objective - I am trying to create an Exceptions Report and only inc. those rows based on the following logic:
% Ownership =! 100% OR
(Ownership Audit Note == "-") & (Account # % OR Entity ID % ==100%)
Attempt - I am able to produce components, which make up my required logic, however can't seem to bring them together:
# This gets me rows which meet 1.
df = df[df['% Ownership'].eq(100)==False]
# Something 'like' this would get me 2.
df = df[df['Ownership Audit Note'] == "-"] & df[df['Account # %'|'Entity ID %'] == "None"]
I am looking for some hints/tips to help me bring all this together in the most pythonic way.
Use:
df = df[df['% Ownership'].ne(100) | (df['Ownership Audit Note'].eq("-") & (df['Account # %'].eq(100) | df['Entity ID %'].eq(100)))]

Create new column based on value of another column

I have a solution below to give me a new column as a universal identifier, but what if there is additional data in the NAME column, how can I tweak the below to account for a wildcard like search term?
I want to basically have so if German/german or Mexican/mexican is in that row value then to give me Euro or South American value in new col
df["Identifier"] = (df["NAME"].str.lower().replace(
to_replace = ['german', 'mexican'],
value = ['Euro', 'South American']
))
print(df)
NAME Identifier
0 German Euro
1 german Euro
2 Mexican South American
3 mexican South American
Desired output
NAME Identifier
0 1990 German Euro
1 german 1998 Euro
2 country Mexican South American
3 mexican city 2006 South American
Based on an answer in this post:
r = '(german|mexican)'
c = dict(german='Euro', mexican='South American')
df['Identifier'] = df['NAME'].str.lower().str.extract(r, expand=False).map(c)
Another approach would be using np.where with those two conditions, but probably there is a more ellegant solution.
below code will work. i tried it using apply function but somehow can't able to get it. probably in sometime. meanwhile workable code below
df3['identifier']=''
js_ref=[{'german':'Euro'},{'mexican':'South American'}]
for i in range(len(df3)):
for l in js_ref:
for k,v in l.items():
if k.lower() in df3.name[i].lower():
df3.identifier[i]=v
break

Pandas Data science Get mean age for the billionaires in top 2 industries

I have a Forbes billionaire data set as below
and the first step is to find top 3 industries which has most billionaires. Which can easily be obtained from value_counts as below:
a = forbes_billionaire["Industries"].value_counts().head(3)
Output
Finance & Investments 296
Technology 237
Fashion & Retail 206
But the next question is somewhat tricky.
Compare the mean of net worth of billionaires from these top 3 industries
Can someone help?
df['Net Worth'].groupby(df['Industires']).mean().sort_values(ascending = False).head(3)

How to construct new DataFrame based on data from for loops?

I have a data set (datacomplete2), where I have data for each country for two different years. I want to calculate the difference between these years for each country (for values life, health, and lifegdp) and create a new data frame with the results.
The code:
for i in datacomplete2['Country'].unique():
life.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'life'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'life'])
health.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'health'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'health'])
lifegdp.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'lifegdp'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'lifegdp'])
newData = pd.DataFrame([life, health, lifegdp, datacomplete2['Country'].unique()], columns = ['life', 'health', 'lifegdp', 'country'])
newData
I think the for loop for calculating is correct, and the problem is in creating the new DataFrame. When I try to run the code, I get an error message: 4 columns passed, passed data had 210 columns.
I have 210 countries so I assume it somehow throws these values to the columns?
Here is also a link to a sneak peek of the data I'm using: https://i.imgur.com/jbGFPpk.png
The data as text would look like:
Country Code Year life health lifegdp
0 Algeria DZA 2000 70.292000 3.489033 20.146558
1 Algeria DZA 2016 76.078000 6.603844 11.520259
2 Angola AGO 2000 47.113000 1.908599 24.684593
3 Angola AGO 2016 61.547000 2.713149 22.684710
4 Antigua and Barbuda ATG 2000 73.541000 4.480701 16.412834
... ... ... ... ... ... ...
415 Vietnam VNM 2016 76.253000 5.659194 13.474181
416 World OWID_WRL 2000 67.684998 8.617628 7.854249
417 World OWID_WRL 2016 72.035337 9.978453 7.219088
418 Zambia ZMB 2000 44.702000 7.152371 6.249955
419 Zambia ZMB 2016 61.874000 4.477207 13.819775
Quick help required !!!
I started coding like two weeks ago so I'm very novice with this stuff.
Anurag Reddy's answer is a good concise solution if you know the dates in advance. To present an alternative and slightly more general answer - this problem is a good example use case for pandas.DataFrame.diff.
Note you don't actually need to sort the data in your example data but I've included a sort_values() line below to account for unsorted DataFrames.
import pandas as pd
# Read the raw datafile in
df = pd.read_csv("example.csv")
# Sort the data if required
df.sort_values(by=["Country"], inplace=True)
# Remove columns where you don't need the difference
new_df = df.drop(["Code", "Year"], axis=1)
# Group the data by country, take the difference between the rows, remove NaN rows, and reset the index to sequential integers
new_df = new_df.groupby(["Country"], as_index=False).diff().dropna().reset_index(drop=True)
# Add back the country names and codes as columns in the new DataFrame
new_df.insert(loc=0, column="Country", value=df["Country"].unique())
new_df.insert(loc=1, column="Code", value=df["Code"].unique())
You could do this instead
country_list = df.Country.unique().tolist()
df.drop(columns = ['Code'])
df_2016 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2016)].reset_index()
df_2000 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2000)].reset_index()
df_2016.drop(columns=['Year'])
df_2000.drop(columns=['Year'])
df_2016.set_index('Country').subtract(df_2000.set_index('Country'), fill_value=0)

Rank country by crop using pandas DataFrame

My DataFrame looks like this:
,Area,Item,Year,Unit,Value
524473,Ecuador,Sesame,2018,tonnes,16.0
524602,Ecuador,Sorghum,2018,tonnes,14988.0
524776,Ecuador,Soybeans,2018,tonnes,25504.0
524907,Ecuador,Spices nes,2018,tonnes,746.0
525021,Ecuador,Strawberries,2018,tonnes,1450.0
525195,Ecuador,Sugar beet,2018,tonnes,4636.0
525369,Ecuador,Sugar cane,2018,tonnes,7502251.0
...
1075710,Mexico,Tomatoes,2018,tonnes,4559375.0
1075865,Mexico,Triticale,2018,tonnes,25403.0
1076039,Mexico,Vanilla,2018,tonnes,495.0
1076213,Mexico,"Vegetables, fresh nes",2018,tonnes,901706.0
1076315,Mexico,"Vegetables, leguminous nes",2018,tonnes,75232.0
1076469,Mexico,Vetches,2018,tonnes,93966.0
1076643,Mexico,"Walnuts, with shell",2018,tonnes,159535.0
1076817,Mexico,Watermelons,2018,tonnes,1472459.0
1076991,Mexico,Wheat,2018,tonnes,2943445.0
1077134,Mexico,Yautia (cocoyam),2018,tonnes,38330.0
1077308,Mexico,Cereals (Rice Milled Eqv),2018,tonnes,35974485.0
In DataFrame there are all countries of the world and all agriculture products.
That's what i want to do:
Choose country, for example France.
Find the place of France in the world ranking for the production of a particular crop.
And so on all crops.
France ranks 1 in the world in oats production.
France ranks 2 in the world in cucumber production.
France ranks 2 in the world in rye production.
France ranks .... and so on on each product if France produces it.
I started with
df = df.loc[df.groupby('Item')['Value'].idxmax()]
but I need not only first place, but the second, third, fourth.... Help me please.
I am very new in pandas.
You can assign a rank column:
df['rank'] = df.groupby('Item')['Value'].rank(ascending=False)
and then extract information for a country with:
df[df['Area']=='France']
Check with rank
s = df.groupby('Item')['Value'].rank(ascending = False)
Then
d = { x : y for x , y in df.groupby(s)}
d[1] # output put rank one

Categories

Resources