How to apply iloc in a Dataframe depending on a column value - python

I have a Dataframe with the follow columns:
"Country Name"
"Indicator Code"
"Porcentaje de Registros" (as it is show in the image) for each country there are 32 indicator codes with its percentage value.
The values are order in an descending way, and I need to keep the 15th highest values for each country, that means for example for Ecuador I need to know which ones are the 15th indicators with highest value. I was trying the following:
countries = gender['Country Name'].drop_duplicates().to_list()
for countries in countries:
test = RelevantFeaturesByID[RelevantFeaturesByID['Country Name']==countries].set_index(["Country Name", "Indicator Code"]).iloc[0:15]
test
But it just returns the first 15 rows for one country.
What am I doing wrong?

There is a mispelling in a loop statement for countries in countries: and then you are using countries again. That for sure is a problem. Also you substitute for test multiple times.
I am not sure whether I understood well what is your aim, however that seems to be a good basis to start:
# sorting with respect to countries and their percentage
df = df.sort_values(by=[df.columns[0],df.columns[-1]],ascending=[True,False])
# choosing unique values of country names
countries = df[df.columns[0]].unique()
test = []
for country in countries:
test.append( df.loc[df["Country Name"]==country].iloc[0:15] )

Related

Summing each year for each grouped in grouped series

I have a dataset of protected areas that I have grouped by each country's ISO3 code. What I want to do is sum the amount of protected areas for each country by the year they were established (STATUS_YR). The goal is to get one dataframe where it gives a country and for each year the number of protected areas established for that year. My code dosen't work and I can't seem to get the syntax right.
I think I need to do a for loop where it uses the ISO3 as a key and takes the number of instances of each year and sums them, I've used the Len() function but it didn't work.
Code:
protected_areas = pd.DataFrame()
columns = ["STATUS_YR"]
for key, group in wdpa4_grouped:
column = len(group[columns])
column["ISO3"] = key
row = column.to_frame().transpose()
protected_areas = pd.concat([protected_areas, row], ignore_index=True)
wdpa_grouped = wdpa4.groupby(['ISO3', 'STATUS_YR'])['STATUS_YR'].count()
Group by each country and for each country each year, then count the number for each year.

Search for variable name using iloc function in pandas dataframe

I have a pandas dataframe that consist of 5000 rows with different countries and emission data, and looks like the following:
country
year
emissions
peru
2020
1000
2019
900
2018
800
The country label is an index.
eg. df = emission.loc[['peru']]
would give me a new dataframe consisting only of the emission data attached to peru.
My goal is to use a variable name instead of 'peru' and store the country-specific emission data into a new dataframe.
what I search for is a code that would work the same way as the code below:
country = 'zanzibar'
df = emissions.loc[[{country}]]
From what I can tell the problem arises with the iloc function which does not accept variables as input. Is there a way I could circumvent this problem?
In other words I want to be able to create a new dataframe with country specific emission data, based on a variable that matches one of the countries in my emission.index()all without having to change anything but the given variable.
One way could be to iterate through or maybe create a function in some way?
Thank you in advance for any help.
An alternative approach where you dont use a country name for your index:
emissions = pd.DataFrame({'Country' : ['Peru', 'Peru', 'Peru', 'Chile', 'Chile', 'Chile'], "Year" : [2021,2020,2019,2021,2020,2019], 'Emissions' : [100,200,400,300,200,100]})
country = 'Peru'
Then to filter:
df = emissions[emissions.Country == country]
or
df = emissions.loc[emissions.Country == country]
Giving:
Country Year Emissions
0 Peru 2021 100
1 Peru 2020 200
2 Peru 2019 400
You should be able to select by a certain string for your index. For example:
df = pd.DataFrame({'a':[1,2,3,4]}, index=['Peru','Peru','zanzibar','zanzibar'])
country = 'zanzibar'
df.loc[{country}]
This will return:
a
zanzibar 3
zanzibar 4
In your case, removing one set of square brackets should work:
country = 'zanzibar'
df = emissions.loc[{country}]
I don't know if this solution is the same as your question. In this case I will give the solution to make a country name into a variable
But, because a variable name can't be named by space (" ") character, you have to replace the space character to underscore ("_") character.
(Just in case your 'country' values have some country names using more than one word)
Example:
the United Kingdom to United_Kingdom
by using this code:
df['country'] = df['country'].replace(' ', '_', regex=True)
So after your country names changed to a new format, you can get all the country names to a list from the dataframe using .unique() and you can store it to a new variable by this code:
country_name = df['country'].unique()
After doing that code, all the unique values in 'country' columns are stored to a list variable called 'country_name'
Next,
Use for to make an iteration to generate a new variable by country name using this code:
for i in country_name:
locals()[i] = df[df['country']=="%s" %(i)]
So, locals() here is to used to transform string format to a non-string format (because in 'country_name' list is filled by country name in string format) and df[df['country']=="%s" %(i)] is used to subset the dataframe by condition country = each unique values from 'country_name'.
After that, it already made a new variable for each country name in 'country' columns.
Hopefully this can help to solve your problem.

Python: How can I filter columns by strings without having to enter each string individually?

I need to analyze a dataset with enteprises of more than 80 industries regarding the respective industries. Specifically, I need a for loop or a def function with which I can summarize the following step for all industries to get a nice short code:
HighTech = data.loc[data['MacrIndustry'] == "High Technology", ['Value']]
Preferably, I would like to separate the enteprises regarding their industries into a separate DataFrame with its value.
Use DataFrame.groupby. The following will get you a dictionary whose keys are all the MacrIndustry unique values, and the values are the Value column (as a DataFrame) of the corresponding industry group.
groups = {industry: df[['Value']] for industry, df in data.groupby('MacrIndustry')}
# or just (less readable)
# groups = dict(iter(data.groupby('MacrIndustry')[['Value']]))
According to your example HighTech = groups['High Technology'].

How to find the biggest value in the pandas dataset

I have a dataset that shows the share of every disease from total diseases.
I want to find the country name which in that country AIDS is bigger than other diseases.
Try with
df.index[df.AIDS.eq(df.drop('Total',1).max(1))]
Have a look at pandas max function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html)
Here you could get the max for each row, as is:
most_frequent_disease = df.drop(columns=['Total', 'Other']).max(axis=1)
Then you can create a condition to check wether AIDS is the most frequent disease, and apply it to your dataframe:
is_aids_most_frequent_disease = df.loc[:, 'A'].eq(most_frequent_disease)
df[is_aids_most_frequent_disease]
You could get the country name by using the .index at the end of the expression too.

DF analyse with Pandas in Python filter data

hey stack overflow users,
i have the following problem. i have a table with informations about the incidence values of the individual countries.
I want to display the data in such a way that I can compare the incidence values of the USA with Germany, for example.
my problem is that the incidence values are accumulated. How can I filter out only the values of USA & Germany from the column day = 14.
As a result I want to see only the 14 days values in the respective rows, so that I can draw a temporal comparison of the incidence values.
DATA PREVIEW:
you can try:
m=(df['day'].isin([7,14,21,28])) & (df['countriesAndTerritories'].isin(['USA','Germany']))
#If the names are exact 'USA' and 'Germany'
#OR
m=(df['day'].isin([7,14,21,28])) & (df['countriesAndTerritories'].str.contains('USA|Germany',case=False))
#IF the names are in irregular case i.e some are in uppercase and some are in lowercase
Finally:
df[m]
#OR
df.loc[m]

Categories

Resources