Summing each year for each grouped in grouped series - python

I have a dataset of protected areas that I have grouped by each country's ISO3 code. What I want to do is sum the amount of protected areas for each country by the year they were established (STATUS_YR). The goal is to get one dataframe where it gives a country and for each year the number of protected areas established for that year. My code dosen't work and I can't seem to get the syntax right.
I think I need to do a for loop where it uses the ISO3 as a key and takes the number of instances of each year and sums them, I've used the Len() function but it didn't work.
Code:
protected_areas = pd.DataFrame()
columns = ["STATUS_YR"]
for key, group in wdpa4_grouped:
column = len(group[columns])
column["ISO3"] = key
row = column.to_frame().transpose()
protected_areas = pd.concat([protected_areas, row], ignore_index=True)

wdpa_grouped = wdpa4.groupby(['ISO3', 'STATUS_YR'])['STATUS_YR'].count()
Group by each country and for each country each year, then count the number for each year.

Related

How to get the index name after regrouping for certain maximum value of another column

I have a dataframe containing election data of four different years. Column "Votes" contain the total votes a party got for different constituencies in each year. I need to find the winning party (party who has got maximum total votes) of each year. I have grouped the data using "Election year" and "Party". Now how can I get the Election Year and Party for the above case?
df1 = df.groupby(['Election Year', 'Party']).sum()
print(df1.loc[df1['Votes'].idxmax()])
The above code is not giving the expected result.
I have attached the
Dataframe after using groupby
How can I get the expected result. Any suggestions is appreciated.

Create a new column comparing two rows

I am working on a COVID-19 dataset with total cases and total deaths at the last day of each month for each city since march. But I would like to create a column which tells me the number of new cases for every city in each of these months.
My logic is: if the value in the cell from the 'city_ibge_code' column in position p is the same as the value in position p-1, it should create a new column that is the difference between the number of cases in two months. And if the values are different (that shows that are different cities), just pass that value to the new column.
casos_full: is the dataframe with the cities and the number of cases and deaths in march, april, may, june, july, august and semptember.
city_ibge_code: is the code for each city in the dataframe - each city has a unique code.
And there also is a "date" column - which represents the last day of the month
for rows in casos_full:
if rows['city_ibge_code'] == rows['city_ibge_code'].shift(1):
rows['New Cases'] = rows['last_available_confirmed'] - rows['last_available_confirmed'].shift(1)
else:
rows['New Cases'] = rows['last_available_confirmed']
rows here is a view of the line. You need to update the actual dataframe. If I understood your problem correctly.
for i, rows in enumerate(casos_full):
if rows['city_ibge_code'] == rows['city_ibge_code'].shift(1):
casos_full[i]['New Cases'] = rows['last_available_confirmed'] - rows['last_available_confirmed'].shift(1)
else:
casos_full[i]['New Cases'] = rows['last_available_confirmed']
Please give more precision on your problem so we can help.

How to apply iloc in a Dataframe depending on a column value

I have a Dataframe with the follow columns:
"Country Name"
"Indicator Code"
"Porcentaje de Registros" (as it is show in the image) for each country there are 32 indicator codes with its percentage value.
The values are order in an descending way, and I need to keep the 15th highest values for each country, that means for example for Ecuador I need to know which ones are the 15th indicators with highest value. I was trying the following:
countries = gender['Country Name'].drop_duplicates().to_list()
for countries in countries:
test = RelevantFeaturesByID[RelevantFeaturesByID['Country Name']==countries].set_index(["Country Name", "Indicator Code"]).iloc[0:15]
test
But it just returns the first 15 rows for one country.
What am I doing wrong?
There is a mispelling in a loop statement for countries in countries: and then you are using countries again. That for sure is a problem. Also you substitute for test multiple times.
I am not sure whether I understood well what is your aim, however that seems to be a good basis to start:
# sorting with respect to countries and their percentage
df = df.sort_values(by=[df.columns[0],df.columns[-1]],ascending=[True,False])
# choosing unique values of country names
countries = df[df.columns[0]].unique()
test = []
for country in countries:
test.append( df.loc[df["Country Name"]==country].iloc[0:15] )

Group column data into Week in Python

I have 4 columns which have Date , Account #, Quantity and Sale respectively. I have daily data but I want to be able to show Weekly Sales per Customer and the Quantity.
I have been able to group the column by week, but I also want to group it by OracleNumber, and Sum the Quantity and Sales columns. How would I get that to work without messing up the Week format.
import pandas as pd
names = ['Date','OracleNumber','Quantity','Sale']
sales = pd.read_csv("CustomerSalesNVG.csv",names=names)
sales['Date'] = pd.to_datetime(sales['Date'])
grouped=sales.groupby(sales['Date'].map(lambda x:x.week))
print(grouped.head())
IIUC, you could groupby w.r.t the week column and OracleNumber column by providing an extra key to the list for which the Groupby object has to use and perform sum operation later:
sales.groupby([sales['Date'].dt.week, 'OracleNumber']).sum()

Pandas - Finding Unique Entries in Daily Census Data

I have census data that looks like this for a full month and I want to find out how many unique inmates there were for the month. The information is taken daily so there are multiples.
_id,Date,Gender,Race,Age at Booking,Current Age
1,2016-06-01,M,W,32,33
2,2016-06-01,M,B,25,27
3,2016-06-01,M,W,31,33
My method now is to group them by day and then add the ones that are not accounted for into the DataFrame. My question is how to account for two people with the same info. They would both get not added to the new DataFrame because one of them already exists? I'm trying to figure out how many people total were in the prison during this time.
_id is incremental, for example here is some data from the second day
2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39
link to the dataset here: https://data.wprdc.org/dataset/allegheny-county-jail-daily-census
You could use the df.drop_duplicates() which will return the DataFrame with only unique values, then count the entries.
Something like this should work:
import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)
uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)
Result:
>> 11845
Pandas drop_duplicates Documentation
Inmates June 2016 CSV
The problem with this approach / data is that there could be many individual inmates that are the same age / gender / race that would be filtered out.
I think the trick here is to groupby as much as possible and check the differences in those (small) groups through the month:
inmates = pd.read_csv('inmates.csv')
# group by everything except _id and count number of entries
grouped = inmates.groupby(
['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()
# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)
# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()
# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]
# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)
# sum total column
diffed['total'].sum() # 3393

Categories

Resources