Python DataFrame: How to continue grouping by after several operation on Dataframe - python

I have a dataframe with states, counties and population statistics with the below columns:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP
And with the below line I am grouping the dataframe and sorting for each state the county population
sorted_df = temp_df.groupby(['STNAME']).apply(lambda x: x.sort_values(['CENSUS2010POP'], ascending = False))
After the sorting I want to only keep the 3 largest counties population-wise
largestcty = sorted_df.groupby(['STNAME'])["CENSUS2010POP"].nlargest(3)
And as the next step I would like to sum the values there witrh the below command
top3sum = largestcty.groupby(['STNAME']).sum()
But the problem now is that the key 'STNAME' is not in the series after the group by. My question is how to preserve the keys of the original DataFrame in the series?
So after applying the answer I have top3sum as a dataframe
top3sum = pd.DataFrame(largestcty.groupby(['STNAME'])'STNAME','CENSUS2010POP'].sum(),columns =['CENSUS2010POP'])
top3sum[:8]
>>>
STNAME CENSUS2010POP
Alabama 1406269
Alaska 478402
Arizona 5173150
Arkansas 807152
California 15924150
Colorado 1794424
Connecticut 2673320
Delaware 897934
This is how the top3sum data look like and then I am getting:
cnty = top3sum['CENSUS2010POP'].idxmax()
And cnty = California
But then trying to use the cnty with top3sum['STNAME'] I am receiving a key error

Your issue is, that after the second grouping, you only select the column CENSUSxxx and pick the three largest values.
Please note that you don't need to sort in advance before applying nlargest, so the first command is unnecessary. But if you sort, you can easily pick the first 3 lines of the sorted grouped dataframes:
largestcty = temp_df.groupby(['TNAME']).apply(lambda x: x.sort_values(['CENSUS2010POP'], ascending = False).head(3)
Then you need to adopt the sum command in order to select your desired column:
top3sum = largestcty.groupby(['STNAME'])['CENSUS2010POP'].sum()

Related

Trouble creating a subset dataframe [duplicate]

This question already has answers here:
Drop all duplicate rows across multiple columns in Python Pandas
(8 answers)
Closed 2 years ago.
I am trying to better understand Python and why I am receiving an error.
I have a dataframe with country names and I want to filter the dataset to only show those that have no duplicates. I entered:
df[df['Country Name'].value_counts() == 1]
However, I get an error
Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
It seems that [df['Country Name'].value_counts() == 1] creates a list that also shows the country names along with the boolean, and not simply the boolean as I was expecting.
Also, I tried filtering only on one country, i.e., df[df['Country Name'] == 'United States'] and this works perfectly fine.
I'm just trying to understand why in one scenario it works and in the other it doesn't. I do notice in the latter that there is an index starting at 0, so perhaps that is why. I just don't understand why I don't receive this same index in the previous example.
Can somebody help with an explanation?
Your solution doesn't work because the resulting dataframe is shorter than the original, they have to be of the same length, then it can filter row by row depending of the boolean values.
Also, I'm pretty sure you're actually looking for pandas.DataFrame.drop_duplicates:
df.drop_duplicates(subset=['Country Name'], keep = False)
It literally drops duplicate values, in this case you drop by 'Country Name' and you don't want to keep neither the first or the last occurrence, which are the other options for keep, then keep = False.
Documentation here.
Here's an explanation..
You've provided for country name..
df[df['Country Name'] == 'United States']
Let's split this,
df['Country Name'] == 'United States'
gives you a series with as many values as the length of the original dataframe with boolean data.
Now, when you do
df[df['Country Name'] == 'United States']
you'll get the data frame containing only 'United States' because pandas directly compare the boolean and returns the row with a 'True'.
Now for value counts..
df[df['Country Name'].value_counts() == 1]
split this,
df['Country Name'].value_counts() == 1
will return only the unique country names and if their count is == 1 in boolean format. If you check the length, it doesn't match the original length of the df.
Now, when you try to subset the dataframe, you get the error you're getting.
The solution. Drop the countries appearing more than once like Pablo mentioned in his answer (I haven't tried it. Mind the keep = False). Or, try the below..
If you want the row with countries that are appearing only once, you can try the map way..
df[df['Country Name'].map(df['Country Name'].value_counts()) == 1]
This will return the data frame with countries that are appearing exactly once.
Or
df[df['Country Name'].isin(df['Country Name'].value_counts()[df['Country Name'].value_counts()==1].index)]
Try this -
Sample data for dataframe
>>> df = pd.DataFrame ({"Country Name": ["United States", "Canada", "Spain", "France", "India", "Greece", "United States","Canada"], "Exports": ["beef","corn","cattle","cheese","cattle","cheese","pork","maple syrup"]})
Display dataframe
>>> df
Country Name Exports
0 United States beef
1 Canada corn
2 Spain cattle
3 France cheese
4 India cattle
5 Greece cheese
6 United States pork
7 Canada maple syrup
Use groupby() in addition to count() to return counts by "Country Name"
>>> df.groupby("Country Name")["Country Name"].count()
Country Name
Canada 2
France 1
Greece 1
India 1
Spain 1
United States 2
Name: Country Name, dtype: int64
Display only the row count() == 1
>>> df[df['Country Name'].map(df.groupby('Country Name')["Country Name"].count()) == 1]
Country Name Exports
2 Spain cattle
3 France cheese
4 India cattle
5 Greece cheese

How to apply iloc in a Dataframe depending on a column value

I have a Dataframe with the follow columns:
"Country Name"
"Indicator Code"
"Porcentaje de Registros" (as it is show in the image) for each country there are 32 indicator codes with its percentage value.
The values are order in an descending way, and I need to keep the 15th highest values for each country, that means for example for Ecuador I need to know which ones are the 15th indicators with highest value. I was trying the following:
countries = gender['Country Name'].drop_duplicates().to_list()
for countries in countries:
test = RelevantFeaturesByID[RelevantFeaturesByID['Country Name']==countries].set_index(["Country Name", "Indicator Code"]).iloc[0:15]
test
But it just returns the first 15 rows for one country.
What am I doing wrong?
There is a mispelling in a loop statement for countries in countries: and then you are using countries again. That for sure is a problem. Also you substitute for test multiple times.
I am not sure whether I understood well what is your aim, however that seems to be a good basis to start:
# sorting with respect to countries and their percentage
df = df.sort_values(by=[df.columns[0],df.columns[-1]],ascending=[True,False])
# choosing unique values of country names
countries = df[df.columns[0]].unique()
test = []
for country in countries:
test.append( df.loc[df["Country Name"]==country].iloc[0:15] )

Creating a dictionary from a dataframe having duplicate column value

I have a dataframe df1 having more than 500k records:
state lat-long
Florida (12.34,34.12)
texas (13.45,56.0)
Ohio (-15,49)
Florida (12.04,34.22)
texas (13.35,56.40)
Ohio (-15.79,49.34)
Florida (12.8764,34.2312)
the lat-long value can differ for a particular state.
Need to get a dictonary like below. the lat-long value can differ for a particular state but need to capture the first occurrence like this.
dict_state_lat_long = {"Florida":"(12.34,34.12)","texas":"(13.45,56.0)","Ohio":"(-15,49)"}
How can I get this in most efficient way?
You can use DataFrame.groupby to group the dataframe with respect to the states and then you can apply the aggregate function first to select the first occurring values of lat-long in the grouped dataframe.
Then you can use DataFrame.to_dict() function to convert the dataframe to the python dict.
Use this:
grouped = df.groupby("state")["lat-long"].agg("first")
dict_state_lat_long = grouped.to_dict()
print(dict_state_lat_long)
Output:
{'Florida': '(12.34,34.12)', 'Ohio': '(-15,49)', 'texas': '(13.45,56.0)'}

How to sum up every 3 rows by column in Pandas Dataframe Python

I have a pandas dataframe top3 with data as in the image below.
Using the two columns, STNAME and SENSUS2010POP, I need to find the sum for Wyoming (sum: 91738+75450+46133=213321), then the sum for Wisconsin (sum:1825699), West Virginia and so on. Summing up the 3 counties for each state. (and need to sort them in ascending order after that).
I have tried this code to compute the answer:
topres=top3.groupby('STNAME').sum().sort_values(['CENSUS2010POP'], ascending=False)
Maybe you can suggest a more efficient way to do it? Maybe with a lambda expression?
You can use groupby:
df.groupby('STNAME').sum()
Note: I'm starting in the problem before selecting the top 3 counties per state, and jumping straight to their sum.
I found it helpful with this problem to use a list selection.
I created a data frame view of the counties with:
counties_df=census_df[census_df['SUMLEV'] == 50]
and a separate one of the states so I could get at their names.
states_df=census_df[census_df['SUMLEV'] == 40]
Then I was able to create that sum of the populations of the top 3 counties per state, by looping over all states and summing the largest 3.
res = [(x, counties_df[(counties_df['STNAME']==x)].nlargest(3,['CENSUS2010POP'])['CENSUS2010POP'].sum()) for x in states_df['STNAME']]
I converted that result to a data frame
dfObj = pd.DataFrame(res)
named its columns
dfObj.columns = ['STNAME','POP3']
sorted in place
dfObj.sort_values(by=['POP3'], inplace=True, ascending=False)
and returned the first 3
return dfObj['STNAME'].head(3).tolist()
Definitely groupby is a more compact way to do the above, but I found this way helped me break down the steps (and the associated course had not yet dealt with groupby).

Can I combine groupby data?

I have two columns home and away. So one row will be England vs Brazil and the next row will be Brazil England. How can I count occurrences of when Brazil faces England or England vs Brazil in one count?
Based on previous solutions, I have tried
results.groupby(["home_team", "away_team"]).size()
results.groupby(["away_team", "home_team"]).size()
however this does not give me the outcome that I am looking for.
Undesired output:
home_team away_team
England Brazil 1
away_team home_team
Brazil England 1
I would like to see:
England Brazil 2
May be you need below:
df = pd.DataFrame({
'home':['England', 'Brazil', 'Spain'],
'away':['Brazil', 'England', 'Germany']
})
pd.Series('-'.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
Output:
Brazil-England 2
Germany-Spain 1
dtype: int64
PS: If you do not like the - between team names, you can use:
pd.Series(' '.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
You can sort values by numpy.sort, create DataFrame and use your original solution:
df1 = (pd.DataFrame(np.sort(df[['home','away']], axis=1), columns=['home','away'])
.groupby(["home", "away"])
.size())
Option 1
You can use numpy.sort to sort the values of the dataframe
However, as that sorts in place, maybe it is better to create a copy of the dataframe.
dfTeams = pd.DataFrame(data=df.values.copy(), columns=['team1','team2'])
dfTeams.values.sort()
(I changed the column names, because with the sorting you are changing their meaning)
After having done this, you can use your groupby.
results.groupby(['team1', 'team2']).size()
Option 2
Since a more general title for your question would be something like how can I count combination of values in multiple columns on a dataframe, independently of their order, you could use a set.
A set object is an unordered collection of distinct hashable objects.
More precisely, create a Series of frozen sets, and then count values.
pd.Series(map(lambda home, away: frozenset({home, away}),
df['home'],
df['away'])).value_counts()
Note: I use the dataframe in #Harv Ipan's answer.

Categories

Resources