Can I combine groupby data? - python

I have two columns home and away. So one row will be England vs Brazil and the next row will be Brazil England. How can I count occurrences of when Brazil faces England or England vs Brazil in one count?
Based on previous solutions, I have tried
results.groupby(["home_team", "away_team"]).size()
results.groupby(["away_team", "home_team"]).size()
however this does not give me the outcome that I am looking for.
Undesired output:
home_team away_team
England Brazil 1
away_team home_team
Brazil England 1
I would like to see:
England Brazil 2

May be you need below:
df = pd.DataFrame({
'home':['England', 'Brazil', 'Spain'],
'away':['Brazil', 'England', 'Germany']
})
pd.Series('-'.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
Output:
Brazil-England 2
Germany-Spain 1
dtype: int64
PS: If you do not like the - between team names, you can use:
pd.Series(' '.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()

You can sort values by numpy.sort, create DataFrame and use your original solution:
df1 = (pd.DataFrame(np.sort(df[['home','away']], axis=1), columns=['home','away'])
.groupby(["home", "away"])
.size())

Option 1
You can use numpy.sort to sort the values of the dataframe
However, as that sorts in place, maybe it is better to create a copy of the dataframe.
dfTeams = pd.DataFrame(data=df.values.copy(), columns=['team1','team2'])
dfTeams.values.sort()
(I changed the column names, because with the sorting you are changing their meaning)
After having done this, you can use your groupby.
results.groupby(['team1', 'team2']).size()
Option 2
Since a more general title for your question would be something like how can I count combination of values in multiple columns on a dataframe, independently of their order, you could use a set.
A set object is an unordered collection of distinct hashable objects.
More precisely, create a Series of frozen sets, and then count values.
pd.Series(map(lambda home, away: frozenset({home, away}),
df['home'],
df['away'])).value_counts()
Note: I use the dataframe in #Harv Ipan's answer.

Related

Trouble creating a subset dataframe [duplicate]

This question already has answers here:
Drop all duplicate rows across multiple columns in Python Pandas
(8 answers)
Closed 2 years ago.
I am trying to better understand Python and why I am receiving an error.
I have a dataframe with country names and I want to filter the dataset to only show those that have no duplicates. I entered:
df[df['Country Name'].value_counts() == 1]
However, I get an error
Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
It seems that [df['Country Name'].value_counts() == 1] creates a list that also shows the country names along with the boolean, and not simply the boolean as I was expecting.
Also, I tried filtering only on one country, i.e., df[df['Country Name'] == 'United States'] and this works perfectly fine.
I'm just trying to understand why in one scenario it works and in the other it doesn't. I do notice in the latter that there is an index starting at 0, so perhaps that is why. I just don't understand why I don't receive this same index in the previous example.
Can somebody help with an explanation?
Your solution doesn't work because the resulting dataframe is shorter than the original, they have to be of the same length, then it can filter row by row depending of the boolean values.
Also, I'm pretty sure you're actually looking for pandas.DataFrame.drop_duplicates:
df.drop_duplicates(subset=['Country Name'], keep = False)
It literally drops duplicate values, in this case you drop by 'Country Name' and you don't want to keep neither the first or the last occurrence, which are the other options for keep, then keep = False.
Documentation here.
Here's an explanation..
You've provided for country name..
df[df['Country Name'] == 'United States']
Let's split this,
df['Country Name'] == 'United States'
gives you a series with as many values as the length of the original dataframe with boolean data.
Now, when you do
df[df['Country Name'] == 'United States']
you'll get the data frame containing only 'United States' because pandas directly compare the boolean and returns the row with a 'True'.
Now for value counts..
df[df['Country Name'].value_counts() == 1]
split this,
df['Country Name'].value_counts() == 1
will return only the unique country names and if their count is == 1 in boolean format. If you check the length, it doesn't match the original length of the df.
Now, when you try to subset the dataframe, you get the error you're getting.
The solution. Drop the countries appearing more than once like Pablo mentioned in his answer (I haven't tried it. Mind the keep = False). Or, try the below..
If you want the row with countries that are appearing only once, you can try the map way..
df[df['Country Name'].map(df['Country Name'].value_counts()) == 1]
This will return the data frame with countries that are appearing exactly once.
Or
df[df['Country Name'].isin(df['Country Name'].value_counts()[df['Country Name'].value_counts()==1].index)]
Try this -
Sample data for dataframe
>>> df = pd.DataFrame ({"Country Name": ["United States", "Canada", "Spain", "France", "India", "Greece", "United States","Canada"], "Exports": ["beef","corn","cattle","cheese","cattle","cheese","pork","maple syrup"]})
Display dataframe
>>> df
Country Name Exports
0 United States beef
1 Canada corn
2 Spain cattle
3 France cheese
4 India cattle
5 Greece cheese
6 United States pork
7 Canada maple syrup
Use groupby() in addition to count() to return counts by "Country Name"
>>> df.groupby("Country Name")["Country Name"].count()
Country Name
Canada 2
France 1
Greece 1
India 1
Spain 1
United States 2
Name: Country Name, dtype: int64
Display only the row count() == 1
>>> df[df['Country Name'].map(df.groupby('Country Name')["Country Name"].count()) == 1]
Country Name Exports
2 Spain cattle
3 France cheese
4 India cattle
5 Greece cheese

Creating a dictionary from a dataframe having duplicate column value

I have a dataframe df1 having more than 500k records:
state lat-long
Florida (12.34,34.12)
texas (13.45,56.0)
Ohio (-15,49)
Florida (12.04,34.22)
texas (13.35,56.40)
Ohio (-15.79,49.34)
Florida (12.8764,34.2312)
the lat-long value can differ for a particular state.
Need to get a dictonary like below. the lat-long value can differ for a particular state but need to capture the first occurrence like this.
dict_state_lat_long = {"Florida":"(12.34,34.12)","texas":"(13.45,56.0)","Ohio":"(-15,49)"}
How can I get this in most efficient way?
You can use DataFrame.groupby to group the dataframe with respect to the states and then you can apply the aggregate function first to select the first occurring values of lat-long in the grouped dataframe.
Then you can use DataFrame.to_dict() function to convert the dataframe to the python dict.
Use this:
grouped = df.groupby("state")["lat-long"].agg("first")
dict_state_lat_long = grouped.to_dict()
print(dict_state_lat_long)
Output:
{'Florida': '(12.34,34.12)', 'Ohio': '(-15,49)', 'texas': '(13.45,56.0)'}

Python DataFrame: How to continue grouping by after several operation on Dataframe

I have a dataframe with states, counties and population statistics with the below columns:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP
And with the below line I am grouping the dataframe and sorting for each state the county population
sorted_df = temp_df.groupby(['STNAME']).apply(lambda x: x.sort_values(['CENSUS2010POP'], ascending = False))
After the sorting I want to only keep the 3 largest counties population-wise
largestcty = sorted_df.groupby(['STNAME'])["CENSUS2010POP"].nlargest(3)
And as the next step I would like to sum the values there witrh the below command
top3sum = largestcty.groupby(['STNAME']).sum()
But the problem now is that the key 'STNAME' is not in the series after the group by. My question is how to preserve the keys of the original DataFrame in the series?
So after applying the answer I have top3sum as a dataframe
top3sum = pd.DataFrame(largestcty.groupby(['STNAME'])'STNAME','CENSUS2010POP'].sum(),columns =['CENSUS2010POP'])
top3sum[:8]
>>>
STNAME CENSUS2010POP
Alabama 1406269
Alaska 478402
Arizona 5173150
Arkansas 807152
California 15924150
Colorado 1794424
Connecticut 2673320
Delaware 897934
This is how the top3sum data look like and then I am getting:
cnty = top3sum['CENSUS2010POP'].idxmax()
And cnty = California
But then trying to use the cnty with top3sum['STNAME'] I am receiving a key error
Your issue is, that after the second grouping, you only select the column CENSUSxxx and pick the three largest values.
Please note that you don't need to sort in advance before applying nlargest, so the first command is unnecessary. But if you sort, you can easily pick the first 3 lines of the sorted grouped dataframes:
largestcty = temp_df.groupby(['TNAME']).apply(lambda x: x.sort_values(['CENSUS2010POP'], ascending = False).head(3)
Then you need to adopt the sum command in order to select your desired column:
top3sum = largestcty.groupby(['STNAME'])['CENSUS2010POP'].sum()

python pandas multi-indexed dataframe selection

Altough I found multiple questions on the topic, I could not find a solution for this one in particular.
I am playing around with this CSV file, which contais a subselection of TBC dat from the WHO:
http://dign.eu/temp/tbc.csv
import pandas as pd
df = pd.read_csv('tbc.csv', index_col=['country', 'year'])
This gives a nicely formatted DataFrame, sorted on country and year, showing one of the parameters.
Now, for this case I would like the mean value of "param" for each country over all avaiable years. Using df.mean() gives me an overall value, and df.mean(axis=1) removes all indices which makes the results useless.
Obviously I can do this using a loop, but I guess there is a smarter way. But how?
If I understand you correctly you want to pass the level to the mean function:
In [182]:
df.mean(level='country')
Out[182]:
param
country
Afghanistan 8391.312500
Albania 183.888889
Algeria 8024.588235
American Samoa 1.500000
....
West Bank and Gaza Strip 12.538462
Yemen 4029.166667
Zambia 13759.266667
Zimbabwe 12889.666667
[219 rows x 1 columns]

Ordering Headers in a DataFrame using Python

How do I order the headers of a dataframe.
from pandas import *
import pandas
import numpy as np
df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],'Country':
['Germany','Switzerland','Austria','France','United States']})
print df2
The result I get on default is this:
Country ISO
0 Germany DE
1 Switzerland CH
2 Austria AT
3 France FR
4 United States US
But I thought the ISO would be before the Country as that was the order i created it in the dataframe. It looks like it sorted it alphabetically?
How can i set up this simple table in memory to be used later in relational queries in my preferred column order. Everytime i reference the dataframe i dont want to have to order the columns.
My first coding post ever, ever.
A dict has no ordering, you can use columns argument to enforce one. If columns is not provided, default ordering is indeed alphabetically.
In [2]: df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],
...: 'Country': ['Germany','Switzerland','Austria','France','United States']},
...: columns=['ISO', 'Country'])
In [3]: df2
Out[3]:
ISO Country
0 DE Germany
1 CH Switzerland
2 AT Austria
3 FR France
4 US United States
a Python dict is unordered. The keys are not stored in the order you declare or append to it. The dict you give to the DataFrame as argument has an arbitrary order the DataFrame takes for granted.
You have several options to circumvent the issue:
Use a OrderedDict object instead of the dict if you really need a dictionary as an input:
df2 = DataFrame(OrderedDict([('ISO',['DE','CH','AT','FR','US']),('Country',['Germany','Switzerland','Austria','France','United States'])]))
If you don't rely on a dictionary in the first place, then call the DataFrame with arguments declaring the columns:
df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],'Country':
['Germany','Switzerland','Austria','France','United States']}, columns=['ISO', 'Country'])

Categories

Resources