Trouble creating a subset dataframe [duplicate] - python

This question already has answers here:
Drop all duplicate rows across multiple columns in Python Pandas
(8 answers)
Closed 2 years ago.
I am trying to better understand Python and why I am receiving an error.
I have a dataframe with country names and I want to filter the dataset to only show those that have no duplicates. I entered:
df[df['Country Name'].value_counts() == 1]
However, I get an error
Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
It seems that [df['Country Name'].value_counts() == 1] creates a list that also shows the country names along with the boolean, and not simply the boolean as I was expecting.
Also, I tried filtering only on one country, i.e., df[df['Country Name'] == 'United States'] and this works perfectly fine.
I'm just trying to understand why in one scenario it works and in the other it doesn't. I do notice in the latter that there is an index starting at 0, so perhaps that is why. I just don't understand why I don't receive this same index in the previous example.
Can somebody help with an explanation?

Your solution doesn't work because the resulting dataframe is shorter than the original, they have to be of the same length, then it can filter row by row depending of the boolean values.
Also, I'm pretty sure you're actually looking for pandas.DataFrame.drop_duplicates:
df.drop_duplicates(subset=['Country Name'], keep = False)
It literally drops duplicate values, in this case you drop by 'Country Name' and you don't want to keep neither the first or the last occurrence, which are the other options for keep, then keep = False.
Documentation here.

Here's an explanation..
You've provided for country name..
df[df['Country Name'] == 'United States']
Let's split this,
df['Country Name'] == 'United States'
gives you a series with as many values as the length of the original dataframe with boolean data.
Now, when you do
df[df['Country Name'] == 'United States']
you'll get the data frame containing only 'United States' because pandas directly compare the boolean and returns the row with a 'True'.
Now for value counts..
df[df['Country Name'].value_counts() == 1]
split this,
df['Country Name'].value_counts() == 1
will return only the unique country names and if their count is == 1 in boolean format. If you check the length, it doesn't match the original length of the df.
Now, when you try to subset the dataframe, you get the error you're getting.
The solution. Drop the countries appearing more than once like Pablo mentioned in his answer (I haven't tried it. Mind the keep = False). Or, try the below..
If you want the row with countries that are appearing only once, you can try the map way..
df[df['Country Name'].map(df['Country Name'].value_counts()) == 1]
This will return the data frame with countries that are appearing exactly once.
Or
df[df['Country Name'].isin(df['Country Name'].value_counts()[df['Country Name'].value_counts()==1].index)]

Try this -
Sample data for dataframe
>>> df = pd.DataFrame ({"Country Name": ["United States", "Canada", "Spain", "France", "India", "Greece", "United States","Canada"], "Exports": ["beef","corn","cattle","cheese","cattle","cheese","pork","maple syrup"]})
Display dataframe
>>> df
Country Name Exports
0 United States beef
1 Canada corn
2 Spain cattle
3 France cheese
4 India cattle
5 Greece cheese
6 United States pork
7 Canada maple syrup
Use groupby() in addition to count() to return counts by "Country Name"
>>> df.groupby("Country Name")["Country Name"].count()
Country Name
Canada 2
France 1
Greece 1
India 1
Spain 1
United States 2
Name: Country Name, dtype: int64
Display only the row count() == 1
>>> df[df['Country Name'].map(df.groupby('Country Name')["Country Name"].count()) == 1]
Country Name Exports
2 Spain cattle
3 France cheese
4 India cattle
5 Greece cheese

Related

Count occurrence of elements in column of lists (with a twist)

I've got a column of lists called "author_background" which I would like to analyze. The actual column consists of 8.000 rows. My aim is to get an overview on how many different elements there are in total (in all lists of the column) and count in how many lists each element occurs in.
How my column looks like:
df.author_background
0 [Professor for Business Administration, Harvard Business School]
1 [Professor for Industrial Engineering, University of Oakland]
2 [Harvard Business School]
3 [CEO, SpaceX]
desired output
0 Harvard Business School 2
1 Professor for Business Administration 1
2 Professor for Industrial Engineering 1
3 CEO 1
4 University of Oakland 1
5 SpaceX 1
I would like to know how often "Professor of Business Administration", "Professor for Industrial Engineering", "Harvard Business School", etc. occurs in the column. There are way more titles I don't know about.
Basically, I would like to use pd.value_counts for the column. However, its not possible because its a list.
Is there another way to count the occurrence of each element?
If thats more helpful: I also got a list which contains all elements of the lists (not nested).
Turn it all into a single series by list flattening:
pd.Series([bg for bgs in df.author_background for bg in bgs])
Now you can call value_counts() to get your result.
You can try so:
el = pd.Series([item for sublist in df.author_background for item in sublist])
df = el.groupby(el).size().rename_axis('author_background').reset_index(name='counter')

Can I combine groupby data?

I have two columns home and away. So one row will be England vs Brazil and the next row will be Brazil England. How can I count occurrences of when Brazil faces England or England vs Brazil in one count?
Based on previous solutions, I have tried
results.groupby(["home_team", "away_team"]).size()
results.groupby(["away_team", "home_team"]).size()
however this does not give me the outcome that I am looking for.
Undesired output:
home_team away_team
England Brazil 1
away_team home_team
Brazil England 1
I would like to see:
England Brazil 2
May be you need below:
df = pd.DataFrame({
'home':['England', 'Brazil', 'Spain'],
'away':['Brazil', 'England', 'Germany']
})
pd.Series('-'.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
Output:
Brazil-England 2
Germany-Spain 1
dtype: int64
PS: If you do not like the - between team names, you can use:
pd.Series(' '.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
You can sort values by numpy.sort, create DataFrame and use your original solution:
df1 = (pd.DataFrame(np.sort(df[['home','away']], axis=1), columns=['home','away'])
.groupby(["home", "away"])
.size())
Option 1
You can use numpy.sort to sort the values of the dataframe
However, as that sorts in place, maybe it is better to create a copy of the dataframe.
dfTeams = pd.DataFrame(data=df.values.copy(), columns=['team1','team2'])
dfTeams.values.sort()
(I changed the column names, because with the sorting you are changing their meaning)
After having done this, you can use your groupby.
results.groupby(['team1', 'team2']).size()
Option 2
Since a more general title for your question would be something like how can I count combination of values in multiple columns on a dataframe, independently of their order, you could use a set.
A set object is an unordered collection of distinct hashable objects.
More precisely, create a Series of frozen sets, and then count values.
pd.Series(map(lambda home, away: frozenset({home, away}),
df['home'],
df['away'])).value_counts()
Note: I use the dataframe in #Harv Ipan's answer.

Python DataFrame: How to continue grouping by after several operation on Dataframe

I have a dataframe with states, counties and population statistics with the below columns:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP
And with the below line I am grouping the dataframe and sorting for each state the county population
sorted_df = temp_df.groupby(['STNAME']).apply(lambda x: x.sort_values(['CENSUS2010POP'], ascending = False))
After the sorting I want to only keep the 3 largest counties population-wise
largestcty = sorted_df.groupby(['STNAME'])["CENSUS2010POP"].nlargest(3)
And as the next step I would like to sum the values there witrh the below command
top3sum = largestcty.groupby(['STNAME']).sum()
But the problem now is that the key 'STNAME' is not in the series after the group by. My question is how to preserve the keys of the original DataFrame in the series?
So after applying the answer I have top3sum as a dataframe
top3sum = pd.DataFrame(largestcty.groupby(['STNAME'])'STNAME','CENSUS2010POP'].sum(),columns =['CENSUS2010POP'])
top3sum[:8]
>>>
STNAME CENSUS2010POP
Alabama 1406269
Alaska 478402
Arizona 5173150
Arkansas 807152
California 15924150
Colorado 1794424
Connecticut 2673320
Delaware 897934
This is how the top3sum data look like and then I am getting:
cnty = top3sum['CENSUS2010POP'].idxmax()
And cnty = California
But then trying to use the cnty with top3sum['STNAME'] I am receiving a key error
Your issue is, that after the second grouping, you only select the column CENSUSxxx and pick the three largest values.
Please note that you don't need to sort in advance before applying nlargest, so the first command is unnecessary. But if you sort, you can easily pick the first 3 lines of the sorted grouped dataframes:
largestcty = temp_df.groupby(['TNAME']).apply(lambda x: x.sort_values(['CENSUS2010POP'], ascending = False).head(3)
Then you need to adopt the sum command in order to select your desired column:
top3sum = largestcty.groupby(['STNAME'])['CENSUS2010POP'].sum()

Basic Pandas data analysis: connecting data types

I loaded in a dataframe where there is a variable called natvty which is a frequency of numbers from 50 - 600. Each number represents a country and each country appears more than once. I did a count of the number of times each country appears in the list. Now I would like to replace the number of the country with the name of the country, for example (57 = United States). I tried all kinds of for loops to no avail. Here's my code thus far. In the value counts table, the country number is on the left and the number of times it appears in the data is on the right. I need to replace the number on the left with the country name. The numbers which correspond to country names are in an external excel sheet in two columns. Thanks.
I think there may be no need to REPLACE the country numbers with country names at first. Since you have now two tables, one is with columns ["country_number", "natvty"] and the other (your excel table, can be exported as .csv file and read by pandas) is with columns ["country_number", "country_name"], so you can simply join them both and keep them all. The resulted table would have 3 columns: ["country_number", "natvty", "country_name"], respectively.
import pandas as pd
df_nav = pd.read_csv("my_natvty.csv")
df_cnames = pd.read_csv("excel_country_names.csv") # or use pd.read_excel("country_names.xlsx") directly on excel files
df_nav_with_cnames = df_nav.join(df_cnames, on='country_number')
Make sure they both have a column "country_number". You can modify the table head in the data source files manually, or treat them as index columns to apply join similarly. The concept is a little bit like SQL operations in relational databases.
Documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html
For this sort of thing, I always prefer the map function, which eats a dictionary, or a function for that matter.
import pandas as pd
import numpy.random as np
In [12]:
print
# generate data
df = pd.DataFrame(data={'natvty':np.randint(low=20,high=500,size=10),
'country':pd.Series([1,2,3,3,3,2,1,1,2,3])})
df
country natvty
0 1 24
1 2 310
2 3 88
3 3 459
4 3 38
5 2 63
6 1 194
7 1 384
8 2 281
9 3 360
Then, the dict. Here I just type it, but you could load it from a csv or excel file. Then you'd want to set the key as the index and turn the resulting series into a dict (to_dict()).
countrymap = {1:'US',2:'Canada',3:'Mexico'}
Then you can simply map the value labels.
df.country.map(countrymap)
Out[10]:
0 US
1 Canada
2 Mexico
3 Mexico
4 Mexico
5 Canada
6 US
7 US
8 Canada
9 Mexico
Name: country, dtype: objec
Note: The basic idea here is the same as Shellay's answer. I just wanted to demonstrate how to handle different column names in the two data frames, and how to retrieve the per-country frequencies you wanted.
You have one data frame containing country codes, and another data frame which maps country codes to country names. You simply need to join them on the country code columns. You can read more about merging in Pandas and SQL joins.
import pandas as pd
# this is your nativity frame
nt = pd.DataFrame([
[123],
[123],
[456],
[789],
[456],
[456]
], columns=('natvty',))
# this is your country code map
# in reality, use pd.read_excel
cc = pd.DataFrame([
[123, 'USA'],
[456, 'Mexico'],
[789, 'Canada']
], columns=('country_code', 'country_name'))
# perform a join
# now each row has an associated country_name
df = nt.merge(cc, left_on='natvty', right_on='country_code')
# now you can get frequencies on country names instead of country codes
print df.country_name.value_counts(sort=False)
The output from the above is
Canada 1
USA 2
Mexico 3
Name: country_name, dtype: int64
I think a dictionary would be your best bet. If you had a dict of the countries and their codes e.g.
country_dict = {333: 'United States', 123: 'Canada', 456: 'Cuba', ...}
You presumably have a key of the countries and their codes, so you could make the dict really easily with a loop:
country_dict = {}
for i in country_list:
country = i[0] # If you had list of countries and their numbers
number = i[1]
country_dict[number] = country
Adding a column to your DataFrame once you have this should be straightforward:
import pandas as pd
df = pd.read_csv('my_data.csv', header=None)
df['country'] = [country_dict[x[0][i]] for i in list(df.index)]
This should work if the country codes column has index 0

Ordering Headers in a DataFrame using Python

How do I order the headers of a dataframe.
from pandas import *
import pandas
import numpy as np
df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],'Country':
['Germany','Switzerland','Austria','France','United States']})
print df2
The result I get on default is this:
Country ISO
0 Germany DE
1 Switzerland CH
2 Austria AT
3 France FR
4 United States US
But I thought the ISO would be before the Country as that was the order i created it in the dataframe. It looks like it sorted it alphabetically?
How can i set up this simple table in memory to be used later in relational queries in my preferred column order. Everytime i reference the dataframe i dont want to have to order the columns.
My first coding post ever, ever.
A dict has no ordering, you can use columns argument to enforce one. If columns is not provided, default ordering is indeed alphabetically.
In [2]: df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],
...: 'Country': ['Germany','Switzerland','Austria','France','United States']},
...: columns=['ISO', 'Country'])
In [3]: df2
Out[3]:
ISO Country
0 DE Germany
1 CH Switzerland
2 AT Austria
3 FR France
4 US United States
a Python dict is unordered. The keys are not stored in the order you declare or append to it. The dict you give to the DataFrame as argument has an arbitrary order the DataFrame takes for granted.
You have several options to circumvent the issue:
Use a OrderedDict object instead of the dict if you really need a dictionary as an input:
df2 = DataFrame(OrderedDict([('ISO',['DE','CH','AT','FR','US']),('Country',['Germany','Switzerland','Austria','France','United States'])]))
If you don't rely on a dictionary in the first place, then call the DataFrame with arguments declaring the columns:
df2 = DataFrame({'ISO':['DE','CH','AT','FR','US'],'Country':
['Germany','Switzerland','Austria','France','United States']}, columns=['ISO', 'Country'])

Categories

Resources