Creating new variable by aggregation in python - python

I'm pretty new to python and pandas, and know only the basics. Nowadays I'm conducting a research and I need your kind help.
Let’s say I have data on births, containing 2 variables: Date and Country.
Date Country
1.1.20 USA
1.1.20 USA
1.1.20 Italy
1.1.20 England
2.1.20 Italy
2.1.20 Italy
3.1.20 USA
3.1.20 USA
Now I want to create a third variable, let’s call him ‘Births’, which contains the number of births in country at a date. In other words, I want to stick to just one row for each date+country combination by aggregating the number of countries in each date, so I end up with something like this:
Date Country Births
1.1.20 USA 2
1.1.20 Italy 1
1.1.20 England 1
2.1.20 Italy 2
3.1.20 USA 2
I’ve tried many things, but nothing seemed to work. Any help will be much appreciated.
Thanks,
Eran

I guess you can use the groupby method of your DataFrame, then use the size method to count the number of individuals in each group :
df.groupby(by=['Date', 'Country']).size().reset_index(name='Births')
Output:
Date Country Births
0 1.1.20 England 1
1 1.1.20 Italy 1
2 1.1.20 USA 2
3 2.1.20 Italy 2
4 3.1.20 USA 2
Also, the pandas documentation has several examples related to group-by operations : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html.

Related

Flag if a row is duplicated an attach if it's the 1st, 2nd, etc duplicated row

I'd like to flag if a row is duplicated, and attach if it's the 1st, 2nd, 3rd, etc duplicated column in a Pandas DataFrame.
More visually, I'd like to go from:
id
Country
City
1
France
Paris
2
France
Paris
3
France
Lyon
4
France
Lyon
5
France
Lyon
to
id
Country
City
duplicated_flag
1
France
Paris
1
2
France
Paris
1
3
France
Lyon
2
4
France
Lyon
2
5
France
Lyon
2
Note that id is not taken into account to see if the row is duplicated.
Two options:
First, if you have lots of columns that you need to compare, you can use:
comparison_df = df.drop("id", axis=1)
df["duplicated_flag"] = (comparison_df != comparison_df.shift()).any(axis=1).cumsum()
We drop the columns that aren't needed in the comparison. Then, we check whether each row is equivalent to the one above it using .shift() and .any(). Finally, we read off the value of duplicated_flag using .cumsum().
But, if you only have two columns to compare (or if for some reason you have lots of columns that you need to drop), you can find mismatched rows one at a time, and then use .cumsum() to get the value of duplicated_flag for each row. It's a bit more verbose so I'm not super happy with this option, but I'm leaving this here for completeness in case this suits your use case better:
country_comparison = df["Country"].ne(df["Country"].shift())
city_comparison = df["City"].ne(df["City"].shift())
df["duplicated_flag"] = (country_comparison | city_comparison).cumsum()
print(df)
These output:
id Country City duplicated_flag
0 1 France Paris 1
1 2 France Paris 1
2 3 France Lyon 2
3 4 France Lyon 2
4 5 France Lyon 2

how to groupby a dataframe if its row value is not present?

i have a dataframe of countrywise open and solved complaints
country complaints status
india network issue solved
usa internet speed issue open
uk network issue open
india internet speed issue solved
usa network issue open
uk voice issue solved
I wanted to group by countries where status is open
i tried
df = df[df.status=="open"]
then
df.groupby("countries",as_index=True).count
the output i got is
country complaints
usa 2
uk 1
but the output want is
country complaints
usa 2
uk 1
india 0
since india has no open complaints I am unable to get india after groupby. How take data is a way such that the groupby also brings india value as 0
You can do:
df['status'].eq('open').groupby(df['country']).sum()
Output:
country
india 0
uk 1
usa 2
Name: status, dtype: int64
If you want to only count that country values so for that instead of using group by you can use value_count() method, it's use to count the categorical value
so your code will be look like this
df = df[df.status=="open"]
df.country.value_counts()
then your output will become like this
india 0
uk 1
usa 2

How to output the top 5 of a specific column along with associated columns using python?

I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000

Pandas - value_counts on multiple values in one cell

I have a dataframe which has a column with multiple values, separated by a comma like this:
Country
Australia, Cuba, Argentina
Australia
United States, Canada, United Kingdom, Argentina
I would like to count each unique value, similar to value_counts, like this:
Australia: 2
Cuba: 1
Argentina: 2
United States: 1
My simplest method is shown below, but I suspect that this can be done more efficiently and neatly.
from collections import Counter
Counter(pd.DataFrame(data['Country'].str.split(',', expand=True)).values.ravel())
Cheers
You can using get_dummies
df.Country.str.get_dummies(sep=', ').sum()
Out[354]:
Argentina 2
Australia 2
Canada 1
Cuba 1
United Kingdom 1
United States 1
dtype: int64
Another option is to split and then use value_counts
pd.Series(df.Country.str.split(', ').sum()).value_counts()
Argentina 2
Australia 2
United Kingdom 1
Canada 1
Cuba 1
United States 1
dtype: int64

Weird behaviour with pandas cut, groupby and multiindex in Python

I have a dataframe like this one,
Continent % Renewable
Country
China Asia 2
United States North America 1
Japan Asia 1
United Kingdom Europe 1
Russian Federation Europe 2
Canada North America 5
Germany Europe 2
India Asia 1
France Europe 2
South Korea Asia 1
Italy Europe 3
Spain Europe 3
Iran Asia 1
Australia Australia 1
Brazil South America 5
where the % Renewableis a column created using the cut function,
Top15['% Renewable'] = pd.cut(Top15['% Renewable'], 5, labels=range(1,6))
when I group by Continentand % Renewable to count the number of countries in each subset I do,
count_groups = Top15.groupby(['Continent', '% Renewable']).size()
which is,
Continent % Renewable
Asia 1 4
2 1
Australia 1 1
Europe 1 1
2 3
3 2
North America 1 1
5 1
South America 5 1
The weird thing is the indexing now, if I index for a value that the category value is > 0 this gives me the value,
count_groups.loc['Asia', 1]
>> 4
if not,
count_groups.loc['Asia', 3]
>> IndexingError: Too many indexers
shouldn't it give me a 0 as there are no entries in that category? I would assume so as that dataframe was created using the groupby.
If not, can anyone suggest a procedure so I can preserve the 0 nr of countries for a category of % Renewable?
You have a Series with MultiIndex. Normally, we use tuples for indexing with MultiIndexes but pandas can be flexible about that.
In my opinion, count_groups.loc[('Asia', 3)] should raise a KeyError since this pair does not appear in the index but that's for pandas developers to decide I guess.
To return a default value from a Series, we can use get like we do in dictionaries:
count_groups.get(('Asia', 3), 0)
This will return 0 if the key does not exist.

Categories

Resources