Weird behaviour with pandas cut, groupby and multiindex in Python - python

I have a dataframe like this one,
Continent % Renewable
Country
China Asia 2
United States North America 1
Japan Asia 1
United Kingdom Europe 1
Russian Federation Europe 2
Canada North America 5
Germany Europe 2
India Asia 1
France Europe 2
South Korea Asia 1
Italy Europe 3
Spain Europe 3
Iran Asia 1
Australia Australia 1
Brazil South America 5
where the % Renewableis a column created using the cut function,
Top15['% Renewable'] = pd.cut(Top15['% Renewable'], 5, labels=range(1,6))
when I group by Continentand % Renewable to count the number of countries in each subset I do,
count_groups = Top15.groupby(['Continent', '% Renewable']).size()
which is,
Continent % Renewable
Asia 1 4
2 1
Australia 1 1
Europe 1 1
2 3
3 2
North America 1 1
5 1
South America 5 1
The weird thing is the indexing now, if I index for a value that the category value is > 0 this gives me the value,
count_groups.loc['Asia', 1]
>> 4
if not,
count_groups.loc['Asia', 3]
>> IndexingError: Too many indexers
shouldn't it give me a 0 as there are no entries in that category? I would assume so as that dataframe was created using the groupby.
If not, can anyone suggest a procedure so I can preserve the 0 nr of countries for a category of % Renewable?

You have a Series with MultiIndex. Normally, we use tuples for indexing with MultiIndexes but pandas can be flexible about that.
In my opinion, count_groups.loc[('Asia', 3)] should raise a KeyError since this pair does not appear in the index but that's for pandas developers to decide I guess.
To return a default value from a Series, we can use get like we do in dictionaries:
count_groups.get(('Asia', 3), 0)
This will return 0 if the key does not exist.

Related

How to maintain the same index after sorting a Pandas series?

I have the following Pandas series from the dataframe 'Reducedset':
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
Which gives me:
Country
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
Brazil 2.189794e+12
Italy 2.120175e+12
India 1.769297e+12
Canada 1.660647e+12
Russian Federation 1.565459e+12
Spain 1.418078e+12
Australia 1.164043e+12
South Korea 1.106715e+12
Iran 4.441558e+11
dtype: float64
I want to update the index, so that index of the dataframe Reducedset is in the same order as the series above.
How can I do this?
In other words, when I then look at the entire dataframe, the index order should be the same as in the series above and not like that below:
Reducedset
Rank Documents Citable documents Citations \
Country
China 1 127050 126767 597237
United States 2 96661 94747 792274
Japan 3 30504 30287 223024
United Kingdom 4 20944 20357 206091
Russian Federation 5 18534 18301 34266
Canada 6 17899 17620 215003
Germany 7 17027 16831 140566
India 8 15005 14841 128763
France 9 13153 12973 130632
South Korea 10 11983 11923 114675
Italy 11 10964 10794 111850
Spain 12 9428 9330 123336
Iran 13 8896 8819 57470
Australia 14 8831 8725 90765
Brazil 15 8668 8596 60702
The answer:
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
This first stage finds the mean of columns 10-20 for each row (axis=1) and sorts them in descending order (ascending = False)
Reducedset.reindex(Reducedset.index)
Here, we are resetting the index of the dataframe 'Reducedset' as the index of the amended dataframe above.

Problem with New Column in Pandas Dataframe

I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.
Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)

Pandas - value_counts on multiple values in one cell

I have a dataframe which has a column with multiple values, separated by a comma like this:
Country
Australia, Cuba, Argentina
Australia
United States, Canada, United Kingdom, Argentina
I would like to count each unique value, similar to value_counts, like this:
Australia: 2
Cuba: 1
Argentina: 2
United States: 1
My simplest method is shown below, but I suspect that this can be done more efficiently and neatly.
from collections import Counter
Counter(pd.DataFrame(data['Country'].str.split(',', expand=True)).values.ravel())
Cheers
You can using get_dummies
df.Country.str.get_dummies(sep=', ').sum()
Out[354]:
Argentina 2
Australia 2
Canada 1
Cuba 1
United Kingdom 1
United States 1
dtype: int64
Another option is to split and then use value_counts
pd.Series(df.Country.str.split(', ').sum()).value_counts()
Argentina 2
Australia 2
United Kingdom 1
Canada 1
Cuba 1
United States 1
dtype: int64

Repeating rows in a pandas dataframe

I have a dataframe that I would like to 'double' (or triple, or....). I am not trying to concatenate a dataframe with itself, i.e. have one full copy of the df stacked on top of another full copy of the df.
Starting with this:
import pandas as pd
from io import StringIO
from IPython.display import display
A_csv = """country
Afghanistan
Brazil
China"""
with StringIO(A_csv) as fp:
A = pd.read_csv(fp)
display(A)
result
country
0 Afghanistan
1 Brazil
2 China
I want to get something like this; the index and indentation aren't so important.
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China
Use np.repeat:
df = pd.DataFrame(A.values.repeat(2), columns=A.columns)
df
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China
For N-D dataframes, the solution should be extended using an axis parameter in repeat:
df = pd.DataFrame(A.values.repeat(2, axis=0), columns=A.columns)
You can use np.repeat
pd.DataFrame(np.repeat(df['country'], 2)).reset_index(drop = True)
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China
By using pd.concat
pd.concat([df]*2,axis=0).sort_index().reset_index(drop=True)
Out[56]:
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China

delete part of a row in pandas / shift up part of a row ? Align Column Headings

So I have a data frame where the headings I want do not currently line up:
In [1]: df = pd.read_excel('example.xlsx')
print (df.head(10))
Out [1]: Portfolio Asset Country Quantity
Unique Identifier Number of fund B24 B65 B35 B44
456 2 General Type A UNITED KINGDOM 1
123 3 General Type B US 2
789 2 General Type C UNITED KINGDOM 4
4852 4 General Type C UNITED KINGDOM 4
654 1 General Type A FRANCE 3
987 5 General Type B UNITED KINGDOM 2
321 1 General Type B GERMANY 1
951 3 General Type A UNITED KINGDOM 2
357 4 General Type C UNITED KINGDOM 3
As we can see; above the first 2 column headings there are 2 blank cells and below the next 4 column headings are "B" numbers which I don't care about.
So 2 questions; How can I shift up the first 2 columns without having a column heading to identify them with (due to the blank cells above)?
And how can I delete just Row 2 of the remaining columns and have the data below move up to take the place of the "B" numbers?
I found some similar questions already asked python: shift column in pandas dataframe up by one but nothing that solves the particular intricacies above I don't think.
Also I'm quite new to Python and Pandas so if this is really basic I apologise!
IIUC you can use:
#create df from multiindex in columns
df1 = pd.DataFrame([x for x in df.columns.values])
print df1
0 1
0 Unique Identifier
1 Number of fund
2 Portfolio B24
3 Asset B65
4 Country B35
5 Quantity B44
#if len of string < 4, give value from column 0 to column 1
df1.loc[df1.iloc[:,1].str.len() < 4, 1] = df1.iloc[:,0]
print df1
0 1
0 Unique Identifier
1 Number of fund
2 Portfolio Portfolio
3 Asset Asset
4 Country Country
5 Quantity Quantity
#set columns by first columns of df1
df.columns = df1.iloc[:,1]
print df
0 Unique Identifier Number of fund Portfolio Asset Country \
0 456 2 General Type A UNITED KINGDOM
1 123 3 General Type B US
2 789 2 General Type C UNITED KINGDOM
3 4852 4 General Type C UNITED KINGDOM
4 654 1 General Type A FRANCE
5 987 5 General Type B UNITED KINGDOM
6 321 1 General Type B GERMANY
7 951 3 General Type A UNITED KINGDOM
8 357 4 General Type C UNITED KINGDOM
0 Quantity
0 1
1 2
2 4
3 4
4 3
5 2
6 1
7 2
8 3
EDIT by comments:
print df.columns
Index([u'Portfolio', u'Asset', u'Country', u'Quantity'], dtype='object')
#set first row by columns names
df.iloc[0,:] = df.columns
#reset_index
df = df.reset_index()
#set columns from first row
df.columns = df.iloc[0,:]
df.columns.name= None
#remove first row
print df.iloc[1:,:]
Unique Identifier Number of fund Portfolio Asset Country Quantity
1 456 2 General Type A UNITED KINGDOM 1
2 123 3 General Type B US 2
3 789 2 General Type C UNITED KINGDOM 4
4 4852 4 General Type C UNITED KINGDOM 4
5 654 1 General Type A FRANCE 3
6 987 5 General Type B UNITED KINGDOM 2
7 321 1 General Type B GERMANY 1
8 951 3 General Type A UNITED KINGDOM 2
9 357 4 General Type C UNITED KINGDOM 3

Categories

Resources