Problem with New Column in Pandas Dataframe - python

I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.

Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)

Related

Delete value Pandas Column

I started to learn about pandas and try to analyze a data
So in my data there is a column country which contain a few country,I only want to take the first value and change it to a new column.
An example First index have Colombia,Mexico,United Stated and I only wanna to take the first one Colombia [0] and delete the other contry[1:x],is this possible?
I try a few like loc,iloc or drop() but I hit a dead end so I asked in here
You can use Series.str.split:
df['country'] = df['country'].str.split(',').str[0]
Consider below df for example:
In [1520]: df = pd.DataFrame({'country':['Colombia, Mexico, US', 'Croatia, Slovenia, Serbia', 'Denmark', 'Denmark, Brazil']})
In [1521]: df
Out[1521]:
country
0 Colombia, Mexico, US
1 Croatia, Slovenia, Serbia
2 Denmark
3 Denmark, Brazil
In [1523]: df['country'] = df['country'].str.split(',').str[0]
In [1524]: df
Out[1524]:
country
0 Colombia
1 Croatia
2 Denmark
3 Denmark
Use .str.split():
df['country'] = df['country'].str.split(',',expand=True)[0]

Assign binary value whether a column contains empty list

I would like to assign a binary value (1 or 0) whether a column contains not empty/empty lists.
For example:
Country Test
Germany []
Italy ['pizza']
United Kingdom ['queen', 'king','big']
France ['Eiffel']
Spain []
...
What I would expect is something like this:
Country Test Binary
Germany [] 0
Italy ['pizza'] 1
United Kingdom ['queen', 'king','big'] 1
France ['Eiffel'] 1
Spain [] 0
...
I do not know how to use np.where or another to get these results.
I think to check if a column contains an empty list I should do something like this: df[df['Test'] != '[]']
You can do a simple check for length and based on the value, you can convert it to 0 or 1.
df['Binary'] = (df['Test'].str.len() != 0).astype(int)
While this is good, the most efficient way to do it was provided by #Marat.
df['Binary'] = df['Test'].astype(bool).astype(int)
The full code is here:
import pandas as pd
c = ['Country','Test']
d = [['Germany',[]],
['Italy',['pizza']],
['United Kingdom', ['queen', 'king','big']],
['France',['Eiffel']],
['Spain',[]]]
df = pd.DataFrame(data=d,columns=c)
df['Binary'] = df['Test'].astype(bool).astype(int)
print (df)
The output of this will be:
Country Test Binary
0 Germany [] 0
1 Italy [pizza] 1
2 United Kingdom [queen, king, big] 1
3 France [Eiffel] 1
4 Spain [] 0
Use str.len:
np.clip(df.Test.str.len(), 0, 1)
#or
(df.Test.str.len()==0).astype(int)

Extract country name from text in column to create another column

I have tried different combinations to extract the country names from a column and create a new column with solely the countries. I can do it for selected rows i.e. df.address[9998] but not for the whole column.
import pycountry
Cntr = []
for country in pycountry.countries:
for country.name in df.address:
Cntr.append(country.name)
Any ideas what is going wrong here?
edit:
address is an object in the df and
df.address[:10] looks like this
Address
0 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 ...
9 Kristiansand, Norway
Name: address, Length: 10, dtype: object
Based on Petar's response when I run individual queries I get the country correctly, but when I try to create a column with all the countries (or ranges like df.address[:5] I get an empty Cntr)
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df['address'][1]:
Cntr.append(country.name)
Cntr
Returns
[Italy]
and df.address[2] returns [ ]
etc.
I have also run
df['address'] = df['address'].astype('str')
to make sure that there are no floats or int in the column.
Sample dataframe
df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})
df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)
address city country
0 Turin, Italy Turin Italy
1 NaN NaN NaN
2 Zurich, Switzerland Zurich Switzerland
3 NaN NaN NaN
4 Glyfada, greece Glyfada greece
You were really close. We cannot loop like this for country.name in df.address. Instead:
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df.address:
Cntr.append(country.name)
If this does not work, please supply more information because I am unsure what df.address looks like.
You can use the function clean_country() from the library DataPrep. Install it with pip install dataprep.
from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
address address_clean
0 Turin, Italy Italy
1 NaN NaN
2 Zurich, Switzerland Switzerland
3 NaN NaN
4 Glyfada, Greece Greece

Weird behaviour with pandas cut, groupby and multiindex in Python

I have a dataframe like this one,
Continent % Renewable
Country
China Asia 2
United States North America 1
Japan Asia 1
United Kingdom Europe 1
Russian Federation Europe 2
Canada North America 5
Germany Europe 2
India Asia 1
France Europe 2
South Korea Asia 1
Italy Europe 3
Spain Europe 3
Iran Asia 1
Australia Australia 1
Brazil South America 5
where the % Renewableis a column created using the cut function,
Top15['% Renewable'] = pd.cut(Top15['% Renewable'], 5, labels=range(1,6))
when I group by Continentand % Renewable to count the number of countries in each subset I do,
count_groups = Top15.groupby(['Continent', '% Renewable']).size()
which is,
Continent % Renewable
Asia 1 4
2 1
Australia 1 1
Europe 1 1
2 3
3 2
North America 1 1
5 1
South America 5 1
The weird thing is the indexing now, if I index for a value that the category value is > 0 this gives me the value,
count_groups.loc['Asia', 1]
>> 4
if not,
count_groups.loc['Asia', 3]
>> IndexingError: Too many indexers
shouldn't it give me a 0 as there are no entries in that category? I would assume so as that dataframe was created using the groupby.
If not, can anyone suggest a procedure so I can preserve the 0 nr of countries for a category of % Renewable?
You have a Series with MultiIndex. Normally, we use tuples for indexing with MultiIndexes but pandas can be flexible about that.
In my opinion, count_groups.loc[('Asia', 3)] should raise a KeyError since this pair does not appear in the index but that's for pandas developers to decide I guess.
To return a default value from a Series, we can use get like we do in dictionaries:
count_groups.get(('Asia', 3), 0)
This will return 0 if the key does not exist.

Repeating rows in a pandas dataframe

I have a dataframe that I would like to 'double' (or triple, or....). I am not trying to concatenate a dataframe with itself, i.e. have one full copy of the df stacked on top of another full copy of the df.
Starting with this:
import pandas as pd
from io import StringIO
from IPython.display import display
A_csv = """country
Afghanistan
Brazil
China"""
with StringIO(A_csv) as fp:
A = pd.read_csv(fp)
display(A)
result
country
0 Afghanistan
1 Brazil
2 China
I want to get something like this; the index and indentation aren't so important.
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China
Use np.repeat:
df = pd.DataFrame(A.values.repeat(2), columns=A.columns)
df
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China
For N-D dataframes, the solution should be extended using an axis parameter in repeat:
df = pd.DataFrame(A.values.repeat(2, axis=0), columns=A.columns)
You can use np.repeat
pd.DataFrame(np.repeat(df['country'], 2)).reset_index(drop = True)
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China
By using pd.concat
pd.concat([df]*2,axis=0).sort_index().reset_index(drop=True)
Out[56]:
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China

Categories

Resources