Removing duplicate columns in pandas dataframe - python

I am trying to parse data, but duplicate names under columns started appearing.
Code:
import pandas as pd
def parseData():
countries = pd.read_csv('Int_Monthly_Visitor.csv')
cols = [e.strip() for e in list(countries.columns)]
regions = {
'Others': cols[30:]
}
countries.rename(str.strip, axis='columns', inplace=True)
regionlist = pd.DataFrame({'Columns': regions['Others'], 'Non-Null count': countries.loc[0:120, regions['Others']].count()})
print(regionlist)
parseData()
Output:
Columns Non-Null count
USA USA 121
Canada Canada 121
Australia Australia 121
New Zealand New Zealand 121
Africa Africa 121
Expected output:
Columns Non-Null count
USA 121
Canada 121
Australia 121
New Zealand 121
Africa 121
Is there a solution to remove the duplicate names under columns?

Since you're reading your dataframe from a .csv file, you can use pandas.read_csv and define usecols argument as shown below :
countries = pd.read_csv('Int_Monthly_Visitor.csv', usecols=lambda c: not c.startswith('Unnamed:'))
>>> print(countries)

Related

How to add conditional row to pandas dataframe

I tried looking for a succinct answer and nothing helped. I am trying to add a row to a dataframe that takes a string for the first column and then for each column grabbing the sum. I ran into a scalar issue, so I tried to make the desired row into a series then convert to a dataframe, but apparently I was adding four rows with one column value instead of one row with the four column values.
My code:
def country_csv():
# loop through absolute paths of each file in source
for filename in os.listdir(source):
filepath = os.path.join(source, filename)
if not os.path.isfile(filepath):
continue
df = pd.read_csv(filepath)
df = df.groupby(['Country']).sum()
df.reset_index()
print(df)
# df.to_csv(os.path.join(path1, filename))
Sample dataframe:
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
Would like to see this as the first row
World 632 27 109
import pandas as pd
import datetime as dt
df
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
df.loc['World'] = [df['Confirmed'].sum(),df['Deaths'].sum(),df['Recovered'].sum()]
df.sort_values(by=['Confirmed'], ascending=False)
Confirmed Deaths Recovered
Country
World 632 27 109
Albania 333 20 99
Afghanistan 299 7 10
IIUC, you can create a dict then repass it into a dataframe to concat.
data = df.sum(axis=0).to_dict()
data.update({'Country' : 'World'})
df2 = pd.concat([pd.DataFrame(data,index=[0]).set_index('Country'),df],axis=0)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99
or a oner liner using assign and Transpose
df2 = pd.concat(
[df.sum(axis=0).to_frame().T.assign(Country="World").set_index("Country"), df],
axis=0,
)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99

Problem with New Column in Pandas Dataframe

I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.
Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)

How to convert list to pandas DataFrame?

I use BeautifulSoup to get some data from a webpage:
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'html5lib')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
df.head()
But df is a list, not the pandas DataFrame as I expected from using pd.read_html.
How can I get pandas DataFrame out of it?
You can use read_html with your url:
df = pd.read_html("http://www.nationmaster.com/country-info/stats/Media/Internet-users")[0]
And then if necessary remove GRAPH and HISTORY columns and replace NaNs in column # by forward filling:
df = df.drop(['GRAPH','HISTORY'], axis=1)
df['#'] = df['#'].ffill()
print(df.head())
# COUNTRY AMOUNT DATE
0 1 China 389 million 2009
1 2 United States 245 million 2009
2 3 Japan 99.18 million 2009
3 3 Group of 7 countries (G7) average (profile) 80.32 million 2009
4 4 Brazil 75.98 million 2009
print(df.tail())
# COUNTRY AMOUNT DATE
244 214 Niue 1100 2009
245 =215 Saint Helena, Ascension, and Tristan da Cunha 900 2009
246 =215 Saint Helena 900 2009
247 217 Tokelau 800 2008
248 218 Christmas Island 464 2001

Importing Excel into Panda Dataframe

The following is only the beginning for an Coursera assignment on Data Science. I hope this is not to trivial for. But I am lost on this and could not find an answer.
I am asked to import an Excelfile into a panda dataframe and to manipulate it afterwards. The file can be found here: http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls
What makes it difficult for me is
a) there is an 'overhead' of 17 lines and a footer
b) the first two columns are empty
c) the index column has no header name
After hours if seraching and reading I came up with this useless line:
energy=pd.read_excel('Energy Indicators.xls',
sheetname='Energy',
header=16,
skiprows=[17],
skipfooter=38,
skipcolumns=2
)
This seems to produce a multindex dataframe. Though the command energy.head() returns nothing.
I have two questions:
what did I wrong. Up to this exercise I thought I understand the dataframe. But now I am totally clueless and lost :-((
How do I have to tackle this? What do I have to do to get this Exceldata into a datafrae with the index consisting of the countries?
Thanks.
I think you need add parameters:
index_col for convert column to index
usecols - parse columns by positions
change header position to 15
energy=pd.read_excel('Energy Indicators.xls',
sheet_name='Energy',
skiprows=[17],
skipfooter=38,
header=15,
index_col=[0],
usecols=[2,3,4,5]
)
print (energy.head())
Energy Supply Energy Supply per capita \
Afghanistan 321 10
Albania 102 35
Algeria 1959 51
American Samoa ... ...
Andorra 9 121
Renewable Electricity Production
Afghanistan 78.669280
Albania 100.000000
Algeria 0.551010
American Samoa 0.641026
Andorra 88.695650
I installed xlrd package, with pip install xlrd and then loaded the file successfully as follows:
In [17]: df = pd.read_excel(r"http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls",
...: sheetname='Energy',
...: header=16,
...: skiprows=[17],
...: skipfooter=38,
...: skipcolumns=2)
In [18]: df.shape
Out[18]: (227, 3)
In [19]: df.head()
Out[19]:
Energy Supply Energy Supply per capita \
NaN Afghanistan Afghanistan 321 10
Albania Albania 102 35
Algeria Algeria 1959 51
American Samoa American Samoa ... ...
Andorra Andorra 9 121
Renewable Electricity Production
NaN Afghanistan Afghanistan 78.669280
Albania Albania 100.000000
Algeria Algeria 0.551010
American Samoa American Samoa 0.641026
Andorra Andorra 88.695650
In [20]: pd.__version__
Out[20]: u'0.20.3'
In [21]: df.columns
Out[21]:
Index([u'Energy Supply', u'Energy Supply per capita',
u'Renewable Electricity Production'],
dtype='object')
Notice that I am using the last version of pandas 0.20.3 make sure you have the latest version on your system.
I modified your code and was able to get the data into the dataframe. Instead of skipcolumns (which did not work), I used the argument usecols as follows
energy=pd.read_excel('Energy_Indicators.xls',
sheetname='Energy',
header=16,
skiprows=[16],
skipfooter=38,
usecols=[2,3,4,5]
)
Unnamed: 2 Petajoules Gigajoules %
0 Afghanistan 321 10 78.669280
1 Albania 102 35 100.000000
2 Algeria 1959 51 0.551010
3 American Samoa ... ... 0.641026
4 Andorra 9 121 88.695650
In order to make the countries as the index, you can do the following
# Rename the column Unnamed: 2 to Country
energy = energy.rename(columns={'Unnamed: 2':'Country'})
# Change the index to country column
energy.index = energy['Country']
# Drop the extra country column
energy = energy.drop('Country', axis=1)

How do I replace the values in the second dataframe based on the values in the first dataframe

I have two dataframes i.e df and df1,
df:
Product_name Name City
Rice Chetwynd Chetwynd, British Columbia, Canada
Wheat Yuma Yuma, AZ, United States
Sugar Dochra Singleton, New South Wales, Australia
Milk India Hyderabad, India
df1:
Product_ID Unique_ID Origin_From Deliver_To
231 125 Sugar Milk
598 125 Milk Wheat
786 125 Rice Sugar
568 125 Sugar Wheat
122 125 Wheat Rice
269 125 Milk Wheat
Final Output (df2): Get the values of "Origin_From" and "Deliver_To" values in df1 then search each values in df, if found then replace "Origin_From" and "Deliver_To" values in df1 with df[city] + df[Origin_From/Origin_To]. output (df2) would be something like below.
df2:
Product_ID unique_ID Origin_From Deliver_To
231 125 Singleton, New South Wales, Australia, (Sugar) Hyderabad, India, (Milk)
598 125 Hyderabad, India, (Milk) Yuma, AZ, United States, (Wheat)
786 125 Chetwynd, British Columbia, Canada, (Rice) Singleton, New South Wales, Australia, (Sugar)
568 125 Singleton, New South Wales, Australia, (Sugar) Yuma, AZ, United States, (Wheat)
122 125 Yuma, AZ, United States, (Wheat) Chetwynd, British Columbia, Canada, (Rice)
269 125 Hyderabad, India, (Milk) Yuma, AZ, United States, (Wheat)
I am struggling a bit with it so a couple of shoves in the right direction would really help.
Thanks in advance.
Setup
from io import StringIO
import pandas as pd
df_txt = """Product_name Name City
Rice Chetwynd Chetwynd, British Columbia, Canada
Wheat Yuma Yuma, AZ, United States
Sugar Dochra Singleton, New South Wales, Australia
Milk India Hyderabad, India"""
df1_txt = """Product_ID Unique_ID Origin_From Deliver_To
231 125 Sugar Milk
598 125 Milk Wheat
786 125 Rice Sugar
568 125 Sugar Wheat
122 125 Wheat Rice
269 125 Milk Wheat"""
df = pd.read_csv(StringIO(df_txt), sep='\s{2,}', engine='python')
df1 = pd.read_csv(StringIO(df1_txt), sep='\s{2,}', engine='python')
Solution
option 1
m = df.set_index('Product_name').City
df2 = df1.copy()
df2.Origin_From = df1.Origin_From.map(m) + ', (' + df1.Origin_From + ')'
df2.Deliver_To = df1.Deliver_To.map(m)+ ', (' + df1.Deliver_To + ')'
df
option 2
m = df.set_index('Product_name').City
c = ['Origin_From', 'Deliver_To']
fnt = df1[c].stack()
df2 = df1.drop(c, 1).join(fnt.map(m).add(fnt.apply(', ({})'.format)).unstack())
option 3
using merge
c = ['Origin_From', 'Deliver_To']
ds = df1[c].stack().to_frame('Product_name')
ds['City'] = ds.merge(df)['City'].values
df2 = df1.drop(c, 1).join(ds.City.add(', (').add(ds.Product_name).add(')').unstack())
Deeper explanation of option 3
assign the target columns to variable c for convenience
use stack to convert 2-column dataframe into a series object with a multi-index
anticipating that I'm going to merge, I use to_frame to convert the series object into a single column dataframe. pd.merge only works on dataframes`
more anticipation, I pass the name of the single column to the to_frame method. This is be a coincident column name that will be merged on.
add a column named 'City' that is the results of the merge. I add the values to the column with the values attribute in order to ignore the index of the resulting merge and focus on just the resulting values.
ds now has the index I want in it's first level. I leave stacked while I do some convenient string operations, then I unstack. In this form, the indices are aligned and a can leverage join
I hope that's clear.

Categories

Resources