Renaming columns from a df in batch - python

I have a dataframe with GDP data. The first few columns contain important data about the countries (which I have renamed it the way I wanted it) but it then goes into a long list of columns displaying a column per year from 1960 to 2015 with each year's GDP. In addition, the columns' names have got messed up and they are named sequentially with the word 'Unnamed' i.e 'Unnamed: 4', 'Unnamed: 5', etc.
My idea was to rename all the 'Unnamed' columns to each of the years (from 1960 to 2015). For example {'Unnamed 4': 1960, 'Unnamed 5': 1961, etc}. So I tried to write the code below:
GDP = pd.read_csv('world_bank.csv')
GDP = GDP.rename(columns={"Data Source": "Country", "World Development Indicators": "Country Code", "Unnamed: 2": "Indicator name", "Unnamed: 3": "Indicator Code"})
GDP = GDP.replace({'Data Source': {'Korea, Rep.': 'South Korea', 'Iran, Islamic Rep.': 'Iran', 'Hong Kong SAR, China': 'Hong Kong'}})
#Below is what I wrote to try to iterate through
GDP = GDP.rename(columns={["Unnamed: "+str(i)+": "+str(j) for i in range(4, 60) for j in range(1960, 2016)]})
But when I use that code it give this error:
TypeError: unhashable type: 'list'
Any thoughts how to do this?

You can directly use dict comprehension in python like:
GDP.rename(columns = {"Unnamed: "+str(i): str(1956+i) for i in range(4, 60)})

You should pass a dictionary to the rename function containing existing column names as keys and the replacing ones as values. You can see an example in the documentation.

Related

pandas using apply method and sending column names?

I have a pandas dataframe that I want to convert each row to a json message to upload. I figured this would be a great usecase for the apply method but I'm having a slight problem, it's not sending the column names.
Here's my data:
Industry CountryName periodDate predicted
0 Advertising Agencies USA 1995.0 144565.060000
1 Advertising Agencies USA 1996.0 165903.120000
2 Advertising Agencies USA 2001.0 326320.740300
When I use apply I lose the column names(industry, countryName, periodDate, etc)
def sendAggData(row):
uploadDataJson = row.to_json(orient='records')
print(json.loads(uploadDataJson))
aggValue.apply(sendAggData, axis=1)
I get this result:
['Advertising Agencies', 'USA', 1995.0, 144565.06]
['Advertising Agencies', 'USA', 1996.0, 165903.12]
['Advertising Agencies', 'USA', 2001.0, 326320.7403]
I want this as a json message, so I'd like the column name on it. so something like {'Industry': 'Advertising Agencies', 'CountryName':'USA'....} - previously I got this to work using a for loop for each row but was told that apply is the more pandas way :-) Any suggestion of what I can do to use apply correctly?
You can just do:
df.apply(lambda x: x.to_json(), axis=1)
Which gives you:
0 {"Industry":0,"CountryName":"Advertising Agenc...
1 {"Industry":1,"CountryName":"Advertising Agenc...
2 {"Industry":2,"CountryName":"Advertising Agenc...
dtype: object
However, what's about df.to_dict('records') which gives:
[{'Industry': 0, 'CountryName': 'Advertising Agencies', 'periodDate': 'USA 1995.0', 'predicted': 144565.06},
{'Industry': 1, 'CountryName': 'Advertising Agencies', 'periodDate': 'USA 1996.0', 'predicted': 165903.12},
{'Industry': 2, 'CountryName': 'Advertising Agencies', 'periodDate': 'USA 2001.0', 'predicted': 326320.7403}]

Create a new column value based on another column content from a list

I am trying to create a new variable called "region" based on the names of countries in Africa in another variable. I have a list of all the African countries (two shown here as an example) but I have am having encountering errors.
def africa(x):
if africalist in x:
return 'African region'
else:
return 'Not African region'
df['region'] = ''
df.region= df.countries.apply(africa)
I'm getting :
'in ' requires string as left operand, not list
I recommend you see When should I want to use apply.
You could use:
df['region'] = df['countries'].isin(africalist).map({True:'African region',
False:'Not African region'})
or
df['region'] = np.where(df['countries'].isin(africalist),
'African region',
'Not African region')
Your condition is wrong.
The correct manner is
if element in list_of_elements:
So, changing the function africa results in:
def africa(x):
if x in africalist:
return 'African region'
else:
return 'Not African region'

How to turn a list of a list of dictionaries into a dataframe via loop

I have a list of a list of dictionaries. I managed to access each list-element within the outer list and convert the dictionary via pandas into a data-frame. I then save the DF and later concat it. That's a perfect result. But I need a loop to do that for big data.
Here is my MWE which works fine in principle.
import pandas as pd
mwe = [
[{"name": "Norway", "population": 5223256, "area": 323802.0, "gini": 25.8}],
[{"name": "Switzerland", "population": 8341600, "area": 41284.0, "gini": 33.7}],
[{"name": "Australia", "population": 24117360, "area": 7692024.0, "gini": 30.5}],
]
df0 = pd.DataFrame.from_dict(mwe[0])
df1 = pd.DataFrame.from_dict(mwe[1])
df2 = pd.DataFrame.from_dict(mwe[2])
frames = [df0, df1, df2]
result = pd.concat(frames)
It creates a nice table.
Here is what I tried to create a list of data frames:
for i in range(len(mwe)):
frame = pd.DataFrame()
frame = pd.DataFrame.from_dict(mwe[i])
frames = []
frames.append(frame)
Addendum: Thanks for all the answers. They are working on my MWE. Which made me notice that there are some strange entries in my dataset. No solution works for my dataset, since I have an inner-list element which contains two dictionaries (due to non unique data retrieval):
....
[{'name': 'United States Minor Outlying Islands', 'population': 300},
{'name': 'United States of America',
'population': 323947000,
'area': 9629091.0,
'gini': 48.0}],
...
How can I drop the entry for "United States Minor Outlying Islands"?
You could get each dict out of the containing list and just have a list of dict:
import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
# use x.pop() so that you aren't carrying around copies of the data
# for a "big data" application
df = pd.DataFrame([x.pop() for x in mwe])
df.head()
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
By bringing the list comprehension into the dataframe declaration, that list is temporary, and you don't have to worry about the cleanup. pop will also consume the dictionaries out of mwe, minimizing the amount of copies you are carrying around in memory
As a note, when doing this, mwe will then look like:
mwe
[[], [], []]
Because the contents of the sub-lists have been popped out
EDIT: New Question Content
If your data contains duplicates, or at least entries you don't want, and the undesired entries don't have matching columns to the rest of the dataset (which appears to be the case), it becomes a bit trickier to avoid copying data as above:
mwe.append([{'name': 'United States Minor Outlying Islands', 'population': 300}, {'name': 'United States of America', 'population': 323947000, 'area': 9629091.0, 'gini': 48.0}])
key_check = {}.fromkeys(["name", "population", "area", "gini"])
# the easy way but copies data
df = pd.DataFrame([item for item in data
for data in mwe
if item.keys()==key_check.keys()])
Since you'll still have the data hanging around in mwe. It might be better to use a generator
def get_filtered_data(mwe):
for data in mwe:
while data: # when data is empty, the while loop will end
item = data.pop() # still consumes data out of mwe
if item.keys() == key_check.keys():
yield item # will minimize data copying through lazy evaluation
df = pd.DataFrame([x for x in get_filtered_data(mwe)])
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
3 9629091.0 48.0 United States of America 323947000
Again, this is under the assumption that non-desired entries have invalid columns, which appears to be the case here, specifically. Otherwise, this will at least flatten out the data structure so you can filter it with pandas later
Create and empty DataFrame and loop over the list using df.append on each loop:
>>> import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
>>> df = pd.DataFrame()
>>> for country in mwe:
... df = df.append(country)
...
>>> df
area gini name population
0 323802.0 25.8 Norway 5223256
0 41284.0 33.7 Switzerland 8341600
0 7692024.0 30.5 Australia 24117360
Try this :
df = pd.DataFrame(columns = ['name', 'population', 'area', 'gini'])
for i in range(len(mwe)):
df.loc[i] = list(mwe[i][0].values())
Output :
name pop area gini
0 Norway 5223256 323802.0 25.8
1 Switzerland 8341600 41284.0 33.7
2 Australia 24117360 7692024.0 30.5

Inaccurate value in my Pandas dataframe column after reading an external Excel file

I have read the following file into a Pandas dataframe: http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls
I've viewed the file before in Excel, and cells contain the string '...' (exactly 3 dots) to represent missing values.
My problem is that after reading the file into a Pandas dataframe called 'energy', some of the missing values are no longer represented with '...' as defined in the Excel document, but rather a series of many more dots, for example: '.................................................'. This makes doing energy.replace('...', np.nan, inplace=True) inaccurate since not all missing values are being replaced.
Could anyone explain why this behavior is occurring, and what is the best way to go about correcting it with Pandas?
This is my code:
import pandas as pd
import numpy as np
import re
# Read excel file
energy = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls',
skiprows = 17,
skipfooter = 38)
# Drop the first 2 unnecessary columns
energy.drop(['Unnamed: 0', 'Unnamed: 1'], axis=1, inplace=True)
# Rename the remaining columns
col_names = ['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
energy.columns = col_names
# Convert energy supply to gigajoules
energy['Energy Supply'] = energy['Energy Supply'] * 1000000
# Replace missing values
energy.replace('...', np.nan, inplace=True)
# Replace country names according to provided to specifications
energy['Country'].replace({
'Republic of Korea': 'South Korea',
'China, Hong Kong Special Administrative Region': 'Hong Kong',
'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
'United States of America': 'United States'
}, inplace=True)
energy.head()
The code above results in the following dataframe:
DataFrame with unexpected value circled
First solution is use parameter na_values in read_excel:
energy = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls',
skiprows = 17,
skipfooter = 38,
na_values='...')
Another solution with replace - regex is changed to ^\.+$ for replace only mutiple dots to NaNs:
^ is for start of string
\ for escape dot, because normally the dot sign is used in regexes to match any character
+ is for one ore more dots
$ is for end of string
energy.replace(r'^\.+$', np.nan, inplace=True, regex=True)
you should place
energy.replace('...', np.nan, inplace=True)
before
energy['Energy Supply'] = energy['Energy Supply'] * 1000000
since your column datatype is object (string) , '...' * 1000000 = ......................
you can use parameters within read_excel
df = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls',
skiprows=17,
skipfooter=38,
na_values='...',
usecols='C:F',
names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'])

Python/Pandas: creating new dataframe, gets error "unalignable boolean Series provided as indexer"

I am trying to compare two dataframes and return different result sets based on whether a value from one dataframe is present in the other.
Here is my sample code:
pmdf = pd.DataFrame(
{
'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
}
)
pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]
jcrdf = pd.DataFrame(
{
'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
'Total Cites': ['223','444','324','166'],
'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],
'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']
})
jcrdf = jcrdf.set_index('Full Journal Title')
pmdf_issn = pmdf['ISSN'].values.tolist()
This line gets me the rows from dataframe jcrdf that contain the issn from dataframe pmdf
pmjcrmatch = jcrdf[jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]
I wanted the following line to create a new dataframe of values from pmdf where the ISSN is not in jcfdf so I negated the previous statement and chose the first dataframe.
pmjcrnomatch = pmdf[~jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]
I get an error: "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match"
I don't find a lot about this specific error, at least nothing that is helping me toward a solution.
Is "str.contains" not the best way of sorting items that are and aren't in the second dataframe?
You are trying to apply the boolean index of one dataframe to another. This is only possible if the length of both dataframes match. In your case you should use isin.
# get all rows from jcrdf where `ALL_ISSNs` contains any of the `ISSN` in `pmdf`.
pmjcrmatch = jcrdf[jcrdf.All_ISSNs.str.contains('|'.join(pmdf.ISSN))]
# assign all remaining rows from `jcrdf` to a new dataframe.
pmjcrnomatch = jcrdf[~jcrdf.ISSN.isin(pmjcrmatch.ISSN)]
EDIT
Let's try another approach:
First i'd create a lookup for all you ISSNs and then create the diff by isolating the matches:
import pandas as pd
pmdf = pd.DataFrame(
{
'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
}
)
pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]
jcrdf = pd.DataFrame(
{
'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
'Total Cites': ['223','444','324','166'],
'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],
'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']
})
jcrdf = jcrdf.set_index('Full Journal Title')
# create lookup from all issns to avoid expansice string matching
jcrdf_lookup = pd.DataFrame(jcrdf['All_ISSNs'].str.split(',').tolist(),
index=jcrdf.ISSN).stack(level=0).reset_index(level=0)
# compare extracted ISSNs from ALL_ISSNs with pmdf.ISSN
matches = jcrdf_lookup[jcrdf_lookup[0].isin(pmdf.ISSN)]
jcrdfmatch = jcrdf[jcrdf.ISSN.isin(matches.ISSN)]
jcrdfnomatch = pmdf[~pmdf.ISSN.isin(matches[0])]

Categories

Resources