I'm playing about with Python and pandas.
I have created a dataframe, I have a column (axis 1) called 'County' but I need to create a column called 'Region' and populate it like this (atleast I think):
If County column == 'Suffolk' or 'Norfolk' or 'Essex' then in Region column insert 'East Anglia'
If County column == 'Kent' or 'East Sussex' or 'West Sussex' then in Region Column insert 'South East'
If County column == 'Dorset' or 'Devon' or 'Cornwall' then in Region Column insert 'South West'
and so on...
So far I have this:
myDataFrame['Region'] = np.where(myDataFrame['County']=='Suffolk', 'East Anglia', '')
But I suspect this won't work for any other counties
As I'm sure is obvious I am a beginner. I have tried googling and reading but only could find out about numpy where, which got me this far.
You'll definitely need df.isin and loc based indexing:
df['Region'] = np.nan
df.loc[df.County.isin(['Suffolk','Norfolk', 'Essex']), 'Region'] = 'East Anglia'
df.loc[df.County.isin(['Kent', 'East Sussex', 'West Sussex']), 'Region'] = 'South East'
df.loc[df.County.isin(['Dorset', 'Devon', 'Cornwall']), 'Region'] = 'South West'
You could also create a mapping of sorts and use df.map or df.replace:
mapping = { 'Suffolk' : 'East Anglia', 'Norfolk': 'East Anglia', ... 'Kent' :'South East', ..., ... }
df['Region'] = df.County.map(mapping)
I would prefer a map here because it would convert non-matches to NaN, which would be the ideal thing.
Related
Given the following df with street names:
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
And df2 which contains that match streets and their following county:
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
How can I create a column that tells me the state where each street of DF is, through a pairing of df(street) df2(street2). The matching does not have to be perfect, it must match at least one word?
The following dataframe is an example of what I want to obtain:
desiredoutput = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown'], 'state': ["Utuado", "NA", "NA", "Bayamon"]})
Maybe a Naive approach, but works well.
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
output = {'street1':[],'county':[]}
streets1 = df['street1']
streets2 = df2['street2']
county = df2['county']
for street in streets1:
for index,street2 in enumerate(streets2):
if street2 in street:
output['street1'].append(street)
output['county'].append(county[index])
count = 1
if count == 0:
output['street1'].append(street)
output['county'].append('NA')
count = 0
print(output)
I have a dataframe with GDP data. The first few columns contain important data about the countries (which I have renamed it the way I wanted it) but it then goes into a long list of columns displaying a column per year from 1960 to 2015 with each year's GDP. In addition, the columns' names have got messed up and they are named sequentially with the word 'Unnamed' i.e 'Unnamed: 4', 'Unnamed: 5', etc.
My idea was to rename all the 'Unnamed' columns to each of the years (from 1960 to 2015). For example {'Unnamed 4': 1960, 'Unnamed 5': 1961, etc}. So I tried to write the code below:
GDP = pd.read_csv('world_bank.csv')
GDP = GDP.rename(columns={"Data Source": "Country", "World Development Indicators": "Country Code", "Unnamed: 2": "Indicator name", "Unnamed: 3": "Indicator Code"})
GDP = GDP.replace({'Data Source': {'Korea, Rep.': 'South Korea', 'Iran, Islamic Rep.': 'Iran', 'Hong Kong SAR, China': 'Hong Kong'}})
#Below is what I wrote to try to iterate through
GDP = GDP.rename(columns={["Unnamed: "+str(i)+": "+str(j) for i in range(4, 60) for j in range(1960, 2016)]})
But when I use that code it give this error:
TypeError: unhashable type: 'list'
Any thoughts how to do this?
You can directly use dict comprehension in python like:
GDP.rename(columns = {"Unnamed: "+str(i): str(1956+i) for i in range(4, 60)})
You should pass a dictionary to the rename function containing existing column names as keys and the replacing ones as values. You can see an example in the documentation.
I am trying to create a new variable called "region" based on the names of countries in Africa in another variable. I have a list of all the African countries (two shown here as an example) but I have am having encountering errors.
def africa(x):
if africalist in x:
return 'African region'
else:
return 'Not African region'
df['region'] = ''
df.region= df.countries.apply(africa)
I'm getting :
'in ' requires string as left operand, not list
I recommend you see When should I want to use apply.
You could use:
df['region'] = df['countries'].isin(africalist).map({True:'African region',
False:'Not African region'})
or
df['region'] = np.where(df['countries'].isin(africalist),
'African region',
'Not African region')
Your condition is wrong.
The correct manner is
if element in list_of_elements:
So, changing the function africa results in:
def africa(x):
if x in africalist:
return 'African region'
else:
return 'Not African region'
A 'Continent' column is added to an existing data frame using a dictionary to match with the country names in data frame.
I am trying to group the data frame by the 'Continent' column.
I have tried the following:
def answer_eleven():
Top15 = answer_one()
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
ContinentDict= pd.Series(ContinentDict)
Top15= Top15.assign(Continent= ContinentDict)
Top15= Top15.groupby('Continent')
return Top15
answer_eleven()
However, the output i get is:
pandas.core.groupby.groupby.DataFrameGroupBy object at 0x0000021C9C3BC6D8
a way to display a groupby object is
data = answer_eleven()
for key, item in data:
print(data.get_group(key), "\n")
I'm currently trying to build a model to predict which ['award'] people will receive in my subset.
I'm getting a key error with 'award' but I'm not sure why.
Here is my code(error in line2):
subset = pd.get_dummies(subset) #one-hot encoding
labels = np.array(subset['award']) #Labels= value to predict
subset= subset.drop('award', axis = 1) #remove labesl from subset, axis 1=columns
subset_list = list(subset.columns) #save subset names for later use
subset = np.array(subset)# Convert to numpy array
[award] typically contains: Best Director, Best Actor etc.
An example of a row in subset is:
birthplace DOB race award
Id
670454353 Chisinau, Moldova 30/09/1895 White Best Director
Before pd.get_dummies columns->
Index(['birthplace', 'date_of_birth', 'race_ethnicity', 'year_of_award',
'award', 'ldob', 'year', 'award_age', 'country', 'bin'],
dtype='object')
After pd.get_dummies(subset)->
Index(['year_of_award', 'ldob', 'year', 'award_age',
'birthplace_Arlington, Va, US', 'birthplace_Astoria, Ny, US',
'birthplace_Athens, Ga, US', 'birthplace_Athens, Greece',
'birthplace_Atlanta, Ga, US', 'birthplace_Baldwin, Ny, US',
...
'country_ Turkey', 'country_ US', 'country_ Ukraine', 'country_ Wales',
'bin_0-25', 'bin_25-35', 'bin_35-45', 'bin_45-55', 'bin_55-65',
'bin_65-75'],
Input:
check_cols = [col for col in subset.columns if 'award' in col]
Output:
['year_of_award', 'award_age', 'award_Best Actor', 'award_Best Actress',
'award_Best Director', 'award_Best Supporting Actor', 'award_Best
Supporting Actress']
If I try referencing any of the above in place of award, I get the same error.
The KeyError means the key award does not exist in subset. You will want to check how subset is structured in order to access it correctly. Right now there is no element award in there.
If you provide a little more code on how subset is built I may be able to help further.