Model building keyw not recognised - python

I'm currently trying to build a model to predict which ['award'] people will receive in my subset.
I'm getting a key error with 'award' but I'm not sure why.
Here is my code(error in line2):
subset = pd.get_dummies(subset) #one-hot encoding
labels = np.array(subset['award']) #Labels= value to predict
subset= subset.drop('award', axis = 1) #remove labesl from subset, axis 1=columns
subset_list = list(subset.columns) #save subset names for later use
subset = np.array(subset)# Convert to numpy array
[award] typically contains: Best Director, Best Actor etc.
An example of a row in subset is:
birthplace DOB race award
Id
670454353 Chisinau, Moldova 30/09/1895 White Best Director
Before pd.get_dummies columns->
Index(['birthplace', 'date_of_birth', 'race_ethnicity', 'year_of_award',
'award', 'ldob', 'year', 'award_age', 'country', 'bin'],
dtype='object')
After pd.get_dummies(subset)->
Index(['year_of_award', 'ldob', 'year', 'award_age',
'birthplace_Arlington, Va, US', 'birthplace_Astoria, Ny, US',
'birthplace_Athens, Ga, US', 'birthplace_Athens, Greece',
'birthplace_Atlanta, Ga, US', 'birthplace_Baldwin, Ny, US',
...
'country_ Turkey', 'country_ US', 'country_ Ukraine', 'country_ Wales',
'bin_0-25', 'bin_25-35', 'bin_35-45', 'bin_45-55', 'bin_55-65',
'bin_65-75'],
Input:
check_cols = [col for col in subset.columns if 'award' in col]
Output:
['year_of_award', 'award_age', 'award_Best Actor', 'award_Best Actress',
'award_Best Director', 'award_Best Supporting Actor', 'award_Best
Supporting Actress']
If I try referencing any of the above in place of award, I get the same error.

The KeyError means the key award does not exist in subset. You will want to check how subset is structured in order to access it correctly. Right now there is no element award in there.
If you provide a little more code on how subset is built I may be able to help further.

Related

Is there a way to index a list using a series without using loop?

Result = pd.DataFrame({
'File': filenames_,
'Actual Classes': Actual_classes,
'Predicted Classes': Predicted_classes
})
Result.sample(frac = 0.02)
Actual Classes and Predicted Classes are integer values ranging from (1 to 8). I want to create a new column in the dataframe using the list of 9 strings:
['Black Sea Sprat', 'Gilt-Head Bream', 'Hourse Mackerel', 'Red Mullet',
'Red Sea Bream', 'Sea Bass', 'Shrimp', 'Striped Red Mullet', 'Trout']
By indexing the values in df to the list without using a loop, rather by using the inbuilt pandas function.
I actually want a new column added to the dataframe using the list with indices corresponding to the row.
How about using apply?
classes = ['Black Sea Sprat', 'Gilt-Head Bream', 'Hourse Mackerel', 'Red Mullet',
'Red Sea Bream', 'Sea Bass', 'Shrimp', 'Striped Red Mullet', 'Trout']
Result['class name'] = Result['Predicted Classes'].apply(lambda x: classes[x])

Why is there no duplicates in pandas dataframe.index?

I just wrote a program for college using pandas to structure some unstructured data. I definitely made it harder than it should be, but I ended up finding something interesting.
here is the data I parsed
Center/Daycare
825 23rd Street South
Arlington, VA 22202
703-979-BABY (2229)
22.
Maria Teresa Desaba, Owner/Director; Tony Saba, Org. >Director.
Website: www.mariateresasbabies.com
Serving children 6 wks to 5yrs full-time.
National Science Foundation Child Development Center
23.
4201 Wilson Blvd., Suite 180 22203
703-292-4794
Website: www.brighthorizons.com 112 children, ages 6 wks - 5 yrs.
7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years.
here is the (aggressively commented for school)code that is mostly irrelevant but here for completeness sake
import csv
import pandas as pd
lines = []
"""opening the raw data from a text file"""
with open('raw_data.txt') as f:
lines = f.readlines()
f.close()
"""removing new line characters"""
for i in range(len(lines)):
lines[i] = lines[i].rstrip('\n')
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
"""here I am taking every column and adding corresponding values from the original dataframe
extra data frames chould be garbage collected but this serves for demonstration"""
df.index = df.index.astype('str')
df1 = df[df.index.str.contains('bus')]
df2 = df[df.index.str.contains('address')]
df3 = df[df.index.str.contains('location')]
df4 = df[df.index.str.contains('number')]
df5 = df[df.index.str.contains('know')]
df6 = df[df.index.str.contains('owner')]
df7 = df[df.index.str.contains('site')]
df8 = df[df.index.str.contains('descript')]
df9 = df[df.index.str.contains('null')]
for i in range(len(df1)):
df['business type'][i] = df1['info'][i]
for i in range(len(df2)):
df['address'][i] = df2['info'][i]
for i in range(len(df3)):
df['location'][i] = df3['info'][i]
for i in range(len(df4)):
df['phone number'][i] = df4['info'][i]
for i in range(len(df5)):
df['unknown'][i] = df5['info'][i]
for i in range(len(df6)):
df['owner'][i] = df6['info'][i]
for i in range(len(df7)):
df['website'][i] = df7['info'][i]
for i in range(len(df8)):
df['description'][i] = df8['info'][i]
for i in range(len(df9)):
df['null'][i] = df9['info'][i]
"""dropping unnecessary columns"""
df.drop(columns='info', inplace=True)
df.drop(columns='null', inplace=True)
df.drop(columns='unknown', inplace=True)
"""changing the index values to int to make easier to drop unused rows"""
idx = []
for i in range(0, len(df)):
idx.append(i)
df.index = idx
"""dropping unused rows"""
for i in range(2, 15):
df.drop([i], inplace=True)
"""writing to csv and printing to console"""
df.to_csv("new.csv", index=False)
print(df.to_string())
I'm just curious why when I create more columns by using the name of the index[i] item here
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
doesn't contain any duplicates.
when I add
print(df.columns)
I get the output
Index(['info', 'business type', 'address', 'location', 'phone number',
'unknown', 'owner', 'website', 'description', 'null'],
dtype='object')
I'm just generally curious why there are no duplicates as I'm sure that could be problematic in certain situations and also pandas is interesting and I hardly understand it and would like to know more. Also, if you feel extra enthusiastic any info on a more efficient way to do this would be greatly appreciated, but if not no worries, I'll eventually read the docs.
The pandas DataFrame is designed for tabular data in which all the entries in any one column have the same type (e.g. integer or string). One row usually represents one instance, sample, or individual. So the natural way to parse your data into a DataFrame is to have two rows, one for each institution, and define the columns as what you have called index (perhaps with the address split into several columns), e.g. business type, street, city, state, post code, phone number, etc.
So there would be one row per institution, and the index would be used to assign a unique identifier to each of them. That's why it's desirable for the index to contain no duplicates.

Create categorical column based on string values

I have kind of a simple problem, but I'm having trouble achieving what I want. I have a district column, with 32 different values for all districts in a city. I want to create a column "sector" that says which sector that district belongs to.
I thought the obvious approach was through a dictionary and map, but couldn't make it work:
sectores={'sector oriente':['Vitacura, Las Condes, Lo Barnechea', 'La Reina','Ñuñoa','Providencia'],
'sector suroriente':['Peñalolén','La Florida', 'Macul'],
'sector sur': ['La Granja','La Pintana','Lo Espejo','San Ramón','La Cisterna','El Bosque','Pedro Aguirre Cerda','San Joaquín','San Miguel'],
'sector surponiente':['Maipú','Estación Central','Cerrillos'],
'sector norponiente':['Cerro Navia','Lo Prado','Pudahuel','Quinta Normal','Renca'],
'sector norte':['Conchalí','Huechuraba','Independencia','Recoleta','Quilicura'],
'sector centro':['Santiago']}
Noticed I needed to switch keys and values:
sectores = dict((y,x) for x,y in sectores.items())
Then tried to map it:
df['sectores']=df['district'].map(sectores)
But I'm getting:
TypeError: unhashable type: 'list'
Is this the right approach? Should I try something else?
Thanks in advance!
Edit: This is what df['district'] looks like:
district
Maipú
Quilicura
Independencia
Conchalí
...
You are trying to use lists as the keys in your dict, which is not possible because lists are mutable and not hashable.
Instead, use the strings by iterating through the values:
sectores = {i: k for k, v in sectores.items() for i in v}
Then, you can use pd.Series.map and
df['sectores']=df['district'].map(sectores)
should work

Python/Pandas: creating new dataframe, gets error "unalignable boolean Series provided as indexer"

I am trying to compare two dataframes and return different result sets based on whether a value from one dataframe is present in the other.
Here is my sample code:
pmdf = pd.DataFrame(
{
'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
}
)
pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]
jcrdf = pd.DataFrame(
{
'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
'Total Cites': ['223','444','324','166'],
'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],
'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']
})
jcrdf = jcrdf.set_index('Full Journal Title')
pmdf_issn = pmdf['ISSN'].values.tolist()
This line gets me the rows from dataframe jcrdf that contain the issn from dataframe pmdf
pmjcrmatch = jcrdf[jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]
I wanted the following line to create a new dataframe of values from pmdf where the ISSN is not in jcfdf so I negated the previous statement and chose the first dataframe.
pmjcrnomatch = pmdf[~jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]
I get an error: "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match"
I don't find a lot about this specific error, at least nothing that is helping me toward a solution.
Is "str.contains" not the best way of sorting items that are and aren't in the second dataframe?
You are trying to apply the boolean index of one dataframe to another. This is only possible if the length of both dataframes match. In your case you should use isin.
# get all rows from jcrdf where `ALL_ISSNs` contains any of the `ISSN` in `pmdf`.
pmjcrmatch = jcrdf[jcrdf.All_ISSNs.str.contains('|'.join(pmdf.ISSN))]
# assign all remaining rows from `jcrdf` to a new dataframe.
pmjcrnomatch = jcrdf[~jcrdf.ISSN.isin(pmjcrmatch.ISSN)]
EDIT
Let's try another approach:
First i'd create a lookup for all you ISSNs and then create the diff by isolating the matches:
import pandas as pd
pmdf = pd.DataFrame(
{
'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
}
)
pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]
jcrdf = pd.DataFrame(
{
'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
'Total Cites': ['223','444','324','166'],
'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],
'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']
})
jcrdf = jcrdf.set_index('Full Journal Title')
# create lookup from all issns to avoid expansice string matching
jcrdf_lookup = pd.DataFrame(jcrdf['All_ISSNs'].str.split(',').tolist(),
index=jcrdf.ISSN).stack(level=0).reset_index(level=0)
# compare extracted ISSNs from ALL_ISSNs with pmdf.ISSN
matches = jcrdf_lookup[jcrdf_lookup[0].isin(pmdf.ISSN)]
jcrdfmatch = jcrdf[jcrdf.ISSN.isin(matches.ISSN)]
jcrdfnomatch = pmdf[~pmdf.ISSN.isin(matches[0])]

Populate new column based on conditions in another column

I'm playing about with Python and pandas.
I have created a dataframe, I have a column (axis 1) called 'County' but I need to create a column called 'Region' and populate it like this (atleast I think):
If County column == 'Suffolk' or 'Norfolk' or 'Essex' then in Region column insert 'East Anglia'
If County column == 'Kent' or 'East Sussex' or 'West Sussex' then in Region Column insert 'South East'
If County column == 'Dorset' or 'Devon' or 'Cornwall' then in Region Column insert 'South West'
and so on...
So far I have this:
myDataFrame['Region'] = np.where(myDataFrame['County']=='Suffolk', 'East Anglia', '')
But I suspect this won't work for any other counties
As I'm sure is obvious I am a beginner. I have tried googling and reading but only could find out about numpy where, which got me this far.
You'll definitely need df.isin and loc based indexing:
df['Region'] = np.nan
df.loc[df.County.isin(['Suffolk','Norfolk', 'Essex']), 'Region'] = 'East Anglia'
df.loc[df.County.isin(['Kent', 'East Sussex', 'West Sussex']), 'Region'] = 'South East'
df.loc[df.County.isin(['Dorset', 'Devon', 'Cornwall']), 'Region'] = 'South West'
You could also create a mapping of sorts and use df.map or df.replace:
mapping = { 'Suffolk' : 'East Anglia', 'Norfolk': 'East Anglia', ... 'Kent' :'South East', ..., ... }
df['Region'] = df.County.map(mapping)
I would prefer a map here because it would convert non-matches to NaN, which would be the ideal thing.

Categories

Resources