Create categorical column based on string values

Create categorical column based on string values - python

I have kind of a simple problem, but I'm having trouble achieving what I want. I have a district column, with 32 different values for all districts in a city. I want to create a column "sector" that says which sector that district belongs to.
I thought the obvious approach was through a dictionary and map, but couldn't make it work:
sectores={'sector oriente':['Vitacura, Las Condes, Lo Barnechea', 'La Reina','Ñuñoa','Providencia'],
'sector suroriente':['Peñalolén','La Florida', 'Macul'],
'sector sur': ['La Granja','La Pintana','Lo Espejo','San Ramón','La Cisterna','El Bosque','Pedro Aguirre Cerda','San Joaquín','San Miguel'],
'sector surponiente':['Maipú','Estación Central','Cerrillos'],
'sector norponiente':['Cerro Navia','Lo Prado','Pudahuel','Quinta Normal','Renca'],
'sector norte':['Conchalí','Huechuraba','Independencia','Recoleta','Quilicura'],
'sector centro':['Santiago']}
Noticed I needed to switch keys and values:
sectores = dict((y,x) for x,y in sectores.items())
Then tried to map it:
df['sectores']=df['district'].map(sectores)
But I'm getting:
TypeError: unhashable type: 'list'
Is this the right approach? Should I try something else?
Thanks in advance!
Edit: This is what df['district'] looks like:
district
Maipú
Quilicura
Independencia
Conchalí
...

You are trying to use lists as the keys in your dict, which is not possible because lists are mutable and not hashable.
Instead, use the strings by iterating through the values:
sectores = {i: k for k, v in sectores.items() for i in v}
Then, you can use pd.Series.map and
df['sectores']=df['district'].map(sectores)
should work

Related

Add keys from dicts (in column) to new column

I have a DataFrame with a 'budgetYearMap' column, which has 1-3 key-value pairs for each record. I'm a bit stuck as to how I'm supposed to make a new column containing only the keys of the "budgetYearMap" column.
Sample data below:
df_sample = pd.DataFrame({'identifier': ['BBI-2016-D02', 'BBI-2016-D03', 'BBI-2016-D04', 'BBI-2016-D05', 'BBI-2016-D06'],
'callIdentifier': ['H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016'],
'budgetYearMap': [{'0': 188650000}, {'2017': 188650000}, {'2015': 188650000}, {'2014': 188650000}, {'2020': 188650000, '2014': 188650000, '2012': 188650000}]
})
First I tried to extract the keys by position, then make a list out of them and add the list to the dataframe. As some records contained multiple keys (I then found out), this approach failed.
all_keys = [i for s in [list(d.keys()) for d in df_sample.budgetYearMap] for i in s]
df_TD_selected['budgetYear'] = all_keys
My problem is that extracting the keys by "name" wouldn't work either, given that the names of the keys are variable, and I do not know the set of years in advance. The data set will keep growing. It can be either 0 or a year within the 2000 range now, but in the future more years will be added.
My desired output would be:
df_output = pd.DataFrame({'identifier': ['BBI-2016-D02', 'BBI-2016-D03', 'BBI-2016-D04', 'BBI-2016-D05', 'BBI-2016-D06'],
'callIdentifier': ['H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016'],
'Year': ['0', '2017', '2015', '2014', '2020, 2014, 2012']
})
Any idea how I should approach this?

Perfect pipeline use-case.
df = (
df_sample
.assign(Year = df_sample['budgetYearMap'].apply(lambda s: list(s.keys())))
.drop(columns = ['budgetYearMap'])
)
.assign creates a new column which takes the 'budgetYearMap' Series and applies the lambda function to it. This returns the dictionary's keys in a list. If you prefer a string (as in your desired output), simply replace the lambda function with
lambda s: ', '.join(list(s.keys()))

How to create a Pandas DataFrame from a list of OrderedDicts?

I have the following list:
o_dict_list = [(OrderedDict([('StreetNamePreType', 'ROAD'), ('StreetName', 'Coffee')]), 'Ambiguous'),
(OrderedDict([('StreetNamePreType', 'AVENUE'), ('StreetName', 'Washington')]), 'Ambiguous'),
(OrderedDict([('StreetNamePreType', 'ROAD'), ('StreetName', 'Quartz')]), 'Ambiguous')]
And like the title says, I am trying to take this list and create a pandas dataframe where the columns are: 'StreetNamePreType' and 'StreetName' and the rows contain the corresponding values for each key in the OrderedDict.
I have done some searching on StackOverflow to get some guidance on how to create a dataframe, see here but I am getting an error when I run this code (I am trying to replicate what is going on in that response).
from collections import Counter, OrderedDict
import pandas as pd
col = Counter()
for k in o_dict_list:
col.update(k)
df = pd.DataFrame([k.values() for k in o_dict_list], columns = col.keys())
When I run this code, the error I get is: TypeError: unhashable type: 'OrderedDict'
I looked up this error, here, I get that there is a problem with the datatypes, but I, unfortunately, I don't know enough about the inner workings of Python/Pandas to resolve this problem on my own.
I suspect that my list of OrderedDict is not exactly the same as in here which is why I am not getting my code to work. More specifically, I believe I have a list of sets, and each element contains an OrderedDict. The example, that I have linked to here seems to be a true list of OrderedDicts.
Again, I don't know enough about the inner workings of Python/Pandas to resolve this problem on my own and am looking for help.

I would use list comprehension to do this as follows.
pd.DataFrame([o_dict_list[i][0] for i, j in enumerate(o_dict_list)])
See the output below.
StreetNamePreType StreetName
0 ROAD Coffee
1 AVENUE Washington
2 ROAD Quartz

extracting the OrderedDict objects from your list and then use pd.Dataframe should work
values= []
for i in range(len(o_dict_list)):
values.append(o_dict_list[i][0])
pd.DataFrame(values)
StreetNamePreType StreetName
0 ROAD Coffee
1 AVENUE Washington
2 ROAD Quartz

d = [{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
pd.DataFrame(d)

Python DataFrame column with list of strings does not flatten

I have a column in a DataFrame (production_company) which has a list of strings that are production companies for a movie. I want to search for all unique occurrence of a production company across all movies.
In the data below I have given a sample of the column values in production_company.
"['Universal Studios', 'Amblin Entertainment', 'Legendary Pictures', 'Fuji Television Network', 'Dentsu']"
"['Village Roadshow Pictures', 'Kennedy Miller Productions']"
"['Summit Entertainment', 'Mandeville Films', 'Red Wagon Entertainment', 'NeoReel']"
"['Lucasfilm', 'Truenorth Productions', 'Bad Robot']"
"['Universal Pictures', 'Original Film', 'Media Rights Capital', 'Dentsu', 'One Race Films']"
"['Regency Enterprises', 'Appian Way', 'CatchPlay', 'Anonymous Content', 'New Regency Pictures']"
I am trying to first flatten the column using a solution to flatten given in Pandas Series of lists to one series
But I get error 'TypeError: 'float' object is not iterable'
17 slist =[]
18 for company in production_companies:
---> 19 slist.extend(company )
20
21
TypeError: 'float' object is not iterable
production_companies holds the column df['production_company']
Company is a list so why is it taking it as float? Even list comprehension gives the same error: flattened_list = [y for x in production_companies for y in x]

You can use collections.Counter to count items. I would split the task into 3 steps:
Convert series of strings into a series of lists via ast.literal_eval.
Use itertools.chain to form an iterable of companies and feed to Counter.
Use a dictionary comprehension to filter for companies with a count of 1.
Here's a demo:
from ast import literal_eval
from itertools import chain
from collections import Counter
s = df['companies'].map(literal_eval)
c = Counter(chain.from_iterable(s))
c_filtered = {k for k, v in c.items() if v == 1}
Result:
print(c_filtered)
['Village Roadshow Pictures', 'Kennedy Miller Productions',
...
'Truenorth Productions', 'Regency Enterprises']

Updating a dictionary with values and predefined keys

I want to create a dictionary that has predefined keys, like this:
dict = {'state':'', 'county': ''}
and read through and get values from a spreadsheet, like this:
for row in range(rowNum):
for col in range(colNum):
and update the values for the keys 'state' (sheet.cell_value(row, 1)) and 'county' (sheet.cell_value(row, 1)) like this:
dict[{}]
I am confused on how to get the state value with the state key and the county value with the county key. Any suggestions?
Desired outcome would look like this:
>>>print dict
[
{'state':'NC', 'county': 'Nash County'},
{'state':'VA', 'county': 'Albemarle County'},
{'state':'GA', 'county': 'Cook County'},....
]

I made a few assumptions regarding your question. You mentioned in the comments that State is at index 1 and County is at index 3; what is at index 2? I assumed that they occur sequentially. In addition to that, there needs to be a way in which you can map the headings to the data columns, hence I used a list to do that as it maintains order.
# A list containing the headings that you are interested in the order in which you expect them in your spreadsheet
list_of_headings = ['state', 'county']
# Simulating your spreadsheet
spreadsheet = [['NC', 'Nash County'], ['VA', 'Albemarle County'], ['GA', 'Cook County']]
list_of_dictionaries = []
for i in range(len(spreadsheet)):
dictionary = {}
for j in range(len(spreadsheet[i])):
dictionary[list_of_headings[j]] = spreadsheet[i][j]
list_of_dictionaries.append(dictionary)
print(list_of_dictionaries)

Raqib's answer is partially correct but had to be modified for use with an actual spreadsheet with row and columns and the xlrd mod. What I did was first use xlrd methods to grab the cell values, that I wanted and put them into a list (similar to the spreadsheet variable raqib has shown above). Not that the parameters sI and cI are the column index values I picked out in a previous step. sI=StateIndex and cI=CountyIndex
list =[]
for row in range(rowNum):
for col in range(colNum):
list.append([str(sheet.cell_value(row, sI)), str(sheet.cell_value(row, cI))])
Now that I have a list of the states and counties, I can apply raqib's solution:
list_of_headings = ['state', 'county']
fipsDic = []
print len(list)
for i in range(len(list)):
temp = {}
for j in range(len(list[i])):
tempDic[list_of_headings[j]] = list[i][j]
fipsDic.append(temp)
The result is a nice dictionary list that looks like this:
[{'county': 'Minnehaha County', 'state': 'SD'}, {'county': 'Minnehaha County', 'state': 'SD', ...}]

Keep some keys in my list with comprehension?

I have a big list that I pulled in from a .csv:
CSV_PATH = 'myfile.csv'
CSV_OBJ = csv.DictReader(open(CSV_PATH, 'r'))
CSV_LIST = list(CSV_OBJ)
And I only want to keep some of the columns in it:
KEEP_COLS = ['Name', 'Year', 'Total Allocations', 'Enrollment']'
It seems from Removing multiple keys from a dictionary safely like this ought to work:
BETTER = {k: v for k, v in CSV_LIST if k not in KEEP_COLS}
But I get an error: ValueError: too many values to unpack What am I missing here? I could write a loop that runs through CSV_LIST and produces BETTER by keeping only what I want, but I suspect that using comprehension is more pythonic.
As requested, a chunk of CSV_LIST
{'EIN': '77-0000091',
'FR': '28.4',
'Name': 'Org A',
'Enrollment': '506',
'Total Allocations': '$34214',
'geo_latitude': '37.9381775755',
'geo_longitude': '-122.3146910612',
'Year': '2009'},
{'EIN': '77-0000091',
'FR': '28.4',
'Name': 'Org A',
'Enrollment': '506',
'Total Allocations': '$34214',
'geo_latitude': '37.9381775755',
'geo_longitude': '-122.3146910612',
'Year': '2010'}
At the commandline I can do csvcut -c 'Name','Year','Total Allocations','Enrollment' myfile.csv > better_myfile.csv but that's definitely not pythonic.

Your dictionary comprehension is fine, but since you have a list of dictionaries, you have to create a list comprehension using that dictionary comprehension for the individual list items. Also, since you want to keep those columns, I guess you should drop that not. Try this:
[{k: v for k, v in d.items() if k in KEEP_COLS} for d in CSV_LIST]

An alternative is to use
CSV_LIST = map(operator.itemgetter(*KEEP_LIST), CSV_OBJ)
This will create a list of tuples with the desired columns.

The issue is that CSV_LIST is a list, not a single dict. #tobias explained how to unpack it correctly.
However, if you're worried about being Pythonic, why are you processing a DictReader into a list of dictionaries and then filtering out all but a few keys? Without knowing your use case I can't be sure, but it's likely that it would be cleaner and simpler to just use the DictReader row-by-row the way it was intended to be used:
with open(CSV_PATH, 'r') as f:
for row in csv.DictReader(f):
process(row['Name'],row['Year'],row['Total Allocations'],row['Enrollment'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create categorical column based on string values - python

Related

Add keys from dicts (in column) to new column

How to create a Pandas DataFrame from a list of OrderedDicts?

Python DataFrame column with list of strings does not flatten

Updating a dictionary with values and predefined keys

Keep some keys in my list with comprehension?

Categories

Resources