Add keys from dicts (in column) to new column - python

I have a DataFrame with a 'budgetYearMap' column, which has 1-3 key-value pairs for each record. I'm a bit stuck as to how I'm supposed to make a new column containing only the keys of the "budgetYearMap" column.
Sample data below:
df_sample = pd.DataFrame({'identifier': ['BBI-2016-D02', 'BBI-2016-D03', 'BBI-2016-D04', 'BBI-2016-D05', 'BBI-2016-D06'],
'callIdentifier': ['H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016'],
'budgetYearMap': [{'0': 188650000}, {'2017': 188650000}, {'2015': 188650000}, {'2014': 188650000}, {'2020': 188650000, '2014': 188650000, '2012': 188650000}]
})
First I tried to extract the keys by position, then make a list out of them and add the list to the dataframe. As some records contained multiple keys (I then found out), this approach failed.
all_keys = [i for s in [list(d.keys()) for d in df_sample.budgetYearMap] for i in s]
df_TD_selected['budgetYear'] = all_keys
My problem is that extracting the keys by "name" wouldn't work either, given that the names of the keys are variable, and I do not know the set of years in advance. The data set will keep growing. It can be either 0 or a year within the 2000 range now, but in the future more years will be added.
My desired output would be:
df_output = pd.DataFrame({'identifier': ['BBI-2016-D02', 'BBI-2016-D03', 'BBI-2016-D04', 'BBI-2016-D05', 'BBI-2016-D06'],
'callIdentifier': ['H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016'],
'Year': ['0', '2017', '2015', '2014', '2020, 2014, 2012']
})
Any idea how I should approach this?

Perfect pipeline use-case.
df = (
df_sample
.assign(Year = df_sample['budgetYearMap'].apply(lambda s: list(s.keys())))
.drop(columns = ['budgetYearMap'])
)
.assign creates a new column which takes the 'budgetYearMap' Series and applies the lambda function to it. This returns the dictionary's keys in a list. If you prefer a string (as in your desired output), simply replace the lambda function with
lambda s: ', '.join(list(s.keys()))

Related

Pandas Dataframe from list nested in json

I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code

Is there a way to sort a dictionary from the outside in

I'm trying to create an event manager in which a dictionary stores the events like this
my_dict = {'2020':
{'9': {'8': ['School ']},
'11': {'13': ['Doctors ']},
'8': {'31': ['Interview']}
},
'2021': {}}
In which the outer key is the year the middle key is a month and the most inner key is a date which leads to a list of events.
I'm trying to first sort it so that the months are in order then sort it again so that the days are in order. Thanks in advance
Use-case
DevOrangeCrush wishes to sort on keys in a nested dictionary where the nesting occurs on multiple levels
Solution
Normalize the data so that the dates match ISO8601 format, for easier sorting
In plain English, this means make sure you always use two digits for month and date, and always use four digits for year
Re-normalize the original dictionary data structure into a single list of dictionaries, where each dictionary represents a row, and the list represents an outer containing table
this is known as an Array of Hashes in perl-speak
this is known as a list of objects in JSON-speak
Once your data is restructured you are solving a much more well-known, well-documented, and more obvious problem, how to sort a simple list of dictionaries (which is already documented in the See also section of this answer).
Example
import pprint
## original data is formatted as a nested dictionary, which is clumsy
my_dict = {'2020':
{'9': {'8': ['School ']}, '11':
{'13': ['Doctors ']},'8':
{'31': ['Interview']}}, '2021': {}
}
## we want the data formatted as a standard table (aka list of dictionary)
## this is the most common format for this kind of data as you would see in
## databases and spreadsheets
mydata_table = []
ddtemp = dict()
for year in my_dict:
for month in my_dict[year].keys():
ddtemp['month'] = '{0:02d}'.format(*[int(month)])
ddtemp['year'] = year
for day in my_dict[year][month].keys():
ddtemp['day'] = '{0:02d}'.format(*[int(day)])
mydata_row = dict()
mydata_row['year'] = '{year}'.format(**ddtemp)
mydata_row['month'] = '{month}'.format(**ddtemp)
mydata_row['day'] = '{day}'.format(**ddtemp)
mydata_row['task_list'] = my_dict[year][month][day]
mydata_row['date'] = '{year}-{month}-{day}'.format(**ddtemp)
mydata_table.append(mydata_row)
pass
pass
pass
## output result is now easily sorted and there is no data loss
## you will have to modify this if you want to deal with years that
## do not have any associated task_list data
pprint.pprint(mydata_table)
'''
## now we have something that can be sorted using well-known python idioms
## and easily manipulated using data-table semantics
## (search, sort, filter-by, group-by, select, project ... etc)
[
{'date': '2020-09-08','day': '08',
'month': '09','task_list': ['School '],'year': '2020'},
{'date': '2020-11-13','day': '13',
'month': '11','task_list': ['Doctors '],'year': '2020'},
{'date': '2020-08-31','day': '31',
'month': '08','task_list': ['Interview'],'year': '2020'},
]
'''
See also
How to sort a python list-of-dictionary
How to sort objects by multiple keys
Why you should use ISO8601 date format
ISO8601 vs timestamp
To get sorted events data, you can do something like this:
def sort_events(my_dict):
new_events_data = dict()
for year, month_data in my_dict.items():
new_month_data = dict()
for month, day_data in month_data.items():
sorted_day_data = sorted(day_data.items(), key=lambda kv: int(kv[0]))
new_month_data[month] = OrderedDict(sorted_day_data)
sorted_months_data = sorted(new_month_data.items(), key=lambda kv: int(kv[0]))
new_events_data[year] = OrderedDict(sorted_months_data)
return new_events_data
Output:
{'2020': OrderedDict([('8', OrderedDict([('31', ['Interview'])])),
('9', OrderedDict([('8', ['School '])])),
('11', OrderedDict([('13', ['Doctors '])]))]),
'2021': OrderedDict()}
A simple dict can't be ordered, you could do it using a OrderedDict but if you simply need to get it sorted while iterating on it do like this
for year in sorted(map(int, my_dict)):
year_dict = my_dict[str(year)]
for month in sorted(map(int, year_dict)):
month_dict = year_dict[str(month)]
for day in sorted(map(int, month_dict)):
events = month_dict[str(day)]
for event in events:
print(year, month, day, event)
Online Demo
The conversion to int is to ensure right ordering between the numbers, without you'll get 1, 10, 11, .., 2, 20, 21
A dictionary in Python does not have an order, you might want to try the OrderedDict class from the collections Module which remembers the order of insertion.
Of course you would have to sort and reinsert the elements whenever you insert a new element which should be placed before any of the existing elements.
If you care about order maybe a different data structure works better. For example a list of lists.

convert list to dataframe using dictionary

I am new to Pythonland and I have a question. I have a list as below and want to convert it into a dataframe.
I read on Stackoverflow that it is better to create a dictionary then a list so I create one as follows.
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick", "nick","pick"]
data = ['100','50','A','107','62','B'] # The actual list has 1640 entries
dic = {key:[] for key in column_names}
dic['name'] = row_names
t = 0
while t< len(data):
dic['height'].append(data[t])
t = t+3
t = 1
while t< len(data):
dic['weight'].append(data[t])
t = t+3
So on and so forth, I have 10 columns so I wrote above code 10 times to complete the full dictionary. Then i convert
it to dataframe. It works perfectly fine, there has to
be a way to do this in shorter way. I don't know how to refer to key of a dictionary with a number. Should it be wrapped to a function. Also, how can I automate adding one to value of t before executing the next loop? Please help me.
You can iterate through columnn_names like this:
dic = {key:[] for key in column_names}
dic['name'] = row_names
for t, column_name in enumerate(column_names):
i = t
while i< len(data):
dic[column_name].append(data[i])
i += 3
Enumerate will automatically iterate through t form 0 to len(column_names)-1
i = 0
while True:
try:
for j in column_names:
d[j].append(data[i])
i += 1
except Exception as er: #So when i value exceed by data list it comes to exception and it will break the loop as well
print(er, "################")
break
The first issue that you have all columns data concatenated to a single list. You should first investigate how to prevent it and have list of lists with each column values in a separate list like [['100', '107'], ['50', '62'], ['A', 'B']]. Any way you need this data structure to proceed efficiently:
cl_count = len(column_names)
d_count = len(data)
spl_data = [[data[j] for j in range(i, d_count, cl_count)] for i in range(cl_count)]
Then you should use dict comprehension. This is a 3.x Python feature so it will not work in Py 2.x.
df = pd.DataFrame({j: spl_data[i] for i, j in enumerate(column_names)})
First, we should understand how an ideal dictionary for a dataframe should look like.
A Dataframe can be thought of in two different ways:
One is a traditional collection of rows..
'row 0': ['jack', 100, 50, 'A'],
'row 1': ['mick', 107, 62, 'B']
However, there is a second representation that is more useful, though perhaps not as intuitive at first.
A collection of columns:
'name': ['jack', 'mick'],
'height': ['100', '107'],
'weight': ['50', '62'],
'grade': ['A', 'B']
Now, here is the key thing to realise, the 2nd representation is more useful
because that is the representation interally supported and used in dataframes.
It does not run into conflict of datatype within a single grouping (each column needs to have 1 fixed datatype)
Across a row representation however, datatypes can vary.
Also, operations can be performed easily and consistently on an entire column
because of this consistency that cant be guaranteed in a row.
So, tl;dr DataFrames are essentially collections of equal length columns.
So, a dictionary in that representation can be easily converted into a DataFrame.
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick"]
data = [100, 50,'A', 107, 62,'B'] # The actual list has 1640 entries
So, With that in mind, the first thing to realize is that, in its current format, data is a very poor representation.
It is a collection of rows merged into a single list.
The first thing to do, if you're the one in control of how data is formed, is to not prepare it this way.
The goal is a list for each column, and ideally, prepare the list in that format.
Now, however, if it is given in this format, you need to iterate and collect the values accordingly. Here's a way to do it
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick"]
data = [100, 50,'A', 107, 62,'B'] # The actual list has 1640 entries
dic = {key:[] for key in column_names}
dic['name'] = row_names
print(dic)
Output so far:
{'height': [],
'weight': [],
'grade': [],
'name': ['jack', 'mick']} #so, now, names are a column representation with all correct values.
remaining_cols = column_names[1:]
#Explanations for the following part given at the end
data_it = iter(data)
for row in zip(*([data_it] * len(remaining_cols))):
for i, val in enumerate(row):
dic[remaining_cols[i]].append(val)
print(dic)
Output:
{'name': ['jack', 'mick'],
'height': [100, 107],
'weight': [50, 62],
'grade': ['A', 'B']}
And we are done with the representation
Finally:
import pd
df = pd.DataFrame(dic, columns = column_names)
print(df)
name height weight grade
0 jack 100 50 A
1 mick 107 62 B
Edit:
Some explanation for the zip part:
zip takes any iterables and allows us through iterate through them together.
data_it = iter(data) #prepares an iterator.
[data_it] * len(remaining_cols) #creates references to the same iterator
Here, this is similar to [data_it, data_it, data_it]
The * in *[data_it, data_it, data_it] allows us to unpack the list into 3 arguments for the zip function instead
so, f(*[data_it, data_it, data_it]) is equivalent to f(data_it, data_it, data_it) for any function f.
the magic here is that traversing through an iterator/advancing an iterator will now reflect the change across all references
Putting it all together:
zip(*([data_it] * len(remaining_cols))) will actually allow us to take 3 items from data at a time, and assign it to row
So, row = (100, 50, 'A') in first iteration of zip
for i, val in enumerate(row): #just iterate through the row, keeping index too using enumerate
dic[remaining_cols[i]].append(val) #use indexes to access the correct list in the dictionary
Hope that helps.
If you are using Python 3.x, as suggested by l159, you can use a comprehension dict and then create a Pandas DataFrame out of it, using the names as row indexes:
data = ['100', '50', 'A', '107', '62', 'B', '103', '64', 'C', '105', '78', 'D']
column_names = ["height", "weight", "grade"]
row_names = ["jack", "mick", "nick", "pick"]
df = pd.DataFrame.from_dict(
{
row_label: {
column_label: data[i * len(column_names) + j]
for j, column_label in enumerate(column_names)
} for i, row_label in enumerate(row_names)
},
orient='index'
)
Actually, the intermediate dictionary is a nested dictionary: the keys of the outer dictionary are the row labels (in this case the items of the row_names list); the value associated with each key is a dictionary whose keys are the column labels (i.e., the items in column_names) and values are the correspondent elements in the data list.
The function from_dict is used to create the DataFrame instance.
So, the previous code produces the following result:
height weight grade
jack 100 50 A
mick 107 62 B
nick 103 64 C
pick 105 78 D

Updating a dictionary with values and predefined keys

I want to create a dictionary that has predefined keys, like this:
dict = {'state':'', 'county': ''}
and read through and get values from a spreadsheet, like this:
for row in range(rowNum):
for col in range(colNum):
and update the values for the keys 'state' (sheet.cell_value(row, 1)) and 'county' (sheet.cell_value(row, 1)) like this:
dict[{}]
I am confused on how to get the state value with the state key and the county value with the county key. Any suggestions?
Desired outcome would look like this:
>>>print dict
[
{'state':'NC', 'county': 'Nash County'},
{'state':'VA', 'county': 'Albemarle County'},
{'state':'GA', 'county': 'Cook County'},....
]
I made a few assumptions regarding your question. You mentioned in the comments that State is at index 1 and County is at index 3; what is at index 2? I assumed that they occur sequentially. In addition to that, there needs to be a way in which you can map the headings to the data columns, hence I used a list to do that as it maintains order.
# A list containing the headings that you are interested in the order in which you expect them in your spreadsheet
list_of_headings = ['state', 'county']
# Simulating your spreadsheet
spreadsheet = [['NC', 'Nash County'], ['VA', 'Albemarle County'], ['GA', 'Cook County']]
list_of_dictionaries = []
for i in range(len(spreadsheet)):
dictionary = {}
for j in range(len(spreadsheet[i])):
dictionary[list_of_headings[j]] = spreadsheet[i][j]
list_of_dictionaries.append(dictionary)
print(list_of_dictionaries)
Raqib's answer is partially correct but had to be modified for use with an actual spreadsheet with row and columns and the xlrd mod. What I did was first use xlrd methods to grab the cell values, that I wanted and put them into a list (similar to the spreadsheet variable raqib has shown above). Not that the parameters sI and cI are the column index values I picked out in a previous step. sI=StateIndex and cI=CountyIndex
list =[]
for row in range(rowNum):
for col in range(colNum):
list.append([str(sheet.cell_value(row, sI)), str(sheet.cell_value(row, cI))])
Now that I have a list of the states and counties, I can apply raqib's solution:
list_of_headings = ['state', 'county']
fipsDic = []
print len(list)
for i in range(len(list)):
temp = {}
for j in range(len(list[i])):
tempDic[list_of_headings[j]] = list[i][j]
fipsDic.append(temp)
The result is a nice dictionary list that looks like this:
[{'county': 'Minnehaha County', 'state': 'SD'}, {'county': 'Minnehaha County', 'state': 'SD', ...}]

Replace values from pandas dataset with dictionary

I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?
Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.

Categories

Resources