data =
{'gems': [{'name': 'garnet', 'colour': 'red', 'month': 'January'},
{'name': 'emerald', 'colour': 'green', 'month': 'May'},
{'name': "cat's eye", 'colour': 'yellow', 'month': 'June'},
{'name': 'sardonyx', 'colour': 'red', 'month': 'August'},
{'name': 'peridot', 'colour': 'green', 'month': 'September'},
{'name': 'ruby', 'colour': 'red', 'month': 'December'}]}
How do I create a list of colours and then just find the months with the colour red?
I've tried for and if, but I keep getting the error message
string indices must be integers
Because you have dictionaries within a list, you can use a list-comprehension with nested if logic to filter out those values you don't want:
[x['month'] for x in data['gems'] if x['colour'] == 'red']
Returns:
['January', 'August', 'December']
Assuming that one wants the output as a dataframe, one can use pandas.json_normalize and pandas.DataFrame.query as follows
df = pd.json_normalize(data['gems']).query('colour == "red"')['month']
[Out]:
0 January
3 August
5 December
If one wants the index to be reset, one needs to pass pandas.DataFrame.reset_index as
df = pd.json_normalize(data['gems']).query('colour == "red"')['month'].reset_index(drop=True)
[Out]:
0 January
1 August
2 December
Related
I would really appreciate any help on the below. I am looking to create a set of values with 1 name compiling all duplicates, with a second dict value to total another value from a list of dicts. i have compiled the below code as an example:
l = [{'id': 1, 'name': 'apple', 'price': '100', 'year': '2000', 'currency': 'eur'},
{'id': 2, 'name': 'apple', 'price': '150', 'year': '2071', 'currency': 'eur'},
{'id': 3, 'name': 'apple', 'price': '1220', 'year': '2076', 'currency': 'eur'},
{'id': 4, 'name': 'cucumber', 'price': '90000000', 'year': '2080', 'currency': 'eur'},
{'id': 5, 'name': 'pear', 'price': '1000', 'year': '2000', 'currency': 'eur'},
{'id': 6, 'name': 'apple', 'price': '150', 'year': '2022', 'currency': 'eur'},
{'id': 9, 'name': 'apple', 'price': '100', 'year': '2000', 'currency': 'eur'},
{'id': 10, 'name': 'grape', 'price': '150', 'year': '2022', 'currency': 'eur'},
]
new_list = []
for d in l:
if d['name'] not in new_list:
new_list.append(d['name'])
print(new_list)
price_list = []
for price in l:
if price['price'] not in price_list:
price_list.append(price['price'])
print(price_list)
The out put i am hoping to achieve is:
[{'name': 'apple'}, {'price': <The total price for all apples>}]
Use a dictionary whose keys are the names and values are the list of prices. Then calculate the averages of each list.
d = {}
for item in l:
d.setdefault(item['name'], []).append(int(item['price']))
for name, prices in d.items()
d[name] = sum(prices)
print(d)
Actually, I thought this was the same as yesterday's question, where you wanted the average. If you just want the total, you don't need the lists. Use a defaultdict containing integers, and just add the price to it.
from collections import defaultdict
d = defaultdict(int)
for item in l:
d[item['name']] += int(item['price'])
print(d)
This method only requires one loop:
prices = {}
for item in l:
prices.update({item['name']: prices.get(item['name'], 0) + int(item['price'])})
print(prices)
Just for fun I decided to also implement the functionality with the item and price dictionaries separated as asked in the question, which gave the following horrendous code:
prices = []
for item in l:
# get indices of prices of corresponding items
price_idx = [n+1 for n, x in enumerate(prices) if item['name'] == x.get('name') and n % 2 == 0]
if not price_idx:
prices.append({'name': item['name']})
prices.append({'price': int(item['price'])})
else:
prices[price_idx[0]].update({'price': prices[price_idx[0]]['price'] + int(item['price'])})
print(prices)
And requires the following function to retrieve prices:
def get_price(name):
for n, x in enumerate(prices):
if n % 2 == 0 and x['name'] == name:
return prices2[n+1]['price']
Which honestly completely defeats the point of having a data structure. But if it answers your question, there you go.
This could be another one:
result = {}
for item in l:
if item['name'] not in result:
result[item['name']] = {'name': item['name'], 'price': 0}
result[item['name']]['price'] += int(item['price'])
I'm trying to get the unique values in a particular column in a Pandas data frame based on multiple filtering criteria. Here is some toy code:
df = pd.DataFrame({'Manufacturer':['<null', 'Mercedes', 'BMW', 'Audi', 'Audi', 'Audi', 'Audi', 'Audi', 'Mercedes', 'BMW'],
'Color':['Purple', '<null>', '<null>', 'Blue', 'Green', 'Green', 'Black', 'White', 'Gold', 'Tan']})
I'm trying to get a list of the unique values of the Color column assuming:
a) a non-null value in the Color column, and
b) a value of 'Audi' in the Manufacturer column
Is there a Pythonic way that doesn't require me to 'pre-process' the data by taking a subset of the data frame, as such:
df_1 = df[(df['Color'] != '<null>') & (df['Manufacturer'] == 'Audi')]
df_1['Color'].unique()
array(['Blue', 'Green', 'Black', 'White'], dtype=object)
Thanks in advance!
You have to subset the dataframe with required conditions. There's no escaping that.
You can always write your code in 1-line, like this:
df[(df['Color'] != '<null>') & (df['Manufacturer'].eq('Audi'))]['Color'].unique()
Also, it's nice to represent a null value in dataframe with numpy.nan. Your df would be this:
In [86]: import numpy as np
In [81]: df = pd.DataFrame({'Manufacturer':[np.nan, 'Mercedes', 'BMW', 'Audi', 'Audi', 'Audi', 'Audi', 'Audi', 'Mercedes', 'BMW'],
...: 'Color':['Purple', np.nan, np.nan, 'Blue', 'Green', 'Green', 'Black', 'White', 'Gold', 'Tan']})
Then you can use df.notna() and df.eq , which are a bit more pythonic:
In [85]: df[df.Color.notna() & df.Manufacturer.eq('Audi')]['Color'].unique()
Out[85]: array(['Blue', 'Green', 'Black', 'White'], dtype=object)
After OP's comment:
Can specify multiple values using isin:
df[(df['Color'] != '<null>') & (df['Manufacturer'].isin(['Audi', 'Mercedes']))]['Color'].unique()
I have a JSON response (sample below) that I'm trying to convert into a DataFrame. I've had several issues with the data being listed as columns (1 x 346), etc. I only need the 5 columns listed below:
area_name,
date,
month,
unemployment_rate,
year
Here's my code:
edd_ca_df = pd.DataFrame.from_dict(edd_ca, orient="index",
columns=["area_name", "month", "date", "year", "unemployment_rate"])
and here's a sample of the JSON response:
[[{'area_name': 'California',
'area_type': 'State',
'date': '1990-01-01T00:00:00.000',
'employment': '14099700',
'labor_force': '14953900',
'month': 'January',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '854200',
'unemployment_rate': '5.7',
'year': '1990'},
{'area_name': 'California',
'area_type': 'State',
'date': '1990-02-01T00:00:00.000',
'employment': '14206700',
'labor_force': '15049400',
'month': 'February',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '842800',
'unemployment_rate': '5.6',
'year': '1990'},
Any help would be greatly appreciated.
Since you have a list of dictionaries, this is as simple as passing all the data to a new DataFrame and specifying what columns you want to keep:
import pandas as pd
all_data = [{'area_name': 'California',
'area_type': 'State',
'date': '1990-01-01T00:00:00.000',
'employment': '14099700',
'labor_force': '14953900',
'month': 'January',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '854200',
'unemployment_rate': '5.7',
'year': '1990'},
{'area_name': 'California',
'area_type': 'State',
'date': '1990-02-01T00:00:00.000',
'employment': '14206700',
'labor_force': '15049400',
'month': 'February',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '842800',
'unemployment_rate': '5.6',
'year': '1990'}]
keep_columns = ['area_name','date','month','unemployment_rate','year']
df = pd.DataFrame(columns=keep_columns, data=all_data)
print(df)
Output
area_name date month unemployment_rate year
0 California 1990-01-01T00:00:00.000 January 5.7 1990
1 California 1990-02-01T00:00:00.000 February 5.6 1990
Considering '1', '2', '3', '4' are the indexes and everything else as the values of a dictionary in Python, I'm trying to exclude the repeating values and increment the quantity field when a dupicate is found. e.g.:
Turn this:
a = {'1': {'name': 'Blue', 'qty': '1', 'sub': ['sky', 'ethernet cable']},
'2': {'name': 'Blue', 'qty': '1', 'sub': ['sky', 'ethernet cable']},
'3': {'name': 'Green', 'qty': '1', 'sub': []},
'4': {'name': 'Blue', 'qty': '1', 'sub': ['sea']}}
into this:
b = {'1': {'name': 'Blue', 'qty': '2', 'sub': ['sky', 'ethernet cable']},
'2': {'name': 'Green', 'qty': '1', 'sub': []},
'3': {'name': 'Blue', 'qty': '1', 'sub': ['sea']}}
I was able to exclude the duplicates, but I'm having a hard time incrementing the 'qty' field:
b = {}
for k,v in a.iteritems():
if v not in b.values():
b[k] = v
P.S.: I posted this question earlier, but forgot to add that the dictionary can have that 'sub' field which is a list. Also, don't mind the weird string indexes.
First, convert the original dict 'name' and 'sub' keys to a comma-delimited string, so we can use set():
data = [','.join([v['name']]+v['sub']) for v in a.values()]
This returns
['Blue,sky,ethernet cable', 'Green', 'Blue,sky,ethernet cable', 'Blue,sea']
Then use the nested dict and list comprehensions as below:
b = {str(i+1): {'name': j.split(',')[0], 'qty': sum([int(qty['qty']) for qty in a.values() if (qty['name']==j.split(',')[0]) and (qty['sub']==j.split(',')[1:])]), 'sub': j.split(',')[1:]} for i, j in enumerate(set(data))}
Maybe you can try to use a counter like this:
b = {}
count = 1
for v in a.values():
if v not in b.values():
b[str(count)] = v
count += 1
print b
For example, this is my list of dictionaries:
[{'name': 'John', 'color': 'red' },
{'name': 'Bob', 'color': 'green'},
{'name': 'Tom', 'color': 'blue' }]
Based on the list ['blue', 'red', 'green'] I want to return the following:
[{'name': 'Tom', 'color': 'blue' },
{'name': 'John', 'color': 'red' },
{'name': 'Bob', 'color': 'green'}]
This might be a little naieve, but it works:
data = [
{'name':'John', 'color':'red'},
{'name':'Bob', 'color':'green'},
{'name':'Tom', 'color':'blue'}
]
colors = ['blue', 'red', 'green']
result = []
for c in colors:
result.extend([d for d in data if d['color'] == c])
print result
Update:
>>> list_ = [{'c': 3}, {'c': 2}, {'c': 5}]
>>> mp = [3, 5, 2]
>>> sorted(list_, cmp=lambda x, y: cmp(mp.index(x.get('c')), mp.index(y.get('c'))))
[{'c': 3}, {'c': 5}, {'c': 2}]
You can sort using any custom key function.
>>> people = [
{'name': 'John', 'color': 'red'},
{'name': 'Bob', 'color': 'green'},
{'name': 'Tom', 'color': 'blue'},
]
>>> colors = ['blue', 'red', 'green']
>>> sorted(people, key=lambda person: colors.index(person['color']))
[{'color': 'blue', 'name': 'Tom'}, {'color': 'red', 'name': 'John'}, {'color': 'green', 'name': 'Bob'}]
list.index takes linear time though, so if the number of colors can grow, then convert to a faster key lookup.
>>> colorkeys = dict((color, index) for index, color in enumerate(colors))
>>> sorted(people, key=lambda person: colorkeys[person['color']])
[{'color': 'blue', 'name': 'Tom'}, {'color': 'red', 'name': 'John'}, {'color': 'green', 'name': 'Bob'}]
Riffing on Harto's solution:
>>> from pprint import pprint
>>> [{'color': 'red', 'name': 'John'},
... {'color': 'green', 'name': 'Bob'},
... {'color': 'blue', 'name': 'Tom'}]
[{'color': 'red', 'name': 'John'}, {'color': 'green', 'name': 'Bob'}, {'color':
'blue', 'name': 'Tom'}]
>>> data = [
... {'name':'John', 'color':'red'},
... {'name':'Bob', 'color':'green'},
... {'name':'Tom', 'color':'blue'}
... ]
>>> colors = ['blue', 'red', 'green']
>>> result = [d for d in data for c in colors if d['color'] == c]
>>> pprint(result)
[{'color': 'red', 'name': 'John'},
{'color': 'green', 'name': 'Bob'},
{'color': 'blue', 'name': 'Tom'}]
>>>
The main difference is in using a list comprehension to build result.
Edit: What was I thinking? This clearly calls out for the use of the any() expression:
>>> from pprint import pprint
>>> data = [{'name':'John', 'color':'red'}, {'name':'Bob', 'color':'green'}, {'name':'Tom', 'color':'blue'}]
>>> colors = ['blue', 'red', 'green']
>>> result = [d for d in data if any(d['color'] == c for c in colors)]
>>> pprint(result)
[{'color': 'red', 'name': 'John'},
{'color': 'green', 'name': 'Bob'},
{'color': 'blue', 'name': 'Tom'}]
>>>
Here is a simple loop function:
# Heres the people:
people = [{'name':'John', 'color':'red'},
{'name':'Bob', 'color':'green'},
{'name':'Tom', 'color':'blue'}]
# Now we can make a method to get people out in order by color:
def orderpeople(order):
for color in order:
for person in people:
if person['color'] == color:
yield person
order = ['blue', 'red', 'green']
print(list(orderpeople(order)))
Now that will be VERY slow if you have many people. Then you can loop through them only once, but build an index by color:
# Here's the people:
people = [{'name':'John', 'color':'red'},
{'name':'Bob', 'color':'green'},
{'name':'Tom', 'color':'blue'}]
# Now make an index:
colorindex = {}
for each in people:
color = each['color']
if color not in colorindex:
# Note that we want a list here, if several people have the same color.
colorindex[color] = []
colorindex[color].append(each)
# Now we can make a method to get people out in order by color:
def orderpeople(order):
for color in order:
for each in colorindex[color]:
yield each
order = ['blue', 'red', 'green']
print(list(orderpeople(order)))
This will be quite fast even for really big lists.
Given:
people = [{'name':'John', 'color':'red'}, {'name':'Bob', 'color':'green'}, {'name':'Tom', 'color':'blue'}]
colors = ['blue', 'red', 'green']
you can do something like this:
def people_by_color(people, colors):
index = {}
for person in people:
if person.has_key('color'):
index[person['color']] = person
return [index.get(color) for color in colors]
If you're going to do this many times with the same list of dictionaries but different lists of colors you'll want to split the index building out and keep the index around so you don't need to rebuild it every time.