Getting key based on values in list of a dict - python

I'm trying to get the Key based on values in a list of the key or return the element if the value/key is not found in the dict.
headersDict = {'Number; SEX AND AGE - Total Population':['TPop'],
'Number; SEX AND AGE - Male Population':['MPop'],
'Number; SEX AND AGE - Female Population':['FPop'],
'Under 5 years': ['<5'],
'5 to 9 years': ['5_9'],
'10 to 14 years': ['10_14'],
'15 to 19 years': ['15_19'],
'20 to 24 years': ['20_24'],
'25 to 29 years': ['25_29'],
'30 to 34 years': ['30_34'],
'35 to 39 years': ['35_39'],
'40 to 44 years': ['40_44'],
'45 to 49 years': ['45_49'],
'50 to 54 years': ['50_54'],
'55 to 59 years': ['55_59'],
'60 to 64 years': ['60_64'],
'65 to 69 years': ['65_69'],
'70 to 74 years': ['70_74'],
'75 to 79 years': ['75_79'],
'80 to 84 years': ['80_84'],
'85 years and over': ['85+'],
'Median age(years)': ['Medage'],
'16 years and over': ['16+'],
'18 years and over': ['18+'],
'21 years and over': ['21+'],
'62 years and over': ['62+', 'sixty two+'],
'65 years and over': ['65+', 'sixty five+']}
headersList = [ '1+', '25_29', '85+',
'65+'
]
new_headersList = [k for k, v in headersDict.items() for elem in headersList for val in v if elem == val]
print(new_headersList)
If I try the above, I get the output as:
$ python 1.py
['25 to 29 years', '85 years and over', '65 years and over']
What I require is:
$ python 1.py
['1+', '25 to 29 years', '85 years and over', '65 years and over']
Thanks in advance for the help

these problems are typically easier if you invert your dict
inverted_dict = {val:key for key,arr in my_dict.items() for val in arr}
now you can simply lookup your keys
for key in [ '1+', '25_29', '85+', '65+']:
print(inverted_dict.get(key,key))

This code inverses the dictionary so that each value within the array becomes a new key. With that inversed dictionary it's very easy to query individual header keys or fall back to the header name.
headersDict = {'Number; SEX AND AGE - Total Population': ['TPop'],
'Number; SEX AND AGE - Male Population': ['MPop'],
'Number; SEX AND AGE - Female Population': ['FPop'],
'Under 5 years': ['<5'],
'5 to 9 years': ['5_9'],
'10 to 14 years': ['10_14'],
'15 to 19 years': ['15_19'],
'20 to 24 years': ['20_24'],
'25 to 29 years': ['25_29'],
'30 to 34 years': ['30_34'],
'35 to 39 years': ['35_39'],
'40 to 44 years': ['40_44'],
'45 to 49 years': ['45_49'],
'50 to 54 years': ['50_54'],
'55 to 59 years': ['55_59'],
'60 to 64 years': ['60_64'],
'65 to 69 years': ['65_69'],
'70 to 74 years': ['70_74'],
'75 to 79 years': ['75_79'],
'80 to 84 years': ['80_84'],
'85 years and over': ['85+'],
'Median age(years)': ['Medage'],
'16 years and over': ['16+'],
'18 years and over': ['18+'],
'21 years and over': ['21+'],
'62 years and over': ['62+', 'sixty two+'],
'65 years and over': ['65+', 'sixty five+']}
headersDictReversed = {}
for k, v in headersDict.items():
for new_k in v:
headersDictReversed[new_k] = k
headersList = ['1+', '25_29', '85+', '65+']
results = []
for header in headersList:
# Return the value for header and default to the header itself.
results.append(headersDictReversed.get(header, header))
print(results)
['1+', '25 to 29 years', '85 years and over', '65 years and over']

If you can use pandas, you can use this solution:
import pandas as pd
df1 = pd.DataFrame(headersDict, index=[0,1]).T.reset_index()
df1 = pd.DataFrame(pd.concat([df1[0], df1[1]]).drop_duplicates()).join(df1, lsuffix='_1').drop(columns=['0',1]).rename(columns={'0_1':0})
a = pd.DataFrame(headersList).merge(df1, 'outer')[0:len(pd.DataFrame(headersList))].set_index(0)['index']
a.fillna(a.index.to_series()).values.tolist()
# ['1+', '25 to 29 years', '85 years and over', '65 years and over']

Related

How to strip only specific parts of a list element in Python

So lets say i have a list, where one element contains an 'hour-minute-second-tagnumber', like this:
mylist = ['10 30 12 TTH-312', '10 38 45 ZHE-968', '11 25 54 KAP-116']
How do I make this of an element: '103012 TTH-312'?
Can I .strip only specific parts of an element somehow?
You can limit the number of replace to 2.
[x.replace(' ','', 2) for x in mylist]
Output
['103012 TTH-312', '103845 ZHE-968', '112554 KAP-116']
If you are trying to do a loop:
adatok = ['10 30 12 TTH-312', '10 38 45 ZHE-968', '11 25 54 KAP-116']
for i in adatok:
i = i.replace(' ','', 2)
print(i)
Using a list comprehension along with re.sub we can try:
mylist = ['10 30 12 TTH-312', '10 38 45 ZHE-968', '11 25 54 KAP-116']
mylist = [re.sub(r'(\d+) (\d+) (\d+) (.*)', r'\1\2\3 \4', x) for x in mylist]
print(mylist) # ['103012 TTH-312', '103845 ZHE-968', '112554 KAP-116']

how to make summary aggregated information from multiple columns in pandas dataframe as list of strings?

I have a dataframe like as:
df =
time_id gt_class num_missed_base num_missed_feature num_objects_base num_objects_feature
5G21A6P00L4100023:1566617404450336 CAR 11 4 27 30
5G21A6P00L4100023:1566617404450336 BICYCLE 4 6 27 30
5G21A6P00L4100023:1566617404450336 PERSON 2 3 27 30
5G21A6P00L4100023:1566617404450336 TRUCK 1 0 27 30
5G21A6P00L4100023:1566617428450689 CAR 25 14 60 67
5G21A6P00L4100023:1566617428450689 PERSON 7 6 60 67
5G21A6P00L4100023:1566617515950900 BICYCLE 1 1 59 65
5G21A6P00L4100023:1566617515950900 CAR 20 9 59 65
5G21A6P00L4100023:1566617515950900 PERSON 10 2 59 65
5G21A6P00L4100037:1567169649450046 CAR 8 0 29 32
5G21A6P00L4100037:1567169649450046 PERSON 1 0 29 32
5G21A6P00L4100037:1567169649450046 TRUCK 1 0 29 32
at each time_id it shows how many objects are missed in base model num_missed_base, how many objects are missed in feature model num_missed_feature, and how many objects exist at that time in base and feature innum_objects_base, num_objects_feature
I need to make the following dataframe:
time_id gt_class num_missed_base num_missed_feature hover_base hover_feature
0 5G21A6P00L4100023:1566617404450336 CAR,BICYCLE,PERSON,TRUCK 18 13 ['CAR: 11', 'BICYCLE: 4', 'PERSON: 2', 'TRUCK:1] ['CAR: 4', 'BICYCLE: 6', 'PERSON: 3', 'TRUCK: 0']
1 5G21A6P00L4100023:1566617428450689 CAR,PERSON 32 20 ['CAR: 25', 'PERSON: 7'] ['CAR: 14', 'PERSON: 6']
2 5G21A6P00L4100023:1566617515950900 BICYCLE,CAR,PERSON 31 12 ['BICYCLE: 1', 'CAR: 20', 'PERSON: 10'] ['BICYCLE: 1', 'CAR: 9', 'PERSON: 2']
3 5G21A6P00L4100037:1567169649450046 CAR,PERSON,TRUCK 10 0 ['CAR: 8', 'PERSON: 1', 'TRUCK: 1'] ['CAR: 0', 'PERSON: 0', 'TRUCK: 0']
You can group by time_id and then apply relevant aggregation function
Refer: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
Note: this is a similar simpler example.
import pandas as pd
df = pd.DataFrame(data={
'time_id': ['2020-01-01','2020-01-01','2020-01-01','2020-01-02','2020-01-02','2020-01-02'],
'val1': ['car', 'bicycle', 'person', 'truck', 'aeroplane', 'train'],
'val2': [0,1,2,9,8,7],
'val3': [9,2,3,4,5,6]
})
mylist = []
def func(row):
return ','.join(row.tolist())
def multi_column1(row):
l = []
for n in row.index:
x = df.loc[n, 'val1']
y = df.loc[n, 'val3']
w = '{} : {}'.format(x, y)
l.append(w)
return l
ans = df.groupby('time_id').agg({'val1':func, 'val2': sum, 'val3': multi_column1})

How print a list into separate lists python

I have a list:
grades= ['doe john 100 90 80 90', 'miller sally 70 90 60 100 80', 'smith jakob 45 55 50 58', 'white jack 85 95 65 80 75']
I want to be able to break that list so the output would be:
['doe john 100 90 80 90']
['miller sally 70 90 60 100 80']
['smith jakob 45 55 50 58']
['white jack 85 95 65 80 75']
Additionally, I would like to split the elements in the list so it looks like:
['doe', 'john', '100', '90', '80', '90']
['miller', 'sally', '70', '90', '60', '100', '80']
['smith', 'jakob', '45', '55', '50', '58']
['white', 'jack', '85', '95', '65', '80', '75']
I'm not really sure how to go about doing this or if this is even possible as I'm just starting to learn python. Any ideas?
for l in grades:
l = l.split()
OR
final = [l.split() for l in grades]
See Split string on whitespace in Python
This can be done quickly with .split() in a list comprehension.
grades = ['doe john 100 90 80 90', 'miller sally 70 90 60 100 80', 'smith jakob 45 55 50 58', 'white jack 85 95 65 80 75']
grades = [grade.split() for grade in grades]
print (grades)

What are the best way to solve text overlapping?

I have a pie chart, this is the code
import numpy as np
import matplotlib.pyplot as plt
title = "Population of Singapore by age(2018)"
titlelen = len(title)
print("{:*^{titlelen}}".format(title, titlelen=titlelen+6))
print()
data = np.genfromtxt("ca1_data/population.csv",
delimiter=',',skip_header=1,
dtype=[('Year','i4'),('Race','U50'),('Age','U50'),('Population','i4')],
missing_values=['na','-'],filling_values=[0])
age = data[(data['Year']==2018) & (data['Race']=='Total Citizen') & (data ['Age']!='65 Years & Over') & (data['Age']!='70 Years & Over') & (data['Age']!='75 Years & Over') & (data['Age']!='80 Years & Over')]['Age']
population = data[(data['Year']==2018) & (data['Race']=='Total Citizen') & (data['Age']!='65 Years & Over') & (data['Age']!='70 Years & Over') & (data['Age']!='75 Years & Over') & (data['Age']!='80 Years & Over')]['Population']
fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(111)
ax1.pie(population, labels=age,autopct='%1.1f%%',
startangle=90)
plt.title(title)
ax1.axis('equal')
plt.show()
The code does produce the correct output but some texts are overlapping, i tried increasing the figsize however it makes every texts small and some of them are above the title. Is there better way to solve this?
This is the value of age
['0 - 4 Years' '5 - 9 Years' '10 - 14 Years' '15 - 19 Years'
'20 - 24 Years' '25 - 29 Years' '30 - 34 Years' '35 - 39 Years'
'40 - 44 Years' '45 - 49 Years' '50 - 54 Years' '55 - 59 Years'
'60 - 64 Years' '65 - 69 Years' '70 - 74 Years' '75 - 79 Years'
'80 - 84 Years' '85 Years & Over']
This is the value of population
[172621 175796 180222 205723 237046 250958 217874 224625 227556 247329
267845 280396 257548 202905 130055 89536 55232 48669]
Thanks in advance!
Update:
import numpy as np
import matplotlib.pyplot as plt
​
title = "Population of Singapore by age(2018)"
titlelen = len(title)
print("{:*^{titlelen}}".format(title, titlelen=titlelen+6))
print()
data = np.genfromtxt("ca1_data/population.csv",
delimiter=',',skip_header=1,
dtype=[('Year','i4'),('Race','U50'),('Age','U50'),('Population','i4')],
missing_values=['na','-'],filling_values=[0])
age = ['0 - 4 Years', '5 - 9 Years', '10 - 14 Years', '15 - 19 Years',
'20 - 24 Years', '25 - 29 Years', '30 - 34 Years', '35 - 39 Years',
'40 - 44 Years', '45 - 49 Years', '50 - 54 Years', '55 - 59 Years',
'60 - 64 Years', '65 - 69 Years', '70 - 74 Years', '75 - 79 Years',
'80 - 84 Years', '85 Years & Over']
population = [172621, 175796, 180222, 205723, 237046, 250958, 217874, 224625, 227556, 247329,
267845,280396, 257548, 202905, 130055, 89536, 55232, 48669]
fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(111)
​
ax1.pie(population, labels=age,autopct='%1.1f%%',
startangle=90)
​
plt.title(title)
ax1.axis('equal')
​
plt.show()
This should be the working code

Regular Expression to extract numbers from age range DataFrame column with multiple formats

I am trying to extract the high and low numbers from a column that has multiple formats.
For instance,
if the value is: 'Age 34 - 35', I want to collect (34, 35)
if the value is: '35-44 years old', I want to collect (35, 44)
if the value is: '75+ years old, I am fine collecting (75, '')
I currently have a regex written that works for some of the formats but not for others:
dataframe[['age_low', 'age_high]] = dataframe['age'].str.extract(r'(\d*)[\s-]*(\d*)')
Here are all the possible values in the original age column:
dataframe['age'].unique()
array([nan, 'Age 34 - 35 ', 'Age 78 - 79 ', 'Age 60 - 61 ',
'Age 50 - 51 ', 'Age 20 - 21 ', 'Age 70 - 71 ', 'Age 82 - 83 ',
'Age 88 - 89 ', 'Age 68 - 69 ', 'Age 86 - 87 ', 'Age 84 - 85 ',
'Age 46 - 47 ', 'Age 30 - 31', 'Age 94 - 95 ', 'Age 22 - 23 ',
'Age 44 - 45 ', 'Age 74 - 75 ', 'Age 40 - 41', 'Age 72 - 73 ',
'Age 52 - 53 ', 'Age 48 - 49 ', 'Age 66 - 67 ', 'Age 62 - 63 ',
'Age 56 - 57 ', 'Age 64 - 65 ', 'Age 38 - 39 ', 'Age 42 - 43 ',
'Age 54 - 55 ', 'Age 24 - 25 ', 'Age 90 - 91 ', 'Age 76 - 77 ',
'Age 58 - 59 ', 'Age 32 - 33', 'Age 26 - 27 ', 'Age 80 - 81 ',
'Age 28 - 29 ', 'Age 36 - 37', 'Age 96 - 97 ',
'Age greater than 99', 'Age 18 - 19', 'Age 92 - 93 ',
'Age 98 - 99 ','65-74 years old', '35-44 years old', '45-54 years old',
'75+ years old', '55-64 years old', '25-34 years old',
'18-24 years old'], dtype=object)
For the possible values in your question that only have one age value, that age always represents the low side of the range. As a result, you could just capture the first one or more digits in the string and then use a non-capture group to indicate a potential following sequence of non-digits followed by another group of one or more digits. If there is a second age in the string, it will get captured as the high side of the range. If there is only one age, you will just get a NaN value for the high side of the range.
For example:
import pandas as pd
ages = ['Age 96 - 97', 'Age greater than 99', '65-74 years old', '75+ years old']
df = pd.DataFrame({'age': ages})
df[['age_low', 'age_high']] = df['age'].str.extract(r'(\d+)(?:\D+(\d+))?')
print(df)
# age age_low age_high
# 0 Age 96 - 97 96 97
# 1 Age greater than 99 99 NaN
# 2 65-74 years old 65 74
# 3 75+ years old 75 NaN

Categories

Resources