What are the best way to solve text overlapping? - python

I have a pie chart, this is the code
import numpy as np
import matplotlib.pyplot as plt
title = "Population of Singapore by age(2018)"
titlelen = len(title)
print("{:*^{titlelen}}".format(title, titlelen=titlelen+6))
print()
data = np.genfromtxt("ca1_data/population.csv",
delimiter=',',skip_header=1,
dtype=[('Year','i4'),('Race','U50'),('Age','U50'),('Population','i4')],
missing_values=['na','-'],filling_values=[0])
age = data[(data['Year']==2018) & (data['Race']=='Total Citizen') & (data ['Age']!='65 Years & Over') & (data['Age']!='70 Years & Over') & (data['Age']!='75 Years & Over') & (data['Age']!='80 Years & Over')]['Age']
population = data[(data['Year']==2018) & (data['Race']=='Total Citizen') & (data['Age']!='65 Years & Over') & (data['Age']!='70 Years & Over') & (data['Age']!='75 Years & Over') & (data['Age']!='80 Years & Over')]['Population']
fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(111)
ax1.pie(population, labels=age,autopct='%1.1f%%',
startangle=90)
plt.title(title)
ax1.axis('equal')
plt.show()
The code does produce the correct output but some texts are overlapping, i tried increasing the figsize however it makes every texts small and some of them are above the title. Is there better way to solve this?
This is the value of age
['0 - 4 Years' '5 - 9 Years' '10 - 14 Years' '15 - 19 Years'
'20 - 24 Years' '25 - 29 Years' '30 - 34 Years' '35 - 39 Years'
'40 - 44 Years' '45 - 49 Years' '50 - 54 Years' '55 - 59 Years'
'60 - 64 Years' '65 - 69 Years' '70 - 74 Years' '75 - 79 Years'
'80 - 84 Years' '85 Years & Over']
This is the value of population
[172621 175796 180222 205723 237046 250958 217874 224625 227556 247329
267845 280396 257548 202905 130055 89536 55232 48669]
Thanks in advance!
Update:
import numpy as np
import matplotlib.pyplot as plt
​
title = "Population of Singapore by age(2018)"
titlelen = len(title)
print("{:*^{titlelen}}".format(title, titlelen=titlelen+6))
print()
data = np.genfromtxt("ca1_data/population.csv",
delimiter=',',skip_header=1,
dtype=[('Year','i4'),('Race','U50'),('Age','U50'),('Population','i4')],
missing_values=['na','-'],filling_values=[0])
age = ['0 - 4 Years', '5 - 9 Years', '10 - 14 Years', '15 - 19 Years',
'20 - 24 Years', '25 - 29 Years', '30 - 34 Years', '35 - 39 Years',
'40 - 44 Years', '45 - 49 Years', '50 - 54 Years', '55 - 59 Years',
'60 - 64 Years', '65 - 69 Years', '70 - 74 Years', '75 - 79 Years',
'80 - 84 Years', '85 Years & Over']
population = [172621, 175796, 180222, 205723, 237046, 250958, 217874, 224625, 227556, 247329,
267845,280396, 257548, 202905, 130055, 89536, 55232, 48669]
fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(111)
​
ax1.pie(population, labels=age,autopct='%1.1f%%',
startangle=90)
​
plt.title(title)
ax1.axis('equal')
​
plt.show()
This should be the working code

Related

Setting X-Tick Labels on Transposed Line Plot

I'm trying to properly label my line plot and set the x-tick labels but have been unsuccessful.
Here is what I've tried so far:
plt.xticks(ticks = ... ,labels =...)
AND
labels = ['8 pcw', '12 pcw', '13 pcw', '16 pcw', '17 pcw', '19 pcw', '21 pcw',
'24 pcw', '35 pcw', '37 pcw', '4 mos', '1 yrs', '2 yrs', '3 yrs',
'4 yrs', '8 yrs', '11 yrs', '13 yrs', '18 yrs', '19 yrs', '21 yrs',
'23 yrs', '30 yrs', '36 yrs', '37 yrs', '40 yrs']
ax.set_xticks(labels)
The code that I've used to transpose this dataframe into a line graph is this:
mean_df.transpose().plot().line(figsize = (25, 10))
plt.xlabel("Age")
plt.ylabel("Raw RPKM")
plt.title("BTRC Expression in V1C")
The dataframe I'm using (mean_df) contains columns that are already named with their respective label (8 pcw, 12 pcw, ... 36yrs, 40yrs) so I would have thought that it would have pulled them automatically from there. However, it looks like matplotlib automatically removes the x-ticks and displays only 5 values for the x-ticks. How can I get it to display all 24 values instead?
I keep getting the following two errors when I try the methods listed above:
Failed to convert value(s) to axis units:
OR
ValueError: The number of FixedLocator locations (n), usually from a
call to set_ticks, does not match the number of ticklabels (n)
Here is an image of my plot:

Given a list of days, months, and hours in python, how to find the minimum?

Suppose I have a list with the following elements:
times = ['1 day ago', '1 day ago', '1 day ago', '1 day ago', '1 day ago', '7 days ago', '5 months ago', '27 days ago', '7 days ago', '7 days ago', '1 month ago', '1 month ago', '7 days ago', '7 days ago', '7 days ago', '7 days ago', '27 days ago', '1 month ago', '6 hours ago', '22 hours ago', '20 hours ago', '15 hours ago', '1 day ago', '4 days ago', '10 days ago', '8 days ago', '6 days ago', '7 days ago', '8 hours ago', '14 days ago', '14 days ago', '22 days ago', '2 months ago', '2 months ago', '2 months ago']
I am wondering how I can find the entry corresponding to the shortest duration. I have an idea to use a look over days, months, etc. but this feels very inefficient. Does anyone have any ideas? thanks!
You can convert to datetime.timedelta which can be used with min:
from datetime import timedelta
def convert(s):
n, unit, __ = s.split()
n = int(n)
if unit.startswith('month'): # assuming "1 month" means 30 days
n *= 30
unit = 'days'
if not unit.endswith('s'):
unit += 's'
return timedelta(**{unit: n})
Then convert the strings and take the minimum:
deltas = [convert(s) for s in times]
min(deltas)
Or use this method as a key to min:
min(times, key=convert)
I would redefine the key parameter of the min built-in function:
def value(t):
x = t.split()
number = int(x[0])
number *= (1 if x[1].startswith("hour") else
24 if x[1].startswith("day") else
24*30)
return number
result = min(times, key=value)
Here I suppose that a month lasts 30 days (this is not always the case).
There you go:
times = ['1 day ago', '1 day ago', '1 day ago', '1 day ago', '1 day ago', '7 days ago', '5 months ago', '27 days ago', '7 days ago', '7 days ago', '1 month ago', '1 month ago', '7 days ago', '7 days ago', '7 days ago', '7 days ago', '27 days ago', '1 month ago', '6 hours ago', '22 hours ago', '20 hours ago', '15 hours ago', '1 day ago', '4 days ago', '10 days ago', '8 days ago', '6 days ago', '7 days ago', '8 hours ago', '14 days ago', '14 days ago', '22 days ago', '2 months ago', '2 months ago', '2 months ago']
def evaluate(time):
if 'hour' in time:
return int(time.split(' ')[0])
if 'day' in time:
return int(time.split(' ')[0]) * 24
if 'month' in time:
return int(time.split(' ')[0]) * 24 * 30
values = [evaluate(time) for time in times]
minValue = min(values)
minIndex = values.index(minValue)
print(minIndex)
print(times[minIndex])
First you have to parse your input and convert to something readable by your software:
for t in times:
num, unit, _ = t.split()
num = int(num) # here you have an integer
Than you can use a dictionary to convert your values to seconds (in this example) to have the same unit of measure.
units = {"second": 1, "minute": 60, "hour": 3600, ... }
So you can extract your unit:
if unit.endswith("s"): # remove the plural `s`
unit = unit[:-1]
converted_unit = units[unit]
seconds_ago = converted_unit * num
And here you have it: you have a single number that you can compare with others, than it is just a matter of finding the minimum.
Enjoy!

Pandas dataframe - sort by partial numeric count of years

I have got a pandas data frame with 26 columns. I need to create barplot based on unique values of a column in particular order. I have managed to extract unique values of the column in an array. Now I want to sort it out in particular order. Is there any way?
NOTE:
I would prefer not to disturb the index of the dataframe, based on this column.
my code
e= df['emp_length'].dropna().unique()
e = np.sort(e)
sns.countplot(x='emp_length',order=e,data=df)
The array e is ordered as below
array(['1 year', '10+ years', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years', '< 1 year'],
dtype=object)
However, I want the array to be ordered as below
array(['< 1 year','1 year', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years', '10+ years'],
dtype=object)
Close what need is use natsorted, but then is necessary change order - add last value to first:
a = np.array(['1 year', '10+ years', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years', '< 1 year'])
from natsort import natsorted
b = natsorted(a)
print (b[-1:] + b[:-1])
['< 1 year', '1 year', '2 years', '3 years',
'4 years', '5 years', '6 years', '7 years',
'8 years', '9 years', '10+ years']

Getting key based on values in list of a dict

I'm trying to get the Key based on values in a list of the key or return the element if the value/key is not found in the dict.
headersDict = {'Number; SEX AND AGE - Total Population':['TPop'],
'Number; SEX AND AGE - Male Population':['MPop'],
'Number; SEX AND AGE - Female Population':['FPop'],
'Under 5 years': ['<5'],
'5 to 9 years': ['5_9'],
'10 to 14 years': ['10_14'],
'15 to 19 years': ['15_19'],
'20 to 24 years': ['20_24'],
'25 to 29 years': ['25_29'],
'30 to 34 years': ['30_34'],
'35 to 39 years': ['35_39'],
'40 to 44 years': ['40_44'],
'45 to 49 years': ['45_49'],
'50 to 54 years': ['50_54'],
'55 to 59 years': ['55_59'],
'60 to 64 years': ['60_64'],
'65 to 69 years': ['65_69'],
'70 to 74 years': ['70_74'],
'75 to 79 years': ['75_79'],
'80 to 84 years': ['80_84'],
'85 years and over': ['85+'],
'Median age(years)': ['Medage'],
'16 years and over': ['16+'],
'18 years and over': ['18+'],
'21 years and over': ['21+'],
'62 years and over': ['62+', 'sixty two+'],
'65 years and over': ['65+', 'sixty five+']}
headersList = [ '1+', '25_29', '85+',
'65+'
]
new_headersList = [k for k, v in headersDict.items() for elem in headersList for val in v if elem == val]
print(new_headersList)
If I try the above, I get the output as:
$ python 1.py
['25 to 29 years', '85 years and over', '65 years and over']
What I require is:
$ python 1.py
['1+', '25 to 29 years', '85 years and over', '65 years and over']
Thanks in advance for the help
these problems are typically easier if you invert your dict
inverted_dict = {val:key for key,arr in my_dict.items() for val in arr}
now you can simply lookup your keys
for key in [ '1+', '25_29', '85+', '65+']:
print(inverted_dict.get(key,key))
This code inverses the dictionary so that each value within the array becomes a new key. With that inversed dictionary it's very easy to query individual header keys or fall back to the header name.
headersDict = {'Number; SEX AND AGE - Total Population': ['TPop'],
'Number; SEX AND AGE - Male Population': ['MPop'],
'Number; SEX AND AGE - Female Population': ['FPop'],
'Under 5 years': ['<5'],
'5 to 9 years': ['5_9'],
'10 to 14 years': ['10_14'],
'15 to 19 years': ['15_19'],
'20 to 24 years': ['20_24'],
'25 to 29 years': ['25_29'],
'30 to 34 years': ['30_34'],
'35 to 39 years': ['35_39'],
'40 to 44 years': ['40_44'],
'45 to 49 years': ['45_49'],
'50 to 54 years': ['50_54'],
'55 to 59 years': ['55_59'],
'60 to 64 years': ['60_64'],
'65 to 69 years': ['65_69'],
'70 to 74 years': ['70_74'],
'75 to 79 years': ['75_79'],
'80 to 84 years': ['80_84'],
'85 years and over': ['85+'],
'Median age(years)': ['Medage'],
'16 years and over': ['16+'],
'18 years and over': ['18+'],
'21 years and over': ['21+'],
'62 years and over': ['62+', 'sixty two+'],
'65 years and over': ['65+', 'sixty five+']}
headersDictReversed = {}
for k, v in headersDict.items():
for new_k in v:
headersDictReversed[new_k] = k
headersList = ['1+', '25_29', '85+', '65+']
results = []
for header in headersList:
# Return the value for header and default to the header itself.
results.append(headersDictReversed.get(header, header))
print(results)
['1+', '25 to 29 years', '85 years and over', '65 years and over']
If you can use pandas, you can use this solution:
import pandas as pd
df1 = pd.DataFrame(headersDict, index=[0,1]).T.reset_index()
df1 = pd.DataFrame(pd.concat([df1[0], df1[1]]).drop_duplicates()).join(df1, lsuffix='_1').drop(columns=['0',1]).rename(columns={'0_1':0})
a = pd.DataFrame(headersList).merge(df1, 'outer')[0:len(pd.DataFrame(headersList))].set_index(0)['index']
a.fillna(a.index.to_series()).values.tolist()
# ['1+', '25 to 29 years', '85 years and over', '65 years and over']

Regular Expression to extract numbers from age range DataFrame column with multiple formats

I am trying to extract the high and low numbers from a column that has multiple formats.
For instance,
if the value is: 'Age 34 - 35', I want to collect (34, 35)
if the value is: '35-44 years old', I want to collect (35, 44)
if the value is: '75+ years old, I am fine collecting (75, '')
I currently have a regex written that works for some of the formats but not for others:
dataframe[['age_low', 'age_high]] = dataframe['age'].str.extract(r'(\d*)[\s-]*(\d*)')
Here are all the possible values in the original age column:
dataframe['age'].unique()
array([nan, 'Age 34 - 35 ', 'Age 78 - 79 ', 'Age 60 - 61 ',
'Age 50 - 51 ', 'Age 20 - 21 ', 'Age 70 - 71 ', 'Age 82 - 83 ',
'Age 88 - 89 ', 'Age 68 - 69 ', 'Age 86 - 87 ', 'Age 84 - 85 ',
'Age 46 - 47 ', 'Age 30 - 31', 'Age 94 - 95 ', 'Age 22 - 23 ',
'Age 44 - 45 ', 'Age 74 - 75 ', 'Age 40 - 41', 'Age 72 - 73 ',
'Age 52 - 53 ', 'Age 48 - 49 ', 'Age 66 - 67 ', 'Age 62 - 63 ',
'Age 56 - 57 ', 'Age 64 - 65 ', 'Age 38 - 39 ', 'Age 42 - 43 ',
'Age 54 - 55 ', 'Age 24 - 25 ', 'Age 90 - 91 ', 'Age 76 - 77 ',
'Age 58 - 59 ', 'Age 32 - 33', 'Age 26 - 27 ', 'Age 80 - 81 ',
'Age 28 - 29 ', 'Age 36 - 37', 'Age 96 - 97 ',
'Age greater than 99', 'Age 18 - 19', 'Age 92 - 93 ',
'Age 98 - 99 ','65-74 years old', '35-44 years old', '45-54 years old',
'75+ years old', '55-64 years old', '25-34 years old',
'18-24 years old'], dtype=object)
For the possible values in your question that only have one age value, that age always represents the low side of the range. As a result, you could just capture the first one or more digits in the string and then use a non-capture group to indicate a potential following sequence of non-digits followed by another group of one or more digits. If there is a second age in the string, it will get captured as the high side of the range. If there is only one age, you will just get a NaN value for the high side of the range.
For example:
import pandas as pd
ages = ['Age 96 - 97', 'Age greater than 99', '65-74 years old', '75+ years old']
df = pd.DataFrame({'age': ages})
df[['age_low', 'age_high']] = df['age'].str.extract(r'(\d+)(?:\D+(\d+))?')
print(df)
# age age_low age_high
# 0 Age 96 - 97 96 97
# 1 Age greater than 99 99 NaN
# 2 65-74 years old 65 74
# 3 75+ years old 75 NaN

Categories

Resources