Pandas dataframe - sort by partial numeric count of years - python

I have got a pandas data frame with 26 columns. I need to create barplot based on unique values of a column in particular order. I have managed to extract unique values of the column in an array. Now I want to sort it out in particular order. Is there any way?
NOTE:
I would prefer not to disturb the index of the dataframe, based on this column.
my code
e= df['emp_length'].dropna().unique()
e = np.sort(e)
sns.countplot(x='emp_length',order=e,data=df)
The array e is ordered as below
array(['1 year', '10+ years', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years', '< 1 year'],
dtype=object)
However, I want the array to be ordered as below
array(['< 1 year','1 year', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years', '10+ years'],
dtype=object)

Close what need is use natsorted, but then is necessary change order - add last value to first:
a = np.array(['1 year', '10+ years', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years', '< 1 year'])
from natsort import natsorted
b = natsorted(a)
print (b[-1:] + b[:-1])
['< 1 year', '1 year', '2 years', '3 years',
'4 years', '5 years', '6 years', '7 years',
'8 years', '9 years', '10+ years']

Related

Setting X-Tick Labels on Transposed Line Plot

I'm trying to properly label my line plot and set the x-tick labels but have been unsuccessful.
Here is what I've tried so far:
plt.xticks(ticks = ... ,labels =...)
AND
labels = ['8 pcw', '12 pcw', '13 pcw', '16 pcw', '17 pcw', '19 pcw', '21 pcw',
'24 pcw', '35 pcw', '37 pcw', '4 mos', '1 yrs', '2 yrs', '3 yrs',
'4 yrs', '8 yrs', '11 yrs', '13 yrs', '18 yrs', '19 yrs', '21 yrs',
'23 yrs', '30 yrs', '36 yrs', '37 yrs', '40 yrs']
ax.set_xticks(labels)
The code that I've used to transpose this dataframe into a line graph is this:
mean_df.transpose().plot().line(figsize = (25, 10))
plt.xlabel("Age")
plt.ylabel("Raw RPKM")
plt.title("BTRC Expression in V1C")
The dataframe I'm using (mean_df) contains columns that are already named with their respective label (8 pcw, 12 pcw, ... 36yrs, 40yrs) so I would have thought that it would have pulled them automatically from there. However, it looks like matplotlib automatically removes the x-ticks and displays only 5 values for the x-ticks. How can I get it to display all 24 values instead?
I keep getting the following two errors when I try the methods listed above:
Failed to convert value(s) to axis units:
OR
ValueError: The number of FixedLocator locations (n), usually from a
call to set_ticks, does not match the number of ticklabels (n)
Here is an image of my plot:

Given a list of days, months, and hours in python, how to find the minimum?

Suppose I have a list with the following elements:
times = ['1 day ago', '1 day ago', '1 day ago', '1 day ago', '1 day ago', '7 days ago', '5 months ago', '27 days ago', '7 days ago', '7 days ago', '1 month ago', '1 month ago', '7 days ago', '7 days ago', '7 days ago', '7 days ago', '27 days ago', '1 month ago', '6 hours ago', '22 hours ago', '20 hours ago', '15 hours ago', '1 day ago', '4 days ago', '10 days ago', '8 days ago', '6 days ago', '7 days ago', '8 hours ago', '14 days ago', '14 days ago', '22 days ago', '2 months ago', '2 months ago', '2 months ago']
I am wondering how I can find the entry corresponding to the shortest duration. I have an idea to use a look over days, months, etc. but this feels very inefficient. Does anyone have any ideas? thanks!
You can convert to datetime.timedelta which can be used with min:
from datetime import timedelta
def convert(s):
n, unit, __ = s.split()
n = int(n)
if unit.startswith('month'): # assuming "1 month" means 30 days
n *= 30
unit = 'days'
if not unit.endswith('s'):
unit += 's'
return timedelta(**{unit: n})
Then convert the strings and take the minimum:
deltas = [convert(s) for s in times]
min(deltas)
Or use this method as a key to min:
min(times, key=convert)
I would redefine the key parameter of the min built-in function:
def value(t):
x = t.split()
number = int(x[0])
number *= (1 if x[1].startswith("hour") else
24 if x[1].startswith("day") else
24*30)
return number
result = min(times, key=value)
Here I suppose that a month lasts 30 days (this is not always the case).
There you go:
times = ['1 day ago', '1 day ago', '1 day ago', '1 day ago', '1 day ago', '7 days ago', '5 months ago', '27 days ago', '7 days ago', '7 days ago', '1 month ago', '1 month ago', '7 days ago', '7 days ago', '7 days ago', '7 days ago', '27 days ago', '1 month ago', '6 hours ago', '22 hours ago', '20 hours ago', '15 hours ago', '1 day ago', '4 days ago', '10 days ago', '8 days ago', '6 days ago', '7 days ago', '8 hours ago', '14 days ago', '14 days ago', '22 days ago', '2 months ago', '2 months ago', '2 months ago']
def evaluate(time):
if 'hour' in time:
return int(time.split(' ')[0])
if 'day' in time:
return int(time.split(' ')[0]) * 24
if 'month' in time:
return int(time.split(' ')[0]) * 24 * 30
values = [evaluate(time) for time in times]
minValue = min(values)
minIndex = values.index(minValue)
print(minIndex)
print(times[minIndex])
First you have to parse your input and convert to something readable by your software:
for t in times:
num, unit, _ = t.split()
num = int(num) # here you have an integer
Than you can use a dictionary to convert your values to seconds (in this example) to have the same unit of measure.
units = {"second": 1, "minute": 60, "hour": 3600, ... }
So you can extract your unit:
if unit.endswith("s"): # remove the plural `s`
unit = unit[:-1]
converted_unit = units[unit]
seconds_ago = converted_unit * num
And here you have it: you have a single number that you can compare with others, than it is just a matter of finding the minimum.
Enjoy!

What are the best way to solve text overlapping?

I have a pie chart, this is the code
import numpy as np
import matplotlib.pyplot as plt
title = "Population of Singapore by age(2018)"
titlelen = len(title)
print("{:*^{titlelen}}".format(title, titlelen=titlelen+6))
print()
data = np.genfromtxt("ca1_data/population.csv",
delimiter=',',skip_header=1,
dtype=[('Year','i4'),('Race','U50'),('Age','U50'),('Population','i4')],
missing_values=['na','-'],filling_values=[0])
age = data[(data['Year']==2018) & (data['Race']=='Total Citizen') & (data ['Age']!='65 Years & Over') & (data['Age']!='70 Years & Over') & (data['Age']!='75 Years & Over') & (data['Age']!='80 Years & Over')]['Age']
population = data[(data['Year']==2018) & (data['Race']=='Total Citizen') & (data['Age']!='65 Years & Over') & (data['Age']!='70 Years & Over') & (data['Age']!='75 Years & Over') & (data['Age']!='80 Years & Over')]['Population']
fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(111)
ax1.pie(population, labels=age,autopct='%1.1f%%',
startangle=90)
plt.title(title)
ax1.axis('equal')
plt.show()
The code does produce the correct output but some texts are overlapping, i tried increasing the figsize however it makes every texts small and some of them are above the title. Is there better way to solve this?
This is the value of age
['0 - 4 Years' '5 - 9 Years' '10 - 14 Years' '15 - 19 Years'
'20 - 24 Years' '25 - 29 Years' '30 - 34 Years' '35 - 39 Years'
'40 - 44 Years' '45 - 49 Years' '50 - 54 Years' '55 - 59 Years'
'60 - 64 Years' '65 - 69 Years' '70 - 74 Years' '75 - 79 Years'
'80 - 84 Years' '85 Years & Over']
This is the value of population
[172621 175796 180222 205723 237046 250958 217874 224625 227556 247329
267845 280396 257548 202905 130055 89536 55232 48669]
Thanks in advance!
Update:
import numpy as np
import matplotlib.pyplot as plt
​
title = "Population of Singapore by age(2018)"
titlelen = len(title)
print("{:*^{titlelen}}".format(title, titlelen=titlelen+6))
print()
data = np.genfromtxt("ca1_data/population.csv",
delimiter=',',skip_header=1,
dtype=[('Year','i4'),('Race','U50'),('Age','U50'),('Population','i4')],
missing_values=['na','-'],filling_values=[0])
age = ['0 - 4 Years', '5 - 9 Years', '10 - 14 Years', '15 - 19 Years',
'20 - 24 Years', '25 - 29 Years', '30 - 34 Years', '35 - 39 Years',
'40 - 44 Years', '45 - 49 Years', '50 - 54 Years', '55 - 59 Years',
'60 - 64 Years', '65 - 69 Years', '70 - 74 Years', '75 - 79 Years',
'80 - 84 Years', '85 Years & Over']
population = [172621, 175796, 180222, 205723, 237046, 250958, 217874, 224625, 227556, 247329,
267845,280396, 257548, 202905, 130055, 89536, 55232, 48669]
fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(111)
​
ax1.pie(population, labels=age,autopct='%1.1f%%',
startangle=90)
​
plt.title(title)
ax1.axis('equal')
​
plt.show()
This should be the working code

Match and append inside a list

So, I am working on a project, and I have the following list :
a = ['2 co',' 2 tr',' 2 pi', '2 ca', '3 co', '3 ca', '3 pi', '6 tr', '6 pi', '8 tr', '7 ca', '7 pi']
I want to run a code that will check whether the first character of each string is present in an other string, and select them to add them in a new list if yes.
I know how to do it, but only for two strings. Here, I want to do it so that it will select all of those which start with the same string, and sort it through the number of original string there is . For example, I want to regroup by sublist of 3 strings (so, coming from the original list), all the possible combinations of strings which start with the same string.
Also, I wish the result would only count one string per possible association of substrings, and not give different combinations with the same substrings but different orders.
The expected result in that case (i.e when i want strings of 3 substrings and with a = ['2 co',' 2 tr',' 2 pi', '2 ca', '3 co', '3 ca', '3 pi', '6 tr', '6 pi', '8 tr', '7 ca', '7 pi']) is:
['2 co, 2 tr, ,2 pi', '2 co, 2 tr, 2, ca', '2pi, 2ca, 2tr', '2pi, 2ca, 2co', 3 co, 3 ca, 3 pi]
You see that here, I don't have '2 tr, 2 co, 2 pi', because i already have '2 co, 2 tr, ,2 pi'
And when i want to regroup by sublist of 4, the expected output is
['2 co, 2 tr, 2, pi, 2 ca']
I managed how to do it, but only when grouping by subset of two, and it gives all the combinations including the one with the same substrings but different order... here is it :
a = ['2 co',' 2 tr',' 2 pi', '2 ca', '3 co', '3 ca', '3 pi', '6 tr', '6 pi', '8 tr', '7 ca', '7 pi']
result = []
for i in range(len(a)):
for j in a[:i]+a[i+1:]:
if a[i][0] == j[0]:
result.append(j)
print(result)
Thanks for your help !
You can use itertools.groupby and itertools.combinations for that task:
import itertools as it
import operator as op
groups = it.groupby(sorted(a), key=op.itemgetter(0))
result = [', '.join(c) for g in groups for c in it.combinations(g[1], 3)]
Note that if the order of elements should only depend on the first character you might want to add another key=op.itemgetter(0) to the sorted function. If the data is already presorted such that "similar" items (with the same first character) are next to each other then you can drop the sorted all together.
Details
it.groupby puts the data into groups, based on their first character (due to key=op.itemgetter(0), which selects the first item, i.e. the first character, from each string). Expanding groups, it looks like this:
[('2', ['2 co', '2 tr', '2 pi', '2 ca']),
('3', ['3 co', '3 ca', '3 pi']),
('6', ['6 tr', '6 pi']),
('7', ['7 ca', '7 pi']),
('8', ['8 tr'])]
Then for each of the groups it.combinations(..., 3) computes all possible combinations of length 3 and concatenates them in the list comprehension (for groups with less than 3 members no combinations are possible):
['2 co, 2 tr, 2 pi',
'2 co, 2 tr, 2 ca',
'2 co, 2 pi, 2 ca',
'2 tr, 2 pi, 2 ca',
'3 co, 3 ca, 3 pi']

How to convert weeks,months into days in python pandas?

The dataset is of hotel and there is a column name "calender updated".Below are the values in this column.Can Anybody help to convert months into days?
array(['2 months ago', '12 months ago', 'yesterday', 'today',
'5 weeks ago', 'a week ago', '3 days ago', '3 months ago',
'4 months ago', '4 days ago', '2 weeks ago', '6 months ago',
'7 months ago', '1 week ago', '3 weeks ago', '5 months ago',
'2 days ago', '18 months ago', '11 months ago', '14 months ago',
'4 weeks ago', '6 weeks ago', '8 months ago', 'never',
'15 months ago', '6 days ago', '10 months ago', '7 weeks ago',
'5 days ago', '16 months ago', '9 months ago', '13 months ago',
'20 months ago', '19 months ago', '17 months ago', '21 months ago',
'22 months ago', '29 months ago', '25 months ago', '24 months ago',
'54 months ago', '27 months ago', '39 months ago', '30 months ago',
'37 months ago', '44 months ago', '23 months ago', '28 months ago',
'35 months ago', '47 months ago', '42 months ago', '40 months ago',
'43 months ago', '52 months ago', '50 months ago', '32 months ago',
'46 months ago', '34 months ago'], dtype=object)
Welcome to Stack Overflow. When you have a chance, take a look at the help pages to see how to format your question to increase the chances that someone will be able to help you.
How to convert weeks,months into days in python pandas?
Take a look at the dateparser module. I think this will do precisely what you're looking for.

Categories

Resources