Convert Dates to number of Days using Pandas [duplicate] - python

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 2 years ago.
I want to get the number of days corresponding to the date for each country. I have a dataset like so:
Date Country
01/03/2020 USA
02/03/2020 USA
03/03/2020 USA
07/04/2020 UK
08/04/2020 UK
09/04/2020 UK
And I want to get the day numbers based on their first date the country is mentioned. So something like this:
Date Country Day_Number
01/03/2020 USA 1
02/03/2020 USA 2
03/03/2020 USA 3
07/04/2020 UK 1
08/04/2020 UK 2
09/04/2020 UK 3
Any help is appreciated. Thanks in advance.

use the following piece of code, it will maintain a cumulative count after groupby operations.
df['Day_Number'] = df.groupby('Country').cumcount()+1

Not a complete copy-paste solution but:
You can get the number of days since January 1 1970 this way:
import datetime
days = (datetime.datetime.utcnow() - datetime.datetime(1970,1,1)).days
# Or
days = (datetime.datetime(year, month, day) - datetime.datetime(1970,1,1)).days
So you can convert your dates to numbers (days since Jan 1 1970) and then:
keep track of the minimum per country
subtract the corresponding minimum from each entry
Hope this helps

Related

how to calculate percentage variation between two values in same one column in pandas dataframe? [duplicate]

This question already has an answer here:
python pandas groupby calculate change
(1 answer)
Closed 9 months ago.
I have this dataframe with the total population number by year.
import pandas as pd
cases_df = pd.DataFrame(data=cases_list, columns=['Year', 'Population', 'Nation'])
cases_df.head(7)
Year Population Nation
0 2019 328239523 United States
1 2018 327167439 United States
2 2017 325719178 United States
3 2016 323127515 United States
4 2015 321418821 United States
5 2014 318857056 United States
6 2013 316128839 United States
I want to calculate how much the population has increased from the year 2013 to 2019 by calculating the percentage change between two values (2013 and 2019):
{[(328239523 - 316128839)/ 316128839] x 100 }
How can I do this? Thank you very much!!
ps. some advice to remove index?
0
1
2
3
4
5
6
This can be done using the pandas method called percentage change.
Syntax:
df.pct_change()
In your case the code will be as follows:
df1 = df.groupby(level='Population').pct_change()
print(df1)

how to calculate percentage variation between two values?

I have this dataframe with the total population number by year.
import pandas as pd
cases_df = pd.DataFrame(data=cases_list, columns=['Year', 'Population', 'Nation'])
cases_df.head(7)
Year Population Nation
0 2019 328239523 United States
1 2018 327167439 United States
2 2017 325719178 United States
3 2016 323127515 United States
4 2015 321418821 United States
5 2014 318857056 United States
6 2013 316128839 United States
I want to calculate how much the population has increased from the year 2013 to 2019 by calculating the percentage change between two values (2013 and 2019):
{[(328239523 - 316128839)/ 316128839] x 100 }
How can I do this? Thank you very much!!
ps. some advice to remove index? 0 1 2 3 4 5 6
i tried to to that
df1 = df.groupby(level='Population').pct_change()
print(df1)
but i get error because "Population" says that is not the name of Index
I would do it following way
import pandas as pd
df = pd.DataFrame({"year":[2015,2014,2013],"population":[321418821,318857056,316128839],"nation":["United States","United States","United States"]})
df = df.set_index("year")
df["percentage"] = df["population"] * 100 / df["population"][2013]
print(df)
output
population nation percentage
year
2015 321418821 United States 101.673363
2014 318857056 United States 100.863008
2013 316128839 United States 100.000000
Note I used subset of data for brevity sake. Using year as index allow easy access to population value in 2013, percentage is computed as (population) * 100 / (population for 2013).
How to remove the mentioned index :
df.set_index('Year',inplace=True)
Now Year will replace your numbered index.
Now
Use cases_df.describe()
or cases_df.attribute_name.describe()
This is more of a math question rather than a programming question.
Let's call this a percentage difference between two values since population can vary both ways (increase or decrease over time).
Now, lets say that in 2013 we had 316128839 people and in 2019 we had 328239523 people:
a = 316128839
b = 328239523
Before we go about calculating the percentage, we need to find the difference between the b and a:
diff = b - a
Now that we have that, we need to see what is the percentage of diff of a:
perc = (diff / a) * 100
And there is your percentage variation between a and b

How to group by column and take average of column weighted by another column?

I am trying to carry out what I thought would be a typical groupby and average problem on a DataFrame, but this has gotten a bit more complex than I had anticipated since the problem will deal with string/ordinal years and float values. I am using python. I will explain below.
I have a data frame showing different model years for different models of refrigerators across several counties in a state. I want to find the average model year of refrigerator for each county.
I have this example dataframe (abbreviated since the full dataframe would be far too long to show):
County_ID Type Year Population
--------------------------------------------
1 A 2022 54355
1 A 2021 54645
1 A 2020 14554
...
1 B 2022 23454
1 B 2021 34657
1 B 2020 12343
...
1 C 2022 23454
1 C 2021 34537
1 C 2020 23323
...
2 A 2022 54355
2 A 2021 54645
2 A 2020 14554
...
2 B 2022 23454
2 B 2021 34657
2 B 2020 12343
...
2 C 2022 23454
2 C 2021 34537
2 C 2020 23323
...
3 A 2022 54355
3 A 2021 54645
3 A 2020 14554
...
3 B 2022 23454
3 B 2021 34657
3 B 2020 12343
...
3 C 2022 23454
3 C 2021 34537
3 C 2020 23323
...
And so I kept this abbreviated for space, but the concept goes as I have many counties in my data, with county IDs going from 1 all the way to 50, and so 50 counties. In this example, there are 3 types of refrigerators. And then for each of these 3 types of refrigerators, there are the model year vintages of these refrigerators shown, e.g. how old the refrigerator is. And then we have population, showing how many of each of these physical units (unique pair of type and year) found in each of these counties. What I am trying to find is, for each County ID, I want the average year.
And so I want to produce the following DataFrame:
County_ID Average_vintage
--------------------------------
1 XXXX.XX
2 XXXX.XX
3 XXXX.XX
4 XXXX.XX
5 XXXX.XX
6 XXXX.XX
...
But here is why this is confusing me, since I want to find the average year, but year is ordinal data and not float, so I am a bit confused conceptually here. What I want to do is weight this by population, I think. And so, basically, if you want to find the average vintage of refrigerators, you would want to find the average of years, but of course, the vintage with a higher population of that vintage would have more influence in that average. And so I want to weight the vintages by population, and basically treat the years like float, so I could have the average year, and then a decimal attached, so there could be an average that says basically, the average refrigerator vintage for County 22 is 2015.48 or something like that. That is what I am trying to go for. I am trying this:
avg_vintage = df.groupby(['County_ID']).mean()
but I don't think this is really going to make much sense, since I need to account for how many (population) of each refrigerator there actually are in each county. How can I find the average year/vintage for each County, considering how many of each refrigerator (population) are found in each County using python?

Sorting values in pandas series [duplicate]

This question already has answers here:
changing sort in value_counts
(4 answers)
Closed 3 years ago.
I have a movies dataframe that looks like this...
title decade
movie name 1 2000
movie name 2 1990
movie name 3 1990
movie name 4 2000
movie name 5 2010
movie name 6 1980
movie name 7 1980
I want to plot number of movies per decade which I am doing this way
freq = movies['decade'].value_counts()
#freq returns me following
2000 56
1980 41
1990 37
1970 21
2010 9
# as you can see the value_counts() method returns a series sorted by the frequencies
freq = movies['decade'].value_counts(sort=False)
# now the frequencies are not sorted, because I want to distribution to be in sequence of decade year
# and not its frequency so I do something like this...
movies = movies.sort_values(by='decade', ascending=True)
freq = movies['decade'].value_counts(sort=False)
now the Series freq should be sorted w.r.t to decades but it does not
although movies is sorted
can someone tell what I am doing wrong? Thanks.
The expected output I am looking for is something like this...
1970 21
1980 41
1990 37
2000 56
2010 9
movies['decade'].value_counts()
returns a series with the decade as index and is sorted descending by count. To sort by decade, just append
movies['decade'].value_counts().sort_index()
or
movies['decade'].value_counts().sort_index(ascending=False)
should do the trick.

Conditional copy of values from one column to another columns

I have a pandas dataframe that looks something like this:
name job jobchange_rank date
Thisguy Developer 1 2012
Thisguy Analyst 2 2014
Thisguy Data Scientist 3 2015
Anotherguy Developer 1 2018
The jobchange_rank represents the each individual's (based on name) ranked change in position, where rank nr 1 represent his/her first position nr 2 his/her second position, etc.
Now for the fun part. I want to create a new column where I can see a person's previous job, something like this:
name job jobchange_rank date previous_job
Thisguy Developer 1 2012 None
Thisguy Analyst 2 2014 Developer
Thisguy Data Scientist 3 2015 Analyst
Anotherguy Developer 1 2018 None
I've created the following code to get the "None" values where there was no job change:
df.loc[df['jobchange_rank'].sub(df['jobchange_rank'].min()) == 0, 'previous_job'] = 'None'
Sadly, I can't seem to figure out how to get the values from the other column where the needed condition applies.
Any help is more then welcome!
Thanks in advance.
This answer assumes that your DataFrame is sorted by name and jobchange_rank, if that is not the case, sort first.
# df = df.sort_values(['name', 'jobchange_rank'])
m = df['name'].eq(df['name'].shift())
df['job'].shift().where(m)
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Or using a groupby + shift (assuming at least sorted by jobchange_rank)
df.groupby('name')['job'].shift()
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Although the groupby + shift is more concise, on larger inputs, if your data is already sorted like your example, it may be faster to avoid the groupby and use the first solution.

Categories

Resources