Check how many IDs dropped out of multi-index dataframe over time - python

I have the following multi-index data frame, with ID and Year being part of the index. The solvency column is based on wether or not there are NaNs in both Profit/Loss and Total Sales for that year.
ID Year Profit/Loss Total Sales Solvency
0 2008 300. 2000. 1
0 2009 NaN NaN 0
0 2010 500. 2000. 1
1 2008 300. 2000. 1
1 2009 NaN NaN 0
1 2010 NaN NaN 0
However, it is the case that sometimes a company has NaNs in one year, but not in the one after, so it is in fact not insolvent and did not disappear from the data set. For my analysis I need to know how many companies drop out over the time period. I am guessing that I need a function with groupby that checks if a 0 appears in the Solvency column and then checks if there ever is a 1 again in the next years for that specific company. The final output should tell how many companies dropped out in every year.
Year Count Dropouts
2008 0
2009 1
2010 1

Related

How to group by column and take average of column weighted by another column?

I am trying to carry out what I thought would be a typical groupby and average problem on a DataFrame, but this has gotten a bit more complex than I had anticipated since the problem will deal with string/ordinal years and float values. I am using python. I will explain below.
I have a data frame showing different model years for different models of refrigerators across several counties in a state. I want to find the average model year of refrigerator for each county.
I have this example dataframe (abbreviated since the full dataframe would be far too long to show):
County_ID Type Year Population
--------------------------------------------
1 A 2022 54355
1 A 2021 54645
1 A 2020 14554
...
1 B 2022 23454
1 B 2021 34657
1 B 2020 12343
...
1 C 2022 23454
1 C 2021 34537
1 C 2020 23323
...
2 A 2022 54355
2 A 2021 54645
2 A 2020 14554
...
2 B 2022 23454
2 B 2021 34657
2 B 2020 12343
...
2 C 2022 23454
2 C 2021 34537
2 C 2020 23323
...
3 A 2022 54355
3 A 2021 54645
3 A 2020 14554
...
3 B 2022 23454
3 B 2021 34657
3 B 2020 12343
...
3 C 2022 23454
3 C 2021 34537
3 C 2020 23323
...
And so I kept this abbreviated for space, but the concept goes as I have many counties in my data, with county IDs going from 1 all the way to 50, and so 50 counties. In this example, there are 3 types of refrigerators. And then for each of these 3 types of refrigerators, there are the model year vintages of these refrigerators shown, e.g. how old the refrigerator is. And then we have population, showing how many of each of these physical units (unique pair of type and year) found in each of these counties. What I am trying to find is, for each County ID, I want the average year.
And so I want to produce the following DataFrame:
County_ID Average_vintage
--------------------------------
1 XXXX.XX
2 XXXX.XX
3 XXXX.XX
4 XXXX.XX
5 XXXX.XX
6 XXXX.XX
...
But here is why this is confusing me, since I want to find the average year, but year is ordinal data and not float, so I am a bit confused conceptually here. What I want to do is weight this by population, I think. And so, basically, if you want to find the average vintage of refrigerators, you would want to find the average of years, but of course, the vintage with a higher population of that vintage would have more influence in that average. And so I want to weight the vintages by population, and basically treat the years like float, so I could have the average year, and then a decimal attached, so there could be an average that says basically, the average refrigerator vintage for County 22 is 2015.48 or something like that. That is what I am trying to go for. I am trying this:
avg_vintage = df.groupby(['County_ID']).mean()
but I don't think this is really going to make much sense, since I need to account for how many (population) of each refrigerator there actually are in each county. How can I find the average year/vintage for each County, considering how many of each refrigerator (population) are found in each County using python?

Transfer column of dates into cumulative counts for each zipcode

My dataframe contains houses (id's) located in zipcodes, that purchased a product on a certain date. I would like to add a column to my dataframe that, for every ID, adds up the number of purchases in the zipcode up untill that point, minus 3 months. So for a row that contains a purchase on December 30th, I would add up all the purchases in that zip code up until September 30th.
I already converted the purchase date column to a datetime format.
My dataset looks like this below. As you can see, between row 2 and 3, there is a period of almost 2 years where nothing happened in that area.
Id: Zipcode: Purchase_Date:
1 9999 2017-August-24
2 9999 2017-December-30
3 9999 2019-July-14
4 2000 2017-March-11
5 2000 2018-May-14
etc.
Ideally, the end result would look like this:
Id: Zipcode: Purchase_Date: Cumulative_purchases:
1 9999 2017-August-24 0
2 9999 2017-December-30 1
3 9999 2019-July-14 1
4 2000 2017-March-11 0
5 2000 2017-May-14 0
etc.

Sorting values in pandas series [duplicate]

This question already has answers here:
changing sort in value_counts
(4 answers)
Closed 3 years ago.
I have a movies dataframe that looks like this...
title decade
movie name 1 2000
movie name 2 1990
movie name 3 1990
movie name 4 2000
movie name 5 2010
movie name 6 1980
movie name 7 1980
I want to plot number of movies per decade which I am doing this way
freq = movies['decade'].value_counts()
#freq returns me following
2000 56
1980 41
1990 37
1970 21
2010 9
# as you can see the value_counts() method returns a series sorted by the frequencies
freq = movies['decade'].value_counts(sort=False)
# now the frequencies are not sorted, because I want to distribution to be in sequence of decade year
# and not its frequency so I do something like this...
movies = movies.sort_values(by='decade', ascending=True)
freq = movies['decade'].value_counts(sort=False)
now the Series freq should be sorted w.r.t to decades but it does not
although movies is sorted
can someone tell what I am doing wrong? Thanks.
The expected output I am looking for is something like this...
1970 21
1980 41
1990 37
2000 56
2010 9
movies['decade'].value_counts()
returns a series with the decade as index and is sorted descending by count. To sort by decade, just append
movies['decade'].value_counts().sort_index()
or
movies['decade'].value_counts().sort_index(ascending=False)
should do the trick.

Conditional copy of values from one column to another columns

I have a pandas dataframe that looks something like this:
name job jobchange_rank date
Thisguy Developer 1 2012
Thisguy Analyst 2 2014
Thisguy Data Scientist 3 2015
Anotherguy Developer 1 2018
The jobchange_rank represents the each individual's (based on name) ranked change in position, where rank nr 1 represent his/her first position nr 2 his/her second position, etc.
Now for the fun part. I want to create a new column where I can see a person's previous job, something like this:
name job jobchange_rank date previous_job
Thisguy Developer 1 2012 None
Thisguy Analyst 2 2014 Developer
Thisguy Data Scientist 3 2015 Analyst
Anotherguy Developer 1 2018 None
I've created the following code to get the "None" values where there was no job change:
df.loc[df['jobchange_rank'].sub(df['jobchange_rank'].min()) == 0, 'previous_job'] = 'None'
Sadly, I can't seem to figure out how to get the values from the other column where the needed condition applies.
Any help is more then welcome!
Thanks in advance.
This answer assumes that your DataFrame is sorted by name and jobchange_rank, if that is not the case, sort first.
# df = df.sort_values(['name', 'jobchange_rank'])
m = df['name'].eq(df['name'].shift())
df['job'].shift().where(m)
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Or using a groupby + shift (assuming at least sorted by jobchange_rank)
df.groupby('name')['job'].shift()
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Although the groupby + shift is more concise, on larger inputs, if your data is already sorted like your example, it may be faster to avoid the groupby and use the first solution.

Comparing rows of pandas dataframe and find intersection?

I have a df :
year name_list
2009 [sam,maj,mak]
2010 [sam, mak, ali, mo, za]
2011 [mp,ki]
I would like to compare each row in terms of name_list and count how many new names are added/deleted each year.
Expected results:
year name_list added_count removed_count
2009 [sam,maj,mak] 0 0
2010 [sam, mak, ali, mo, za] 3 1
2011 [mp,ki] 2 5
Can anybody help?
First two lines are to initialize 2009 values to zero. Assumes that the years are in chronological order and the years are in the index and not a separate column. Also assumes no duplicate values for the names in column 'name_list'.
df.loc[2009,'added_count'] = 0
df.loc[2009,'removed_count'] = 0
for i in df.index[1:]:
df.loc[i,'added_count'] = len(list(set(df.loc[i,'name_list'])-set(df.loc[i-1,'name_list'])))
df.loc[i,'removed_count'] = len(list(set(df.loc[i-1,'name_list'])-set(df.loc[i,'name_list'])))

Categories

Resources