This question already has answers here:
changing sort in value_counts
(4 answers)
Closed 3 years ago.
I have a movies dataframe that looks like this...
title decade
movie name 1 2000
movie name 2 1990
movie name 3 1990
movie name 4 2000
movie name 5 2010
movie name 6 1980
movie name 7 1980
I want to plot number of movies per decade which I am doing this way
freq = movies['decade'].value_counts()
#freq returns me following
2000 56
1980 41
1990 37
1970 21
2010 9
# as you can see the value_counts() method returns a series sorted by the frequencies
freq = movies['decade'].value_counts(sort=False)
# now the frequencies are not sorted, because I want to distribution to be in sequence of decade year
# and not its frequency so I do something like this...
movies = movies.sort_values(by='decade', ascending=True)
freq = movies['decade'].value_counts(sort=False)
now the Series freq should be sorted w.r.t to decades but it does not
although movies is sorted
can someone tell what I am doing wrong? Thanks.
The expected output I am looking for is something like this...
1970 21
1980 41
1990 37
2000 56
2010 9
movies['decade'].value_counts()
returns a series with the decade as index and is sorted descending by count. To sort by decade, just append
movies['decade'].value_counts().sort_index()
or
movies['decade'].value_counts().sort_index(ascending=False)
should do the trick.
Related
I have the following multi-index data frame, with ID and Year being part of the index. The solvency column is based on wether or not there are NaNs in both Profit/Loss and Total Sales for that year.
ID Year Profit/Loss Total Sales Solvency
0 2008 300. 2000. 1
0 2009 NaN NaN 0
0 2010 500. 2000. 1
1 2008 300. 2000. 1
1 2009 NaN NaN 0
1 2010 NaN NaN 0
However, it is the case that sometimes a company has NaNs in one year, but not in the one after, so it is in fact not insolvent and did not disappear from the data set. For my analysis I need to know how many companies drop out over the time period. I am guessing that I need a function with groupby that checks if a 0 appears in the Solvency column and then checks if there ever is a 1 again in the next years for that specific company. The final output should tell how many companies dropped out in every year.
Year Count Dropouts
2008 0
2009 1
2010 1
I am trying to carry out what I thought would be a typical groupby and average problem on a DataFrame, but this has gotten a bit more complex than I had anticipated since the problem will deal with string/ordinal years and float values. I am using python. I will explain below.
I have a data frame showing different model years for different models of refrigerators across several counties in a state. I want to find the average model year of refrigerator for each county.
I have this example dataframe (abbreviated since the full dataframe would be far too long to show):
County_ID Type Year Population
--------------------------------------------
1 A 2022 54355
1 A 2021 54645
1 A 2020 14554
...
1 B 2022 23454
1 B 2021 34657
1 B 2020 12343
...
1 C 2022 23454
1 C 2021 34537
1 C 2020 23323
...
2 A 2022 54355
2 A 2021 54645
2 A 2020 14554
...
2 B 2022 23454
2 B 2021 34657
2 B 2020 12343
...
2 C 2022 23454
2 C 2021 34537
2 C 2020 23323
...
3 A 2022 54355
3 A 2021 54645
3 A 2020 14554
...
3 B 2022 23454
3 B 2021 34657
3 B 2020 12343
...
3 C 2022 23454
3 C 2021 34537
3 C 2020 23323
...
And so I kept this abbreviated for space, but the concept goes as I have many counties in my data, with county IDs going from 1 all the way to 50, and so 50 counties. In this example, there are 3 types of refrigerators. And then for each of these 3 types of refrigerators, there are the model year vintages of these refrigerators shown, e.g. how old the refrigerator is. And then we have population, showing how many of each of these physical units (unique pair of type and year) found in each of these counties. What I am trying to find is, for each County ID, I want the average year.
And so I want to produce the following DataFrame:
County_ID Average_vintage
--------------------------------
1 XXXX.XX
2 XXXX.XX
3 XXXX.XX
4 XXXX.XX
5 XXXX.XX
6 XXXX.XX
...
But here is why this is confusing me, since I want to find the average year, but year is ordinal data and not float, so I am a bit confused conceptually here. What I want to do is weight this by population, I think. And so, basically, if you want to find the average vintage of refrigerators, you would want to find the average of years, but of course, the vintage with a higher population of that vintage would have more influence in that average. And so I want to weight the vintages by population, and basically treat the years like float, so I could have the average year, and then a decimal attached, so there could be an average that says basically, the average refrigerator vintage for County 22 is 2015.48 or something like that. That is what I am trying to go for. I am trying this:
avg_vintage = df.groupby(['County_ID']).mean()
but I don't think this is really going to make much sense, since I need to account for how many (population) of each refrigerator there actually are in each county. How can I find the average year/vintage for each County, considering how many of each refrigerator (population) are found in each County using python?
I am having some difficulties with the following data (from a pandas dataframe):
Text
0 Selected moments from Fifa game t...
1 What I learned is that I am ...
3 Bill Gates kept telling us it was comi...
5 scenario created a month before the...
... ...
1899 Events for May 19 – October 7 - October CTOvision.com
1900 Office of Event Services and Campus Center Ope...
1901 How the CARES Act May Affect Gift Planning in ...
1902 City of Rohnert Park: Home
1903 iHeartMedia, Inc.
I would need to extract the count of unique words per row (after removing punctuation). So, for example:
Unique
0 6
1 6
3 8
5 6
... ...
1899 8
1900 8
1901 9
1902 5
1903 2
I tried to do it as follows:
df["Unique"]=df['Text'].str.lower()
df["Unique"]==Counter(word_tokenize('\n'.join( file["Unique"])))
but I have not got any count, only a list of words (without their frequency in that row).
Can you please tell me what is wrong?
First remove all Punctuation if you dont need it counted. Leverage sets. str.split.map(set) will give you a set. Count the elements in the set there after. Sets do not take multiple unique elements.
Chained
df['Text'].str.replace(r'[^\w\s]+', '').str.split().map(set).str.len()
Stepwise
df[Text]=df['Text'].str.replace(r'[^\w\s]+', '')
df['New Text']=df.Text.str.split().map(set).str.len()
So, I'm just updating this as per the comments. This solution accounts for punctuation as well.
df['Unique'] = df['Text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).strip()).str.split(' ').apply(len)
try this
from collections import Counter
dict = {'A': {0:'John', 1:'Bob'},
'Desc': {0:'Bill ,Gates Started Microsoft at 18 Bill', 1:'Bill Gates, Again .Bill Gates and Larry Ellison'}}
df = pd.DataFrame(dict)
df['Desc']=df['Desc'].str.replace(r'[^\w\s]+', '')
print(df.loc[:,"Desc"])
print(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items())
print(len(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items()))
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 2 years ago.
I want to get the number of days corresponding to the date for each country. I have a dataset like so:
Date Country
01/03/2020 USA
02/03/2020 USA
03/03/2020 USA
07/04/2020 UK
08/04/2020 UK
09/04/2020 UK
And I want to get the day numbers based on their first date the country is mentioned. So something like this:
Date Country Day_Number
01/03/2020 USA 1
02/03/2020 USA 2
03/03/2020 USA 3
07/04/2020 UK 1
08/04/2020 UK 2
09/04/2020 UK 3
Any help is appreciated. Thanks in advance.
use the following piece of code, it will maintain a cumulative count after groupby operations.
df['Day_Number'] = df.groupby('Country').cumcount()+1
Not a complete copy-paste solution but:
You can get the number of days since January 1 1970 this way:
import datetime
days = (datetime.datetime.utcnow() - datetime.datetime(1970,1,1)).days
# Or
days = (datetime.datetime(year, month, day) - datetime.datetime(1970,1,1)).days
So you can convert your dates to numbers (days since Jan 1 1970) and then:
keep track of the minimum per country
subtract the corresponding minimum from each entry
Hope this helps
I have a df :
year name_list
2009 [sam,maj,mak]
2010 [sam, mak, ali, mo, za]
2011 [mp,ki]
I would like to compare each row in terms of name_list and count how many new names are added/deleted each year.
Expected results:
year name_list added_count removed_count
2009 [sam,maj,mak] 0 0
2010 [sam, mak, ali, mo, za] 3 1
2011 [mp,ki] 2 5
Can anybody help?
First two lines are to initialize 2009 values to zero. Assumes that the years are in chronological order and the years are in the index and not a separate column. Also assumes no duplicate values for the names in column 'name_list'.
df.loc[2009,'added_count'] = 0
df.loc[2009,'removed_count'] = 0
for i in df.index[1:]:
df.loc[i,'added_count'] = len(list(set(df.loc[i,'name_list'])-set(df.loc[i-1,'name_list'])))
df.loc[i,'removed_count'] = len(list(set(df.loc[i-1,'name_list'])-set(df.loc[i,'name_list'])))