I am very new to python (and to stack overflow!) so hopefully this makes sense!
I have a dataframe which contains years and names (amongst otherthings however this is all I am interested in working with).
I have done df = df.groupby(['year', 'name']).size() to get the amount of times each names appears in each year.
it returns something similar to this:
year name
2001 nameone 2
2001 nametwo 3
2002 nameone 1
2002 nametwo 5
what I'm trying to do is put the size data in to a new column called 'count'.
(eventually what I am intending to do with this is plot it on graphs)
Any help would be greatly appreciated!
Here is the raw code (I have condensed it a bit for convenience) :
hso_df = pd.read_csv('HibernationSurveyObservationsCleaned.csv')
hso_df[["startDate", "endDate", "commonName"]]
year_df = hso_df
year_df['startDate'] = pd.to_datetime(hso_df['startDate'] )
year_df['year'] = year_df['startDate'].dt.year
year_df = year_df[["year", "commonName"]].sort_values('year')
year_df = year_df.groupby(['year', 'commonName']).size()
here is an image of the first 3 rows of the data displayed with .head()
The only columns that are of interest from this data are the commonName and the year (I have taken this from startDate)
IIUC you want transform to add the result of the groupby with its index aligned to the original df:
df['count'] = df.groupby(['year', 'name']).transform('size')
EDIT
Looking at your requirements, I suggest calling reset_index on the groupby result and then merging this back to your main df:
year_df= year_df.reset_index()
hso_df.merge(year_df).rename(columns={0:'count'})
Related
I have a dataframe which looks like this (see table). For simplicity sake I've "aapl" is the only ticker shown. However, the real dataframe has more tickers.
ticker
year
return
aapl
1999
1
aapl
2000
3
aapl
2000
2
What I'd like to do is first group the dataframe by ticker, then by year. Next, I'd like to remove any duplicate years. In the end the dataframe should look like this:
ticker
year
return
aapl
1999
1
aapl
2000
3
I have a working solution, but it's not very "Pandas-esque", and involves for loops. I'm semi-certain that if I come back to the solution in three months, it'll be completely foreign to me.
Right now, I've been working on the following, with little luck:
df = df.groupby('ticker').groupby('year').drop_duplicates(subset=['year'])
This however, produces the following error:
AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'
Any help here would be greatly appreciated, thanks.
#QuangHoang provided the simplest version in the comments:
df.drop_duplicates(['ticker', 'year'])
Alternatively, you can use .groupby twice, inside two .applys:
df.groupby("ticker", group_keys=False).apply(lambda x:
x.groupby("year", group_keys=False).apply(lambda x: x.drop_duplicates(['year']))
)
Alternatively, you can use the .duplicated function:
df.groupby('ticker', group_keys=False).apply(lambda x:
x[~x['year'].duplicated(keep='first')])
)
You can try to sort the values first and then groupby.tail
df.sort_values('return').groupby(['ticker','year']).tail(1)
ticker year return
0 aapl 1999 1
1 aapl 2000 3
I'm almost sure you want to do this:
df.drop_duplicates(subset=["ticker","year"])
output
How do I perform an arithmetic operation across rows and columns for a data frame like the one shown below?
For example I want to calculate gross margin (gross profit/Revenue) - this is basically dividing one row by another row. I want to do this across all columns.
I think you need to restructure your dataframe a little bit to do this most effectively. If you transposed your dataframe such that Revenue, etc were columns and the years were the index, you could do:
df["gross_margin"] = df["Gross profit"] / df["Revenue"]
If you don't want to make so many changes, you should at least set the metric as the index.
df = df.set_index("Metric")
And then you could:
gross_margin = df.loc["Gross profit", :] / df.loc["Revenue", :]
here is one way to do it
df2=df.T
df2['3']=df2.iloc[1:,2]/df2.iloc[1:,0]
df2=df2.T
df2.iloc[3,0] = 'Gross Margin'
df2
Metric 2012 2013 2014 2015 2016
0 Revenue 116707394.0 133084076.0 143328982.0 151271526.0 181910977.0
1 Cost_of_Sales -66538762.0 -76298147.0 -82099051.0 -83925957.0 -106583385.0
2 Gross_profit 501686320.0 56785929.0 612299310.0 67345569.0 75327592.0
3 Gross Margin 4.298668 0.426692 4.271985 0.445197 0.41409
I am looking to calculate last 3 months of unique employee ID count using pandas. I am able to calculate unique employee ID count for current month but not sure how to do it for last 3 months.
df['DateM'] = df['Date'].dt.to_period('M')
df.groupby("DateM")["EmpId"].nunique().reset_index().rename(columns={"EmpId":"One Month Unique EMP count"}).sort_values("DateM",ascending=False).reset_index(drop=True)
testdata.xlsx Google drive link..
https://docs.google.com/spreadsheets/d/1Kaguf72YKIsY7rjYfctHop_OLIgOvIaS/edit?usp=sharing&ouid=117123134308310688832&rtpof=true&sd=true
After using above groupby command I get output for 1 month groups based on DateM column which correct.
Similarly I'm looking for another column where 3 months unique active user count based on EmpId is calculated.
Sample output:
I tried calculating same using rolling window but it doesn't help. Even I tried creating period for last 3 months and also search it before asking this question. Thanks for your help in advance, otherwise I'll have to calculate it manually.
I don't know if you are looking for 3 consecutive months or something else because your date discontinues at 2022-09 to 2022-10.
I also don't know your purpose, so I give a general solution here. In case you only want to count unique for every 3 consecutive months, then it is much easier. The solution here gives you the list of unique empid for every 3 consecutive months. Note that: this means for 2022-08, I will count 3 consecutive months as 2022-08, 2022-09, and 2022-10. And so on
# Sort data:
df.sort_values(by='datem', inplace=True, ignore_index=True)
# Create `dfu` which is `df` with unique `empid` for each `datem` only:
dfu = df.groupby(['datem', 'empid']).count().reset_index()
dfu.rename(columns={'date':'count'}, inplace=True)
dfu.sort_values(by=['datem', 'empid'], inplace=True, ignore_index=True)
dfu
# Obtain the list of unique periods:
unique_period = dfu['datem'].unique()
# Create empty dataframe:
dfe = pd.DataFrame(columns=['datem', 'empid', 'start_period'])
for p in unique_period:
# Create 3 consecutive range:
tem_range = pd.period_range(start=p, freq='M', periods=3)
# Extract dataframe from `dfu` with period in range wanted:
tem_dfu = dfu.loc[dfu['datem'].isin(tem_range),:].copy()
# Some cleaning:
tem_dfu.drop_duplicates(subset='empid', keep='first')
tem_dfu.drop(columns='count', inplace=True)
tem_dfu['start_period'] = p
# Concat and obtain desired output:
dfe = pd.concat([dfe, tem_dfu])
dfe
Hope this is what you are looking for
I am a beginner in Python and trying to learn it. We have a df called allyears that has years, gender, names in it.
Something like this:
name
sex
number
year
John
M
1
2010
Jane
F
2
2011
I want to group the top10 names for a given year with their respective counts. I tried this code, but this is not returning what I am looking for.
males = allyears[(allyears.year>2009)&(allyears.sex=='M')]
maleNameCounts = pd.DataFrame(males.groupby(['year', 'name']).count())
maleNameCounts.sort_values('number', ascending=True)
How should I be approaching this problem?
Hope this helps:
Add a column with counts
df["name_count"] = df[name].map(df.name.value_counts())
Optional to remove duplicates
df = df.drop_duplicates(["name"])
Sort (by counts)
df = df.sort_values("name_count")
Note that this can all be tweaked were necessary.
You can try following:
males = allyears[(allyears.year>2009)&(allyears.sex=='M'),]
maleNameCounts = df.groupby(['Year', 'Name']).size().nlargest(10).reset_index().rename(columns={0:'count'})
maleNameCounts
I am getting familiar with Pandas and I want to learn the logic with a few simple examples.
Let us say I have the following panda DataFrame object:
import pandas as pd
d = {'year':pd.Series([2014,2014,2014,2014], index=['a','b','c','d']),
'dico':pd.Series(['A','A','A','B'], index=['a','b','c','d']),
'mybool':pd.Series([True,False,True,True], index=['a','b','c','d']),
'values':pd.Series([10.1,1.2,9.5,4.2], index=['a','b','c','d'])}
df = pd.DataFrame(d)
Basic Question.
How do I take a column as a list.
I.e., d['year']
would return
[2013,2014,2014,2014]
Question 0
How do I take rows 'a' and 'b' and columns 'year' and 'values' as a new dataFrame?
If I try:
d[['a','b'],['year','values']]
it doesn't work.
Question 1.
How would I aggregate (sum/average) the values column by the year, and dico columns, for example. I.e., such that different years/dico combinations would not be added, but basically mybool would be removed from the list.
I.e., after aggregation (this case average) I should get:
tipo values year
A 10.1 2013
A (9.5+1.2)/2 2014
B 4.2 2014
If I try the groupby function it seems to output some odd new DataFrame structure with bool in it, and all possible years/dico combinations - my objective is rather to have that simpler new sliced and smaller dataframe I showed above.
Question 2. How do I filter by a condition?
I.e., I want to filter out all bool columns that are False.
It'd return:
tipo values year mybool
A 10.1 2013 True
A 9.5 2014 True
B 4.2 2014 True
I've tried the panda tutorial but I still get some odd behavior so asking directly seems to be a better idea.
Thanks!
values from series in a list:
df['year'].values #returns an array
loc lets you subset a dateframe by index labels:
df.loc[['a','b'],['year','values']]
Group by lets you aggregate over columns:
df.groupby(['year','dico'],as_index=False).mean() #don't have 2013 in your df
Filtering by a column value:
df[df['mybool']==True]