I am trying to add a column to index duplicate rows and order by another column.
Here's the example dataset:
df = pd.DataFrame({'Name' = ['A','A','A','B','B','B','B'], 'Score'=[9,10,10,8,7,8,8], 'Year'=[2019,2018,2017,2019,2018,2017,2016']})
I want to use ['Name', 'Score'] for identifying duplicates. Then index the duplicate order by Year to get following result:
Here rows 2 and 3 are duplicate rows because they have same name and score, so I order them by year and give index.
Is anyone have good idea to realize this in Python? Thank you so much!
You are looking for cumcount:
df['Index'] = (df.sort_values('Year', ascending=False)
.groupby(['Name','Score'])
.cumcount() + 1
)
Output:
Name Score Year Index
0 A 9 2019 1
1 A 10 2018 1
2 A 10 2017 2
3 B 8 2019 1
4 B 7 2018 1
5 B 8 2017 2
6 B 8 2016 3
Related
I'm currently dealing with a worldbank dataset and I would like to pivot it to analyse it better.
Currently, it's in a form like this :
df = pd.DataFrame({'Country Name' : ['A','A','A','B','B','B','C','C','C'],'Indicator' : ['X','Y','Z','X','Y','Z','X','Y','Z'],'2010' : [1,2,3,4,5,6,7,8,9],'2011' : [9,8,7,6,5,4,3,2,1]})
print(df)
Country Name Indicator 2010 2011
0 A X 1 9
1 A Y 2 8
2 A Z 3 7
3 B X 4 6
4 B Y 5 5
5 B Z 6 4
6 C X 7 3
7 C Y 8 2
8 C Z 9 1
I simplified it, there's more columns with more yearly value but that's good enough to explain. Now I would like to get that table to pivot, so that I could get the indicators in columns, with the countries in index and the yearly values for the values. However for that I would like to have a df like this :
year
indicator1
indicator 2
indicator3
Country A
2010
values
.
.
2011
.
.
.
Country B
2010
.
.
.
2011
.
.
.
Country C
2010
.
.
.
But since the values are stored in the year columns, I don't know how to transform it to get to this disposition
I tried doing it like this :
indicators = data.pivot_table(index='Country Name',columns='Indicator Name', values='2011', dropna=False)
This ofc works, but I just get the values for the year 2011 and I don't really want to create a dataframe for each year.
But I can't add the name of the year columns in the index as it creates an enormous dataframe for some reason, and if i just add them in the values it will only create more columns, with some kind of index on the columns to separate the years.
I don't know if there's a method to do what I'm looking for but I appreciate any help !
Thanks
melt before pivot_table:
(df.melt(['Country Name', 'Indicator'], var_name='Year')
.pivot_table(index=['Country Name', 'Year'],
columns='Indicator', values='value')
)
Or reshape with stack/unstack (requires unique values, unlike pivot_table):
(df.set_index(['Country Name', 'Indicator'])
.rename_axis(columns='Year').stack().unstack('Indicator')
)
output:
Indicator X Y Z
Country Name Year
A 2010 1 2 3
2011 9 8 7
B 2010 4 5 6
2011 6 5 4
C 2010 7 8 9
2011 3 2 1
I have the following dataset as input in my jupyter notebook:
Product Year Variable
A 2018 2
A 2019 4
B 2018 2
B 2019 3
I'm wondering what would be the quickest way to create a loop or something or that sorts in my data set, such that I get the following output:
Product Year Variable Row_Num
A 2018 2 1
A 2018 2 2
A 2019 4 1
A 2019 4 2
A 2019 4 3
A 2019 4 4
B 2018 2 1
B 2018 2 2
and so on...
TL;DR - Based on a variable in a particular column, I would like to create rows. Ex- if the variable is 3, I would like to create 3 copies of that row with a column that has values 1,2,3 against it.
One of the ways I think I found is to first create duplicates based on my variable and then use a function similar to rank() or row_number() to create my "row_num" column. It would be helpful if anyone can share other possible ways to do the same. 😄
If I understand correctly, you'd like to create n duplicates of each row, where the value of n is given in one of the columns. Here's a way to do that:
df["new_id"] = df.Variable.apply(lambda x: list(range(x)))
df = df.explode("new_id")
Output:
Product Year Variable new_id
0 A 2018 2 0
0 A 2018 2 1
1 A 2019 4 0
1 A 2019 4 1
1 A 2019 4 2
1 A 2019 4 3
2 B 2018 2 0
2 B 2018 2 1
3 B 2019 3 0
3 B 2019 3 1
3 B 2019 3 2
Solution for Pandas <= 0.24
If for whatever reason, explode is not available because you're using an older version of pandas, you can do the following:
cols = df.columns
def make_df(r):
d = {k: r[k] for k in cols}
d["new_var"] = range(r["Variable"])
res = pd.DataFrame(d)
return res
dfs = []
for row in df.iterrows():
dfs.append(make_df(row[1]))
pd.concat(dfs)
The output is identical.
Currently I have the following python code
forumposts = pd.DataFrame({'UserId': [1,1,2,3,2,1,3], 'FirstPostDate': [2018,2018,2017,2019,2017,2018,2019], 'PostDate': [201801,201802,201701,201901,201801,201803,201902]})
data = forumposts.groupby(['UserId', 'PostDate','FirstPostDate']).size().reset_index()
rankedUserIdByFirstPostDate = data.groupby(['UserId', 'FirstPostDate']).size().reset_index().sort_values('FirstPostDate').reset_index(drop=True).reset_index()
data.loc[:,'Rank'] = data.merge(rankedUserIdByFirstPostDate , how='left', on='UserId')['index'].values
The code works as intended but its complicated is there a more pandas like way of doing this? The intent is the following:
Create a dense rank over the UserId column sorted by the FirstPostDate such that the user with the earliest posting gets rank 0 and the user with the second earliest first post gets rank 1 and so on.
Using forumposts.UserId.rank(method='dense') gives me a ranking but its sorted by the order of the UserId.
Use map by dictionary created by sort_values with drop_duplicates for order zipped with np.arange:
data = (forumposts.groupby(['UserId', 'PostDate','FirstPostDate'])
.size()
.reset_index(name='count'))
users = data.sort_values('FirstPostDate').drop_duplicates('UserId')['UserId']
d = dict(zip(users, np.arange(len(users))))
data['Rank'] = data['UserId'].map(d)
print (data)
UserId PostDate FirstPostDate count Rank
0 1 201801 2018 1 1
1 1 201802 2018 1 1
2 1 201803 2018 1 1
3 2 201701 2017 1 0
4 2 201801 2017 1 0
5 3 201901 2019 1 2
6 3 201902 2019 1 2
Another solution:
data['Rank'] = (data.groupby('UserId')['FirstPostDate']
.transform('min')
.rank(method='dense')
.sub(1)
.astype(int))
Here is test data
import numpy as np
import pandas as pd
import datetime
# multi-indexed dataframe via cartesian join
df1 = pd.DataFrame([1, 2, 3])
df2 = pd.DataFrame(pd.date_range(start='2016', end='2018', freq='M'))
df1['key'] = 0
df2['key'] = 0
df = df1.merge(df2, how='outer', on='key')
del df1, df2
del df['key']
df.columns = ['id','date']
df['value'] = pd.DataFrame(np.random.randn(len(df)))
df.set_index(['date', 'id'], inplace=True)
df.sort_index(inplace=True)
df.head()
output:
value
date id
2016-01-31 1 0.245029
2 -2.141292
3 1.521566
2016-02-29 1 0.870639
2 1.407977
There is probably a better way to generate the cartesian join, but I'm new and that is the best I could find to generate panel data that looks like mine. Anyway, my goal is to create a quick table looking at the pattern of observations to see if any are missing as it relates to time.
My goal is to create a year by month table of frequency observations. This is close to what I want:
df.groupby(pd.Grouper(level='date',freq='M')).count()
But it gives a vertical list. My data is much bigger than this small MWE so I'd like to fit it more compactly, as well as see if there are seasonal patterns (i.e. lots of observations in December or June).
It seems to me that this should work but it doesn't:
df.groupby([df.index.levels[0].month, df.index.levels[0].year]).count()
I get a ValueError: Grouper and axis must be same length error.
This gives what I'm looking for but it seems to me that it should be easier with the time index:
df.reset_index(inplace=True)
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df.groupby(['month', 'year'])['value'].count().unstack().T
output:
month 1 2 3 4 5 6 7 8 9 10 11 12
year
2016 3 3 3 3 3 3 3 3 3 3 3 3
2017 3 3 3 3 3 3 3 3 3 3 3 3
Also, since this is just a quick validation, I'd rather not reset the index, then re-establish the index (and delete month and year) each time just to see this table.
I think need Index.get_level_values for select first level of MultiIndex:
idx = df.index.get_level_values(0)
df1 = df.groupby([idx.year, idx.month])['value'].count().unstack()
Or:
df1 = df.groupby([idx.year, idx.month]).size().unstack()
Difference between count and size is count omit NaNs and size not.
print (df1)
date 1 2 3 4 5 6 7 8 9 10 11 12
date
2016 3 3 3 3 3 3 3 3 3 3 3 3
2017 3 3 3 3 3 3 3 3 3 3 3 3
Hello I'm having troubles dealing with Pandas. I'm trying to sum duplicated rows on a multiindex Dataframe.
I tryed with df.groupby(level=[0,1]).sum() , also with df.stack().reset_index().groupby(['year', 'product']).sum() and some others, but I cannot get it to work.
I'd also like to add every unique product for each given year and give them a 0 value if they weren't listed.
Example: dataframe with multi-index and 3 different products (A,B,C):
volume1 volume2
year product
2010 A 10 12
A 7 3
B 7 7
2011 A 10 10
B 7 6
C 5 5
Expected output : if there are duplicated products for a given year then we sum them.
If one of the products isnt listed for a year, we create a new row full of 0.
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Any idea ? Thanks
You can make the second level of the index a CategoricalIndex and when you use groupby it will include all of the categories.
df.index.set_levels(pd.CategoricalIndex(df.index.levels[1]), 1, inplace=True)
df.groupby(level=[0, 1]).sum().fillna(0, downcast='infer')
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Use sum with unstack and stack:
df = df.sum(level=[0,1]).unstack(fill_value=0).stack()
#same as
#df = df.groupby(level=[0,1]).sum().unstack(fill_value=0).stack()
Alternative with reindex:
df = df.sum(level=[0,1])
#same as
#df = df.groupby(level=[0,1]).sum()
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0)
Alternative1, thanks #Wen:
df = df.sum(level=[0,1]).unstack().stack(dropna=False)
print (df)
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5