I have the following dataset as input in my jupyter notebook:
Product Year Variable
A 2018 2
A 2019 4
B 2018 2
B 2019 3
I'm wondering what would be the quickest way to create a loop or something or that sorts in my data set, such that I get the following output:
Product Year Variable Row_Num
A 2018 2 1
A 2018 2 2
A 2019 4 1
A 2019 4 2
A 2019 4 3
A 2019 4 4
B 2018 2 1
B 2018 2 2
and so on...
TL;DR - Based on a variable in a particular column, I would like to create rows. Ex- if the variable is 3, I would like to create 3 copies of that row with a column that has values 1,2,3 against it.
One of the ways I think I found is to first create duplicates based on my variable and then use a function similar to rank() or row_number() to create my "row_num" column. It would be helpful if anyone can share other possible ways to do the same. 😄
If I understand correctly, you'd like to create n duplicates of each row, where the value of n is given in one of the columns. Here's a way to do that:
df["new_id"] = df.Variable.apply(lambda x: list(range(x)))
df = df.explode("new_id")
Output:
Product Year Variable new_id
0 A 2018 2 0
0 A 2018 2 1
1 A 2019 4 0
1 A 2019 4 1
1 A 2019 4 2
1 A 2019 4 3
2 B 2018 2 0
2 B 2018 2 1
3 B 2019 3 0
3 B 2019 3 1
3 B 2019 3 2
Solution for Pandas <= 0.24
If for whatever reason, explode is not available because you're using an older version of pandas, you can do the following:
cols = df.columns
def make_df(r):
d = {k: r[k] for k in cols}
d["new_var"] = range(r["Variable"])
res = pd.DataFrame(d)
return res
dfs = []
for row in df.iterrows():
dfs.append(make_df(row[1]))
pd.concat(dfs)
The output is identical.
Related
I have a dataframe that consists of the following enteries:
The dataset consists information about each country for about 10 years for 5 indicators as represented above.
I am trying to convert Indicator column into rows for all the 5 indicators and similarly convert each year column into one merged columns.
Ideally the final output should look like:
So, the country column should have extra enetries according to the number of years and the values should transpose according to each indicator.
I tried using pandas in built functions such as melt and pivot but not getting anywhere.
Any guidance on this would be appricated. Thanks!
You can use stack and unstack:
out = (df.set_index(['Country Name', 'Indicator'])
.rename_axis(columns='Year').stack('Year')
.unstack('Indicator').rename_axis(columns=None)
.reset_index())
print(out)
# Output
Country Name Year I J K L
0 A 2008 0.002535 0.966967 0.033397 0.487713
1 A 2009 0.797714 0.642878 0.752803 0.527796
2 A 2010 0.773789 0.288100 0.013059 0.918957
3 B 2008 0.979707 0.968245 0.820731 0.309862
4 B 2009 0.086082 0.608920 0.629591 0.156926
5 B 2010 0.983092 0.536192 0.380157 0.091473
6 C 2008 0.834870 0.145200 0.225985 0.686520
7 C 2009 0.771646 0.834432 0.519951 0.651756
8 C 2010 0.003791 0.292212 0.257748 0.473694
Alternative with melt and pivot:
out = (df.melt(['Country Name', 'Indicator'], var_name='Year')
.pivot(['Country Name', 'Year'], 'Indicator', 'value')
.rename_axis(columns=None).reset_index())
Input data:
>>> df
Country Name Indicator 2008 2009 2010
0 A I 0.002535 0.797714 0.773789
1 A J 0.966967 0.642878 0.288100
2 A K 0.033397 0.752803 0.013059
3 A L 0.487713 0.527796 0.918957
4 B I 0.979707 0.086082 0.983092
5 B J 0.968245 0.608920 0.536192
6 B K 0.820731 0.629591 0.380157
7 B L 0.309862 0.156926 0.091473
8 C I 0.834870 0.771646 0.003791
9 C J 0.145200 0.834432 0.292212
10 C K 0.225985 0.519951 0.257748
11 C L 0.686520 0.651756 0.473694
I have a dataset that looks like his:
ID date
1 O1-01-2012
1 05-02-2012
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 19-05-2012
2 07-08-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 15-04-2013
3 17-05-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I am working with Python and I would like to select the 3 last dates for each ID. Here is the dataset I would like to have:
ID date
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I used this code to select the very last date for each ID:
df_2=df.sort_values(by=['date']).drop_duplicates(subset='ID',keep='last')
But how can I select more than one date (for example the 3 last dates, or 4 last dates, etc)?
You might use groupby and tail following way to get 2 last items from each group:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3],'value':['A','B','C','D','E','F','G','H','I']})
df2 = df.groupby('ID').tail(2)
print(df2)
Output:
ID value
1 1 B
2 1 C
4 2 E
5 2 F
7 3 H
8 3 I
Note that for simplicity sake I used other (already sorted) data for building df.
can try this:
df.sort_values(by=['date']).groupby('ID').tail(3).sort_values(['ID', 'date'])
I tried this but with a non-datetime data type
a = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3]
b = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
import pandas as pd
import numpy as np
a = np.array([a,b])
df=pd.DataFrame(a.T,columns=['ID','Date'])
# the tail would give you the last n number of elements you are interested in
df_ = df.groupby('ID').tail(3)
df_
output:
ID Date
2 1 c
3 1 d
4 1 e
7 2 h
8 2 i
9 2 j
12 3 m
13 3 n
14 3 o
I am trying to add a column to index duplicate rows and order by another column.
Here's the example dataset:
df = pd.DataFrame({'Name' = ['A','A','A','B','B','B','B'], 'Score'=[9,10,10,8,7,8,8], 'Year'=[2019,2018,2017,2019,2018,2017,2016']})
I want to use ['Name', 'Score'] for identifying duplicates. Then index the duplicate order by Year to get following result:
Here rows 2 and 3 are duplicate rows because they have same name and score, so I order them by year and give index.
Is anyone have good idea to realize this in Python? Thank you so much!
You are looking for cumcount:
df['Index'] = (df.sort_values('Year', ascending=False)
.groupby(['Name','Score'])
.cumcount() + 1
)
Output:
Name Score Year Index
0 A 9 2019 1
1 A 10 2018 1
2 A 10 2017 2
3 B 8 2019 1
4 B 7 2018 1
5 B 8 2017 2
6 B 8 2016 3
Currently I have the following python code
forumposts = pd.DataFrame({'UserId': [1,1,2,3,2,1,3], 'FirstPostDate': [2018,2018,2017,2019,2017,2018,2019], 'PostDate': [201801,201802,201701,201901,201801,201803,201902]})
data = forumposts.groupby(['UserId', 'PostDate','FirstPostDate']).size().reset_index()
rankedUserIdByFirstPostDate = data.groupby(['UserId', 'FirstPostDate']).size().reset_index().sort_values('FirstPostDate').reset_index(drop=True).reset_index()
data.loc[:,'Rank'] = data.merge(rankedUserIdByFirstPostDate , how='left', on='UserId')['index'].values
The code works as intended but its complicated is there a more pandas like way of doing this? The intent is the following:
Create a dense rank over the UserId column sorted by the FirstPostDate such that the user with the earliest posting gets rank 0 and the user with the second earliest first post gets rank 1 and so on.
Using forumposts.UserId.rank(method='dense') gives me a ranking but its sorted by the order of the UserId.
Use map by dictionary created by sort_values with drop_duplicates for order zipped with np.arange:
data = (forumposts.groupby(['UserId', 'PostDate','FirstPostDate'])
.size()
.reset_index(name='count'))
users = data.sort_values('FirstPostDate').drop_duplicates('UserId')['UserId']
d = dict(zip(users, np.arange(len(users))))
data['Rank'] = data['UserId'].map(d)
print (data)
UserId PostDate FirstPostDate count Rank
0 1 201801 2018 1 1
1 1 201802 2018 1 1
2 1 201803 2018 1 1
3 2 201701 2017 1 0
4 2 201801 2017 1 0
5 3 201901 2019 1 2
6 3 201902 2019 1 2
Another solution:
data['Rank'] = (data.groupby('UserId')['FirstPostDate']
.transform('min')
.rank(method='dense')
.sub(1)
.astype(int))
Hello I'm having troubles dealing with Pandas. I'm trying to sum duplicated rows on a multiindex Dataframe.
I tryed with df.groupby(level=[0,1]).sum() , also with df.stack().reset_index().groupby(['year', 'product']).sum() and some others, but I cannot get it to work.
I'd also like to add every unique product for each given year and give them a 0 value if they weren't listed.
Example: dataframe with multi-index and 3 different products (A,B,C):
volume1 volume2
year product
2010 A 10 12
A 7 3
B 7 7
2011 A 10 10
B 7 6
C 5 5
Expected output : if there are duplicated products for a given year then we sum them.
If one of the products isnt listed for a year, we create a new row full of 0.
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Any idea ? Thanks
You can make the second level of the index a CategoricalIndex and when you use groupby it will include all of the categories.
df.index.set_levels(pd.CategoricalIndex(df.index.levels[1]), 1, inplace=True)
df.groupby(level=[0, 1]).sum().fillna(0, downcast='infer')
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Use sum with unstack and stack:
df = df.sum(level=[0,1]).unstack(fill_value=0).stack()
#same as
#df = df.groupby(level=[0,1]).sum().unstack(fill_value=0).stack()
Alternative with reindex:
df = df.sum(level=[0,1])
#same as
#df = df.groupby(level=[0,1]).sum()
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0)
Alternative1, thanks #Wen:
df = df.sum(level=[0,1]).unstack().stack(dropna=False)
print (df)
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5