What would be the workaround (or the more tidy way) to insert a column in a pandas dataframe where some indices are duplicated?
For example, having the following dataframe:
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df1 = df1.set_index([0])
df1
1 2
0
1 51 R
2 51 R
3 74 R
4 29 R
1 39 F
2 3 F
3 14 F
4 16 F
how can I insert the column foo from df2 (below) in df1?
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df2 = df2.set_index([0])
df2
foo 2
0
1 5 R
2 5 R
3 7 R
4 2 R
1 3 F
3 1 F
4 1 F
Note that the index 2 is missing from category F.
I would like the result to be something like:
1 foo 2
0
1 51 5 R
2 51 5 R
3 74 7 R
4 29 2 R
1 39 3 F
2 3 NaN F
3 14 1 F
4 16 1 F
I tried the DataFrame.insert method but am getting
df1.insert(2, 'FOO', df2['foo'])
ValueError: cannot reindex from a duplicate axis
The index and column 2 uniquely define a row on both data frames, you can do a join on the two columns (after resetting the index):
df1.reset_index().merge(df2.reset_index(), how='left', on=[0,2]).set_index([0])
# 1 2 foo
#0
#1 51 R 5.0
#2 51 R 5.0
#3 74 R 7.0
#4 29 R 2.0
#1 39 F 3.0
#2 3 F NaN
#3 14 F 1.0
#4 16 F 1.0
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df1 = df1.set_index([0, 2])
df2 = df2.set_index([0, 2])
df1.join(df2, how='left').reset_index(level=2)
2 1 foo
0
1 R 51 5.0
2 R 51 5.0
3 R 74 7.0
4 R 29 2.0
1 F 39 3.0
2 F 3 NaN
3 F 14 1.0
4 F 16 1.0
You're very close...
As you already know based on your question, you can't do this for reasons clearly stated in the error, because you have a repeated index. If you must have column '0' as the index, then don't set it as the index before your merge, set it after:
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df = df1.merge(df2, how='left')
df.set_index([0])
Related
I have a sales table with columns item, week, and sales. I wanted to create a week to date sales column (wtd sales) that is a weekly roll-up of sales per item.
I have no idea how to create this in Python.
I'm stuck at groupby(), which probably is not the answer. Can anyone help?
output_df['wtd sales'] = input_df.groupby(['item'])['sales'].transform(wtd)
As I stated in my comment, you are looking for cumsum():
import pandas as pd
df = pd.DataFrame({
'items': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'weeks': [1, 2, 3, 4, 1, 2, 3, 4],
'sales': [100, 101, 102, 130, 10, 11, 12, 13]
})
df.groupby(['items'])['sales'].cumsum()
Which results in:
0 100
1 201
2 303
3 433
4 10
5 21
6 33
7 46
Name: sales, dtype: int64
I'm using:
pd.__version__
'1.5.1'
Putting it all together:
import pandas as pd
df = pd.DataFrame({
'items': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'weeks': [1, 2, 3, 4, 1, 2, 3, 4],
'sales': [100, 101, 102, 130, 10, 11, 12, 13]
})
df['wtds'] = df.groupby(['items'])['sales'].cumsum()
Resulting in:
items weeks sales wtds
0 A 1 100 100
1 A 2 101 201
2 A 3 102 303
3 A 4 130 433
4 B 1 10 10
5 B 2 11 21
6 B 3 12 33
7 B 4 13 46
I have one dataset with 7 variables and 5 indicators (df1):
A B C D E F G R_1 R_2 R_3 R_4 R_5
0 4 16 5 7 1 12 9 B C D F A
1 8 4 10 14 4 5 9 B E A NaN NaN
Second key-value dataset showing cut-off value for each variable (df2):
Variable Value
0 A 11
1 B 15
2 C 22
3 D 25
4 E 3
5 F 14
6 G 15
Want to add another 5 columns R_new_1-R_new_5 on the condition:
if R_1 = B and (value of B, df1) 16>15 (from df2):
df1['R_new_1'] = "C" (from R_2)
df1['R_new_2'] = "D" (from R_3)
df1['R_new_3'] = "F" (from R_4)
df1['R_new_4'] = "A" (from R_5)
df1['R_new_5'] = np.nan
Repeating above for the new R_2 value which is now stored in R_new_2
R_new_1 R_new_2 R_new_3 R_new_4 R_new_5
0 C D F A NaN
1 B A NaN NaN NaN
I have tried the below to automate the above:
var_list={'A','B','C','D','E','F','G'}
for col in var_list:
df1[str(col) + "_val"] = df2[df2['Variable']==str(col)].iloc[0][1]
for col in var_list:
if (df1[str(col) + "_val"] > df1[str(col)]):
df1[str(col) + "_ind"] = "OK"
else:
df1[str(col) + "_ind"] = "NOK"
The first run checks for R_1.
Unable to recursively replace B with C in R_new_1 and replace C with D in R_new_2 and replace D with F in R_new_3 and replace F with A in R_new_4 and CONTINUE checking for R_new_2 and so on
import pandas as pd
##dataframes constructers
#input
data1 = [{'A': 4, 'B': 16, 'C': 5, 'D': 7, 'E': 1, 'F': 12, 'G': 9, 'R_1':'B', 'R_2':'C', 'R_3':'D', 'R_4':'F', 'R_5':'A'},
{'A': 8, 'B': 4, 'C': 10, 'D': 14, 'E': 4, 'F': 5, 'G': 9, 'R_1':'B', 'R_2':'E', 'R_3':'A', 'R_4':np.nan, 'R_5':np.nan}]
df1 = pd.DataFrame(data1)
#input
data2 = [['A', 11], ['B', 15], ['C', 22], ['D', 25], ['E', 3], ['F', 14], ['G', 15]]
df2 = pd.DataFrame(data2, columns=['Variable', 'Value'])
#desired output
data3 = [{'A': 4, 'B': 16, 'C': 5, 'D': 7, 'E': 1, 'F': 12, 'G': 9, 'R_1':'B', 'R_2':'C', 'R_3':'D', 'R_4':'F', 'R_5':'A', 'R_new_1':'C', 'R_new_2':'D', 'R_new_3':'F', 'R_new_4':'A', 'R_new_5':np.nan},
{'A': 8, 'B': 4, 'C': 10, 'D': 14, 'E': 4, 'F': 5, 'G': 9, 'R_1':'B', 'R_2':'E', 'R_3':'A', 'R_4':np.nan, 'R_5':np.nan, 'R_new_1':'B', 'R_new_2':'A', 'R_new_3':np.nan, 'R_new_4':np.nan, 'R_new_5':np.nan}]
df3 = pd.DataFrame(data3)
This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 1 year ago.
I have a pandas dataframe with unique values in ID column.
df = pd.DataFrame({'ID': ['A', 'B', 'C'],
'STAT': ['X', 'X', 'X'],
'IN1': [1, 3, 7],
'IN2': [2, 5, 8],
'IN3': [3, 6, 9]})
I need to create a new dataframe where I have a row for each value in IN1, IN2 and IN3 with corresponding ID and STAT:
df_new = pd.DataFrame({'IN': [1, 2, 3, 3, 5, 6, 7, 8, 9],
'ID': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'STAT': ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X']})
You can use pandas.wide_to_long:
(pd.wide_to_long(df, ['IN'], j='to_drop', i='ID')
.droplevel('to_drop')
.sort_index()
.reset_index()
)
output:
ID STAT IN
0 A X 1
1 A X 2
2 A X 3
3 B X 3
4 B X 5
5 B X 6
6 C X 7
7 C X 8
8 C X 9
You can use melt
df.melt(id_vars=['ID','STAT'], value_name='IN')
Gives:
ID STAT variable IN
0 A X IN1 1
1 B X IN1 3
2 C X IN1 7
3 A X IN2 2
4 B X IN2 5
5 C X IN2 8
6 A X IN3 3
7 B X IN3 6
8 C X IN3 9
To make the df into a row:
(df.melt(id_vars=['ID','STAT'], value_name='IN')
.sort_values(by='ID')
.drop('variable', axis=1)
)
Gives the exact same results.
Let's assume I have the following DataFrame:
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df
Out[10]:
a b c d
0 1 1 f 10
1 1 1 f 20
2 2 1 f 30
3 2 1 e 40
4 2 2 f 50
5 2 2 f 60
6 3 1 f 70
7 3 1 e 80
8 3 2 f 90
9 3 2 f 100
In the following I want to take the values of column a and b, where c='e' and use those values to select respective rows of df (which would filter rows 2, 3, 6, 7). The idea is to create a list of tuples and index df by that list:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new
Out[13]:
c d
a b
1 1 f 10
1 f 20
2 1 f 30
1 e 40
2 f 50
2 f 60
3 1 f 70
1 e 80
2 f 90
2 f 100
list_tup
Out[14]: [(2, 1), (3, 1)]
df.loc[list_tup]
Results in an TypeError: unhashable type: 'writeable void-scalar', which I don't understand. Any suggestions? I'm pretty new to python and pandas, hence I assume that I miss something fundamental.
I believe it's better to groupby().transform() and boolean indexing in this use case:
valids = (df['c'].eq('e') # check if `c` is 'e`
.groupby([df['a'],df['b']]) # group by `a` and `b`
.transform('any') # check if `True` occurs in the group
# use the same label for all rows in group
)
# filter with `boolean indexing
df[valids]
Output:
a b c d
2 2 1 f 30
3 2 1 e 40
6 3 1 f 70
7 3 1 e 80
A similar idea with groupby().filter() which is more readable but can be slightly slower:
df.groupby(['a','b']).filter(lambda x: x['c'].eq('e').any())
You could try an innner join.
import pandas as pd
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df.merge(df.loc[df['c']=='e', ['a','b']], on=['a','b'])
Output
a b c d
0 2 1 f 30
1 2 1 e 40
2 3 1 f 70
3 3 1 e 80
May be try a MultiIndex:
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Full code:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Outputs:
c d
a b
2 1 f 30
1 e 40
3 1 f 70
1 e 80
Let us try
m = df[['a','b']].apply(tuple,1).isin(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False).tolist())
out = df[m].copy()
I would like to reorder the columns in a dataframe, and keep the underlying values in the right columns.
For example this is the dataframe I have
cols = [ ['Three', 'Two'],['A', 'D', 'C', 'B']]
header = pd.MultiIndex.from_product(cols)
df = pd.DataFrame([[1,4,3,2,5,8,7,6]]*4,index=np.arange(1,5),columns=header)
df.loc[:,('One','E')] = 9
df.loc[:,('One','F')] = 10
>>> df
And I would like to change it as follows:
header2 = pd.MultiIndex(levels=[['One', 'Two', 'Three'], ['E', 'F', 'A', 'B', 'C', 'D']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2], [0, 1, 2, 3, 4, 5, 2, 3, 4, 5]])
df2 = pd.DataFrame([[9,10,1,2,3,4,5,6,7,8]]*4,index=np.arange(1,5), columns=header2)
>>>>df2
First, define a categorical ordering on the top level. Then, call sort_index on the first axis with both levels.
v = pd.Categorical(df.columns.get_level_values(0),
categories=['One', 'Two', 'Three'],
ordered=True)
v2 = pd.Categorical(df.columns.get_level_values(1),
categories=['E', 'F', 'C', 'B', 'A', 'D'],
ordered=True)
df.columns = pd.MultiIndex.from_arrays([v, v2])
df = df.sort_index(axis=1, level=[0, 1])
df
One Two Three
E F C B A D C B A D
1 9 10 7 6 5 8 3 2 1 4
2 9 10 7 6 5 8 3 2 1 4
3 9 10 7 6 5 8 3 2 1 4
4 9 10 7 6 5 8 3 2 1 4