Dataframe groupby certain column and repeat the row n times - python

I would like to get df_output from df_input in below code. It is basically repeating the row 2 times grouped by date column. Also repeated tag should be included.
import pandas as pd
df_input = pd.DataFrame( [
['01/01', '1', '10'],
['01/01', '2', '5'],
['01/02', '1', '9'],
['01/02', '2', '7'],
], columns=['date','type','value'])
df_output = pd.DataFrame( [
['01/01', '1', '10', '1'],
['01/01', '2', '5', '1'],
['01/01', '1', '10', '2'],
['01/01', '2', '5', '2'],
['01/02', '1', '9', '1'],
['01/02', '2', '7', '1'],
['01/02', '1', '9', '2'],
['01/02', '2', '7', '2'],
], columns=['date','type','value', 'repeat'])
print(df_output)
I thought about grouping by the date column above and repeat the rows n times, but could not find the code.

You can use GroupBy.apply per date, and pandas.concat:
N = 2
out = (df_input
.groupby(['date'], group_keys=False)
.apply(lambda d: pd.concat([d]*N))
)
output:
date type value
0 01/01 1 10
1 01/01 2 5
0 01/01 1 10
1 01/01 2 5
2 01/02 1 9
3 01/02 2 7
2 01/02 1 9
3 01/02 2 7
With "repeat" column:
N = 2
out = (df_input
.groupby(['date'], group_keys=False)
.apply(lambda d: pd.concat([d.assign(repeat=n+1) for n in range(N)]))
)
output:
date type value repeat
0 01/01 1 10 1
1 01/01 2 5 1
0 01/01 1 10 2
1 01/01 2 5 2
2 01/02 1 9 1
3 01/02 2 7 1
2 01/02 1 9 2
3 01/02 2 7 2

Related

Merge row if cels are equals pandas

I have this df:
import pandas as pd
df = pd.DataFrame({'Time' : ['s_1234','s_1234', 's_1234', 's_5678', 's_8998','s_8998' ],
'Control' : ['A', '', '','B', 'C', ''],
'tot_1' : ['1', '1', '1','1', '1', '1'],
'tot_2' : ['2', '2', '2','2', '2', '2']})
--------
Time Control tot_1 tot_2
0 1234 A 1 2
1 1234 A 1 2
2 1234 1 2
3 5678 B 1 2
4 8998 C 1 2
5 8998 1 2
I would like each time an equal time value to be merged into one column. I would also like the "tot_1" and "tot_2" columns to be added together. And finally I would like to keep checking if present. Like:
Time Control tot_1 tot_2
0 1234 A 3 6
1 5678 B 1 2
2 8998 C 2 4
Your data is different then the example df.
construct df:
import pandas as pd
df = pd.DataFrame({'Time' : ['s_1234','s_1234', 's_1234', 's_5678', 's_8998','s_8998' ],
'Control' : ['A', '', '','B', 'C', ''],
'tot_1' : ['1', '1', '1','1', '1', '1'],
'tot_2' : ['2', '2', '2','2', '2', '2']})
df.Time = df.Time.str.split("_").str[1]
df = df.astype({"tot_1": int, "tot_2": int})
Group by Time and aggregate the values.
df.groupby('Time').agg({"Control": "first", "tot_1": "sum", "tot_2": "sum"}).reset_index()
Time Control tot_1 tot_2
0 1234 A 3 6
1 5678 B 1 2
2 8998 C 2 4
EDIT for comment: Not sure if thats the best way to do it, but you could construct your agg information like this:
n = 2
agg_ = {"Control": "first"} | {f"tot_{i+1}": "sum" for i in range(n)}
df.groupby('Time').agg(agg_).reset_index()

Create a column which is the mean of multiple columns in a data frame in pandas

So I've looked at multiple potential solutions but none seem to work.
Basically, I want to create a new column in my data frame which is the mean of multiple other columns. I want this mean to exclude NaN values but still calculate the mean even if there are NaN values in the row.
I have a data frame which looks something like this (but actually Q222-229):
ID Q1 Q2 Q3 Q4 Q5
1 4 NaN NaN NaN NaN
2 5 7 8 NaN NaN
3 7 1 2 NaN NaN
4 2 2 3 4 1
5 1 3 NaN NaN NaN
And I want to create a column which is the mean of Q1, Q2, Q3, Q4, Q5 ie:
ID Q1 Q2 Q3 Q4 Q5 avg_age
1 4 NaN NaN NaN NaN 4
2 5 7 8 NaN NaN 5.5
3 7 1 2 NaN NaN 3.5
4 2 2 3 4 1 2
5 1 3 NaN NaN NaN 2
(ignore values)
However, every method I have tried returns NaN values in the avg_age column which is making me think that when ignoring the NaN values, pandas is ignoring the whole row. But I dont want this to happen, instead I want the mean returned with the NaN values ignored.
Here is what I have tried so far:
1.
avg_age = s.loc[: , "Q222":"Q229"]
avg_age = avg_age.mean(axis=1)
s = pd.concat([s, avg_age], axis=1)
2.
s['avg_age'] = s[['Q222', 'Q223', 'Q224', 'Q225', 'Q226', 'Q227', 'Q228', 'Q229']].mean(axis=1)
3.
avg_age = ['Q222', 'Q223', 'Q224', 'Q225', 'Q226', 'Q227', 'Q228', 'Q229']
s.loc[:, 'avg_age'] = s[avg_age].mean(axis=1)
I am not sure if there is something wrong with the way I have coded the values initially so here is my code for reference:
#Changing age variable inputs
s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q223'] = s['Q223'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q224'] = s['Q224'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q225'] = s['Q225'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q226'] = s['Q226'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q227'] = s['Q227'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q228'] = s['Q228'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q229'] = s['Q229'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q222'] = s['Q222'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q223'] = s['Q223'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q224'] = s['Q224'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q225'] = s['Q225'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q226'] = s['Q226'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q227'] = s['Q227'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q228'] = s['Q228'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q229'] = s['Q229'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
Thanks in advance to anyone who is able to help!
The default behavior of DataFrame.mean() should do what you want.
Here's an example showing taking a mean over a subset of the columns and placing it in a newly created column:
In[19]: tmp
Out[19]:
a b c
0 1 2 5.0
1 2 3 6.0
2 3 4 NaN
In[24]: tmp['mean'] = tmp[['b', 'c']].mean(axis=1)
In[25]: tmp
Out[25]:
a b c mean
0 1 2 5.0 3.5
1 2 3 6.0 4.5
2 3 4 NaN 4.0
As for what's going wrong in your code:
s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
You don't have numerical values (i.e 2, 3, 4) in your data frame, you have strings ('2', '3', and '4'). The DataFrame.mean() function is treating these strings as NaN, so you are getting NaN as the result for all your mean calculations.
Try filling your frame with numbers, like so:
s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
[2, 3, 4, 5, 6, 7, 8, np.NaN])
skipna=True
Can get it with a list comprehension to get the columns to average, and mean() with:
df['ave_age'] = df[[col for col in df.columns if 'Q' in col]].mean(axis = 1,skipna = True)

find duplicated groups in dataframe

I have a dataframe as described below and I need to find out the duplicate groups based on the columns - value1,value2 & value3 (groups should be grouped by id).
I need to fill column 'duplicated' with true
if the group appears elsewhere in the table,if group is unique fill with false.
note: each group has different id.
df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value2': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value3': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'duplicated' : []
})
expected result is:
I tried this, but if is comparing rows, I need to compare groups (grouped by id)
import pandas as pd
data = pd.read_excel('C:/Users/path/Desktop/example.xlsx')
# False : Mark all duplicates as True.
data['duplicates'] = data.duplicated(subset= ["value1","value2","value3"], keep=False)
data.to_excel('C:/Users/path/Desktop/example_result.xlsx',index=False)
and I got:
note: the order of the records in the both groups doesnt matter
This may not be very efficient but it works if duplicated groups have the same "order".
import pandas as pd
df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value2': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value3': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'duplicated': [False] * 13
})
def check_dup(df, col1, col2):
# Checks if two groups are duplicates.
# First checks the sizes, if they are equal then checks actual values.
df1 = df[df['id'] == col1][['value1', 'value2', 'value3']]
df2 = df[df['id'] == col2][['value1', 'value2', 'value3']]
if df1.size != df2.size:
return False
return (df1.values == df2.values).all()
id_unique = set(df['id'].values) # set of unique ids
id_dic = dict.fromkeys(id_unique, False) # dict for "duplicated" value for each id
for id1 in id_unique:
for id2 in id_unique - {id1}:
if check_dup(df, id1, id2):
id_dic[id1] = True
break
# Update 'duplicated' column on df
for id_ in id_dic:
df.loc[df['id'] == id_, 'duplicated'] = id_dic[id_]
print(df)
id value1 value2 value3 duplicated
0 A 1 1 1 True
1 A 2 2 2 True
2 A 3 3 3 True
3 A 4 4 4 True
4 B 1 1 1 False
5 B 2 2 2 False
6 C 1 1 1 True
7 C 2 2 2 True
8 C 3 3 3 True
9 C 4 4 4 True
10 D 1 1 1 False
11 D 2 2 2 False
12 D 3 3 3 False
You can do it like this
First sort_values just in case, set_index the id and stack to change the shape of your data and get a single column with to_frame
df_ = (df.sort_values(by=["value1","value2","value3"])
.set_index('id')[["value1","value2","value3"]]
.stack()
.to_frame()
)
Second, you can append an set_index with a cumcount per id, drop the level of index with the name of the original column (Value1 ...), unstack to get one row per id, fillna with a random value and use duplicated.
s_dup = df_.set_index([df_.groupby('id').cumcount()], append=True)\
.reset_index(level=1, drop=True)[0]\
.unstack()\
.fillna(0)\
.duplicated(keep=False)
print (s_dup)
id
A True
B False
C True
D False
dtype: bool
Now you can just map to the original dataframe:
df['dup'] = df['id'].map(s_dup)
print (df)
id value1 value2 value3 dup
0 A 1 1 1 True
1 A 2 2 2 True
2 A 3 3 3 True
3 A 4 4 4 True
4 B 1 1 1 False
5 B 2 2 2 False
6 C 2 2 2 True
7 C 1 1 1 True
8 C 3 3 3 True
9 C 4 4 4 True
10 D 1 1 1 False
11 D 2 2 2 False
12 D 3 3 3 False

How to paste a list into a multi index dataframe?

Could you let me know how to paste a list into a multi-index dataframe?
I wanna paste list1 into column([func1 - In - Name1, Name2]['Val6'])
and list2 into column([func1 - Out - Name3, Name4]['Val6']) in multi-index dataframe
below is dataframe I used
from pandas import Series, DataFrame
raw_data = {'Function': ['env', 'env', 'env', 'func1', 'func1', 'func1'],
'Type': ['In', 'In', 'In', 'In','In', 'out'],
'Name': ['Volt', 'Temp', 'BD#', 'Name1','Name2', 'Name3'],
'Val1': ['Max', 'High', '1', '3', '5', '6'],
'Val2': ['Typ', 'Mid', '2', '4', '7', '6'],
'Val3': ['Min', 'Low', '3', '3', '6', '3'],
'Val4': ['Max', 'High', '4', '3', '9', '4'],
'Val5': ['Max', 'Low', '5', '3', '4', '5'] }
df = DataFrame(raw_data)
df= df.set_index(["Function", "Type","Name"])
df['Val6'] = np.NaN
list1 = [1,2]
list2 = [3,4]
print (df)
below is printed dataframe
Val1 Val2 Val3 Val4 Val5 Val6
Function Type Name
env In Volt Max Typ Min Max Max NaN
Temp High Mid Low High Low NaN
BD# 1 2 3 4 5 NaN
func1 In Name1 4 2 3 4 5 NaN
Name2 6 7 6 9 4 NaN
out Name3 6 6 3 4 5 NaN
Name4 3 3 4 5 6 NaN
Below is expected results.
I'd like to sequentially put each list1 and list2 into dataframe instead of NaN like below
Val1 Val2 Val3 Val4 Val5 Val6
Function Type Name
env In Volt Max Typ Min Max Max NaN
Temp High Mid Low High Low NaN
BD# 1 2 3 4 5 NaN
func1 In Name1 4 2 3 4 5 1
Name2 6 7 6 9 4 2
out Name3 6 6 3 4 5 3
Name4 3 3 4 5 6 4
I have tried to use concat, replace functions to do it but failed
In more complex datafrmae, I think it is better to use mask of multi -index in dataframe.
list1=[1,2]
list2=[3,4]
m1 = df.index.get_level_values(0) == 'func1'
m2 = df.index.get_level_values(1) == 'In'
list1 = [float(i) for i in list1]
df_list1=pd.DataFrame(list1)
df.replace(df[m1&m2]['Val6'], df_list1)
Unfortunately, I don't have any idea to solve the problem. T_T
Please give me some advice.
IIUC add an extra line at the end, simple modify it like it's a non-multi-index dataframe:
df['Val6'] = df['Val6'].tolist()[:-4] + list1 + list2
So your code would be:
from pandas import Series, DataFrame
raw_data = {'Function': ['env', 'env', 'env', 'func1', 'func1', 'func1'],
'Type': ['In', 'In', 'In', 'In','In', 'out'],
'Name': ['Volt', 'Temp', 'BD#', 'Name1','Name2', 'Name3'],
'Val1': ['Max', 'High', '1', '3', '5', '6'],
'Val2': ['Typ', 'Mid', '2', '4', '7', '6'],
'Val3': ['Min', 'Low', '3', '3', '6', '3'],
'Val4': ['Max', 'High', '4', '3', '9', '4'],
'Val5': ['Max', 'Low', '5', '3', '4', '5'] }
df = DataFrame(raw_data)
df= df.set_index(["Function", "Type","Name"])
df['Val6'] = np.NaN
list1 = [1,2]
list2 = [3,4]
df['Val6'] = df['Val6'].tolist()[:-4] + list1 + list2
print(df)
Output:
Val1 Val2 Val3 Val4 Val5 Val6
Function Type Name
env In Volt Max Typ Min Max Max NaN
Temp High Mid Low High Low NaN
BD# 1 2 3 4 5 1.0
func1 In Name1 3 4 3 3 3 2.0
Name2 5 7 6 9 4 3.0
out Name3 6 6 3 4 5 4.0

How to count occurrences of a set of values for a range of columns in a pandas dataframe?

I have a pandas dataframe that looks something like this:
The dataframe is populated with 4 distinct strings: '00', '01', '10',and '11'. I'm hoping to count each occurrence of the values in each column, so that the data above would return a resulting dataframe that looks something like this:
A B C D E
00 2 1 3 0 3
01 2 2 0 2 1
10 0 0 1 2 0
11 1 2 1 1 1
The original dataframe can be created with this code:
dft = pd.DataFrame({'A' : ['11', '01', '01', '00', '00'],
'B' : ['00', '01', '11', '01', '11'],
'C' : ['00', '00', '10', '00', '11'],
'D' : ['10', '01', '11', '10', '01'],
'E' : ['00', '01', '00', '11', '00'],})
dft
You can use value_counts together with a dictionary comprehension to generate the values, and then use the data to create a DataFrame.
>>> pd.DataFrame({col: dft[col].value_counts() for col in dft}).fillna(0)
A B C D E
00 2 1 3 0 3
01 2 2 0 2 1
10 0 0 1 2 0
11 1 2 1 1 1

Categories

Resources