I'm trying to create a new column based on a groupby function, but I'm running into an error. In the sample dataframe below, I want to create a new column where there is a new integer only in rows correspond to the max seq variable per user. So, for instance, user122 would only have a number in the 3rd row, where seq is 3 (this users highest seq number).
df = pd.DataFrame({
'user':
{0: 'user122',
1: 'user122',
2: 'user122',
3: 'user124',
4: 'user125',
5: 'user125',
6: 'user126',
7: 'user126',
8: 'user126'},
'baseline':
{0: 4.0,
1: 4.0,
2: 4.0,
3: 2,
4: 4,
5: 4,
6: 5,
7: 5,
8: 5},
'score':
{0: np.nan,
1: 3,
2: 2,
3: 5,
4: np.nan,
5: 6,
6: 3,
7: 2,
8: 1},
'binary':
{0: 1,
1: 1,
2: 0,
3: 0,
4: 0,
5: 0,
6: 1,
7: 0,
8: 1},
'var1':
{0: 3,
1: 5,
2: 5,
3: 1,
4: 1,
5: 1,
6: 1,
7: 3,
8: 5},
'seq':
{0: 1,
1: 2,
2: 3,
3: 1,
4: 1,
5: 2,
6: 1,
7: 2,
8: 3},
})
The function I used is below
df['newnum'] = np.where(df.groupby('user')['seq'].max(), random.randint(4, 9), 'NA')
The shapes between the new column and old column are not the same, so I run into an error. I thought if I specify multiple conditions in np.where it would put "NA" in all of the places where it was not the max seq value, but this didn't happen.
Length of values (4) does not match length of index (9)
Anyone else have a better idea?
And, if possible, I'd ideally like for the newnum variable to be a multiple of the baseline (but that was too complicated, so I just created a random digit).
Thanks for any help!
the groupby results in fewer rows and not matching 1:1 with your dataframe, hence the error.
Here is how you can accomplish it,
#using transform with the groupby to return the max against each of the items
#in the groupby
df['newnum']=np.where ( df.groupby('user')['seq'].transform('max').eq(df['seq']),
np.random.randint(4, 9),
np.nan)
df
user baseline score binary var1 seq newnum
0 user122 4.0 NaN 1 3 1 NaN
1 user122 4.0 3.0 1 5 2 NaN
2 user122 4.0 2.0 0 5 3 6.0
3 user124 2.0 5.0 0 1 1 6.0
4 user125 4.0 NaN 0 1 1 NaN
5 user125 4.0 6.0 0 1 2 6.0
6 user126 5.0 3.0 1 1 1 NaN
7 user126 5.0 2.0 0 3 2 NaN
8 user126 5.0 1.0 1 5 3 6.0
idxmax = df.groupby('user')['seq'].idxmax()
df.loc[idxmax, 'newnum'] = ...
Notes:
In the first line of the above code, we get indexes of df where maximum seq is reached for each user.
In the second line, we're creating a new columns newnum and assigning it at the same time to some values at the idxmax positions. Other values are NaN by default.
Update
When we assign a numpy.ndarray vector to a new column of a pandas.DataFrame, all data frame indexes are used by default to populate the column with values from the vector. If the number of indexes is different from the vector dimension, then you get ValueError about size mismatch, as in your case. To avoid it we have to restrict data frame indexes to those which are used in assigning operation. That's the meaning of df.loc[idxmax, 'newnum'] where we address df cells in a new column 'newnum' with indexes from idxmax.
There's a restriction: only the first occurrence of a maximum per user will be used. If there's two or more occurrences of max value for some user, all others will be ignored. You've provided test values with growing seq for each user. If this assumption (about growing seq) is right, then we can use idxmax. Taking the last value in the group will do as well in this case:
idxmax = df.groupby('user').tail(1).index
If more then one occurrences of max is expected, then we should use some other method, e.g.:
df.groupby('user')['seq'].transform('max').eq(df['seq'])
I have a dataframe that is very similar to this dataframe:
index
date
month
0
2019-12-1
12
1
2020-03-1
3
2
2020-07-1
7
3
2021-02-1
2
4
2021-09-1
9
And i want to combine all dates that are closest to a set of months. The months need to be normalized like this:
Months
Normalized month
3, 4, 5
4
6, 7, 8, 9
8
1, 2, 10, 11, 12
12
So the output will be:
index
date
month
0
2019-12-1
12
1
2020-04-1
4
2
2020-08-1
8
3
2020-12-1
12
4
2021-08-1
8
You can iterate through the DataFrame and use replace to change the dates.
import pandas as pd
df = pd.DataFrame(data={'date': ["2019-12-1", "2020-03-1", "2020-07-1", "2021-02-1", "2021-09-1"],
'month': [12,3,7,2,9]})
for index, row in df.iterrows():
if (row['month'] in [3,4,5]):
df['month'][index] = 4
df["date"][index] = df["date"][0].replace(df["date"][0][5:7],"04")
elif (row['month'] in [6,7,8,9]):
df['month'][index] = 8
df["date"][index] = df["date"][0].replace(df["date"][0][5:7],"08")
else:
df['month'][index] = 12
df["date"][index] = df["date"][0].replace(df["date"][0][5:7],"12")
you can try creating a dictionary of months where:
norm_month_dict = {3: 4, 4: 4, 5: 4, 6: 8, 7: 8, 8: 8, 9: 8, 1: 12, 2: 12, 10: 12, 11: 12, 12: 12}
then use this dictionary to map month values to their respective normalized month values.
df['normalized_months'] = df.months.map(norm_month_dict)
You need to construct a dictionary from the second dataframe (assuming df1 and df2):
d = (
df2.assign(Months=df2['Months'].str.split(', '))
.explode('Months').astype(int)
.set_index('Months')['Normalized month'].to_dict()
)
# {3: 4, 4: 4, 5: 4, 6: 8, 7: 8, 8: 8, 9: 8, 1: 12, 2: 12, 10: 12, 11: 12, 12: 12}
Then map the values:
df1['month'] = df1['month'].map(d)
output:
index date month
0 0 2019-12-1 12
1 1 2020-03-1 4
2 2 2020-07-1 8
3 3 2021-02-1 12
4 4 2021-09-1 8`
I can't figure out a problem I am trying to solve.
I have a pandas data frame coming from this:
date, id, measure, result
2016-07-11, 31, "[2, 5, 3, 3]", 1
2016-07-12, 32, "[3, 5, 3, 3]", 1
2016-07-13, 33, "[2, 1, 2, 2]", 1
2016-07-14, 34, "[2, 6, 3, 3]", 1
2016-07-15, 35, "[39, 31, 73, 34]", 0
2016-07-16, 36, "[3, 2, 3, 3]", 1
2016-07-17, 37, "[3, 8, 3, 3]", 1
Measurements column consists of arrays in string format.
I want to have a new moving-average-array column from the past 3 measurement records, excluding those records where the result is 0. Past 3 records mean that for id 34, the arrays of id 31,32,33 to be used.
It is about taking the average of every 1st point, 2nd point, 3rd and 4th point to have this moving-average-array.
It is not about getting the average of 1st array, 2nd array ... and then averaging the average, no.
For the first 3 rows, because there is not enough history, I just want to use their own measurement. So the solution should look like this:
date, id, measure, result . Solution
2016-07-11, 31, "[2, 5, 3, 3]", 1, "[2, 5, 3, 3]"
2016-07-12, 32, "[3, 5, 3, 3]", 1, "[3, 5, 3, 3]"
2016-07-13, 33, "[2, 1, 2, 2]", 1, "[2, 1, 2, 2]"
2016-07-14, 34, "[2, 6, 3, 3]", 1, "[2.3, 3.6, 2.6, 2.6]"
2016-07-15, 35, "[39, 31, 73, 34]", 0, "[2.3, 4, 2.6, 2.6]"
2016-07-16, 36, "[3, 2, 3, 3]", 1, "[2.3, 4, 2.6, 2.6]"
2016-07-17, 37, "[3, 8, 3, 3]", 1, "[2.3, 3, 2.6, 2.6]"
The real data is bigger. result 0 may repeat 2 or more times after each other also. I think it will be about keeping a track of previous OK results properly getting those averages. I spent time but I could not.
I am posting the dataframe here:
mydict = {'date': {0: '2016-07-11',
1: '2016-07-12',
2: '2016-07-13',
3: '2016-07-14',
4: '2016-07-15',
5: '2016-07-16',
6: '2016-07-17'},
'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
'measure': {0: '[2, 5, 3, 3]',
1: '[3, 5, 3, 3]',
2: '[2, 1, 2, 2]',
3: '[2, 6, 3, 3]',
4: '[39, 31, 73, 34]',
5: '[3, 2, 3, 3]',
6: '[3, 8, 3, 3]'},
'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}
df = pd.DataFrame(mydict)
Thank you for giving directions or pointing out how to.
Solution using only 1 for loop:
Considering the data:
mydict = {'date': {0: '2016-07-11',
1: '2016-07-12',
2: '2016-07-13',
3: '2016-07-14',
4: '2016-07-15',
5: '2016-07-16',
6: '2016-07-17'},
'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
'measure': {0: '[2, 5, 3, 3]',
1: '[3, 5, 3, 3]',
2: '[2, 1, 2, 2]',
3: '[2, 6, 3, 3]',
4: '[39, 31, 73, 34]',
5: '[3, 2, 3, 3]',
6: '[3, 8, 3, 3]'},
'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}
df = pd.DataFrame(mydict)
I defined a simple function to calculate the means and return a list. Then, loop the dataframe applying the rules:
def calc_mean(in_list):
p0 = round((in_list[0][0] + in_list[1][0] + in_list[2][0])/3,1)
p1 = round((in_list[0][1] + in_list[1][1] + in_list[2][1])/3,1)
p2 = round((in_list[0][2] + in_list[1][2] + in_list[2][2])/3,1)
p3 = round((in_list[0][3] + in_list[1][3] + in_list[2][3])/3,1)
return [p0, p1, p2, p3]
Solution = []
aux_list = []
for index, row in df.iterrows():
if index in [0,1,2]:
Solution.append(row.measure)
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
else:
Solution.append('[' +', '.join(map(str, calc_mean(aux_list))) + ']')
if row.result > 0:
aux_list.pop(0)
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
df['Solution'] = Solution
The output is:
Please note that the result is rounded to 1 decimal place, a bit different from your desired output. Made more sense to me.
EDIT:
As a suggestion in the comments by #Frenchy, to deal with result == 0 in the first 3 rows, we need to change a bit the first if clause:
if index in [0,1,2] or len(aux_list) <3:
Solution.append(row.measure)
if row.result > 0:
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
You can use pd.eval to change from a str of list to a proper list only the part of the data in measure where result is not 0. Use rolling with mean and then shift to get the rolling average over the last 3 rows at the next row. Then map to str once your dataframe is changed to a list of list with values and tolist. Finally you just need to replace the first three rows and ffill the missing data:
df.loc[df.result.shift() != 0,'solution'] = list(map(str,
pd.DataFrame(pd.eval(df[df.result != 0].measure))
.rolling(3).mean().shift().values.tolist()))
df.loc[:2,'solution'] = df.loc[:2,'measure']
df.solution = df.solution.ffill()
Here's another solution:
# get data to reproduce example
from io import StringIO
data = StringIO("""
date;id;measure;result
2016-07-11;31;"[2,5,3,3]";1
2016-07-12;32;"[3,5,3,3]";1
2016-07-13;33;"[2,1,2,2]";1
2016-07-14;34;"[2,6,3,3]";1
2016-07-15;35;"[39,31,73,34]";0
2016-07-16;36;"[3,2,3,3]";1
2016-07-17;37;"[3,8,3,3]";1
""")
df = pd.read_csv(data, sep=";")
df
# Out:
# date id measure result
# 0 2016-07-11 31 [2,5,3,3] 1
# 1 2016-07-12 32 [3,5,3,3] 1
# 2 2016-07-13 33 [2,1,2,2] 1
# 3 2016-07-14 34 [2,6,3,3] 1
# 4 2016-07-15 35 [39,31,73,34] 0
# 5 2016-07-16 36 [3,2,3,3] 1
# 6 2016-07-17 37 [3,8,3,3] 1
# convert values in measure column to lists
from ast import literal_eval
dm = df['measure'].apply(literal_eval)
# apply rolling mean with period 2 and recollect values into list in column means
df["means"] = dm.apply(pd.Series).rolling(2, min_periods=0).mean().values.tolist()
df
# Out:
# date id measure result means
# 0 2016-07-11 31 [2,5,3,3] 1 [2.0, 5.0, 3.0, 3.0]
# 1 2016-07-12 32 [3,5,3,3] 1 [2.5, 5.0, 3.0, 3.0]
# 2 2016-07-13 33 [2,1,2,2] 1 [2.5, 3.0, 2.5, 2.5]
# 3 2016-07-14 34 [2,6,3,3] 1 [2.0, 3.5, 2.5, 2.5]
# 4 2016-07-15 35 [39,31,73,34] 0 [20.5, 18.5, 38.0, 18.5]
# 5 2016-07-16 36 [3,2,3,3] 1 [21.0, 16.5, 38.0, 18.5]
# 6 2016-07-17 37 [3,8,3,3] 1 [3.0, 5.0, 3.0, 3.0]
# moving window of size 3
df["means"] = dm.apply(pd.Series).rolling(3, min_periods=0).mean().round(2).values.tolist()
df
# Out:
# date id measure result means
# 0 2016-07-11 31 [2,5,3,3] 1 [2.0, 5.0, 3.0, 3.0]
# 1 2016-07-12 32 [3,5,3,3] 1 [2.5, 5.0, 3.0, 3.0]
# 2 2016-07-13 33 [2,1,2,2] 1 [2.33, 3.67, 2.67, 2.67]
# 3 2016-07-14 34 [2,6,3,3] 1 [2.33, 4.0, 2.67, 2.67]
# 4 2016-07-15 35 [39,31,73,34] 0 [14.33, 12.67, 26.0, 13.0]
# 5 2016-07-16 36 [3,2,3,3] 1 [14.67, 13.0, 26.33, 13.33]
# 6 2016-07-17 37 [3,8,3,3] 1 [15.0, 13.67, 26.33, 13.33]
I have a dataset shown in the below :
And want to drop rows like 4,5 & 7 as there majority of columns are having 0 but not all. At the same time, I don't want to drop rows like 0 and 1 as they have very few entries as 0.
First create a column to calculate zeros in your rows
df['no_of_zeros']=(df == 0).astype(int).sum(axis=1)
Define how many zeros are acceptable in your row and filter the dataframe according to it.
df=df[df['no_of_zeros'] < 3].drop(['no_of_zeros'], axis=1)
Here is one way:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 4],
[0, 0, 0, 1, 2]],
columns=['A', 'B', 'C', 'D', 'E'])
df = df[~((df == 0).astype(int).sum(axis=1) > len(df.columns) / 2)]
# A B C D E
# 0 0 1 2 3 4
Assuming "majority" means "more than half of the columns", this works:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'c2': {0: 76, 1: 45, 2: 47, 3: 92, 4: 0, 5: 0, 6: 26, 7: 0, 8: 71},
...: 'c3': {0: 0, 1: 3, 2: 6, 3: 9, 4: 0, 5: 0, 6: 12, 7: 0, 8: 15},
...: 'c4': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
...: 'c5': {0: 23, 1: 0, 2: 23, 3: 23, 4: 0, 5: 0, 6: 23, 7: 0, 8: 23},
...: 'c6': {0: 65, 1: 25, 2: 62, 3: 26, 4: 52, 5: 22, 6: 65, 7: 0, 8: 69},
...: 'c7': {0: 12, 1: 12, 2: 12, 3: 12, 4: 12, 5: 12, 6: 12, 7: 12, 8: 12},
...: 'c8': {0: 0, 1: 0, 2: 8, 3: 9, 4: 0, 5: 0, 6: 4, 7: 0, 8: 4},
...: 'cl': {0: 5, 1: 7, 2: 8, 3: 15, 4: 0, 5: 0, 6: 2, 7: 0, 8: 5},
...: 'sr': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8}})
...:
In [3]: df
Out[3]:
c2 c3 c4 c5 c6 c7 c8 cl sr
0 76 0 1 23 65 12 0 5 0
1 45 3 1 0 25 12 0 7 1
2 47 6 1 23 62 12 8 8 2
3 92 9 1 23 26 12 9 15 3
4 0 0 1 0 52 12 0 0 4
5 0 0 1 0 22 12 0 0 5
6 26 12 1 23 65 12 4 2 6
7 0 0 1 0 0 12 0 0 7
8 71 15 1 23 69 12 4 5 8
In [4]: df[((df == 0).sum(axis=1) <= len(df.columns) / 2)]
Out[4]:
c2 c3 c4 c5 c6 c7 c8 cl sr
0 76 0 1 23 65 12 0 5 0
1 45 3 1 0 25 12 0 7 1
2 47 6 1 23 62 12 8 8 2
3 92 9 1 23 26 12 9 15 3
6 26 12 1 23 65 12 4 2 6
8 71 15 1 23 69 12 4 5 8
In [5]: