Pandas group by selected dates - python

I have a dataframe that is very similar to this dataframe:
index
date
month
0
2019-12-1
12
1
2020-03-1
3
2
2020-07-1
7
3
2021-02-1
2
4
2021-09-1
9
And i want to combine all dates that are closest to a set of months. The months need to be normalized like this:
Months
Normalized month
3, 4, 5
4
6, 7, 8, 9
8
1, 2, 10, 11, 12
12
So the output will be:
index
date
month
0
2019-12-1
12
1
2020-04-1
4
2
2020-08-1
8
3
2020-12-1
12
4
2021-08-1
8

You can iterate through the DataFrame and use replace to change the dates.
import pandas as pd
df = pd.DataFrame(data={'date': ["2019-12-1", "2020-03-1", "2020-07-1", "2021-02-1", "2021-09-1"],
'month': [12,3,7,2,9]})
for index, row in df.iterrows():
if (row['month'] in [3,4,5]):
df['month'][index] = 4
df["date"][index] = df["date"][0].replace(df["date"][0][5:7],"04")
elif (row['month'] in [6,7,8,9]):
df['month'][index] = 8
df["date"][index] = df["date"][0].replace(df["date"][0][5:7],"08")
else:
df['month'][index] = 12
df["date"][index] = df["date"][0].replace(df["date"][0][5:7],"12")

you can try creating a dictionary of months where:
norm_month_dict = {3: 4, 4: 4, 5: 4, 6: 8, 7: 8, 8: 8, 9: 8, 1: 12, 2: 12, 10: 12, 11: 12, 12: 12}
then use this dictionary to map month values to their respective normalized month values.
df['normalized_months'] = df.months.map(norm_month_dict)

You need to construct a dictionary from the second dataframe (assuming df1 and df2):
d = (
df2.assign(Months=df2['Months'].str.split(', '))
.explode('Months').astype(int)
.set_index('Months')['Normalized month'].to_dict()
)
# {3: 4, 4: 4, 5: 4, 6: 8, 7: 8, 8: 8, 9: 8, 1: 12, 2: 12, 10: 12, 11: 12, 12: 12}
Then map the values:
df1['month'] = df1['month'].map(d)
output:
index date month
0 0 2019-12-1 12
1 1 2020-03-1 4
2 2 2020-07-1 8
3 3 2021-02-1 12
4 4 2021-09-1 8`

Related

How to find percent change by row within groups python pandas

I have a sample of my dataframe as follows:
data = {'retailer': [2, 2, 2, 2, 2, 5, 5, 5, 5, 5],
'store': [1, 1, 1, 1, 1, 7, 7, 7, 7, 7],
'week':[2021110701, 2021101301, 2021100601, 2021092901, 2021092201, 2021110701, 2021101301, 2021100601, 2021092901, 2021092201],
'dollars': [353136.2, 379263.8, 507892.1, 491528.2, 503602.8, 435025.2, 406698.5, 338383.5, 360845.1, 372385.2]
}
data = pd.DataFrame(data)
I have sorted my columns by doing
data = data.sort_values(['retailer', 'store', 'week'], ascending=(True, True, False))
I would like to find the percent different in dollars between each row WITHIN each group...so essentially group by retailer, then store and then find the percent difference between the rows for 'dollars' between the week and the week below it, and then save the value in a column next to the dollars.
Basically have the output look like:
data = {'retailer': [2, 2, 2, 2, 2, 5, 5, 5, 5, 5],
'store': [1, 1, 1, 1, 1, 7, 7, 7, 7, 7],
'week':[2021110701, 2021101301, 2021100601, 2021092901, 2021092201, 2021110701, 2021101301, 2021100601, 2021092901, 2021092201],
'dollars': [353136.2, 379263.8, 507892.1, 491528.2, 503602.8, 435025.2, 406698.5, 338383.5, 360845.1, 372385.2],
'pc_diff': [-0.06889030801252315, -0.2532591075939161, 0.03329188437204613, -0.02397643539710259, 'NaN', 0.06965036753270545, 0.20188632128930636, -0.062247208012523876, -0.030989684874694362, 'NaN']
}
data = pd.DataFrame(data)
So for retailer 2, store 1 trying to find the percent difference between week 2021110701 and 2021101301 which would be (353136.2 - 379263.8)/379263.8.
The NAs exist because there is no row below that one so there is nothing to find the percent change between (if that makes sense). Is there a way I can do this/is there a pandas equivalent of using a lag function?
You can use groupby+pct_change:
data.groupby(['retailer', 'store'])['dollars'].pct_change(-1)
output:
retailer store week dollars pc_diff
0 2 1 2021110701 353136.2 -0.068890
1 2 1 2021101301 379263.8 -0.253259
2 2 1 2021100601 507892.1 0.033292
3 2 1 2021092901 491528.2 -0.023976
4 2 1 2021092201 503602.8 NaN
5 5 7 2021110701 435025.2 0.069650
6 5 7 2021101301 406698.5 0.201886
7 5 7 2021100601 338383.5 -0.062247
8 5 7 2021092901 360845.1 -0.030990
9 5 7 2021092201 372385.2 NaN

how to convert pandas dataframe to dictionary like specified [duplicate]

This question already has answers here:
GroupBy results to dictionary of lists
(2 answers)
Closed 1 year ago.
this is the data frame:
course_id weight
0 1 10
1 1 40
2 1 50
3 2 40
4 2 60
5 3 90
6 3 10
want to convert it to a dictionary like:
{1 : [10,40,50] , 2:[40,60] , 3 :[90,10]}
df = pd.read_csv(tests)
df = df[['course_id','weight']]
print(df)
You need to groupby course_id column, then keep weight, put values in lists, and use to_dict() to get final structure
import pandas as pd
df = pd.DataFrame({'course_id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3},
'weight': {0: 10, 1: 40, 2: 50, 3: 40, 4: 60, 5: 90, 6: 10}})
result = df.groupby("course_id")['weight'].apply(list).to_dict()
print(result) # {1: [10, 40, 50], 2: [40, 60], 3: [90, 10]}

Iterate pandas data frame for rows consists of arrays and compute a moving average based on condition

I can't figure out a problem I am trying to solve.
I have a pandas data frame coming from this:
date, id, measure, result
2016-07-11, 31, "[2, 5, 3, 3]", 1
2016-07-12, 32, "[3, 5, 3, 3]", 1
2016-07-13, 33, "[2, 1, 2, 2]", 1
2016-07-14, 34, "[2, 6, 3, 3]", 1
2016-07-15, 35, "[39, 31, 73, 34]", 0
2016-07-16, 36, "[3, 2, 3, 3]", 1
2016-07-17, 37, "[3, 8, 3, 3]", 1
Measurements column consists of arrays in string format.
I want to have a new moving-average-array column from the past 3 measurement records, excluding those records where the result is 0. Past 3 records mean that for id 34, the arrays of id 31,32,33 to be used.
It is about taking the average of every 1st point, 2nd point, 3rd and 4th point to have this moving-average-array.
It is not about getting the average of 1st array, 2nd array ... and then averaging the average, no.
For the first 3 rows, because there is not enough history, I just want to use their own measurement. So the solution should look like this:
date, id, measure, result . Solution
2016-07-11, 31, "[2, 5, 3, 3]", 1, "[2, 5, 3, 3]"
2016-07-12, 32, "[3, 5, 3, 3]", 1, "[3, 5, 3, 3]"
2016-07-13, 33, "[2, 1, 2, 2]", 1, "[2, 1, 2, 2]"
2016-07-14, 34, "[2, 6, 3, 3]", 1, "[2.3, 3.6, 2.6, 2.6]"
2016-07-15, 35, "[39, 31, 73, 34]", 0, "[2.3, 4, 2.6, 2.6]"
2016-07-16, 36, "[3, 2, 3, 3]", 1, "[2.3, 4, 2.6, 2.6]"
2016-07-17, 37, "[3, 8, 3, 3]", 1, "[2.3, 3, 2.6, 2.6]"
The real data is bigger. result 0 may repeat 2 or more times after each other also. I think it will be about keeping a track of previous OK results properly getting those averages. I spent time but I could not.
I am posting the dataframe here:
mydict = {'date': {0: '2016-07-11',
1: '2016-07-12',
2: '2016-07-13',
3: '2016-07-14',
4: '2016-07-15',
5: '2016-07-16',
6: '2016-07-17'},
'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
'measure': {0: '[2, 5, 3, 3]',
1: '[3, 5, 3, 3]',
2: '[2, 1, 2, 2]',
3: '[2, 6, 3, 3]',
4: '[39, 31, 73, 34]',
5: '[3, 2, 3, 3]',
6: '[3, 8, 3, 3]'},
'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}
df = pd.DataFrame(mydict)
Thank you for giving directions or pointing out how to.
Solution using only 1 for loop:
Considering the data:
mydict = {'date': {0: '2016-07-11',
1: '2016-07-12',
2: '2016-07-13',
3: '2016-07-14',
4: '2016-07-15',
5: '2016-07-16',
6: '2016-07-17'},
'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
'measure': {0: '[2, 5, 3, 3]',
1: '[3, 5, 3, 3]',
2: '[2, 1, 2, 2]',
3: '[2, 6, 3, 3]',
4: '[39, 31, 73, 34]',
5: '[3, 2, 3, 3]',
6: '[3, 8, 3, 3]'},
'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}
df = pd.DataFrame(mydict)
I defined a simple function to calculate the means and return a list. Then, loop the dataframe applying the rules:
def calc_mean(in_list):
p0 = round((in_list[0][0] + in_list[1][0] + in_list[2][0])/3,1)
p1 = round((in_list[0][1] + in_list[1][1] + in_list[2][1])/3,1)
p2 = round((in_list[0][2] + in_list[1][2] + in_list[2][2])/3,1)
p3 = round((in_list[0][3] + in_list[1][3] + in_list[2][3])/3,1)
return [p0, p1, p2, p3]
Solution = []
aux_list = []
for index, row in df.iterrows():
if index in [0,1,2]:
Solution.append(row.measure)
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
else:
Solution.append('[' +', '.join(map(str, calc_mean(aux_list))) + ']')
if row.result > 0:
aux_list.pop(0)
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
df['Solution'] = Solution
The output is:
Please note that the result is rounded to 1 decimal place, a bit different from your desired output. Made more sense to me.
EDIT:
As a suggestion in the comments by #Frenchy, to deal with result == 0 in the first 3 rows, we need to change a bit the first if clause:
if index in [0,1,2] or len(aux_list) <3:
Solution.append(row.measure)
if row.result > 0:
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
You can use pd.eval to change from a str of list to a proper list only the part of the data in measure where result is not 0. Use rolling with mean and then shift to get the rolling average over the last 3 rows at the next row. Then map to str once your dataframe is changed to a list of list with values and tolist. Finally you just need to replace the first three rows and ffill the missing data:
df.loc[df.result.shift() != 0,'solution'] = list(map(str,
pd.DataFrame(pd.eval(df[df.result != 0].measure))
.rolling(3).mean().shift().values.tolist()))
df.loc[:2,'solution'] = df.loc[:2,'measure']
df.solution = df.solution.ffill()
Here's another solution:
# get data to reproduce example
from io import StringIO
data = StringIO("""
date;id;measure;result
2016-07-11;31;"[2,5,3,3]";1
2016-07-12;32;"[3,5,3,3]";1
2016-07-13;33;"[2,1,2,2]";1
2016-07-14;34;"[2,6,3,3]";1
2016-07-15;35;"[39,31,73,34]";0
2016-07-16;36;"[3,2,3,3]";1
2016-07-17;37;"[3,8,3,3]";1
""")
df = pd.read_csv(data, sep=";")
df
# Out:
# date id measure result
# 0 2016-07-11 31 [2,5,3,3] 1
# 1 2016-07-12 32 [3,5,3,3] 1
# 2 2016-07-13 33 [2,1,2,2] 1
# 3 2016-07-14 34 [2,6,3,3] 1
# 4 2016-07-15 35 [39,31,73,34] 0
# 5 2016-07-16 36 [3,2,3,3] 1
# 6 2016-07-17 37 [3,8,3,3] 1
# convert values in measure column to lists
from ast import literal_eval
dm = df['measure'].apply(literal_eval)
# apply rolling mean with period 2 and recollect values into list in column means
df["means"] = dm.apply(pd.Series).rolling(2, min_periods=0).mean().values.tolist()
df
# Out:
# date id measure result means
# 0 2016-07-11 31 [2,5,3,3] 1 [2.0, 5.0, 3.0, 3.0]
# 1 2016-07-12 32 [3,5,3,3] 1 [2.5, 5.0, 3.0, 3.0]
# 2 2016-07-13 33 [2,1,2,2] 1 [2.5, 3.0, 2.5, 2.5]
# 3 2016-07-14 34 [2,6,3,3] 1 [2.0, 3.5, 2.5, 2.5]
# 4 2016-07-15 35 [39,31,73,34] 0 [20.5, 18.5, 38.0, 18.5]
# 5 2016-07-16 36 [3,2,3,3] 1 [21.0, 16.5, 38.0, 18.5]
# 6 2016-07-17 37 [3,8,3,3] 1 [3.0, 5.0, 3.0, 3.0]
# moving window of size 3
df["means"] = dm.apply(pd.Series).rolling(3, min_periods=0).mean().round(2).values.tolist()
df
# Out:
# date id measure result means
# 0 2016-07-11 31 [2,5,3,3] 1 [2.0, 5.0, 3.0, 3.0]
# 1 2016-07-12 32 [3,5,3,3] 1 [2.5, 5.0, 3.0, 3.0]
# 2 2016-07-13 33 [2,1,2,2] 1 [2.33, 3.67, 2.67, 2.67]
# 3 2016-07-14 34 [2,6,3,3] 1 [2.33, 4.0, 2.67, 2.67]
# 4 2016-07-15 35 [39,31,73,34] 0 [14.33, 12.67, 26.0, 13.0]
# 5 2016-07-16 36 [3,2,3,3] 1 [14.67, 13.0, 26.33, 13.33]
# 6 2016-07-17 37 [3,8,3,3] 1 [15.0, 13.67, 26.33, 13.33]

Drop pandas DF rows having majority of 0's

I have a dataset shown in the below :
And want to drop rows like 4,5 & 7 as there majority of columns are having 0 but not all. At the same time, I don't want to drop rows like 0 and 1 as they have very few entries as 0.
First create a column to calculate zeros in your rows
df['no_of_zeros']=(df == 0).astype(int).sum(axis=1)
Define how many zeros are acceptable in your row and filter the dataframe according to it.
df=df[df['no_of_zeros'] < 3].drop(['no_of_zeros'], axis=1)
Here is one way:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 4],
[0, 0, 0, 1, 2]],
columns=['A', 'B', 'C', 'D', 'E'])
df = df[~((df == 0).astype(int).sum(axis=1) > len(df.columns) / 2)]
# A B C D E
# 0 0 1 2 3 4
Assuming "majority" means "more than half of the columns", this works:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'c2': {0: 76, 1: 45, 2: 47, 3: 92, 4: 0, 5: 0, 6: 26, 7: 0, 8: 71},
...: 'c3': {0: 0, 1: 3, 2: 6, 3: 9, 4: 0, 5: 0, 6: 12, 7: 0, 8: 15},
...: 'c4': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
...: 'c5': {0: 23, 1: 0, 2: 23, 3: 23, 4: 0, 5: 0, 6: 23, 7: 0, 8: 23},
...: 'c6': {0: 65, 1: 25, 2: 62, 3: 26, 4: 52, 5: 22, 6: 65, 7: 0, 8: 69},
...: 'c7': {0: 12, 1: 12, 2: 12, 3: 12, 4: 12, 5: 12, 6: 12, 7: 12, 8: 12},
...: 'c8': {0: 0, 1: 0, 2: 8, 3: 9, 4: 0, 5: 0, 6: 4, 7: 0, 8: 4},
...: 'cl': {0: 5, 1: 7, 2: 8, 3: 15, 4: 0, 5: 0, 6: 2, 7: 0, 8: 5},
...: 'sr': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8}})
...:
In [3]: df
Out[3]:
c2 c3 c4 c5 c6 c7 c8 cl sr
0 76 0 1 23 65 12 0 5 0
1 45 3 1 0 25 12 0 7 1
2 47 6 1 23 62 12 8 8 2
3 92 9 1 23 26 12 9 15 3
4 0 0 1 0 52 12 0 0 4
5 0 0 1 0 22 12 0 0 5
6 26 12 1 23 65 12 4 2 6
7 0 0 1 0 0 12 0 0 7
8 71 15 1 23 69 12 4 5 8
In [4]: df[((df == 0).sum(axis=1) <= len(df.columns) / 2)]
Out[4]:
c2 c3 c4 c5 c6 c7 c8 cl sr
0 76 0 1 23 65 12 0 5 0
1 45 3 1 0 25 12 0 7 1
2 47 6 1 23 62 12 8 8 2
3 92 9 1 23 26 12 9 15 3
6 26 12 1 23 65 12 4 2 6
8 71 15 1 23 69 12 4 5 8
In [5]:

Find max values in a dict containing lists

The dict got the keys years and for each year it's a list of all the temperatures in all 12 months of that year. My goal is to print out a table starting with what year it is and then a new line for each month and the temp that month.
The main thing is to mark the highest temp of all years with (ATH) and mark the highest temp in each year with (YearHighest).
My current code:
temp_dict= {
"2010": [2, 3, 4, 5, 7, 3, 20, 29, 34, 2, 10, 1],
"2011": [2, 7, 4, 5, 9, 3, 20, 9, 34, 2, 10, 10]
}
for key in temp_dict:
print("Year",key,":")
x=0
for temps in temp_dict[key]:
x=x+1
print("Month "+str(x)+":%3d"%temps)
print()
I'm not sure how to make the max function, I was thinking something like this but I can't get it to work:
for key in temp_dict:
ATH = temp_dict[key]
YearHigh = temp_dict[key][0]
for temps in temp_dict[key]:
if temps >= temp_dict[key][0]:
YearHigh = temps
if YearHigh >= ATH:
ATH = YearHigh
How I want my output to look like:
Year 2011 :
Month1: 2
Month2: 7
Month3: 4
Month4: 5
Month5: 9
Month6: 3
Month7: 20
Month8: 9
Month9: 34 (YearHighest)(ATH)
Month10: 2
Month11: 10
Month12: 10
Year 2010 :
Month1: 2
Month2: 3
Month3: 4
Month4: 5
Month5: 7
Month6: 3
Month7: 20
Month8: 29
Month9: 34 (YearHighest)(ATH)
Month10: 2
Month11: 10
Month12: 1
Python has built-in function max, it's considered a good practice to use it.
Max in year:
max(temp_dict["2010"])
Max all time:
max(sum(temp_dict.values(), []))
sum(lists, []) does list flattening, equivalent to
[] + lists[0] + lists[1]...
Python has a builtin function max you can utilize:
for key in temp_dict:
print("Year", key,":")
temps = temp_dict[key]
max_temp = max(temps)
max_index = temps.index(max_temp)
for index, temps in enumerate(temps):
r = "Month "+str(index+1)+":%3d"%temps
if index == max_index:
r += "(YearHighest)(ATH)"
print(r)
You can try something like this:
temp_dict= {
"2010": [2, 3, 4, 5, 7, 3, 20, 29, 34, 2, 10, 1],
"2011": [2, 7, 4, 5, 9, 3, 20, 9, 34, 2, 10, 10]
}
# defines the max of all years with a list comprehension
global_max_temp = max([ max(year_temps) for year_temps in temp_dict.values() ])
# iterates through each year
for year, temps in temp_dict.items():
print("Year {}".format(year))
for i, temp in enumerate(temps):
# prepares the output
temp_string = ["Month{}: {}".format(i+1, temp)]
# builds a list of flags to be displayed
flags = []
if temp == max(temps):
# max in year flag
flags.append("YearHighest")
if temp == global_max_temp:
# absolute max flag
flags.append("ATH")
# joins temp_string and flags in a single line and prints it
print(" ".join(temp_string + [ "({})".format(flag) for flag in flags ]))
Useful links from Python's documentation: enumerate, list comprehensions, max
This is my code.
temp_dict= {
"2010": [2, 3, 4, 5, 7, 3, 20, 29, 34, 2, 10, 1],
"2011": [2, 7, 4, 5, 9, 3, 20, 9, 34, 2, 10, 10]
}
# Find the highest temp of all years
ath = max([ max(v) for v in temp_dict.values()])
for key in temp_dict:
# Output Year
print("Year{k}:".format(k=key))
x=0
# Find max
max_value = max(temp_dict[key])
for temps in temp_dict[key]:
# Output Month
x=x+1
s = "Month {x}:{v:3d}".format(x=str(x), v=temps)
# Tag the max value
if max_value == temps:
s += "(YearHighest)"
if ath == temps:
s += "(ATH)"
print(s)
print()
And this is my output.
Year2010:
Month 1: 2
Month 2: 3
Month 3: 4
Month 4: 5
Month 5: 7
Month 6: 3
Month 7: 20
Month 8: 29
Month 9: 34(YearHighest)(ATH)
Month 10: 2
Month 11: 10
Month 12: 1
Year2011:
Month 1: 2
Month 2: 7
Month 3: 4
Month 4: 5
Month 5: 9
Month 6: 3
Month 7: 20
Month 8: 9
Month 9: 34(YearHighest)(ATH)
Month 10: 2
Month 11: 10
Month 12: 10
Here needs to use max function. It can max value from numbers of a list fast.

Categories

Resources