Avoiding itertuples and predefined numpy arrays when iterating pandas dataframe quickly - python

I am trying to extend the first dataframe based on the information in the second without using itertuples() as it takes too long.
The first dataframe contains data of special events, lets say an avalanche going down in a ski resort. In the end I want to correlate the mass of the avalanche to available status data from the city, like temperature or amount of snow. All status data from all cities is in one DataFrame. All information on the avalanche is in the other DataFrame, which I would like to extend by the status data IF the status data was collected on the same day in the same city at an earlier time.
Here is the setup, I hope it is easy to understand/paste:
import pandas as pd
import numpy as np
import datetime
status = pd.DataFrame({'location': {0: 'Silvretta', 1: 'Landeck', 2: 'Silvretta', 3: 'Landeck', 4: 'Silvretta', 5: 'Landeck', 6: 'Silvretta', 7: 'Landeck'}, 'date': {0: datetime.date(2022, 3, 30), 1: datetime.date(2022, 3, 30), 2: datetime.date(2022, 3, 30), 3: datetime.date(2022, 3, 30), 4: datetime.date(2022, 3, 31), 5: datetime.date(2022, 3, 31), 6: datetime.date(2022, 3, 31), 7: datetime.date(2022, 3, 31)}, 'time': {0: datetime.time(8, 0), 1: datetime.time(8, 0), 2: datetime.time(20, 0), 3: datetime.time(20, 0), 4: datetime.time(8, 0), 5: datetime.time(8, 0), 6: datetime.time(20, 0), 7: datetime.time(20, 0)}, 'wind_lvl': {0: 8, 1: 5, 2: 10, 3: 7, 4: 8, 5: 10, 6: 2, 7: 1}, 'snow_lvl': {0: 10, 1: 11, 2: 7, 3: 9, 4: 4, 5: 0, 6: 4, 7: 4}, 'number_skiers': {0: 26, 1: 87, 2: 8, 3: 25, 4: 90, 5: 86, 6: 55, 7: 31}})
avalanches = pd.DataFrame({'location': {0: 'Silvretta', 1: 'Landeck', 2: 'Silvretta', 3: 'Silvretta'}, 'date': {0: datetime.date(2022, 3, 30), 1: datetime.date(2022, 3, 30), 2: datetime.date(2022, 3, 31), 3: datetime.date(2022, 3, 31)}, 'time': {0: datetime.time(7, 35), 1: datetime.time(12, 37), 2: datetime.time(12, 42), 3: datetime.time(23, 12)}, 'mass': {0: 10, 1: 15, 2: 8, 3: 7}})
Now with itertuples I can do the following, non-pythonic loop:
wind_lvls = np.full(len(avalanches), np.nan)
snow_lvls = np.full(len(avalanches), np.nan)
number_ski = np.full(len(avalanches), np.nan)
for idx, ava in enumerate(avalanches.itertuples()):
relevant_status = status[(status.date == ava.date) & (status.location == ava.location) & (status.time < ava.time)]
if len(relevant_status) > 0:
wind_lvls[idx] = relevant_status.wind_lvl.iloc[-1]
snow_lvls[idx] = relevant_status.snow_lvl.iloc[-1]
number_ski[idx] = relevant_status.number_skiers.iloc[-1]
avalanches['wind_lvl'] = wind_lvls
avalanches['snow_lvls'] = snow_lvls
avalanches['number_ski'] = number_ski
This gives me the correct table avalanches and I can call avalanches.dropna().corr()['mass'] which is the information that I was after - however, the process takes way too long to iterate in a non-toy dataset.
To cut the loop and all the manual specification of numpy arrays I tried the following:
avalanches['info'] = avalanches.apply(lambda row: status[(status.location == row.location) & (status.date == row.date) & (status.time < row.time)], axis=1)
This at least gives me the relevant information in a cell of the avalanche df, but I still need to get the latest row of the info and unpack it. I cannot use avalanches['info'] = avalanches.apply(lambda row: status[(status.location == row.location) & (status.date == row.date) & (status.time < row.time)][-1], axis=1) because there might be no match at all (which should result in NaNs). Should I just pack the whole function into a lambda function and would that actually speed up the process?
Could you help me specify a function to apply that would be faster than my itertuples() approach? I feel like I am missing something.
Thanks in advance!
Edit: I updated the status DataFrame with an additional row measured at 12:00 here below:
status = pd.DataFrame({'location': {0: 'Silvretta', 1: 'Landeck', 2: 'Silvretta', 3: 'Landeck', 4: 'Silvretta', 5: 'Landeck', 6: 'Silvretta', 7: 'Silvretta', 8: 'Landeck'}, 'date': {0: datetime.date(2022, 3, 30), 1: datetime.date(2022, 3, 30), 2: datetime.date(2022, 3, 30), 3: datetime.date(2022, 3, 30), 4: datetime.date(2022, 3, 31), 5: datetime.date(2022, 3, 31), 6: datetime.date(2022, 3, 31), 7: datetime.date(2022, 3, 31), 8: datetime.date(2022, 3, 31)}, 'time': {0: datetime.time(8, 0), 1: datetime.time(8, 0), 2: datetime.time(20, 0), 3: datetime.time(20, 0), 4: datetime.time(8, 0), 5: datetime.time(8, 0), 6: datetime.time(12, 0), 7: datetime.time(20, 0), 8: datetime.time(20, 0)}, 'wind_lvl': {0: 8, 1: 5, 2: 10, 3: 7, 4: 8, 5: 10, 6: 5, 7: 2, 8: 1}, 'snow_lvl': {0: 10, 1: 11, 2: 7, 3: 9, 4: 4, 5: 0, 6: 2, 7: 4, 8: 4}, 'number_skiers': {0: 26, 1: 87, 2: 8, 3: 25, 4: 90, 5: 86, 6: 62, 7: 55, 8: 31}})
Now row 2 of avalanches after itertuples() is
Silvretta 2022-03-31 12:42:00 8 5.0 2.0 62.0
The answer from #Laurent leads to
Silvretta 2022-03-31 12:42:00 8 8.0 4.0 90.0

Given your edits, here is one way to do it:
avalanches = pd.concat([avalanches, status]).sort_values(
by=["location", "date", "time"]
)
dfs = [
avalanches.loc[avalanches["location"] == location, :]
for location in avalanches["location"].unique()
]
dfs = [
pd.concat(
[
df[["location", "date", "time", "mass"]],
df[["wind_lvl", "snow_lvl", "number_skiers"]].shift(1),
],
axis=1,
)
for df in dfs
]
avalanches = pd.concat(dfs).dropna(subset="mass").sort_values(by=["time"])
So that:
print(avalanches)
# Output
location date time mass wind_lvl snow_lvl number_skiers
0 Silvretta 2022-03-30 07:35:00 10.0 NaN NaN NaN
1 Landeck 2022-03-30 12:37:00 15.0 5.0 11.0 87.0
2 Silvretta 2022-03-31 12:42:00 8.0 5.0 2.0 62.0
3 Silvretta 2022-03-31 23:12:00 7.0 2.0 4.0 55.0

Thanks a lot to #Laurent who showed me to concat the dataframes into one and work on a joint dataframe. In the end I used the following:
avalanches = pd.concat([avalanches, status]).sort_values('time').reset_index(drop=True)
avalanches[["wind_lvl", "snow_lvl", "number_skiers"]] = avalanches.groupby(by=['location', 'date'])[["wind_lvl", "snow_lvl", "number_skiers"]].fillna(method='ffill')
avalanches.dropna(inplace=True)

Related

Calcuating the proportion of total 'yes' values in a group

I have a dataframe that that looks like this:
chr
start
end
plus
minus
total
in_control
sites_in_cluster
mean
cluster
1
1000
1005
6
7
13
Y
3
6
36346
1
1007
10012
3
1
4
N
3
6
36346
1
10014
10020
0
1
1
Y
3
6
36346
2
33532
33554
1
1
2
N
1
2
22123
cluster is an ID assigned to each row, in this case, we have 3 "sites"
In this cluster, two of these sites are in the control (in_control==Y)
I want to create an additional column, which tells me what proportion of the sites are in the control. i.e. (sum(in_control==Y) for a cluster)/sites_in_cluster
In this example, we have two rows with in_control==Y and 3 sites_in_cluster in cluster 36346. Therefore, cluster_sites_in_control would be 2/3 = 0.66 whereas cluster 22123 only has one site and isn't in the control, so would be 0/1=0
chr
start
end
plus
minus
total
in_control
sites_in_cluster
mean
cluster
cluster_sites_in_control
1
1000
1005
6
7
13
Y
3
6
36346
0.66
1
1007
10012
3
1
4
N
3
6
36346
0.66
1
10014
10020
0
1
1
Y
3
6
36346
0.66
2
33532
33554
1
1
2
N
1
2
22123
0.00
I have created code which seemingly accomplishes this, however, it seems to be extremely roundabout and I'm certain there's a better solution out there:
intersect_in_control
# %%
import pandas as pd
#get the number of sites in a control that are 'Y'
number_in_control = pd.DataFrame(intersect_in_control.groupby(['cluster']).in_control.value_counts().unstack(fill_value=0).loc[:,'Y'])
#get the number of breaksites for that cluster
number_of_breaksites = pd.DataFrame(intersect_in_control.groupby(['cluster'])['sites_in_cluster'].count())
#combine these two dataframes
combined_dataframe = pd.concat([number_in_control.reset_index(drop=False), number_of_breaksites.reset_index(drop=True)], axis=1)
#calculate the desired column
combined_dataframe["proportion_in_control"] = combined_dataframe["Y"]/combined_dataframe["sites_in_cluster"]
#left join this new dataframe to the original whilst dropping undesired columns.
cluster_in_control = intersect_in_control.merge((combined_dataframe.drop(["Y","sites_in_cluster"], axis = 1)), on='cluster', how='left')
10 rows of the df as example data:
{'chr': {0: 'chr14',
1: 'chr2',
2: 'chr1',
3: 'chr10',
4: 'chr17',
5: 'chr17',
6: 'chr2',
7: 'chr2',
8: 'chr2',
9: 'chr1',
10: 'chr1'},
'start': {0: 23016497,
1: 133031338,
2: 64081726,
3: 28671025,
4: 45219225,
5: 45219225,
6: 133026750,
7: 133026761,
8: 133026769,
9: 1510391,
10: 15853061},
'end': {0: 23016501,
1: 133031342,
2: 64081732,
3: 28671030,
4: 45219234,
5: 45219234,
6: 133026755,
7: 133026763,
8: 133026770,
9: 1510395,
10: 15853067},
'plus_count': {0: 2,
1: 0,
2: 5,
3: 1,
4: 6,
5: 6,
6: 14,
7: 2,
8: 0,
9: 2,
10: 4},
'minus_count': {0: 6,
1: 7,
2: 1,
3: 5,
4: 0,
5: 0,
6: 0,
7: 0,
8: 2,
9: 3,
10: 1},
'count': {0: 8, 1: 7, 2: 6, 3: 6, 4: 6, 5: 6, 6: 14, 7: 2, 8: 2, 9: 5, 10: 5},
'in_control': {0: 'N',
1: 'N',
2: 'Y',
3: 'N',
4: 'Y',
5: 'Y',
6: 'N',
7: 'Y',
8: 'N',
9: 'Y',
10: 'Y'},
'total_breaks': {0: 8,
1: 7,
2: 6,
3: 6,
4: 6,
5: 6,
6: 18,
7: 18,
8: 18,
9: 5,
10: 5},
'sites_in_cluster': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 3,
7: 3,
8: 3,
9: 1,
10: 1},
'mean_breaks_per_site': {0: 8.0,
1: 7.0,
2: 6.0,
3: 6.0,
4: 6.0,
5: 6.0,
6: 6.0,
7: 6.0,
8: 6.0,
9: 5.0,
10: 5.0},
'cluster': {0: 22665,
1: 24664,
2: 3484,
3: 13818,
4: 23640,
5: 23640,
6: 24652,
7: 24652,
8: 24652,
9: 48,
10: 769}}
Thanks in advance for any help :)
For percentage is possible symplify solution with mean per boolean column and for create new column use GroupBy.transform, it working well because Trues apre processing like 1:
df['cluster_sites_in_control'] = (df['in_control'].eq('Y')
.groupby(df['cluster']).transform('mean'))
print (df)
chr start end plus minus total in_control sites_in_cluster mean \
0 1 1000 1005 6 7 13 Y 3 6
1 1 1007 10012 3 1 4 N 3 6
2 1 10014 10020 0 1 1 Y 3 6
3 2 33532 33554 1 1 2 N 1 2
cluster cluster_sites_in_control
0 36346 0.666667
1 36346 0.666667
2 36346 0.666667
3 22123 0.000000

Python - speed up iteration between nested dictionary

I want ask you for some advice to speed up my code. I know that you can see many of mistakes, but I need your knowledge and help where problem lies and how I can improve this code.
Background - what application creates:
Use OpenPyXL
Open excel file, read data and put it into nested dictionary:
2a. 1st level for rows
2b. 2nd level for items
Example:
{1: '#5C\Qopen#', 2: '20386239', 3: '3000133215', 4: 'RA', 5: None,
6: 'Vendor2', 7: 'IM45', 8: '#FR\QNot due#', 9: None, 10: None, 11:
'E1', 12: 'DNS', 13: datetime.datetime(2019, 12, 27, 0, 0), 14:
datetime.datetime(2019, 12, 26, 0, 0), 15: -21501, 16: 'GBP', 17:
-21501, 18: 'GBP', 19: datetime.datetime(2019, 12, 26, 0, 0), 20: datetime.datetime(2020, 2, 9, 0, 0)}
{2: '#5C\Qopen#', 2: '20386239',
3: '3000133215', 4: 'RA', 5: None, 6: 'Vendor1', 7: 'IM45', 8:
'#FR\QNot due#', 9: None, 10: None, 11: 'E1', 12: 'DNS', 13:
datetime.datetime(2019, 12, 27, 0, 0), 14: datetime.datetime(2019, 12,
26, 0, 0), 15: -21501, 16: 'GBP', 17: -21501, 18: 'GBP', 19:
datetime.datetime(2019, 12, 26, 0, 0), 20: datetime.datetime(2020, 2,
9, 0, 0)}
{3: '#5C\Qopen#', 2: '20386239',
3: '3000133215', 4: 'RA', 5: None, 6: 'Vendor1', 7: 'IM45', 8:
'#FR\QNot due#', 9: None, 10: None, 11: 'E1', 12: 'DNS', 13:
datetime.datetime(2019, 12, 27, 0, 0), 14: datetime.datetime(2019, 12,
26, 0, 0), 15: -21501, 16: 'EUR', 17: -21501, 18: 'GBP', 19:
datetime.datetime(2019, 12, 26, 0, 0), 20: datetime.datetime(2020, 2,
9, 0, 0)}
Script should look via whole data - comparing Vendor & Currency and look whether in particular vendor we have different currencies (I mean e.g. when Vendor 1 doesn't have 100% one specific currency [GBP or other]
When this will happened - put text "something something" for example in column "17" in the same row where different currency is
Mainly my code works properly but its very slow. I mean when I need compare in the same time file with 30 000 rows.
Do you know how I can improve it?
Thank you
next_row2 = 1
numerkolumny = 1
nastepny = 1
numer_vendora = 1
ilosc_gbp = 0
ilosc_inne = 0
linijkadanych = {}
lista_vendorow = {}
for zmienna2 in progressbar.progressbar(assets2, redirect_stdout=True):
for iteracja in assets2:
if assets2[zmienna2][6] not in lista_vendorow.values():
if nastepny < len(assets2):
if assets2[zmienna2][6] == assets2[nastepny+1][6]:
if assets2[nastepny+1][16] == "GBP": # JESLI ZNALAZLES GBP, POLICZ DO GBP
ilosc_gbp = ilosc_gbp + 1
nastepny = nastepny + 1
else: # JESLI ZNALAZLES INNA WALUTE, POLICZ DO INNEJWALUTY
ilosc_inne = ilosc_inne + 1
nastepny = nastepny + 1
else:
nastepny = nastepny + 1
if nastepny >= len(assets2): # JESLI PRZEITEROWALES PRZEZ WSZYSTKIE WIERSZE, OBLICZ WYNIK
suma_walut = ilosc_gbp + ilosc_inne # SUMUJ WSZYSTKIE WALUTY
# JESLI ZNAJDZIE ODCHYLELNIA - RAPORTUJ!
if (suma_walut != ilosc_gbp) and (suma_walut != ilosc_inne):
for waluty in assets2: # nr wiersza
for waluty2 in assets2[waluty]: # nr kolumny
if assets2[waluty][6] == assets2[zmienna2][6]:
if ilosc_gbp > ilosc_inne:
result_tab.cell(column=17, row=waluty+1, value="Waluta other than GBP. Check!").font = style_blad_bold
else:
result_tab.cell(column=17, row=waluty+1, value="Other currencies between GBP!. Check!").font = style_blad_bold
lista_vendorow[numer_vendora] = assets2[zmienna2][6]
ilosc_gbp = 0 # ZERUJ ZMIENNE, LICZYMY NOWEGO VENDORA
ilosc_inne = 0 # ZERUJ ZMIENNE, LICZYMY NOWEGO VENDORA
nastepny = 1 # ZERUJ ZMIENNE, LICZYMY NOWEGO VENDORA
numer_vendora = numer_vendora + 1

Iterate pandas data frame for rows consists of arrays and compute a moving average based on condition

I can't figure out a problem I am trying to solve.
I have a pandas data frame coming from this:
date, id, measure, result
2016-07-11, 31, "[2, 5, 3, 3]", 1
2016-07-12, 32, "[3, 5, 3, 3]", 1
2016-07-13, 33, "[2, 1, 2, 2]", 1
2016-07-14, 34, "[2, 6, 3, 3]", 1
2016-07-15, 35, "[39, 31, 73, 34]", 0
2016-07-16, 36, "[3, 2, 3, 3]", 1
2016-07-17, 37, "[3, 8, 3, 3]", 1
Measurements column consists of arrays in string format.
I want to have a new moving-average-array column from the past 3 measurement records, excluding those records where the result is 0. Past 3 records mean that for id 34, the arrays of id 31,32,33 to be used.
It is about taking the average of every 1st point, 2nd point, 3rd and 4th point to have this moving-average-array.
It is not about getting the average of 1st array, 2nd array ... and then averaging the average, no.
For the first 3 rows, because there is not enough history, I just want to use their own measurement. So the solution should look like this:
date, id, measure, result . Solution
2016-07-11, 31, "[2, 5, 3, 3]", 1, "[2, 5, 3, 3]"
2016-07-12, 32, "[3, 5, 3, 3]", 1, "[3, 5, 3, 3]"
2016-07-13, 33, "[2, 1, 2, 2]", 1, "[2, 1, 2, 2]"
2016-07-14, 34, "[2, 6, 3, 3]", 1, "[2.3, 3.6, 2.6, 2.6]"
2016-07-15, 35, "[39, 31, 73, 34]", 0, "[2.3, 4, 2.6, 2.6]"
2016-07-16, 36, "[3, 2, 3, 3]", 1, "[2.3, 4, 2.6, 2.6]"
2016-07-17, 37, "[3, 8, 3, 3]", 1, "[2.3, 3, 2.6, 2.6]"
The real data is bigger. result 0 may repeat 2 or more times after each other also. I think it will be about keeping a track of previous OK results properly getting those averages. I spent time but I could not.
I am posting the dataframe here:
mydict = {'date': {0: '2016-07-11',
1: '2016-07-12',
2: '2016-07-13',
3: '2016-07-14',
4: '2016-07-15',
5: '2016-07-16',
6: '2016-07-17'},
'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
'measure': {0: '[2, 5, 3, 3]',
1: '[3, 5, 3, 3]',
2: '[2, 1, 2, 2]',
3: '[2, 6, 3, 3]',
4: '[39, 31, 73, 34]',
5: '[3, 2, 3, 3]',
6: '[3, 8, 3, 3]'},
'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}
df = pd.DataFrame(mydict)
Thank you for giving directions or pointing out how to.
Solution using only 1 for loop:
Considering the data:
mydict = {'date': {0: '2016-07-11',
1: '2016-07-12',
2: '2016-07-13',
3: '2016-07-14',
4: '2016-07-15',
5: '2016-07-16',
6: '2016-07-17'},
'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
'measure': {0: '[2, 5, 3, 3]',
1: '[3, 5, 3, 3]',
2: '[2, 1, 2, 2]',
3: '[2, 6, 3, 3]',
4: '[39, 31, 73, 34]',
5: '[3, 2, 3, 3]',
6: '[3, 8, 3, 3]'},
'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}
df = pd.DataFrame(mydict)
I defined a simple function to calculate the means and return a list. Then, loop the dataframe applying the rules:
def calc_mean(in_list):
p0 = round((in_list[0][0] + in_list[1][0] + in_list[2][0])/3,1)
p1 = round((in_list[0][1] + in_list[1][1] + in_list[2][1])/3,1)
p2 = round((in_list[0][2] + in_list[1][2] + in_list[2][2])/3,1)
p3 = round((in_list[0][3] + in_list[1][3] + in_list[2][3])/3,1)
return [p0, p1, p2, p3]
Solution = []
aux_list = []
for index, row in df.iterrows():
if index in [0,1,2]:
Solution.append(row.measure)
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
else:
Solution.append('[' +', '.join(map(str, calc_mean(aux_list))) + ']')
if row.result > 0:
aux_list.pop(0)
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
df['Solution'] = Solution
The output is:
Please note that the result is rounded to 1 decimal place, a bit different from your desired output. Made more sense to me.
EDIT:
As a suggestion in the comments by #Frenchy, to deal with result == 0 in the first 3 rows, we need to change a bit the first if clause:
if index in [0,1,2] or len(aux_list) <3:
Solution.append(row.measure)
if row.result > 0:
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
You can use pd.eval to change from a str of list to a proper list only the part of the data in measure where result is not 0. Use rolling with mean and then shift to get the rolling average over the last 3 rows at the next row. Then map to str once your dataframe is changed to a list of list with values and tolist. Finally you just need to replace the first three rows and ffill the missing data:
df.loc[df.result.shift() != 0,'solution'] = list(map(str,
pd.DataFrame(pd.eval(df[df.result != 0].measure))
.rolling(3).mean().shift().values.tolist()))
df.loc[:2,'solution'] = df.loc[:2,'measure']
df.solution = df.solution.ffill()
Here's another solution:
# get data to reproduce example
from io import StringIO
data = StringIO("""
date;id;measure;result
2016-07-11;31;"[2,5,3,3]";1
2016-07-12;32;"[3,5,3,3]";1
2016-07-13;33;"[2,1,2,2]";1
2016-07-14;34;"[2,6,3,3]";1
2016-07-15;35;"[39,31,73,34]";0
2016-07-16;36;"[3,2,3,3]";1
2016-07-17;37;"[3,8,3,3]";1
""")
df = pd.read_csv(data, sep=";")
df
# Out:
# date id measure result
# 0 2016-07-11 31 [2,5,3,3] 1
# 1 2016-07-12 32 [3,5,3,3] 1
# 2 2016-07-13 33 [2,1,2,2] 1
# 3 2016-07-14 34 [2,6,3,3] 1
# 4 2016-07-15 35 [39,31,73,34] 0
# 5 2016-07-16 36 [3,2,3,3] 1
# 6 2016-07-17 37 [3,8,3,3] 1
# convert values in measure column to lists
from ast import literal_eval
dm = df['measure'].apply(literal_eval)
# apply rolling mean with period 2 and recollect values into list in column means
df["means"] = dm.apply(pd.Series).rolling(2, min_periods=0).mean().values.tolist()
df
# Out:
# date id measure result means
# 0 2016-07-11 31 [2,5,3,3] 1 [2.0, 5.0, 3.0, 3.0]
# 1 2016-07-12 32 [3,5,3,3] 1 [2.5, 5.0, 3.0, 3.0]
# 2 2016-07-13 33 [2,1,2,2] 1 [2.5, 3.0, 2.5, 2.5]
# 3 2016-07-14 34 [2,6,3,3] 1 [2.0, 3.5, 2.5, 2.5]
# 4 2016-07-15 35 [39,31,73,34] 0 [20.5, 18.5, 38.0, 18.5]
# 5 2016-07-16 36 [3,2,3,3] 1 [21.0, 16.5, 38.0, 18.5]
# 6 2016-07-17 37 [3,8,3,3] 1 [3.0, 5.0, 3.0, 3.0]
# moving window of size 3
df["means"] = dm.apply(pd.Series).rolling(3, min_periods=0).mean().round(2).values.tolist()
df
# Out:
# date id measure result means
# 0 2016-07-11 31 [2,5,3,3] 1 [2.0, 5.0, 3.0, 3.0]
# 1 2016-07-12 32 [3,5,3,3] 1 [2.5, 5.0, 3.0, 3.0]
# 2 2016-07-13 33 [2,1,2,2] 1 [2.33, 3.67, 2.67, 2.67]
# 3 2016-07-14 34 [2,6,3,3] 1 [2.33, 4.0, 2.67, 2.67]
# 4 2016-07-15 35 [39,31,73,34] 0 [14.33, 12.67, 26.0, 13.0]
# 5 2016-07-16 36 [3,2,3,3] 1 [14.67, 13.0, 26.33, 13.33]
# 6 2016-07-17 37 [3,8,3,3] 1 [15.0, 13.67, 26.33, 13.33]

Taking the mean by one column and then by another in pandas

I have the following dataset:
data = {'VALVE_SCORE': {0: 34.1,1: 41.0,2: 49.7,3: 53.8,4: 35.8,5: 49.2,6: 38.6,7: 51.2,8: 44.8,9: 51.5,10: 41.9,11: 46.0,12: 41.9,13: 51.4,14: 35.0,15: 49.7,16: 41.5,17: 51.5,18: 45.2,19: 53.4,20: 38.1,21: 50.2,22: 25.4,23: 30.0,24: 28.1,25: 49.9,26: 27.5,27: 37.2,28: 27.7,29: 45.7,30: 27.2,31: 30.0,32: 27.9,33: 34.3,34: 29.5,35: 34.5,36: 28.0,37: 33.6,38: 26.8,39: 31.8},
'DAY': {0: 6, 1: 6, 2: 6, 3: 6, 4: 13, 5: 13, 6: 13, 7: 13, 8: 20, 9: 20, 10: 20, 11: 20, 12: 27, 13: 27, 14: 27, 15: 27, 16: 3, 17: 3, 18: 3, 19: 3, 20: 10, 21: 10, 22: 10, 23: 10, 24: 17, 25: 17, 26: 17, 27: 17, 28: 24, 29: 24, 30: 24, 31: 24, 32: 3, 33: 3, 34: 3, 35: 3, 36: 10, 37: 10, 38: 10, 39: 10},
'MONTH': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 2, 17: 2, 18: 2, 19: 2, 20: 2, 21: 2, 22: 2, 23: 2, 24: 2, 25: 2, 26: 2, 27: 2, 28: 2, 29: 2, 30: 2, 31: 2, 32: 3, 33: 3, 34: 3, 35: 3, 36: 3, 37: 3, 38: 3, 39: 3}}
df = pd.DataFrame(data)
First, I would like to take the mean by day and then by month. However, taking the mean by grouping the days results in decimal months. I would like to preserve the months before I do a groupby('MONTH').mean()
In [401]: df.groupby("DAY").mean()
Out[401]:
VALVE_SCORE MONTH
DAY
3 39.7250 2.5
6 44.6500 1.0
10 32.9875 2.5
13 43.7000 1.0
17 35.6750 2.0
20 46.0500 1.0
24 32.6500 2.0
27 44.5000 1.0
I would like the end result to be:
MONTH VALVE_SCORE
1 value
2 value
3 value
Consider with the data you have, you would like to have the daily mean and then the monthly mean. Putting the same in an Excel pivot table will result like this:
Do doing the same in pandas, grouping by months is enough to get the same result:
df.groupby(['MONTH']).mean()
DAY VALVE_SCORE
MONTH
1 16.5 44.7250
2 13.5 38.0375
3 6.5 30.8000
Since the month and day values are numeric, pandas process it, consider 'DAY' and 'MONTH' values are not numeric and are strings, you get this result:
VALVE_SCORE
MONTH
1 44.7250
2 38.0375
3 30.8000
So pandas already computes the daily means and using it computes the monthly means.
Here's a possible solution. Do let me know if there is a more efficient way of doing it.
df = pd.DataFrame(data)
months = list(df['MONTH'].unique())
frames = []
for p in months:
df_part = df[df['MONTH'] == p]
df_part_avg = df_part.groupby("DAY", as_index=False).mean()
df_part_avg = df_part_avg.drop('DAY', axis=1)
frames.append(df_part_avg)
df_months = pd.concat(frames)
df_final = df_months.groupby("MONTH", as_index=False).mean()
And the result is:
In [430]: df_final
Out[430]:
MONTH VALVE_SCORE
0 1 44.7250
1 2 38.0375
2 3 30.8000

Chaining generators within a comprehension

Is it possible to do something like the following as a one liner in python where the resulting syntax is readable?
d = dict((i,i+1) for i in range(10))
d.update((i,i+2) for i in range(20,25))
>>> from itertools import chain
>>> dict(chain(((i,i+1) for i in range(10)),
((i,i+2) for i in range(20,25))))
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 20: 22, 21: 23, 22: 24, 23: 25, 24: 26}
how about this:
d = dict(dict((i,i+1) for i in range(10)), **dict(((i,i+2) for i in range(20,25))))
result:
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 20: 22, 21: 23, 22: 24, 23: 25, 24: 26}
#jamylak's answer is great and should do. Anyway for this specific problem, I would probably do this:
d = dict((i, i+1) if i < 10 else (i, i+2) for i in range(25) if i < 10 or i >= 20)
This gives the same output:
d = dict((i,i+x) for x,y in [(1, range(10)), (2, range(20,25))] for i in y)
You could also write it with enumerate, so:
d = dict((i,i+x) for x,y in enumerate([range(10), range(20,25)], 1) for i in y)
But it's slightly longer and it assumes your intention is to use a smooth incrementation, which might not be the case later (?). The problem is not knowing whether you plan to extend this into an even longer expression, which would alter the requirements and affect which answer is most convenient.

Categories

Resources