Sum groups of flagged items and then find max values - python

I'd like to sum the values grouped by positive and negatives flows and then compare them to figure out the largest negative and largest positive flows.
I think itertools is probably the way to do this but can't figure it out.
#create a data frame that shows week and value
n_rows = 30
dftest = pd.DataFrame({'week': pd.date_range('1/4/2019', periods=n_rows, freq='W'),
'value': np.random.randint(-100,100,size=(n_rows))})
#flag positives and negatives
def flowFinder(row):
if row['value'] > 0:
return "Positive"
else:
return "Negative"
dftest['flag'] = dftest.apply(flowFinder,axis=1)
dftest
In this example df, you'd determine that 15-19 adds up toe 249 which is the max value of all the positive flows. The max negative flow is line 5 with -98.
Edit by Scott Boston
It is best if you added code that generates your dataframe instead of links to a picture.
df = pd.DataFrame({'week':pd.date_range('2019-01-06',periods=21, freq='W'),
'value':[64,43,94,-19,3,-98,1,80,-7,-43,45,58,27,29,
-4,20,97,30,22,80,-95],
'flag':['Positive']*3+['Negative']+['Positive']+['Negative']+
['Positive']*2+['Negative']*2+['Positive']*4+
['Negative']+['Positive']*5+['Negative']})

You can try this:
df.groupby((df['flag'] != df['flag'].shift()).cumsum())['value'].sum().agg(['min','max'])
Output:
min -98
max 249
Name: value, dtype: int64
Using rename:
df.groupby((df['flag'] != df['flag'].shift()).cumsum())['value'].sum().agg(['min','max'])\
.rename(index={'min':'Negative','max':'Positive'})
Output:
Negative -98
Positive 249
Name: value, dtype: int64
Update answer comment:
df_out = df.groupby((df['flag'] != df['flag'].shift()).cumsum())['value','week']\
.agg({'value':'sum','week':'last'})
df_out.loc[df_out.agg({'value':['idxmin','idxmax']}).squeeze().tolist()]
Output:
value week
flag
4 -98 2019-02-10
9 249 2019-05-19

Related

Sum of Maximum Postive and Negative Consecutive rows in pandas

I have a dataframe df as below:
# Import pandas library
import pandas as pd
# initialize list elements
data = [10,-20,30,40,-50,60,12,-12,11,1,90,-20,-10,-5,-4]
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['Numbers'])
# print dataframe.
df
I want the sum of count of max consecutive positive and negative numbers.
I am able to get count of max consucutive positive and negative numbers, but unable to sum using below code.
my code:
streak = df['Numbers'].to_list()
from collections import defaultdict
from itertools import groupby
counter = defaultdict(list)
for key, val in groupby(streak, lambda ele: "plus" if ele >= 0 else "minus"):
counter[key].append(len(list(val)))
lst = []
for key in ('plus', 'minus'):
lst.append(counter[key])
print("Max Pos Count " + str(max(lst[0])))
print("Max Neg Count : " + str(max(lst[1])))
Current Output:
Max Pos Count 3
Max Neg Count : 4
I am struggling to get sum of max consuctive positive and negative.
Expected Output:
Sum Pos Max Consecutive: 102
Sum Neg Max Consecutive: -39
The logic is unclear, the way I understand it is:
group by successive negative/positive values
get the longest stretch per group
compute the sum
You can use:
m = df['Numbers'].gt(0).map({True: 'positive', False: 'negative'})
df2 = df.groupby([m, m.ne(m.shift()).cumsum()])['Numbers'].agg(['count', 'sum'])
out = df2.loc[df2.groupby(level=0)['count'].idxmax(), 'sum'].droplevel(1)
Output:
Numbers
negative -39
positive 102
Name: sum, dtype: int64
Intermediate df2:
count sum
Numbers Numbers
negative 2 1 -20
4 1 -50
6 1 -12
8 4 -39 # longest negative stretch
positive 1 1 10
3 2 70
5 2 72
7 3 102 # longest positive stretch

Merge_asof but only let the nearest merge on key

I am currently trying to merge two data frames using the merge_asof method. However, when using this method I stumbled upon the issue that if I have a empty gap in any of my data then there will be issues with duplicate cells in the merged dataframe. For clarification, I two dataframes that look like this:
1.
index Meter_Indication (km) Fuel1 (l)
0 35493 245
1 35975 267
2 36000 200
3 36303 160
4 36567 300
5 38653 234
index Meter_Indication (km) Fuel2 (l)
0 35494 300
1 35980 203
2 36573 323
3 38656 233
These two dataframes contain data about refueling vehicles where the fuel column is refueled amount in liters and the Meter_Indication indicate how many km the car in total has driven (something that is impossible to become less over time, and is why it is a great key to merge on). However, as you can see there are less rows in df2 than in df1 which currently (in my case makes it so that the values merge on the nearest value. Like this:
(merged df)
index Meter_Indication (km) Fuel1 (l) Fuel2(l)
0 35493 245 300
1 35975 267 203
2 36000 200 203
3 36303 160 323
4 36567 300 323
5 38653 234 233
As you can see there are duplicates of the value 203 and 323. My goal is to instead of the dataframe containing all the 5 rows, instead excluding the ones that dont have a "nearest"-match. I want only the actually nearest to merge with the value. With other words my desired data frame is:
index Meter_Indication (km) Fuel1 (l) Fuel2(l)
0 35493 245 300
1 35975 267 203
3 36567 300 323
4 38653 234 233
You can see here that the values that were not a "closest" match with another value were dropped.
I have tried looking for this everywhere but cant find anything to match my desired outcome.
My current code is:
#READS PROVIDED DOCUMENTS.
df1 = pd.read_excel(
filepathname1, "CWA107 Event", na_values=["NA"], skiprows=1, usecols="A, B, C, D, E, F")
df2 = pd.read_excel(
filepathname2,
na_values=["NA"],
skiprows=1,
usecols=["Fuel2 (l)", "Unnamed: 3", "Meter_Indication"],)
# Drop NaN rows.
df2.dropna(inplace=True)
df1.dropna(inplace=True)
#Filters out rows with the keywords listed in 'blacklist'.
df1.rename(columns={"Bränslenivå (%)": "Bränsle"}, inplace=True)
df1 = df1[~df1.Bränsle.isin(blacklist)]
df1.rename(columns={"Bränsle": "Bränslenivå (%)"}, inplace=True)
#Creates new column for the difference in fuellevel column.
df1["Difference (%)"] = df1["Bränslenivå (%)"]
df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()
# Renames time-column so that they match.
df2.rename(columns={"Unnamed: 3": "Tid"}, inplace=True)
# Drops rows where the difference is equal to 0.
df1filt = df1[(df1["Difference (%)"] != 0)]
# Converts time-column to only year, month and date.
df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)
df1filt.reset_index(level=0, inplace=True)
#Renames the index column to "row" in order to later use the "row" column
df1filt.rename(columns={"index": "row"}, inplace=True)
# Creates a new column for the difference in total driven kilometers (used for matching)
df1filt["Match"] = df1filt["Vägmätare (km)"]
df1filt["Match"] = df1filt.loc[:, "Vägmätare (km)"].diff()
#Merges refuels that are previously seperated because of the timeintervals. For example when a refuel takes a lot of time and gets split into two different refuels.
ROWRANGE = len(df1filt)+1
thevalue = 0
for currentrow in range(ROWRANGE-1):
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] <= 0:
thevalue = 0
thevalue += df1filt.loc[currentrow,'Difference (%)']
df1filt.loc[currentrow,'Match'] = "SUMMED"
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
thevalue += df1filt.loc[currentrow,'Difference (%)']
if df1filt.loc[currentrow, 'Difference (%)'] <= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
df1filt.loc[currentrow-1,'Match'] = "OFFICIAL"
thevalue = 0
#Removes single "refuels" that are lower than 5
df1filt = df1filt[(df1filt['Difference (%)'] > 5)]
#Creates a new dataframe for the summed values
df1filt2 = df1filt[(df1filt['Match'] == "OFFICIAL")]
#Creates a estimated refueled amount column for the automatic
df1filt2["Fuel1 (l)"] = df1filt2["Difference (%)"]
df1filt2["Fuel1 (l)"] = df1filt2.loc[:, "Difference (%)"]/100 *fuelcapacity
#Renames total kilometer column so that the two documents can match
df1filt2.rename(columns={"Vägmätare (km)": "Meter_Indication"}, inplace=True)
#Filters out rows where refuel and kilometer = NaN (Manual)
df2filt = df2[(df2['Fuel2 (l)'] != NaN) & (df2['Meter_Indication'] != NaN)]
#Drops first row
df2filt.drop(df2filt.index[0], inplace=True)
#Adds prefix for the time column so that they match (not used anymore because km is used to match)
df2filt['Tid'] = '20' + df2filt['Tid'].astype(str)
#Rounds numeric columns
decimals = 0
df2filt['Meter_Indication'] = pd.to_numeric(df2filt['Meter_Indication'],errors='coerce')
df2filt['Fuel2 (l)'] = pd.to_numeric(df2filt['Fuel2 (l)'],errors='coerce')
df2filt['Meter_Indication'] = df2filt['Meter_Indication'].apply(lambda x: round(x, decimals))
df2filt['Fuel2 (l)'] = df2filt['Fuel2 (l)'].apply(lambda x: round(x, decimals))
#Removes last number (makes the two excels matchable)
df2filt['Meter_Indication'] //= 10
df1filt2['Meter_Indication'] //= 10
#Creates merged dataframe with the two
merged_df = df1filt2.merge(df2filt, on='Meter_Indication')
Hopefully this was enough information! Thank you in advance.
Try this:
# Assign new column to keep meter indication from df2
df = pd.merge_asof(df1, df2.assign(meter_indication_2=df2['Meter_Indication (km)']), on='Meter_Indication (km)', direction='nearest')
# Calculate absolute difference
df['meter_indication_diff'] = df['Meter_Indication (km)'].sub(df['meter_indication_2']).abs()
# Sort values, drop duplicates (keep the ones with the smallest diff) and do some clean up
df = df.sort_values(by=['meter_indication_2', 'meter_indication_diff']).drop_duplicates(subset=['meter_indication_2']).sort_index().drop(['meter_indication_2', 'meter_indication_diff'], axis=1)
# Output
Meter_Indication (km) Fuel1 (l) Fuel2 (l)
0 35493 245 300
1 35975 267 203
4 36567 300 323
5 38653 234 233

panda python how to make a groupby more columns

This is my question:
I have a file csv like this:
SELL,NUMBER,TYPE,MONTH
-1484829.72,25782,E,3
-1337196.63,26688,E,3
-1110271.83,15750,E,3
-1079426.55,16117,E,3
-964656.26,11344,D,1
-883818.81,10285,D,2
-836068.57,14668,E,3
-818612.27,13806,E,3
-765820.92,14973,E,3
-737911.62,8685,D,2
-728828.93,8975,D,1
-632200.31,12384,E
41831481.50,18425,E,2
1835587.70,33516,E,1
1910671.45,20342,E,6
1916569.50,24088,E,6
1922369.40,25101,E,1
2011347.65,23814,E,3
2087659.35,18108,D,3
2126371.86,34803,E,2
2165531.50,35389,E,3
2231818.85,37515,E,3
2282611.90,32422,E,6
2284141.50,21199,A,1
2288121.05,32497,E,6
I want to make a groupby TYPE and sum the columns SELLS and NUMBERS making a separation between negative and positive number
I make this command:
end_result= info.groupby(['TEXTOCANAL']).agg({
'SELLS': (('negative', lambda x : x[x < 0].sum()), ('positiv', lambda x : x[x > 0].sum())),
'NUMBERS': (('negative', lambda x : x[info['SELLS'] <0].sum()), ('positive', lambda x : x[info['SELLS'] > 0].sum())),
})
And the result is the following:
SELLS NUMBERS
negative positive negative positive
TYPE
A -1710.60 5145.25 17 9
B -95.40 3391.10 1 29
C -3802.25 36428.40 191 1063
D 0.00 30.80 0 7
E -19143.30 102175.05 687 1532
But i want to make this groupby adding the column MONTH
Something like that:
1 2
SELLS NUMBERS
negative positive negative positive negative positive negative positive
TYPE
A -1710.60 5145.25 17 9 -xxx.xx xx.xx xx xx
B -95.40 3391.10 1 29
C -3802.25 36428.40 191 1063
D 0.00 30.80 0 7
E -19143.30 102175.05 687 1532
Any idea?
Thanks in advance for your help
This should work:
end_result = (
info.groupby(['TYPE', 'MONTH', np.sign(info.SELL)]) # groupby negative and positive SELL
['SELL', 'NUMBER'].sum() # select columns to be aggregated
# in this case is redundant to select columns
# since those are the only two columns left
# groupby moves TYPE and MONTH as index
.unstack([1, 2]) # reshape as you need it
.reorder_levels([0, 1, 3, 2]) # to have pos/neg as last level in MultiIndex
.rename({-1: 'negative', 1: 'positive'}, axis=1, level=-1)
)
Similar answer to RichieV's answer. I was unaware of np.sign, which is a neat trick.
Another way to do this is that you can .assign a flag column with np.where to identify a positive or negative. Then, groupby all non-numerical columns and move the second and third fields to the the columns with .unstack([1,2]).
info = (info.assign(flag=np.where((info['SELL'] > 0), 'postive', 'negative'))
.groupby(['TYPE','MONTH','flag'])['SELL', 'NUMBER'].sum()
.unstack([1,2]))
output (image since multi-indexes are messy).

Identifying statistical outliers with pandas: groupby and individual columns

I'm trying to understand how to identify statistical outliers which I will be sending to a spreadsheet. I will need to group the rows by the index and then find the stdev for specific columns and anything that exceeds the stdev would be used to populate a spreadsheet.
df = pandas.DataFrame({'Sex': ['M','M','M','F','F','F','F'], 'Age': [33,42,19,64,12,30,32], 'Height': ['163','167','184','164','162','158','160'],})
Using a dataset like this I would like to group by sex, and then find entries that exceed either the stdev of age or height. Most examples I've seen are addressing the stdev of the entire dataset as opposed to broken down by columns. There will be additional columns such as state, so I don't need the stdev of every column just particular ones out of the set.
Looking for the ouput to just contain the data for the rows that are identified as statistical outliers in either of the columns. For instance:
0 M 64 164
1 M 19 184
Assuming that 64 years old exceeds the men's stdevs set for height and 184 cm tall exceeds the stdevs for men's height
First, convert your height from strings to values.
df['Height'] = df['Height'].astype(float)
You then need to group on Sex using transform to create a boolean indicator marking if any of Age or Height is a statistical outlier within the group.
stds = 1.0 # Number of standard deviation that defines 'outlier'.
z = df[['Sex', 'Age', 'Height']].groupby('Sex').transform(
lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
>>> outliers
Age Height
0 False False
1 False False
2 True True
3 True True
4 True False
5 False True
6 False False
Now filter for rows that contain any outliers:
>>> df[outliers.any(axis=1)]
Age Height Sex
2 19 184 M
3 64 164 F
4 12 162 F
5 30 158 F
If you only care about the upside of the distribution (i.e. values > mean + 2 SDs), then just drop the .abs(), i.e. lambda group: (group - group.mean()).div(group.std()).abs() > stds

Set pandas dataframe winner column value based on majority value from three other columns

I have pandas df as this
id Vote1 Vote2 Vote3
123 Positive Negative Positive
223 Positive Negative Neutral
323 Positive Negative Negative
423 Positive Positive
I want to add another column with name winner
which will be set to whatever is the majority of votes and if there is a tie then the first vote will be set, as shown for id= 223
So the result df should be
id Vote1 Vote2 Vote3 Winner
123 Positive Negative Positive Positive
223 Positive Negative Neutral Positive
323 Positive Negative Negative Negative
423 Positive Positive Positive
This might be related to
Update Pandas Cells based on Column Values and Other Columns
You could do this row-by-row, like this:
import pandas as pd
import numpy as np
# Create the dataframe
df = pd.DataFrame()
df['id']=[123,223,323,423]
df['Vote1']=['Positive']*4
df['Vote2']=['Negative']*3+['Positive']
df['Vote3']=['Positive','Neutral','Negative','']
mostCommonVote=[]
for row in df[['Vote1','Vote2','Vote3']].values:
votes, values = np.unique(row, return_counts=True)
if np.all(values<=1):
mostCommonVote.append( row[0] )
else:
mostCommonVote.append( votes[np.argmax(values)] )
df['Winner'] = mostCommonVote
Result:
df:
id Vote1 Vote2 Vote3 Winner
0 123 Positive Negative Positive Positive
1 223 Positive Negative Neutral Positive
2 323 Positive Negative Negative Negative
3 423 Positive Positive Positive
It may not be the most elegant solution, but it is quite simple. It uses the numpy function unique which can return the counts for each unique string for the rows.
Another, Pandas solution without looping:
df = df.set_index('id')
rep = {'Positive':1,'Negative':-1,'Neutral':0}
df1 = df.replace(rep)
df = df.assign(Winner=np.where(df1.sum(axis=1) > 0,'Positive',np.where(df1.sum(axis=1) < 0, 'Negative', df.iloc[:,0])))
print(df)
Output:
Vote1 Vote2 Vote3 Winner
id
123 Positive Negative Positive Positive
223 Positive Negative Neutral Positive
323 Positive Negative Negative Negative
423 Positive Positive NaN Positive
Explanation
df.assign is a way to create column in a copy of the original dataframe, therefore you have to re assign back to df. The name of the column is Winner, hence 'winner='.
Next, you have nested if statements using np.where ... np.where(cond,result,else)
np.where(df.sum(axis=1) > 0, # this sums the dataframe by row
'Positive', #if true
np.where(df.sum(axis=1) < 0, #nested if the first if return false
'Negative', #sum of the row is less than 0
df.iloc[:,0] #sum = 0 get the first value from that row.
)
)
I wrote a function and apply it to the df. It is usually a bit faster than normal looping.
import pandas as pd
import numpy as np
def vote(row):
pos = np.sum(row.values == 'Positive')
neg = np.sum(row.values == 'Negative')
if pos > neg:
return('Positive')
elif pos < neg:
return('Negative')
else:
return(row['Vote1'])
# Create the dataframe
df = pd.DataFrame()
df['id']=[123,223,323,423]
df['Vote1']=['Positive']*4
df['Vote2']=['Negative']*3+['Positive']
df['Vote3']=['Positive','Neutral','Negative','']
df = df.set_index('id')
df['Winner'] = df.apply(vote,axis=1)
Result
Out[41]:
Vote1 Vote2 Vote3 Winner
id
123 Positive Negative Positive Positive
223 Positive Negative Neutral Positive
323 Positive Negative Negative Negative
423 Positive Positive Positive

Categories

Resources