Assign a value to a sub group if condition is met - python

I would like to create column and assign a number to each team that won and lost in a given 'Rally' (0 for a Loss, 1 for a Win). The last row of each rally will display who won in the 'Points' column.
The image shows how the data is formatted and the desired result is in the 'Outcome' column:
My current code is;
def winLoss(x):
if 'A' in x['Points']:
if x.TeamAB == 'A':
return 1
else:
return 0
elif 'B' in x['Points']:
if x.TeamAB == 'B':
return 1
else:
return 0
df['Outcome'] = df.groupby('Rally').apply(winLoss).any()

Grab the winners for each rally by grouping and taking the last row of Points for each group, then use multiindex to loc filter an assign the Outcome:
winners = pd.MultiIndex.from_frame(
df.groupby(['Rally'])['Points']
.last().str.slice(-1).reset_index()
)
df.set_index(['Rally', 'TeamAB'], inplace=True)
df['Outcome'] = 0
df.loc[df.index.isin(winners), 'Outcome'] = 1
df.reset_index(inplace=True)

Related

Sum of a numeric columns into specific ranges and counting its occurrences

I am quite new to programming field of Python.
I have a dataset which needs to be modified. I tried few methods for sum part but I dont get the exact results.
Dataset : My data table
Requirements:
To categorize the debit and credit values into the following ranges/bins :
a) 2000-4000
b) 5000-8000
c) 9000-20000
The sum of debit should be for 20 days period like
if the transaction happened on 2020-01-01 then
the sum of credit should be from 2020-01-01 to 2020-01-20
I also want the record of occurrences i.e
the number of times the value from the bins lies in the category
Required Result : Result]2
The code I tried for credit values:
EndDate = BM['transaction_date']+ pd.to_timedelta(20, unit='D')
StartDate= BM['transaction_date']
dfx=BM
dfx['EndDate'] = EndDate
dfx['StartDate'] = StartDate
dfx['Debit'] = dfx.apply(lambda x: BM.loc[(df['transaction_date'] >= x.StartDate) &
(BM['transaction_date']
<=x.EndDate),'Debit'].sum(), axis=1)
Code1-
Code2-
error :
I have created a lot of functions and broke the problem into smaller tasks. Hope the comments make this understandable.
def sum20Days(df, debitORCredit):
"""
Calculates the sum of all amount in the debitORCredit column of df looking 20 days into the future within df
df: pandas DataFrame. Should already do groupby on name
debitORCredit : String. Takes either debit or credit. Column names in the dataframe
Returns:
df: Creates a column sum_debit_20days, adds the sum amount and returns the final dataframe
"""
df = df.copy()
temp_df = df[df[debitORCredit]>0]
dates = sorted(temp_df["transaction_date"].unique())
curr_date = dates[0]
date_20days = curr_date + pd.Timedelta(20, unit="D")
i = 0
while i < len(dates):
date = dates[i]
if date > date_20days:
curr_date = date
date_20days = curr_date + pd.Timedelta(20, unit="D")
series = temp_df.loc[(df["transaction_date"]>=date)&(df["transaction_date"]<=date_20days), :]
df.loc[max(df.loc[df["transaction_date"] == series["transaction_date"].max()].index), f"sum_{debitORCredit}_20days"] = sum(series[debitORCredit])
new_i = series["transaction_date"].nunique()
if new_i > 1:
i = new_i+1
else:
i += 1
return df
def groupListUsingList(inp, groupby):
"""
Groups inp by list groupby
inp: List
groupby: List
Example: inp = [0, 1, 2, 3, 4, 5, 6, 7], groupby=[3, 6] then output = [[0, 1, 2, 3], [4, 5, 6], [7]]
"""
groupby = sorted(groupby)
inp = sorted(inp)
lst = []
arr = []
for i in inp:
if len(groupby) > 0:
if i <= groupby[0]:
arr.append(i)
else:
if len(arr)>0:
lst.append(arr)
arr = [i]
groupby.pop(0)
else:
arr += inp[i:]
if len(arr) > 0:
lst.append(arr)
return lst
def count_amounts_in_category(df, debitORCredit, category_info):
"""
Based on the category assigned, finds the number of amounts belonging to that category
Inputs-
df: Pandas Dataframe. Grouped by name and only contains the transactions belonging to a single category calculation
debitORCredit: String. Takes either credit/debit. Used to get column in df
category_info: Dict. Contains the rules of categorization.
Output-
count: Float. Returns count
"""
if debitORCredit.lower() == "debit":
temp_df = df.loc[(df["debitorcredit"]=="D")]
elif debitORCredit.lower() == "credit":
temp_df = df.loc[(df["debitorcredit"]=="C")]
if temp_df.shape[0] == 0:
return np.nan
category = temp_df.iloc[-1].loc[f"category_{debitORCredit}"]
amount_range = category_info.get(category)
count = temp_df[debitORCredit].apply(lambda x: 1 if x<=amount_range[1] and x>=amount_range[0] else 0).sum()
return count
def assign_category(amount, category_info):
"""
Assigns category based on amount and categorization rules
Input -
amount: Float/Int. The amount
category_info: Dict. Contains the rules of categorization.
Ouptut -
Returns the String category based on the categorization rules
"""
if pd.isna(amount):
return np.nan
for k, v in category_info.items():
if v[0]<=amount<=v[1]:
return k
return np.nan
category_info = {"A": (2000, 4000),
"B": (5000, 8000),
"C":(9000, 20000)}
debitORCredit = "debit"
new_df = pd.DataFrame()
#Groupby name, then for each date in a group, calculate the sum of debitORCredit amounts over the next 20 days
for group in df.groupby("name"):
temp_df = sum20Days(group[1], debitORCredit=debitORCredit)
new_df = pd.concat([new_df, temp_df])
new_df = new_df.reset_index(drop=True)
#Based on the 20 days sum, use the categorization rules to assign a category
new_df[f"category_{debitORCredit}"] = new_df[f"sum_{debitORCredit}_20days"].apply(lambda x: assign_category(x, category_info))
#After assigning a category, groupby name and later groupby each 20 day transaction to find the count of transaction that belong to category assigned to that group of transactions
for group in new_df.groupby("name"):
#to groupby every 20 day transaction, we identified the last row of every 20 day transaction (ones which have a sum_debit_20days value) and split the group(a group from name groupby) on the last value in the index
indices = groupListUsingList(inp=group[1].index, groupby=group[1][group[1][f"sum_{debitORCredit}_20days"].notna()].index)
for index in indices:
count = count_amounts_in_category(df=new_df.loc[index], debitORCredit=debitORCredit, category_info=category_info)
new_df.loc[index[-1], f"count_{debitORCredit}"] = count
new_df

How to create a new column as a result of comparing two nested consecutive rows in the Panda dataframe?

I need to write a code in Panda Dataframe. So: The values in the ID column will be checked sequentially whether they are the same or not. Three situations arise here. Case 1: If the ID is not the same as the next line, write it as "unique" in the Comment column. Case 2: If the ID is the same as the next column and different from the next one, write it as "ring" in the Comment column. Case 3: If the ID is the same as the next multiple columns, write it as "multi" in the Comment column. Case 4: do this until the rows in the ID column are complete.
import pandas as pd
df = pd.read_csv('History-s.csv')
a = len(df['ID'])
c = 0
while a != 0:
c += 1
while df['ID'][i] == df['ID'][i + 1]:
if c == 2:
if df['Nod 1'][i] == df['Nod 2'][i + 1]:
df['Comment'][i] = "Ring"
df['Comment'][i + 1] = "Ring"
else:
df['Comment'][i] = "Multi"
df['Comment'][i + 1] = "Multi"
elif c > 2:
df['Comment'][i] = "Multi"
df['Comment'][i + 1] = "Multi"
i += 1
else:
df['Comment'][i] = "Unique"
a = a -1
print(df, '\n')
Data is like this:
Data
After coding data frame should be like this:
Result
From the input dataframe you have provided, my first impression was that as you are checking next line in a while loop, so you are strictly considering just the next comin line, for ex.
ID
value
comment
1
2
MULTI
1
3
RING
3
4
UNIQUE
But if that is not the case, you can simply use pandas groupby function.
def func(df):
if len(df)>2:
df['comment'] = 'MULTI'
elif len(df)==2:
df['comment'] = 'RING'
else:
df['comment'] = 'UNIQUE'
return df
df = df.groupby(['ID']).apply(func)
Output:
ID value comment
0 1 2 RING
1 1 3 RING
2 3 4 UNIQUE

Creating new column based on multiple different values

I have current code below that creates a new column based on multiple different values of a column that has different values representing similar things such as Car, Van or Ship, Boat, Submarine that I want all to be classified under the same value in the new column such as Vehicle or Boat.
Code with Simplified Dataset example:
def f(row):
if row['A'] == 'Car':
val = 'Vehicle'
elif row['A'] == 'Van':
val = 'Vehicle'
elif row['Type'] == 'Ship'
val = 'Boat'
elif row['Type'] == 'Scooter'
val = 'Bike'
elif row['Type'] == 'Segway'
val = 'Bike'
return val
What is best method similar to using wildcards rather than type each value out if there are multiple values (30 plus values ) that I want to bucket into the same new values under the new column?
Thanks
One way is to use np.select with isin:
df = pd.DataFrame({"Type":["Car","Van","Ship","Scooter","Segway"]})
df["new"] = np.select([df["Type"].isin(["Car","Van"]),
df["Type"].isin(["Scooter","Segway"])],
["Vehicle","Bike"],"Boat")
print (df)
Type new
0 Car Vehicle
1 Van Vehicle
2 Ship Boat
3 Scooter Bike
4 Segway Bike

Need assistance in analyzing data (mean, median, mode, etc) from csv file

I am having difficulty with finding the mean, median, mode, counting occurrences of a value within a csv file.
This section of the file is a column of letters 'M' or 'F'
This specific excerpt of code displays a problem I am facing:
I am not sure why the counting variables are not being incremented.
Any assistance would be greatly appreciated
citations2 = open('Non Traffic Citations.csv')
data2 = csv.reader(citations2)
gender = []
for row in data2:
gender.append(row[2])
del gender [0]
male_count = 0
female_count = 0
for item in gender:
# print(item) - shows that the list has values within it
if 'M' == item:
male_count = + 1
if 'F' == item:
female_count = + 1
print(male_count)
print(female_count)
If you are trying to increment the gender counts, you have the syntax incorrect in your loop.
for item in gender:
if 'F' == item:
female_count += 1
elif 'M' == item:
male_count += 1
print(male_count)
print(female_count)
You can use pandas:
import pandas as pd
df=pd.read_csv('Non Traffic Citations.csv')
df.describe()

Removing value from a DataFrame column which repeats over 15 times

I'm working on forex data like this:
0 1 2 3
1 AUD/JPY 20040101 00:01:00.000 80.598 80.598
2 AUD/JPY 20040101 00:02:00.000 80.595 80.595
3 AUD/JPY 20040101 00:03:00.000 80.562 80.562
4 AUD/JPY 20040101 00:04:00.000 80.585 80.585
5 AUD/JPY 20040101 00:05:00.000 80.585 80.585
I want to go through column 2 and 3 and remove the rows in which the value is repeated for more than 15 times in a row. So far I managed to produce this piece of code:
price = 0
drop_start = 0
counter = 0
df_new = df
for i, r in df.iterrows():
if r.iloc[2] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[2]
counter = 1
drop_start = i
if r.iloc[2] == price:
counter = counter + 1
price = 0
drop_start = 0
counter = 0
df = df_new
for i, r in df.iterrows():
if r.iloc[3] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[3]
counter = 1
drop_start = i
if r.iloc[3] == price:
counter = counter + 1
print(df_new.info())
df_new.to_csv('df_new.csv', index=False, header=None)
Unfortunately when I check the output file there are some mistakes, there are some weekends which have not been removed by the program. How should I build my algorithm, so it removes the duplicated values correctly?
First 250k rows of my initial dataset is available here: https://ufile.io/omg5h
The output of this program for that sample data is available here:
https://ufile.io/2gc3d
You can see that in the output file the rows 6931+ were not succesfully removed:
The problem with your algorithm is that, you are not holding specific counter values for the row values, but rather increment the counter through the loop. This causes the result to be false I believe. Also, the comparison r.iloc[2] != price also does not make sense because you are changing the value of price every iteration, so if there are elements between the duplicates, this check do not serve a proper function. I wrote a small code to copy the behavior you asked for.
df = pd.DataFrame([[0,0.5, 2.5],[0,1, 2],[0,1.5,2.5 ],[0,2, 3],[0,2, 3],[0,3, 4],
[0,4, 5]],columns = ['A','B','C'])
df_new = df
dict = {}
print('Initial DF')
print(df)
print()
for i, r in df.iterrows():
counter = dict.get(r.iloc[1])
if counter == None:
counter = 0
dict[r.iloc[1]] = counter + 1
if dict[r.iloc[1]] >= 2:
df_new = df_new[df_new.B != r.iloc[1]]
print('2nd col. deleted DF')
print(df_new)
print()
df_fin = df_new
dict2 = {}
for i, r in df_new.iterrows():
counter = dict2.get(r.iloc[2])
if counter == None:
counter = 0
dict2[r.iloc[2]] = counter + 1
if dict2[r.iloc[2]] >= 2:
df_fin = df_fin[df_fin.C != r.iloc[2]]
print('3rd col. deleted DF')
print(df_fin)
Here, I hold the counter value for each unique value in the rows of column 2 and 3. Then, according to the threshold(which is 2 in this case) I remove the rows which are exceeding the threshold. I first eliminate values according to the 2nd column, then forward this modified array to the next loop and eliminate values according to the 3rd column and finish the process.

Categories

Resources