Iterating through rows in a dataframe - python

I have a dataframe of 12 different teams with their own statistics. My objective is to repeat an entire series of steps for one team, and so on, until the last team has been processed. My code currently correctly calculates statistics for only the first row of the dataframe. I want to repeat these lines of code for each row of the dataframe. I figured that a for loop would be the way to do so, but I'm struggling with the arguments to pass through. Any help would be appreciated, thank you.
import pandas as pd
stats = pd.read_csv('question2_data .csv')
print(stats)
team_count = 0
Output:
Team ID Wins Losses Ties
0 9867 4 2 3
1 1234 7 5 2
2 6213 9 7 0
3 1231 12 2 2
4 8821 2 7 7
5 1131 8 0 8
6 7761 10 3 3
7 6831 0 16 0
8 3131 16 0 0
9 3131 0 0 16
10 8424 0 0 0
11 4211 4 4 4
team_id = stats.iloc[0]['Team ID']
win_count = stats.iloc[0]['Wins']
loss_count = stats.iloc[0]['Losses']
tie_count = stats.iloc[0]['Ties']
print('Team', team_id)
print(win_count, 'Wins', loss_count, 'Losses', tie_count, 'Ties')
game_count = win_count + loss_count + tie_count
remaining_games_count = 16 - game_count
if (game_count == 16):
print('Games played:', game_count, 'The teams season is finished')
elif (game_count < 16):
print('Games played:', game_count, 'Games remaining:', remaining_games_count)
win_avg = round((win_count/game_count), 4)
print('Winning average:', win_avg)
if (tie_count >= win_count):
print('Games tied are greater than or equal to games won')
else:
print('Games tied are not greater than or equal to games won')
if (tie_count > loss_count):
print('Games tied are greater than games lost')
else:
print('Games tied are not greater than games lost')
wip_tot = win_count + tie_count - (loss_count*3)
if (wip_tot%2==0):
wip_tot = 0
print('WIP total:', wip_tot)

Your code calculates the statistics for the first row because you're using stats.iloc[0]. So just replace the 0 for your iterator in the for loop:
for i in range(12):
team_id = stats.iloc[i]['Team ID']
win_count = stats.iloc[i]['Wins']
loss_count = stats.iloc[i]['Losses']
tie_count = stats.iloc[i]['Ties']
etc...
You can use stats.shape(0) to get the number of rows.
Bonus: There's a pd.DataFrame.apply() function if you want to get each statistic in a new column.

Related

Python if statement not working correctly and no idea why

In my code I iterate through dataframes of each year to calculate the number of wins (increase between numbers) and losses (decrease between numbers), and the ratio of wins to losses. The loop I run correctly displays the right number of wins and losses in the dataframe they are eventually pushed to. However, when calculating the win/loss ratio, the if statement isn't working for no real reason. Here is the loop:
trades = []
wins = []
losses = []
winloss = []
for df in df_years_grouped:
total_trades = len(df)
trades.append(total_trades)
win = 0
loss = 0
for i, row in df.iterrows():
if i == 0:
continue
elif (df[1][i] > df[1][i-1]):
win += 1
elif (df[1][i] < df[1][i-1]):
loss += 1
wins.append(win)
losses.append(loss)
if win == 0 & loss == 0:
winloss.append(0)
elif win > 0 & loss == 0:
winloss.append('All Wins')
elif win == 0 & loss > 0:
winloss.append('All Losses')
else:
winloss.append(win/loss)
Here is the outcome in the Dataframe:
Trades Win Lose W/L
11 5 5 All Wins
42 21 20 All Wins
35 16 18 All Wins
14 9 4 All Wins
23 13 9 All Wins
12 7 4 All Wins
4 2 1 All Wins
4 2 1 All Wins
11 5 5 All Wins
6 3 2 All Wins
0 0 0 0
9 6 2 All Wins
2 0 1 0
16 6 9 All Wins
3 0 2 0
14 7 6 All Wins
206 106 99 1.070707
As you can see it works on one or two but fails on the most, making most of them wins?
I believe you should be using "and" instead of "&" in the if statements. You can read up on the differences here:
https://www.geeksforgeeks.org/difference-between-and-and-in-python/
I think the issue is between 'and' and '&'
Give the following a read:
https://www.geeksforgeeks.org/difference-between-and-and-in-python/
'and' is the Locigal AND whereas '&' is the bit-and 'AND'
a = 14
b = 4
print(b and a) - 14
print(b & a) - 4
See here too:
what is the difference between "&" and "and" in Python?
I would suggest changing you code to use the logical AND
or maybe the following if you do not mind more IF statements
if win == 0:
if loss == 0
winloss.append(0)
elif loss > 0:
winloss.append('All Losses')
elif win > 0:
if loss == 0:
winloss.append('All Wins')
else:
winloss.append(win/loss)
I personally would also find this easier to read as an external user
You should use and instead of &, their have different priority:
win == 0 & loss == 0 is equivalent to win == (0 & loss) == 0:
import ast
assert ast.dump(ast.parse("win == 0 & loss == 0")) == ast.dump(ast.parse("win == (0 & loss) == 0"))

Calculate a np.arange within a Panda dataframe from other columns

I want to create a new column with all the coordinates the car needs to pass to a certain goal. This should be as a list in a panda.
To start with I have this:
import pandas as pd
cars = pd.DataFrame({'x_now': np.repeat(1,5),
'y_now': np.arange(5,0,-1),
'x_1_goal': np.repeat(1,5),
'y_1_goal': np.repeat(10,5)})
output would be:
x_now y_now x_1_goal y_1_goal
0 1 5 1 10
1 1 4 1 10
2 1 3 1 10
3 1 2 1 10
4 1 1 1 10
I have tried to add new columns like this, and it does not work
for xy_index in range(len(cars)):
if cars.at[xy_index, 'x_now'] == cars.at[xy_index,'x_1_goal']:
cars.at[xy_index, 'x_car_move_route'] = np.repeat(cars.at[xy_index, 'x_now'].astype(int),(
abs(cars.at[xy_index, 'y_now'].astype(int)-cars.at[xy_index, 'y_1_goal'].astype(int))))
else:
cars.at[xy_index, 'x_car_move_route'] = \
np.arange(cars.at[xy_index,'x_now'], cars.at[xy_index,'x_1_goal'],
(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now']) / (
abs(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now'])))
at the end I want the columns x_car_move_route and y_car_move_route so I can loop over the coordinates that they need to pass. I will show it with tkinter. I will also add more goals, since this is actually only the first turn that they need to make.
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]
You can apply() something like this route() function along axis=1, which means route() will receive rows from cars. It generates either x or y coordinates depending on what's passed into var (from args).
You can tweak/fix as needed, but it should get you started:
def route(row, var):
var2 = 'y' if var == 'x' else 'x'
now, now2 = row[f'{var}_now'], row[f'{var2}_now']
goal, goal2 = row[f'{var}_1_goal'], row[f'{var2}_1_goal']
diff, diff2 = goal - now, goal2 - now2
if diff == 0:
result = np.array([now] * abs(diff2)).astype(int)
else:
result = 1 + np.arange(now, goal, diff / abs(diff)).astype(int)
return result
cars['x_car_move_route'] = cars.apply(route, args=('x',), axis=1)
cars['y_car_move_route'] = cars.apply(route, args=('y',), axis=1)
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]

Index and save last N points from a list that meets conditions from dataframe Python

I have a DataFrame that contains gas concentrations and the corresponding valve number. This data was taken continuously where we switched the valves back and forth (valves=1 or 2) for a certain amount of time to get 10 cycles for each valve value (20 cycles total). A snippet of the data looks like this (I have 2,000+ points and each valve stayed on for about 90 seconds each cycle):
gas1 valveW time
246.9438 2 1
247.5367 2 2
246.7167 2 3
246.6770 2 4
245.9197 1 5
245.9518 1 6
246.9207 1 7
246.1517 1 8
246.9015 1 9
246.3712 2 10
247.0826 2 11
... ... ...
My goal is to save the last N points of each valve's cycle. For example, the first cycle where valve=1, I want to index and save the last N points from the end before the valve switches to 2. I would then save the last N points and average them to find one value to represent that first cycle. Then I want to repeat this step for the second cycle when valve=1 again.
I am currently converting from Matlab to Python so here is the Matlab code that I am trying to translate:
% NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
ind_noaaHigh_end = zeros(1,length(t_c));
numPoints = 40;
for i = 1:length(valveW_c)-1
if (valveW_c(i) == 1 && valveW_c(i+1) ~= 1)
test = (i-numPoints):i;
ind_noaaHigh_end(test) = 1;
n2o_noaaHigh = [n2o_noaaHigh mean(n2o_c(test))];
co2_noaaHigh = [co2_noaaHigh mean(co2_c(test))];
co_noaaHigh = [co_noaaHigh mean(co_c(test))];
h2o_noaaHigh = [h2o_noaaHigh mean(h2o_c(test))];
end
end
ind_noaaHigh_end = logical(ind_noaaHigh_end);
This is what I have so far for Python:
# NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
t_c_High = []; # time
for i in range(len(valveW_c)):
# NOAA HIGH
if (valveW_c[i] == 1):
t_c_High.append(t_c[i])
n2o_noaaHigh.append(n2o_c[i])
co2_noaaHigh.append(co2_c[i])
co_noaaHigh.append(co_c[i])
h2o_noaaHigh.append(h2o_c[i])
Thanks in advance!
I'm not sure if I understood correctly, but I guess this is what you are looking for:
# First we create a column to show cycles:
df['cycle'] = (df.valveW.diff() != 0).cumsum()
print(df)
gas1 valveW time cycle
0 246.9438 2 1 1
1 247.5367 2 2 1
2 246.7167 2 3 1
3 246.677 2 4 1
4 245.9197 1 5 2
5 245.9518 1 6 2
6 246.9207 1 7 2
7 246.1517 1 8 2
8 246.9015 1 9 2
9 246.3712 2 10 3
10 247.0826 2 11 3
Now you can use groupby method to get the average for the last n points of each cycle:
n = 3 #we assume this is n
df.groupby('cycle').apply(lambda x: x.iloc[-n:, 0].mean())
Output:
cycle 0
1 246.9768
2 246.6579
3 246.7269
Let's call your DataFrame df; then you could do:
results = {}
for k, v in df.groupby((df['valveW'].shift() != df['valveW']).cumsum()):
results[k] = v
print(f'[group {k}]')
print(v)
Shift(), as it suggests, shifts the column of the valve cycle allows to detect changes in number sequences. Then, cumsum() helps to give a unique number to each of the group with the same number sequence. Then we can do a groupby() on this column (which was not possible before because groups were either of ones or twos!).
which gives e.g. for your code snippet (saved in results):
[group 1]
gas1 valveW time
0 246.9438 2 1
1 247.5367 2 2
2 246.7167 2 3
3 246.6770 2 4
[group 2]
gas1 valveW time
4 245.9197 1 5
5 245.9518 1 6
6 246.9207 1 7
7 246.1517 1 8
8 246.9015 1 9
[group 3]
gas1 valveW time
9 246.3712 2 10
10 247.0826 2 11
Then to get the mean for each cycle; you could e.g. do:
df.groupby((df['valveW'].shift() != df['valveW']).cumsum()).mean()
which gives (again for your code snippet):
gas1 valveW time
valveW
1 246.96855 2.0 2.5
2 246.36908 1.0 7.0
3 246.72690 2.0 10.5
where you wouldn't care much about the time mean but the gas1 one!
Then, based on results you could e.g. do:
n = 3
mean_n_last = []
for k, v in results.items():
if len(v) < n:
mean_n_last.append(np.nan)
else:
mean_n_last.append(np.nanmean(v.iloc[len(v) - n:, 0]))
which gives [246.9768, 246.65796666666665, nan] for n = 3 !
If your dataframe is sorted by time you could get the last N records for each valve like this.
N=2
valve1 = df[df['valveW']==1].iloc[-N:,:]
valve2 = df[df['valveW']==2].iloc[-N:,:]
If it isn't currently sorted you could easily sort it like this.
df.sort_values(by=['time'])

Setting batch number for set of records in python

I have following data in csv
id,date,records
1,2019-03-28 01:22:12,5
2,2019-03-29 01:23:23,5
3,2019-03-30 01:28:54,5
4,2019-03-28 01:12:21,2
5,2019-03-12 01:08:11,1
6,2019-03-28 01:01:21,12
7,2019-03-12 01:02:11,1
What i am trying to achieve is set a batch number that should keep on increasing after moving sum value crosses 15 and the moving sum should reset as well, so i am trying to create batch for records that has total moving sum value as 15
For ex. if Moving sum becomes 15 the batch number value should increment, which would given me rows containing total value of 15.
so the output i am looking for is if the cumulative sum exceeds 15 -
id,date,records, moving_sum,batch_number
1,2019-03-28 01:22:12,5,5,1
2,2019-03-29 01:23:23,5,10,1
3,2019-03-30 01:28:54,5,15,1
4,2019-03-28 01:12:21,2,2,2
5,2019-03-12 01:08:11,1,1,2
6,2019-03-28 01:01:21,2,12,2
7,2019-03-12 01:02:11,1,1,3
I am using pandas for this but not able to reset the moving_sum and carry forward the previous set batch_number.
You could do something like this using df.iterrows().
moving = []
batch = []
cntr = 1
for idx,row in df.iterrows():
if len(moving) == 0:
moving.append(row['records'])
batch.append(cntr)
elif moving[-1] < 15:
moving.append(row['records']+moving[-1])
batch.append(cntr)
elif moving[-1] >= 15:
moving.append(row['records'])
cntr += 1
batch.append(cntr)
df['moving_sum'] = moving
df['batch_number'] = batch
id records moving_sum batch_number
0 1 5 5 1
1 2 5 10 1
2 3 5 15 1
3 4 2 2 2
4 5 1 3 2
5 6 12 15 2
6 7 1 1 3

Randomise round-robin league schedule given certain constraints

I'm trying to write a script to randomise a round-robin schedule for a tournament.
The constraints are:
8 Teams
Teams face each other twice, once at home and once away
14 weeks, one game for each team per week
My code works fine in theory, but when it's generated it sometimes freezes on certain weeks when there are only two teams left for that week, and both possible games have already been played. I use a numpy array to check which matchups have been played.
At the moment my code looks like this:
import random
import numpy
regular_season_games = 14
regular_season_week = 0
checker = numpy.full((8,8), 0)
for x in range (0,8):
checker[x][x] = 1
teams_left = list(range(8))
print ("Week " + str(regular_season_week+1))
while (regular_season_week < regular_season_games):
game_set = False
get_away_team = False
while get_away_team == False:
Team_A = random.choice(teams_left)
if 0 in checker[:,Team_A]:
for x in range (0,8):
if checker[x][Team_A] == 0 and x in teams_left:
teams_left.remove(Team_A)
get_away_team = True
break
while game_set == False:
Team_B = random.choice(teams_left)
if checker[Team_B][Team_A] == 0:
teams_left.remove(Team_B)
print(str(Team_A) + " vs " + str(Team_B))
checker[Team_B][Team_A] = 1
game_set = True
if not teams_left:
print ("Week " + str(regular_season_week+2))
teams_left = list(range(8))
regular_season_week = regular_season_week + 1
I've used an adaptation of the scheduling algorithm from here to achieve this. Basically, we generate a list of the teams - list(range(8)) - and choose as our initial matchup 0 vs 4, 1 vs 5, 2 vs 6, 3 vs 7. We then rotate the list, excluding the first element, and choose as our next matchup 0 vs 3, 7 vs 4, 1 vs 5, 2 vs 6. We continue on in the following way until we have every pairing.
I've added a handler for home & away matches - if a pairing has already been played, we play the opposite home/away pairing. Below is the code, including a function to check if a list of games is valid, and a sample output.
Code:
import random
# Generator function for list of matchups from a team_list
def games_from_list(team_list):
for i in range(4):
yield team_list[i], team_list[i+4]
# Function to apply rotation to list of teams as described in article
def rotate_list(team_list):
team_list = [team_list[4]] + team_list[0:3] + team_list[5:8] + [team_list[3]]
team_list[0], team_list[1] = team_list[1], team_list[0]
return team_list
# Function to check if a list of games is valid
def checkValid(game_list):
if len(set(game_list)) != len(game_list):
return False
for week in range(14):
teams = set()
this_week_games = game_list[week*4:week*4 + 4]
for game in this_week_games:
teams.add(game[0])
teams.add(game[1])
if len(teams) < 8:
return False
else:
return True
# Generate list of teams & empty list of games played
teams = list(range(8))
games_played = []
# Optionally shuffle teams before generating schedule
random.shuffle(teams)
# For each week -
for week in range(14):
print(f"Week {week + 1}")
# Get all the pairs of games from the list of teams.
for pair in games_from_list(teams):
# If the matchup has already been played:
if pair in games_played:
# Play the opposite match
pair = pair[::-1]
# Print the matchup and append to list of games.
print(f"{pair[0]} vs {pair[1]}")
games_played.append(pair)
# Rotate the list of teams
teams = rotate_list(teams)
# Checks that the list of games is valid
print(checkValid(games_played))
Sample Output:
Week 1
0 vs 7
4 vs 3
6 vs 1
5 vs 2
Week 2
0 vs 3
7 vs 1
4 vs 2
6 vs 5
Week 3
0 vs 1
3 vs 2
7 vs 5
4 vs 6
Week 4
0 vs 2
1 vs 5
3 vs 6
7 vs 4
Week 5
0 vs 5
2 vs 6
1 vs 4
3 vs 7
Week 6
0 vs 6
5 vs 4
2 vs 7
1 vs 3
Week 7
0 vs 4
6 vs 7
5 vs 3
2 vs 1
Week 8
7 vs 0
3 vs 4
1 vs 6
2 vs 5
Week 9
3 vs 0
1 vs 7
2 vs 4
5 vs 6
Week 10
1 vs 0
2 vs 3
5 vs 7
6 vs 4
Week 11
2 vs 0
5 vs 1
6 vs 3
4 vs 7
Week 12
5 vs 0
6 vs 2
4 vs 1
7 vs 3
Week 13
6 vs 0
4 vs 5
7 vs 2
3 vs 1
Week 14
4 vs 0
7 vs 6
3 vs 5
1 vs 2
True

Categories

Resources