Nested loop to replace rows in dataframe - python

I'm trying to write a for loop that takes each row in a dataframe and compares it to the rows in a second dataframe.
If the row in the second dataframe:
isn't in the first dataframe already
has a higher value in the total points column
has a lower cost than the available budget (row_budget)
then I want to remove the row from the first dataframe and add the row from the second dataframe in its place.
Example data:
df
code team_name total_points now_cost
78 93284 BHA 38 50
395 173514 WAT 42 50
342 20452 SOU 66 50
92 17761 BUR 97 50
427 18073 WHU 99 50
69 61933 BHA 115 50
130 116594 CHE 116 50
pos_pool
code team_name total_points now_cost
438 90585 WOL 120 50
281 67089 NEW 131 50
419 37096 WHU 143 50
200 97032 LIV 208 65
209 110979 LIV 231 115
My expected output for the first three loops should be:
df
code team_name total_points now_cost
92 17761 BUR 97 50
427 18073 WHU 99 50
69 61933 BHA 115 50
130 116594 CHE 116 50
438 90585 WOL 120 50
281 67089 NEW 131 50
419 37096 WHU 143 50
Here is the nested for loop that I've tried:
for index, row in df.iterrows():
budget = squad['budget']
team_limits = squad['team_limits']
pos_pool = players_1920.loc[players_1920['position'] == row['position']].sort_values('total_points', ascending=False)
row_budget = row.now_cost + 1000 - budget
for index2, row2 in pos_pool.iterrows():
if (row2 not in df) and (row2.total_points > row.total_points) and (row2.now_cost <= row_budget):
team_limits[row.team_name] += 1
team_limits[row2.team_name] -=1
budget += row.now_cost - row2.now_cost
df = df.append(row2)
df = df.drop(row)
else:
pass
return df
At the moment I am only iterating through the first dataframe but doesn't seem to do anything in the second.

Related

data.dropna() doesnt work for my data.csv file and i still get a data with NaN elements

I'm studying Pandas from Python.
I'm trying to remove NaN elements from my data.csv file with data.dropna() and it isn't removing.
import pandas as pd
data = pd.read_csv('data.csv')
new_data = data.dropna()
print(new_data)
This is data.csv content.
Duration Date Pulse Maxpulse Calories
60 '2020/12/01' 110 130 409.1
60 '2020/12/02' 117 145 479.0
60 '2020/12/03' 103 135 340.0
45 '2020/12/04' 109 175 282.4
45 '2020/12/05' 117 148 406.0
60 '2020/12/06' 102 127 300.0
60 '2020/12/07' 110 136 374.0
450 '2020/12/08' 104 134 253.3
30 '2020/12/09' 109 133 195.1
60 '2020/12/10' 98 124 269.0
60 '2020/12/11' 103 147 329.3
60 '2020/12/12' 100 120 250.7
60 '2020/12/12' 100 120 250.7
60 '2020/12/13' 106 128 345.3
60 '2020/12/14' 104 132 379.3
60 '2020/12/15' 98 123 275.0
60 '2020/12/16' 98 120 215.2
60 '2020/12/17' 100 120 300.0
45 '2020/12/18' 90 112 NaN
60 '2020/12/19' 103 123 323.0
45 '2020/12/20' 97 125 243.0
60 '2020/12/21' 108 131 364.2
45 NaN 100 119 282.0
60 '2020/12/23' 130 101 300.0
45 '2020/12/24' 105 132 246.0
60 '2020/12/25' 102 126 334.5
60 2020/12/26 100 120 250.0
60 '2020/12/27' 92 118 241.0
60 '2020/12/28' 103 132 NaN
60 '2020/12/29' 100 132 280.0
60 '2020/12/30' 102 129 380.3
60 '2020/12/31' 92 115 243.0
My guess is that data.csv is written incorrect?
The data.csv file is written wrong, to fix it need to add commas.
Corrected format: data.csv
Duration,Date,Pulse,Maxpulse,Calories
60,2020/12/01',110,130,409.1
60,2020/12/02',117,145,479.0
60,2020/12/03',103,135,340.0
45,2020/12/04',109,175,282.4
45,2020/12/05',117,148,406.0
60,2020/12/06',102,127,300.0
60,2020/12/07',110,136,374.0
450,2020/12/08',104,134,253.3
30,2020/12/09',109,133,195.1
60,2020/12/10',98,124,269.0
60,2020/12/11',103,147,329.3
60,2020/12/12',100,120,250.7
60,2020/12/12',100,120,250.7
60,2020/12/13',106,128,345.3
60,2020/12/14',104,132,379.3
60,2020/12/15',98,123,275.0
60,2020/12/16',98,120,215.2
60,2020/12/17',100,120,300.0
45,2020/12/18',90,112,
60,2020/12/19',103,123,323.0
45,2020/12/20',97,125,243.0
60,2020/12/21',108,131,364.2
45,,100,119,282.0
60,2020/12/23',130,101,300.0
45,2020/12/24',105,132,246.0
60,2020/12/25',102,126,334.5
60,20201226,100,120,250.0
60,2020/12/27',92,118,241.0
60,2020/12/28',103,132,
60,2020/12/29',100,132,280.0
60,2020/12/30',102,129,380.3
60,2020/12/31',92,115,243.0
TL,DR:
Try this:
new_data = df.fillna(pd.NA).dropna()
or:
import numpy as np
new_data = df.fillna(np.NaN).dropna()
That's the real csv file? I don't think so.
There isn't any specification of missing values in csv doc. From my experience, missing values in csv are represented by nothing between two separators (if the separator is a comma, it looks like ,,).
From pandas doc, the pandas.read_csv contains an argument na_values:
na_values : scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
If your csv file contains 'NaN', pandas are capable to infer and read as NaN, but you can pass the parameter as you need.
Also, you can use (consider i as the number of row and j for column):
type(df.iloc[i,j])
Compare with:
type(np.NaN) # numpy NaN
float
type(pd.NA) # pandas NaN
pandas._libs.missing.NAType

Calculating max ,mean and min of a column in dataframe

Calculated the mean, max and mean of a column in dataframe as follows:
g['MAX range'] = g['Current_range'].max()
g['min range'] = g['Current_range'].min()
g['mean'] = g['Current_range'].mean()
The output was as follows:
current_speed current_range maxrange minrange mean
10 25 190 25 74
20 40 190 25 74
20 41 190 25 74
80 190 190 25 74
i dont want to get repeated values in max range,min range,mean but only single values in those columns.
Expected output:
current_speed current_range maxrange minrange mean
10 25 190 25 74
20 40
20 41
80 190
How can i modify it?
You can add it with .loc. Example for mean:
g.loc[g.index[0], 'mean'] = g['Current_range'].mean()
It will create column mean with mean value in the first row and NaN values for other rows.

DataFrame add a dataframe row that is sum each row's sum

I am new in Python and i have a question. I have an exported .csv with values and i want to sum each row's total value than make a total column to there.
I've tried that but it doesnt work.
import pandas as pd
wine = pd.read_csv('testelek.csv', 'rb', delimiter=';')
wine['Total'] = [wine[row].sum(axis=1) for row in wine]
I want to make my DataFrame like this.
101 102 103 104 .... Total
__________________________________________________________________________
0 80 84 86 78 .... 328
1 78 76 77 79 .... 310
2 79 81 88 83 .... 331
3 70 85 89 84 .... 328
4 78 84 88 85 .... 335
You can bypass the need for the list comprehension and just use the axis=1 parameter to get what you want.
wine['Total'] = wine.sum(axis=1)
A nice way to do this is by using .apply().
Suppose that you want to create a new column named Total by adding the values per row for columns named 101, 102, and 103 you can try the following:
wine['Total'] = wine.apply(lambda row: sum([row['101'], row['102'], row['103']]), axis=1)

Can't set column name from index to str(index) + string (Pandas, Python)

I need to change the names of a subset of columns in a dataframe from whatever number they are to that number plus a string suffix. I know there is a function to add a suffix, but it doesn't seem to work on just indices.
I create a list with all the column indices in it, then run a loop that, for each item in that list, it renames the dataframe column that matches the list item to the same number, plus the suffix string.
if scalename == "CDR":
print(scaledf.columns.tolist())
oldCols = scaledf.columns[7:].tolist()
for f in range(len(oldCols)):
changeCol = int(oldCols[f])
print(changeCol)
scaledf.rename(columns = {changeCol:scalename + str(changeCol)})
print(scaledf.columns)
This doesn't work.
The code will print out the column names, and prints out every item, but it does not rename the columns. It doesn't throw errors, it just doesn't work. I've tried variation after variation, and gotten all kinds of other errors, but this error-free code does nothing. It just runs, and doesn't rename anything.
Any help would be seriously appreciated! Thank you.
Adding sample of list:
45
52
54
55
59
60
61
66
67
68
69
73
74
75
80
81
82
94
101
103
104
108
110
115
116
117
129
136
138
139
143
144
145
150
151
157
158
159
171
178
180
181
185
186
187
192
193
199
200
201
213
220
222
223
227
228
229
234
235
236
Try this:
scaledf = scaledf.rename(columns=lambda c:scalename + str(c) if c in oldCols else c)

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Categories

Resources