Subtraction/Addition from seperate rows/columns - python

I have a dataframe like this:
Day Diff
137 0
185 48
249 64
139 -110
In the column Diff whenever a negative value is encountered I want to subtract 365 from the value in Day from the previous row and then add that value to the Day value in the current row of the negative number. For example, in this scenario when -110 is encountered I want to do 365-249 (249 is from Day in previous row) and then add 139. So 365-249 = 116 and 116 + 139 = 255. Therefore -110 would be replaced with 255.
My desired output then is:
Day Diff
137 0
185 48
249 64
139 255

you can do it this way:
In [32]: df.loc[df.Diff < 0, 'Diff'] = 365 + df.Day - df.shift().loc[df.Diff < 0, 'Day']
In [33]: df
Out[33]:
Day Diff
0 137 0.0
1 185 48.0
2 249 64.0
3 139 255.0

Related

using python i wan inside function for pandas dataframe

Am having a dataframe,need to implement
every month I will be running this script so automatically it will pick based on extracted date
Input Dataframe
client_id expo_value value cal_value extracted_date
1 126 30 27.06 08/2022
2 135 60 36.18 08/2022
3 144 120 45 08/2022
4 162 30 54.09 08/2022
5 153 90 63.63 08/2022
6 181 120 72.9 08/2022
Input Dataframe
client_id expo_value value cal_value extracted_date Output_Value
1 126 30 27.06 08/2022 126+26.18 = 152.18
2 135 60 36.18 08/2022 261.29+70.02 = 331.31
3 144 120 45 08/2022 557.4+174.19 = 731.59
4 162 30 54.09 08/2022 156.7+ 52.34 = 209.04
5 153 90 63.63 08/2022 444.19+ 182.9 =627.09
6 181 120 72.9 08/2022 700.64+282.19=982.83
I want to implement 31 days/30 days/28 days inside the below function & i tried manually entering the number 31(days) for calculation but it should automatically should pick based on which month has how many days
def month_data(data):
if (data['value'] <=30).any():
return data['expo_value'] *30/ 31(days) + data['cal_value'] * 45/ 31(days)
elif (data['value'] <=60).any():
return data['expo_value'] *60/ 31(days) + data['cal_value'] * 90/31(days)
elif (data['value'] <=90).any():
return data['expo_value'] *100/31(days) + data['cal_value'] * 120/ 31(days)
else (data['value'] <=120).any():
return np.nan
Let me see if I understood you correctly. I tried to reproduce a small subset of your dataframe (you should do this next time you post something). The answer is as follows:
import pandas as pd
from datetime import datetime
import calendar
# I'll make a subset dataframe based on your example
data = [[30, '02/2022'], [60, '08/2022']]
df = pd.DataFrame(data, columns=['value', 'extracted_date'])
# First, turn the extracted_date column into a correct date format
date_correct_format = [datetime.strptime(i, '%m/%Y') for i in df['extracted_date']]
# Second, calculate the number of days per month
num_days = [calendar.monthrange(i.year, i.month)[1] for i in date_correct_format]
num_days

Iterate over specific rows, sum results and store in new row

I have a DataFrame in which I have already defined rows to be summed up and store the results in a new row.
For example in Year 1990:
Category
A
B
C
D
Year
E
147
78
476
531
1990
F
914
356
337
781
1990
G
117
874
15
69
1990
H
45
682
247
65
1990
I
20
255
465
19
1990
Here, the rows G - H should be summed up and the results stored in a new row. The same categories repeat every year from 1990 - 2019
I have already tried it with .iloc e.g. [4:8], [50:54] [96:100] and so on, but with iloc I can not specify multiple index. I can't manage to make a loop over the single years.
Is there a way to sum the values in categories (G-H) for each year (1990 -2019)?
I'm not sure the multiple index what you mean.
It usually appear after some group and aggregate function.
At your table, it looks just multiple column
So, if I understand correctly.
Here a complete code to show how to use the multiple condition of DataFrame
import io
import pandas as pd
data = """Category A B C D Year
E 147 78 476 531 1990
F 914 356 337 781 1990
G 117 874 15 69 1990
H 45 682 247 65 1990
I 20 255 465 19 1990"""
table = pd.read_csv(io.StringIO(data), delimiter="\t")
years = table["Year"].unique()
for year in years:
row = table[((table["Category"] == "G") | (table["Category"] == "H")) & (table["Year"] == year)]
row = row[["A", "B", "C", "D"]].sum()
row["Category"], row["Year"] = "sum", year
table = table.append(row, ignore_index=True)
If you are only interested in G/H, you can slice with isin combined with boolean indexing, then sum:
df[df['Category'].isin(['G', 'H'])].sum()
output:
Category GH
A 162
B 1556
C 262
D 134
Year 3980
dtype: object
NB. note here the side effect of sum that combines the two "G"/"H" strings into one "GH".
Or, better, set Category as index and slice with loc:
df.set_index('Category').loc[['G', 'H']].sum()
output:
A 162
B 1556
C 262
D 134
Year 3980
dtype: int64

Nested loop to replace rows in dataframe

I'm trying to write a for loop that takes each row in a dataframe and compares it to the rows in a second dataframe.
If the row in the second dataframe:
isn't in the first dataframe already
has a higher value in the total points column
has a lower cost than the available budget (row_budget)
then I want to remove the row from the first dataframe and add the row from the second dataframe in its place.
Example data:
df
code team_name total_points now_cost
78 93284 BHA 38 50
395 173514 WAT 42 50
342 20452 SOU 66 50
92 17761 BUR 97 50
427 18073 WHU 99 50
69 61933 BHA 115 50
130 116594 CHE 116 50
pos_pool
code team_name total_points now_cost
438 90585 WOL 120 50
281 67089 NEW 131 50
419 37096 WHU 143 50
200 97032 LIV 208 65
209 110979 LIV 231 115
My expected output for the first three loops should be:
df
code team_name total_points now_cost
92 17761 BUR 97 50
427 18073 WHU 99 50
69 61933 BHA 115 50
130 116594 CHE 116 50
438 90585 WOL 120 50
281 67089 NEW 131 50
419 37096 WHU 143 50
Here is the nested for loop that I've tried:
for index, row in df.iterrows():
budget = squad['budget']
team_limits = squad['team_limits']
pos_pool = players_1920.loc[players_1920['position'] == row['position']].sort_values('total_points', ascending=False)
row_budget = row.now_cost + 1000 - budget
for index2, row2 in pos_pool.iterrows():
if (row2 not in df) and (row2.total_points > row.total_points) and (row2.now_cost <= row_budget):
team_limits[row.team_name] += 1
team_limits[row2.team_name] -=1
budget += row.now_cost - row2.now_cost
df = df.append(row2)
df = df.drop(row)
else:
pass
return df
At the moment I am only iterating through the first dataframe but doesn't seem to do anything in the second.

Handling Zeros or NaNs in a Pandas DataFrame operations

I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)
Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])
The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Categories

Resources