I have a flask app where I am getting back data and transformed into pandas Dataframe.
if request.method == 'PUT':
content = request.get_json(silent=True)
df = pd.DataFrame.from_dict(content)
for index, row in df.iterrows():
if row["label"] == True:
row['A'] = row['B'] / row['C']
elif row["label"] == False:
row['A'] = row["B"]
if row['D'] == 0:
row['C'] = 0
else:
...
What i am trying to do here is simple arithmetic like addition, subtraction & division.
I have used iterrows() mainly because i needed multiple values to iterate and perform calculations on specific row values. df['..'].item() didn't work in my use case.
Addition and subtraction work fine but division seems to slip out somehow and always returns values like 0, -1, 1
Example calculation
row['A'] = row['B'] / row['C']
Most of the time the values of row['B'] is lesser than row['C']. Example values
row['A'] = 1232455 / 26719856
The only calculation involved in the app are addition, subtraction & division.
you can try this (Here is an example):
import pandas as pd
import numpy as np
data = {'label': [True, False, True, True, False],
'A': [2012, 2012, 2013, 2014, 2014],
'B': [4, 24, 31, 21, 3],
'C': [25, 94, 57, 62, 70],
'D': [3645, 0, 27, 24, 96]}
df = pd.DataFrame(data)
You can apply your changes directly on your main Dataframe without being required to iterate over each row each time like this :
# select only rows with label == True and apply the division function
df.loc[df.label == True, 'A'] = df['B']/df['C']
df.loc[df.label == False, 'A'] = df['B']
df.loc[np.logical_and(df.label == False, df.D == 0), 'C'] = 0
.
.
.
You can select each time the row you want to change and apply the changes directly on. Just like i did.
Another point :
After applying the division in my example the integers are transformed into float64 you can try on your example the function series.astype('flat64')
for row['A'] = 1232455 / 26719856 you will get 0.046125 and not only the integer part 0 .
Maybe it will save you from having zeros every time when you do the divisions
Related
*edited DataFrame random generator
I have 2 dfs, one used as a mask for the other.
rndm = pd.DataFrame(np.random.randint(0,15,size=(100, 4)), columns=list('ABCD'))
rndm_mask = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD'))
I want to use 2 conditions to change the values in rndm:
Is the value the mode of the column?
rndm_mask == 1
What works so far:
def colorBoolean(val):
return f'background-color: {"red" if val else ""}'
rndm.style.apply(lambda _: rndm_mask.applymap(colorBoolean), axis=None)
# helper function to find Mode
def highlightMode(s):
# Get mode of columns
mode_ = s.mode().values
# Apply style if the current value is in mode_ array (len==1)
return ['background-color: yellow' if v in mode_ else '' for v in s]
Issue:
I'm unsure how to chain both functions in a way that values in rndm are highlighted only if they match both criteria (ie. value must be the most frequent value in column as well as be True in rndm_mask).
I appreciate any advice! Thanks
Try this, since your df_bool dataframe is a mask (identically indexed) then you can referred to the df_bool object inside the style function, where x.name is the name of the column passed in via df.apply:
df = pd.DataFrame({'A':[5.5, 3, 0, 3, 1],
'B':[2, 1, 0.2, 4, 5],
'C':[3, 1, 3.5, 6, 0]})
df_bool = pd.DataFrame({'A':[0, 1, 0, 0, 1],
'B':[0, 0, 1, 0, 0],
'C':[1, 1, 1, 0, 0]})
# I want to use 2 conditions to change the values in df:
# Is the value the mode of the column?
# df_bool == 1
# What works so far:
def colorBoolean(x):
return [f'background-color: red' if v else '' for v in df_bool[x.name]]
# helper function to find Mode
def highlightMode(s):
# Get mode of columns
mode_ = s.mode().values
# Apply style if the current value is in mode_ array (len==1)
return ['background-color: yellow' if v in mode_ else '' for v in s]
df.style.apply(colorBoolean).apply(highlightMode)
Output:
Or the other way:
df.style.apply(highlightMode).apply(colorBoolean)
Output:
Update
Highlight where both are true:
def highlightMode(s):
# Get mode of columns
mode_ = s.mode().values
# Apply style if the current value is in mode_ array (len==1)
return ['background-color: yellow' if (v in mode_) & b else '' for v, b in zip(s, df_bool[s.name])]
df.style.apply(highlightMode)
Output:
Given an input dataframe and string:
df = pd.DataFrame({"A" : [10, 20, 30], "B" : [0, 1, 8]})
colour = "green" #or "red", "blue" etc.
I want to add a new column df["C"] conditional on the values in df["A"], df["B"] and colour so it looks like:
df = pd.DataFrame({"A" : [4, 2, 10], "B" : [1, 4, 3], "C" : [True, True, False]})
So far, I have a function that works for just the input values alone:
def check_passing(colour, A, B):
if colour == "red":
if B < 5:
return True
else:
return False
if colour == "blue":
if B < 10:
return True
else:
return False
if colour == "green":
if B < 5:
if A < 5:
return True
else:
return False
else:
return False
How would you go about using this function in df.assign() so that it calculates this for each row? Specifically, how do you pass each column to check_passing()?
df.assign() allows you to refer to the columns directly or in a lambda, but doesn't work within a function as you're passing in the entire column:
df = df.assign(C = check_passing(colour, df["A"], df["B"]))
Is there a way to avoid a long and incomprehensible lambda? Open to any other approaches or suggestions!
Applying a function like that can be inefficient, especially when dealing with dataframes with many rows. Here is a one-liner:
colour = "green" #or "red", "blue" etc.
df['C'] = ((colour == 'red') & df['B'].lt(5)) | ((colour == 'blue') & df['B'].lt(5)) | ((colour == 'green') & df['B'].lt(5) & df['A'].lt(5))
I have 2 data frames, df_ts and df_cmexport. I am trying to get the index of placement id in df_cmexport for the placements in df_ts
Refer to get an idea of the explanation : Click here to view excel file
Once I have the index of those placement id's as a list, I will iterate through them using for j in list_pe_ts_1: to get some value for 'j' index as such : df_cmexport['p_start_year'][j].
My code below returns an empty list for some reason print(list_pe_ts_1) returns []
I think something wrong with list_pe_ts_1 = df_cmexport.index[df_cmexport['Placement ID'] == pid_1].tolist() as this returens empty list of length 0
I even tried using list_pe_ts_1 = df_cmexport.loc[df_cmexport.isin([pid_1]).any(axis=1)].index but still gives a empty list
Help is always appreciated :) Cheers to you all #stackoverflow
for i in range(0, len(df_ts)):
pid_1 = df_ts['PLACEMENT ID'][i]
print('for pid ', pid_1)
list_pe_ts_1 = df_cmexport.index[df_cmexport['Placement ID'] == pid_1].tolist()
print('len of list',len(list_pe_ts_1))
ts_p_start_year_for_pid = df_ts['p_start_year'][i]
ts_p_start_month_for_pid = df_ts['p_start_month'][i]
ts_p_start_day_for_pid = df_ts['p_start_date'][i]
print('\np_start_full_date_ts for :', pid_1, 'y:', ts_p_start_year_for_pid, 'm:', ts_p_start_month_for_pid,
'd:', ts_p_start_day_for_pid)
# j=list_pe_ts
print(list_pe_ts_1)
for j in list_pe_ts_1:
# print(j)
export_p_start_year_for_pid = df_cmexport['p_start_year'][j]
export_p_start_month_for_pid = df_cmexport['p_start_month'][j]
export_p_start_day_for_pid = df_cmexport['p_start_date'][j]
print('\np_start_full_date_export for ', pid, "at row(", j, ") :", export_p_start_year_for_pid,
export_p_start_month_for_pid, export_p_start_day_for_pid)
if (ts_p_start_year_for_pid == export_p_start_year_for_pid) and (
ts_p_start_month_for_pid == export_p_start_month_for_pid) and (
ts_p_start_day_for_pid == export_p_start_day_for_pid):
pids_p_1.add(pid_1)
# print('pass',pids_p_1)
# print(export_p_end_year_for_pid)
else:
pids_f_1.add(pid_1)
# print("mismatch in placement end date for pid ", pids)
# print("pids list ",pids)
# print('fail',pids_f_1)
With below snippest you can get a list of the matching index field from seconds dataframe.
import pandas as pd
df_ts = pd.DataFrame(data = {'index in df':[0,1,2,3,4,5,6,7,8,9,10,11,12],
"pid":[1,1,2,2,3,3,3,4,6,8,8,9,9],
})
df_cmexport = pd.DataFrame(data = {'index in df':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
"pid":[1,1,1,2,3,3,3,3,3,4,4,4,5,5,6,7,8,8,9,9,9],
})
Create new dataframe by mearging the two
result = pd.merge(df_ts, df_cmexport, left_on=["pid"], right_on=["pid"], how='left', indicator='True', sort=True)
Then identify unique values in "index in df_y" dataframe
index_list = result["index in df_y"].unique()
The result you get;
index_list
Out[9]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16, 17, 18, 19,
20], dtype=int64)
I want to fill missing values with like this:
data = pd.read_csv("E:\\SPEED.csv")
Data - DataFrame
Case - 1
if flcass= "motorway", "motorway_link", "trunk" or "trunk_link"
I want to replace the text "nan" with 110
Case - 2
if flcass= "primary", "primary_link", "secondary" or "secondary_link"
I want to replace the text "nan" with 70
Case - 3
if "fclass" is another value, I want to change it to 40.
I would be grateful for any help.
Two ways in pandas:
df = DataFrame(
{
"A": [1, 2, np.nan, 4],
"B": [1, 4, 9, np.nan],
"C": [1, 2, 3, 5],
"D": list("abcd"),
}
)
fillna lets you fill NA's (or NaNs) with what appears to be a fixed value:
df['B'].fillna(12)
[1,4,9,12]
interpolate uses scipy's interpolation methods -- linear by default:
df.interpolate()
df['A']
[1,2,3,4]
Thank you all for your answers. However, as there are 6812 rows and 16 columns (containing nan values) in the data, it seems that different solutions are required.
You can try this
import pandas as pd
import math
def valuesMapper(data, valuesDict, columns_to_update):
for i in columns_to_update:
data[i] = data[i].apply(lambda x: valuesDict.get(x, 40) if math.isnan(x) else x)
return data
data = pd.read_csv("E:\\SPEED.csv")
valuesDict = {"motorway":110, "motorway_link":110, "trunk":110, "primary":70, "primary_link":70, "secondary":70, "secondary_link":70}
column_to_update = ['AGU_PZR_07_10'] #columns_to_update is the list of columns to be updated, you can get it through code didn't added that as i dont have your data
print(valuesMapper(data, valuesDict, columns_to_update))
With the below example:
data = pandas.DataFrame({
'flclass': ['a', 'b', 'c', 'a'],
'AGU': [float('nan'), float('nan'), float('nan'), 9]
})
You can update it using numpy conditionals iterating over your columns starting from 2nd ([1:]) - 5th ([4:]) in your data:
for column in data.columns[1:]:
data[column] = np.where((data['flclass'] == 'b') & (data[column].isna()), 110, data[column])
Or panadas apply:
import numpy as np
data['AGU'] = data.apply(
lambda row: 110 if np.isnan(row['AGU']) and row['flclass'] in ("b","a") else row['AGU'],
axis=1,
)
where you can replace ("b","a") with eg ("motorway", "motorway_link", "trunk", "trunk_link")
I have a multiindexed pandas dataframe sort of like this:
data = np.random.random((1800,9))
col = pd.MultiIndex.from_product([('A','B','C'),('a','b','c')])
year = range(2006,2011)
month = range(1,13)
day = range(1,31)
idx = pd.MultiIndex.from_product([year,month,day], names=['Year','Month','Day'])
df1 = pd.DataFrame(data, idx, col)
Which has multiindexed rows of Year, Month, Day. I want to be able to select rows from this Dataframe as if it were one that has a DatetimeIndex.
The equivalent DataFrame with a DatetimeIndex would be:
idx = pd.DatetimeIndex(start='2006-01-01', end='2010-12-31', freq='d')
timeidx = [ix for ix in idx if ix.day < 29]
df2 = pd.DataFrame(data, timeidx, col)
What I would like is this:
all(df2.ix['2006-06-06':'2008-10-11'] == df1'insert expression here')
to equal True
I know I can select cross-sections via df1.xs('2006', level='Year'), but I basically need an easy way to replicate what was done for df2 as I am forced to use this index as opposed to the DatetimeIndex.
One issue you'll immediately have by storing these as strings is '2' > '10', which is almost certainly not what you want, so I recommend using ints. That is:
year = range(2006,2011)
month = range(1,13)
day = range(1,31)
I though that you ought to be able to use pd.IndexSlice here, my first thought was to use it as follows:
In [11]: idx = pd.IndexSlice
In [12]: df1.loc[idx[2006:2008, 6:10, 6:11], :]
...
but this shows those between 2006-8 and june-oct and 6-11th (ie 3*5*6 = 90 days).
So here's a non-vectorized way, just compare the tuples:
In [21]: df1.index.map(lambda x: (2006, 6, 6) < x < (2008, 10, 11))
Out[21]: array([False, False, False, ..., False, False, False], dtype=bool)
In [22]: df1[df1.index.map(lambda x: (2006, 6, 6) < x < (2008, 10, 11))]
# just the (844) rows you want
If this was unbearably slow, a trick (to vectorize) would be to use some float representation, for example:
In [31]: df1.index.get_level_values(0).values + df1.index.get_level_values(1).values * 1e-3 + df1.index.get_level_values(2).values * 1e-6
Out[31]:
array([ 2006.001001, 2006.001002, 2006.001003, ..., 2010.012028,
2010.012029, 2010.01203 ])