Defining a variable between two points with a datetime pandas series - python

I have a pandas dataframe, and I want to calculate a variable based on certain hours of the day. I already pulled the hours as integers out of the datetime series. When I write my conditional statements between two hours and execute my script, I get the warning "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
When I put in any() or all() in my script, the script runs but it doesn't calculate the value between the two hours. I just get back a value that is not in the conditions. Can anyone help me out?
Here is my code so far
METdata = pd.read_csv('C:\Schoolwork\GEOL 701s_HW1\MET_station\MET_Data_3.26_hourly.csv', infer_datetime_format = True, na_values = '', header = [1], skiprows = [2, 3], index_col = [0])
hour = METdata.index.hour
NET_rad_Wm2 = np.array(METdata['NR_Wm2_Avg'])
Nr = NET_rad_Wm2 * 0.0036
g_day = Nr * 0.1
g_night = Nr * 0.5
def func(hour):
if ((hour > 8) and (hour < 17)):
return g_night
else:
return g_day
g = func(hour)

If you want a series as return, then you just need to call apply instead of calling the function directly
hour.apply(func)

Related

I want to make a function if the common key is found in both dataframes

I have two dataframes df1 and df2 each of them have column containing product code and product price, I wanted to check the difference between prices in the 2 dataframes and store the result of this function I created in a new dataframe "df3" containing the product code and the final price, Here is my attempt :
Function to calculate the difference in the way I want:
def range_calc(z, y):
final_price = pd.DataFrame(columns = ["Market_price"])
res = z-y
abs_res = abs(res)
if abs_res == 0:
return (z)
if z>y:
final_price = (abs_res / z ) * 100
else:
final_price = (abs_res / y ) * 100
return(final_price)
For loop I created to check the two df and use the function:
Last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
for i in df1["product_ID"]:
for x in df2["product_code"]:
if i == x:
Last_df["Product_number"] = i
Last_df["Market_Price"] = range_calc(df1["full_price"],df2["tot_price"])
The problem is that I am getting this error every time:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Why you got the error message "The truth value of a Series is ambiguous"
You got the error message The truth value of a Series is ambiguous
because you tried input a pandas.Series into an if-clause
nums = pd.Series([1.11, 2.22, 3.33])
if nums == 0:
print("nums == zero (nums is equal to zero)")
else:
print("nums != zero (nums is not equal to zero)")
# AN EXCEPTION IS RAISED!
The Error Message is something like the following:
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Somehow, you got a Series into the inside of the if-clause.
Actually, I know how it happened, but it will take me a moment to explain:
Well, suppose that you want the value out of row 3 and column 4 of a pandas dataframe.
If you attempt to extract a single value out of a pandas table in a specific row and column, then that value is sometimes a Series object, not a number.
Consider the following example:
# DATA:
# Name Age Location
# 0 Nik 31 Toronto
# 1 Kate 30 London
# 2 Evan 40 Kingston
# 3 Kyra 33 Hamilton
To create the dataframe above, we can write:
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
Now, let us try to get a specific row of data:
evans_row = df.loc[df['Name'] == 'Evan']
and we try to get a specific value out of that row of data:
evans_age = evans_row['Age']
You might think that evans_age is the integer 40, but you would be wrong.
Let us see what evans_age really is:
print(80*"*", "EVAN\'s AGE", type(Evans_age), sep="\n")
print(Evans_age)
We have:
EVAN's AGE
<class 'pandas.core.series.Series'>
2 40
Name: Age, dtype: int64
Evan's Age is not a number.
evans_age is an instance of the class stored as pandas.Series
After extracting a single cell out of a pandas dataframe you can write .tolist()[0] to extract the number out of that cell.
evans_real_age = evans_age.tolist()[0]
print(80*"*", "EVAN\'s REAL AGE", type(evans_real_age), sep="\n")
print(evans_real_age)
EVAN's REAL AGE
<class 'numpy.int64'>
40
The exception in your original code was probably thrown by if abs_res == 0.
If abs_res is a pandas.Series then abs_res == 0 returns another Series.
There is no way to compare if an entire list of numbers is equal to zero.
Normally people just enter one input to an if-clause.
if (912):
print("912 is True")
else:
print("912 is False")
When an if-statement receives more than one value, then the python interpreter does not know what to do.
For example, what should the following do?
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
if data:
print("data is true")
else:
print("data is false")
You should only input one value into an if-condition. Instead, you entered a pandas.Series object as input to the if-clause.
In your case, the pandas.Series only had one number in it. However, in general, pandas.Series contain many values.
The authors of the python pandas library assume that a series contains many numbers, even if it only has one.
The computer thought that you tired to put many different numbers inside of one single if-clause.
The difference between a "function definition" and a "function call"
Your original question was,
"I want to make a function if the common key is found"
Your use of the phrase "make a function" is incorrect. You probably meant, "I want to call a function if a common key is found."
The following are all examples of function "calls":
import pandas as pd
import numpy as np
z = foo(1, 91)
result = funky_function(811, 22, "green eggs and ham")
output = do_stuff()
data = np.random.randn(6, 4)
df = pd.DataFrame(data, index=dates, columns=list("ABCD"))
Suppose that you have two containers.
If you truly want to "make" a function if a common key is found, then you would have code like the following:
dict1 = {'age': 26, 'phone':"303-873-9811"}
dict2 = {'name': "Bob", 'phone':"303-873-9811"}
def foo(dict1, dict2):
union = set(dict2.keys()).intersection(set(dict1.keys()))
# if there is a shared key...
if len(union) > 0:
# make (create) a new function
def bar(*args, **kwargs):
pass
return bar
new_function = foo(dict1, dict2)
print(new_function)
If you are not using the def keyword, that is known as a function call
In python, you "make" a function ("define" a function) with the def keyword.
I think that your question should be re-titled.
You could write, "How do I call a function if two pandas dataframes have a common key?"
A second good question be something like,
"What went wrong if we see the error message, ValueError: The truth value of a Series is ambiguous.?"
Your question was worded strangely, but I think I can answer it.
Generating Test Data
Your question did not include test data. If you ask a question on stack overflow again, please provide a small example of some test data.
The following is an example of data we can use:
product_ID full_price
0 prod_id 1-1-1-1 11.11
1 prod_id 2-2-2-2 22.22
2 prod_id 3-3-3-3 33.33
3 prod_id 4-4-4-4 44.44
4 prod_id 5-5-5-5 55.55
5 prod_id 6-6-6-6 66.66
6 prod_id 7-7-7-7 77.77
------------------------------------------------------------
product_code tot_price
0 prod_id 3-3-3-3 34.08
1 prod_id 4-4-4-4 45.19
2 prod_id 5-5-5-5 56.30
3 prod_id 6-6-6-6 67.41
4 prod_id 7-7-7-7 78.52
5 prod_id 8-8-8-8 89.63
6 prod_id 9-9-9-9 100.74
Products 1 and 2 are unique to data-frame 1
Products 8 and 9 are unique to data-frame 2
Both data-frames contain data for products 3, 4, 5, ..., 7.
The prices are slightly different between data-frames.
The test data above is generated by the following code:
import pandas as pd
from copy import copy
raw_data = [
[
"prod_id {}-{}-{}-{}".format(k, k, k, k),
int("{}{}{}{}".format(k, k, k, k))/100
] for k in range(1, 10)
]
raw_data = [row for row in raw_data]
df1 = pd.DataFrame(data=copy(raw_data[:-2]), columns=["product_ID", "full_price"])
df2 = pd.DataFrame(data=copy(raw_data[2:]), columns=["product_code", "tot_price"])
for rowid in range(0, len(df2.index)):
df2.at[rowid, "tot_price"] += 0.75
print(df1)
print(60*"-")
print(df2)
Add some error checking
It is considered to be "best-practice" to make sure that your function
inputs are in the correct format.
You wrote a function named range_calc(z, y). I reccomend making sure that z and y are integers, and not something else (such as a pandas Series object).
def range_calc(z, y):
try:
z = float(z)
y = float(y)
except ValueError:
function_name = inspect.stack()[0][3]
with io.StringIO() as string_stream:
print(
"Error: In " + function_name + "(). Inputs should be
like decimal numbers.",
"Instead, we have: " + str(type(y)) + " \'" +
repr(str(y))[1:-1] + "\'",
file=string_stream,
sep="\n"
)
err_msg = string_stream.getvalue()
raise ValueError(err_msg)
# DO STUFF
return
Now we get error messages:
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
range_calc("I am supposed to be an integer", data)
# ValueError: Error in function range_calc(). Inputs should be like
decimal numbers.
# Instead, we have: <class 'str'> "I am supposed to be an integer"
Code which Accomplishes what you Wanted.
The following is some rather ugly code which computes what you wanted:
# You can continue to use your original `range_calc()` function unmodified
# Use the test data I provided earlier in this answer.
def foo(df1, df2):
last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
df1_ids = set(df1["product_ID"].tolist())
df2_ids = set(df2["product_code"].tolist())
pids = df1_ids.intersection(df2_ids) # common_product_ids
for pid in pids:
row1 = df1.loc[df1['product_ID'] == pid]
row2 = df2.loc[df2["product_code"] == pid]
price1 = row1["full_price"].tolist()[0]
price2 = row2["tot_price"].tolist()[0]
price3 = range_calc(price1, price2)
row3 = pd.DataFrame([[pid, price3]], columns=["Product_number", "Market_Price"])
last_df = pd.concat([last_df, row3])
return last_df
# ---------------------------------------
last_df = foo(df1, df2)
The result is:
Product_number Market_Price
0 prod_id 6-6-6-6 1.112595
0 prod_id 7-7-7-7 0.955171
0 prod_id 4-4-4-4 1.659659
0 prod_id 5-5-5-5 1.332149
0 prod_id 3-3-3-3 2.200704
Note that one of many reasons that my solution is ugly is in the following line of code:
last_df = pd.concat([last_df, row3])
if last_df is large (thousands of rows), then the code will run very slowly.
This is because instead of inserting a new row of data, we:
copy the original dataframe
append a new row of data to the copy.
delete/destroy the original data-frame.
It is really silly to copy 10,000 rows of data only to add one new value, and then delete the old 10,000 rows.
However, my solution has fewer bugs than your original code, relatively speaking.
sometimes when you check a condition on series or dataframes, your output is a series such as ( , False).
In this case you must use any, all, item,...
use print function for your condition to see the series.
Also I must tell your code is very very slow and has O(n**2). You can first calculate df3 as joining df1 and df2 then using apply method for fast calculating.

Apply row wise conditional function on dataframe python

I have a dataframe in which I want to execute a function that checks if the actual value is a relative maximum, and check if the previous ''n'' values are lower than the actual value.
Having a dataframe 'df_data':
temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63, 131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99, 138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1]
df_data = pd.DataFrame(temp)
First I create a function that will check the previous conditions:
def get_max(high, rolling_max, prev,post):
if ((high > prev) & (high>post) & (high>rolling_max)):
return 1
else:
return 0
df_data['rolling_max'] = df_data.high.rolling(n).max().shift()
Then I apply previous condition row wise:
df_data['ismax'] = df_data.apply(lambda x: get_max(df_data['high'], df_data['rolling_max'],df_data['high'].shift(1),df_data['high'].shift(-1)),axis = 1)
The problem is that I have always get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Which comes due to applying the boolean condition from 'get_max' function to a Serie.
I will love to have a vectorized function, not using loops.
Try:
df_data['ismax'] = ((df_data['high'].gt(df_data.high.rolling(n).max().shift())) & (df_data['high'].gt(df_data['high'].shift(1))) & (df_data['high'].gt(df_data['high'].shift(-1)))).astype(int)
The error is occuring because you are sending the entire series (entire column) to your get_max function rather than doing it row-wise. Creating new columns for the shifted "prev" and "post" values and then using df.apply(func, axis = 1) normally will work fine here.
As you have hinted at, this solution is quite inefficient and looping through every row will become much slower as your dataframe increases in size.
On my computer, the below code posts:
LIST_MULTIPLIER = 1, Vectorised code: 0.29s, Row-wise code: 0.38s
LIST_MULTIPLIER = 100, Vectorised code: 0.31s, Row-wise code = 13.27s
In general therefore it is best to avoid using df.apply(..., axis = 1) as you can almost always get a better solution using logical operators.
import pandas as pd
from datetime import datetime
LIST_MULTIPLIER = 100
ITERATIONS = 100
def get_dataframe():
temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63,
131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99,
138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1] * LIST_MULTIPLIER
df = pd.DataFrame(temp_list)
df.columns = ['high']
return df
df_original = get_dataframe()
t1 = datetime.now()
for i in range(ITERATIONS):
df = df_original.copy()
df['rolling_max'] = df.high.rolling(2).max().shift()
df['high_prev'] = df['high'].shift(1)
df['high_post'] = df['high'].shift(-1)
mask_prev = df['high'] > df['high_prev']
mask_post = df['high'] > df['high_post']
mask_rolling = df['high'] > df['rolling_max']
mask_max = mask_prev & mask_post & mask_rolling
df['ismax'] = 0
df.loc[mask_max, 'ismax'] = 1
t2 = datetime.now()
print(f"{t2 - t1}")
df_first_method = df.copy()
t3 = datetime.now()
def get_max_rowwise(row):
if ((row.high > row.high_prev) &
(row.high > row.high_post) &
(row.high > row.rolling_max)):
return 1
else:
return 0
for i in range(ITERATIONS):
df = df_original.copy()
df['rolling_max'] = df.high.rolling(2).max().shift()
df['high_prev'] = df['high'].shift(1)
df['high_post'] = df['high'].shift(-1)
df['ismax'] = df.apply(get_max_rowwise, axis = 1)
t4 = datetime.now()
print(f"{t4 - t3}")
df_second_method = df.copy()

Creating an array of timestamps between two timestamps in pyspark

I have two timestamp columns in my pyspark dataframe. I want to create a third column which has the array of timestamp hours between the two timestamps.
This is the code I wrote for that..
# Creating udf function
def getBetweenStamps(st_date, dc_date):
import numpy as np
hr = 0
date_list = []
runnig_date = st_date
while (dc_date>runnig_date):
runnig_date = st_date+timedelta(hours=hr)
date_list.append(runnig_date)
hr+=1
dates = np.array(date_list)
return dates
udf_betweens = F.udf(getBetweenStamps, ArrayType(DateType()))
# Using udf function
orders.withColumn('date_array', udf_betweens(F.col('start_date'), F.col('ICUDischargeDate'))).show()
However this is showing the error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I think the inputs to the functions are going in as two arrays and not as two datetimes causing the error. Is there any way around this? Any other way of solving this problem?
Thank you very much.
You are getting the error when returning numpy array from your udf. You can simply return the date_list and it will work.
def getBetweenStamps(st_date, dc_date):
import numpy as np
hr = 0
date_list = []
runnig_date = st_date
while (dc_date>runnig_date):
runnig_date = st_date+timedelta(hours=hr)
date_list.append(runnig_date)
hr+=1
return date_list
udf_betweens = F.udf(getBetweenStamps, ArrayType(DateType()))
To test the above function:
df = spark.sql("select current_timestamp() as t1").withColumn("t2", col("t1") + expr("INTERVAL 1 DAYS"))
df.withColumn('date_array', udf_betweens(F.col('t1'), F.col('t2'))).show()

Trouble obtaining counts using multiple datetime columns as conditionals

I am attempting to collect counts of occurrences of an id between two time periods in a dataframe. I have a moderately sized dataframe (about 400 unique ids and just short of 1m rows) containing a time of occurrence and an id for the account which caused the occurrence. I am attempting to get a count of occurrences for multiple time periods (1 hour, 6 hour, 1 day, etc.) prior a specific occurrence and have run into lots of difficulties.
I am using Python 3.7, and for this instance I only have the pandas package loaded. I have tried using for loops and while it likely would have worked (eventually), I am looking for something a bit more efficient time-wise. I have also tried using list comprehension and have run into some errors that I did not anticipate when dealing with datetimes columns. Examples of both are below.
## Sample data
data = {'id':[ 'EAED813857474821E1A61F588FABA345', 'D528C270B80F11E284931A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '7B9C7C02F19711E38C670EDFB82A24A9', '80B409D1EC3D4CC483239D15AAE39F2E', '314EB192F25F11E3B68A0EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', '156097CF030E4519DBDF84419B855E10', 'EE80E4C0B82B11E28C561A7D66640965', 'CA9F2DF6B82011E28C561A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '314EB192F25F11E3B68A0EDFB82A24A9', 'D528C270B80F11E284931A7D66640965', '3A024345C1E94CED8C7E0DA3A96BBDCA', '314EB192F25F11E3B68A0EDFB82A24A9', '47C18B6B38E540508561A9DD52FD0B79', 'B72F6EA5565B49BBEDE0E66B737A8E6B', '47C18B6B38E540508561A9DD52FD0B79', 'B92CB51EFA2611E2AEEF1A7D66640965', '136EDF0536F644E0ADE6F25BB293DD17', '7B9C7C02F19711E38C670EDFB82A24A9', 'C5FAF9ACB88D4B55AB8196DBFFE5B3C0', '1557D4ECEFA74B40C718A4E5425F3ACB', '68D30EE473FE11E49C060EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', 'CAF9D8CD627B422DFE1D587D25FC4035', 'C620D865AEE1412E9F3CA64CB86DC484', '47C18B6B38E540508561A9DD52FD0B79', 'CA9F2DF6B82011E28C561A7D66640965', '06E2501CB81811E290EF1A7D66640965', '68EEE17873FE11E4B5B90AFEF9534BE1', '47C18B6B38E540508561A9DD52FD0B79', '1BFE9CB25AD84B64CC2D04EF94237749', '7B20C2BEB82811E28C561A7D66640965', '261692EA8EE447AEF3804836E4404620', '74D7C3901F234993B4788EFA9E6BEE9E', 'CAF9D8CD627B422DFE1D587D25FC4035', '76AAF82EB8C511E2A76C1A7D66640965', '4BD38D6D44084681AFE13C146542A565', 'B8D27E80B82911E28C561A7D66640965' ], 'datetime':[ "24/06/2018 19:56", "24/05/2018 03:45", "12/01/2019 14:36", "18/08/2018 22:42", "19/11/2018 15:43", "08/07/2017 21:32", "15/05/2017 14:00", "25/03/2019 22:12", "27/02/2018 01:59", "26/05/2019 21:50", "11/02/2017 01:33", "19/11/2017 19:17", "04/04/2019 13:46", "08/05/2019 14:12", "11/02/2018 02:00", "07/04/2018 16:15", "29/10/2016 20:17", "17/11/2018 21:58", "12/05/2017 16:39", "28/01/2016 19:00", "24/02/2019 19:55", "13/06/2019 19:24", "30/09/2016 18:02", "14/07/2018 17:59", "06/04/2018 22:19", "25/08/2017 17:51", "07/04/2019 02:24", "26/05/2018 17:41", "27/08/2014 06:45", "15/07/2016 19:30", "30/10/2016 20:08", "15/09/2018 18:45", "29/01/2018 02:13", "10/09/2014 23:10", "11/05/2017 22:00", "31/05/2019 23:58", "19/02/2019 02:34", "02/02/2019 01:02", "27/04/2018 04:00", "29/11/2017 20:35"]}
df = pd.dataframe(data)
df = df.sort_values(['id', 'datetime'], ascending=True)
# for loop attempt
totalAccounts = df['id'].unique()
for account in totalAccounts:
oneHourCount=0
subset = df[df['id'] == account]
for i in range(len(subset)):
onehour = subset['datetime'].iloc[i] - timedelta(hours=1)
for j in range(len(subset)):
if (subset['datetime'].iloc[j] >= onehour) and (subset['datetime'].iloc[j] < sub):
oneHourCount+=1
#list comprehension attempt
df['onehour'] = df['datetime'] - timedelta(hours=1)
for account in totalAccounts:
onehour = sum([1 for x in subset['datetime'] if x >= subset['onehour'] and x < subset['datetime']])
I am getting either 1) incredibly long runtime with the for loop or 2) an ValueError regarding the truth of a series being ambiguous. I know the issue is dealing with the datetimes, and perhaps it is just going to be slow-going, but I want to check here first just to make sure.
So I was able to figure this out using bisection. If you have a similar question please PM me and I'd be more than happy to help.
Solution:
left = bisect_left(keys, subset['start_time'].iloc[i]) ## calculated time
right = bisect_right(keys, subset['datetime'].iloc[i]) ## actual time of occurrence
count=len(subset['datetime'][left:right]

How to apply a function in Pandas to a cell in every row where a different cell in that same row meets a condition?

I am trying to use the pandas string method "str.zfill" to add leading zeros to a cell in the same column for every row in the dataframe where another cell in that row meets a certain condition. So for any given row in my DataFrame "excodes", when the value in column "LOB_SVC_CD" is "MTG", apply the str.zfill(5) method to the cell in column "PS_CD". When the value in "LOB_SVC_CD" is not "MTG" leave the value in "PS_CD" as is.
I've tried a few custom functions, "np.where" and a few apply/map lambdas. I'm getting errors on all of them.
#Custom Function
def add_zero(column):
if excodes.loc[excodes.LOB_SVC_CD == 'MTG']:
excodes.PS_CD.str.zfill(5)
else:
return excodes.PS_CD
excodes['code'] = excodes.apply(add_zero)
#Custom Function with For Loop
def add_zero2(column):
code = []
for row(i) in column:
if excodes.LOB_SVC_CD == 'MTG':
code.append(excodes.PS_CD.str.zfill(5))
else:
code.append(excodes.PS_CD)
excodes['Code'] = code
excodes['code'] = excodes.apply(add_zero)
#np.Where
mask = excodes[excodes.LOB_SVC_CD == 'MTG']
excodes['code'] = pd.DataFrame[np.where(mask, excodes.PS_CD.str.zfill(5), excodes.PS_CD)]
#Lambda
excodes['code'] = excodes['LOB_SVC_CD'].map(lambda x: excodes.PS_CD.str.zfill(5)) if x[excodes.LOB_SVC_CD == 'MTG'] else excodes.PS_CD)
#Assign with a "Where"
excodes.assign((excodes.PS_CD.str.zfill(5)).where(excodes.LOB_SVC_CD == 'MTG'))
Expected results will be either:
create a new called "code" with all values in "PS_CD" are given leading zeroes in rows where excodes.LOB_SVC_CD == 'MTG'
adding leading zeroes to the values in excodes["PS_CD"] when the row excodes['LOB_SVC_CD'] == 'MTG'
Error Messages I'm getting are - on each of the approaches I've tried:
#Custom Function:
"ValueError: ('The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index PS_CD')"
# Custom Function with For Loop:
"SyntaxError: can't assign to function call"
#np.Where:
"ValueError: operands could not be broadcast together with shapes (152,7) (720,) (720,)"
#Apply Lambda:
"string indices must be integers"
#Assign with a "Where":
"TypeError: assign() takes 1 positional argument but 2 were given"
This seems to work :)
# Ensure the data in the PS_CD are strings
data["PS_CD"] = data["PS_CD"].astype(str)
# Iterate over all rows
for index in data.index:
# If the LOB_SVC_CD is "MTG"
if (data.loc[index, "LOB_SVC_CD"] == "MTG"):
# Apply the zfill(5) in the PS_CD on the same row (index)
data.loc[index, "PS_CD"] = data.loc[index, "PS_CD"].zfill(5)
# Print the result
print(data)
Alternative way (maybe a bit more Python-ish) :)
# Ensure the data in the PS_CD are strings
data["PS_CD"] = data["PS_CD"].astype(str)
# Custom function for applying the zfill
def my_zfill(x, y):
return y.zfill(5) if x == "MTG" else y
# Iterate over the data applying the custom function on each row
data["PS_CD"] = pd.Series([my_zfill(x, y) for x, y in zip(data["LOB_SVC_CD"], data["PS_CD"])])
My take:
>>> import pandas
>>> df = pandas.DataFrame(data = [['123', 'MTG'],['321', 'CLOC']], columns = ['PS_CD', 'LOB_SVC_CD'])
>>> df
PS_CD LOB_SVC_CD
0 123 MTG
1 321 CLOC
>>>
>>> df['PS_CD'] = df.apply(lambda row: row['PS_CD'].zfill(5) if row['LOB_SVC_CD'] == 'MTG' else row['PS_CD'], axis='columns')
>>> df
PS_CD LOB_SVC_CD
0 00123 MTG
1 321 CLOC
Using lambda will return value for every row, zfilled PS_CD if LOB_SVC_CD was MTG else original PS_CD.

Categories

Resources