Pandas - Evaluate a condition for each row of a series - python

I have a dataset like this:
Policy | Customer | Employee | CoveragDate | LapseDate
123 | 1234 | 1234 | 2011-06-01 | 2015-12-31
124 | 1234 | 1234 | 2016-01-01 | ?
125 | 1234 | 1234 | 2011-06-01 | 2012-01-01
124 | 5678 | 5555 | 2014-01-01 | ?
I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.
So, expected output would be:
Policy | Customer | Employee
123 | 1234 | 1234
because policy 123's lapse date was within 5 days of policy 124's covered date.
So far, I've used this code:
import pandas
import datetime
#Pull in data from query
wd = pandas.read_csv('DATA')
wd=wd.set_index('Policy#')
wd = wd.rename(columns={'Policy#':'Policy'})
Resultlist=[]
for EMPID in wd.groupby(['EMPID', 'Customer']):
for Policy in wd.groupby(['EMPID','Customer']):
EffDate = pandas.to_datetime(wd['CoverageEffDate'])
for Policy in wd.groupby(['EMPID','Customer']):
check=wd['LapseDate'].astype(str)
if check.any() =='?': #here lies the problem - it's evaluating if ANY of the items ='?'
print(check)
continue
else:
LapseDate = pandas.to_datetime(wd['LapseDate']) + datetime.timedelta(days=5)
if EffDate < LapseDate:
Resultlist.append(wd['Policy','Customer'])
print(Resultlist)
I'm trying to use the pandas .any() function to evaluate if the current row is a '?' (which means null data, i.e. the policy hasn't lapsed). However, it appears that this statement just evaluates if there is a '?' row in the entire column, not the current row. I need to determine this because if I compare the '?' value against a date I get an error.
Is there a way to reference just the row I'm iterating on for a conditional check? To my knowledge, I can't use the pandas apply function technique because I need each employee's policy data compared against any other policies they hold.
Thank you!

check.str.contains('?') would return a boolean array showing which entries had a '?' in them. Otherwise you might consider just iterating through i.e
check=wd['LapseDate'].astype(str)
for row in check:
if row == '?':
print(check)
but there's really no difference between checking for any match and returning if there's a match and iterating through all and returning if there's a match.

Related

Subsetting data with a column condition

I have a dataframe which contains Date, Visitor_ID and Pages columns. In the Page_visited column there are different row wise entries for each dates. Please refer the below table to understand the data.
[| Dates | Visitor_ID| Pages |
|:------ |:---------:| -----: |
| 10/1/2021 | 1 | xy |
| 10/1/2021 | 1 | step2 |
|10/1/2021 | 1 | xx |
|10/1/2021 | 1 | NetBanking|
| 10/1/2021 | 2 | step1 |
| 10/1/2021 | 2 | xy |
|10/1/2021 | 3 | step1 |
|10/1/2021 | 3 | NetBanking|
|11/1/2021 | 4 | step1 |
|12/1/2021 | 4 | NetBanking|][1]
Desired output:
Date Visitor_ID
|10/1/2021 | 1 |
|10/1/2021 | 3 |
the output should be a subset of actual data where the condition is that if for same Visitor_ID the page contains string "step" before string "Netbanking in same date then return the Visitor ID.
To initialise your dataframe you could do:
import pandas as pd
columns = ["Dates", "Visitor_ID", "Pages"]
records = [
["10/1/2021", 1, "xy"],
["10/1/2021", 1, "step2"],
["10/1/2021", 1, "NetBanking"],
["10/1/2021", 2, "step1"],
["10/1/2021", 2, "xy"],
["10/1/2021", 3, "step1"],
["10/1/2021", 3, "NetBanking"],
["11/1/2021", 4, "step1"],
["12/1/2021", 4, "NetBanking"]]
data = pd.DataFrame().from_records(records, columns=columns)
data["Dates"] = pd.DatetimeIndex(data["Dates"])
index_names = columns[:2]
data.set_index(index_names, drop=True, inplace=True)
Note that I have left out your third line in the records, otherwise I cannot reproduce your desired output. I have made this a multi-index data frame in order to easily loop over the groups 'date/visitor'. The structure of the dataframe looks like:
print(data)
Pages
Dates Visitor_ID
2021-10-01 1 xy
1 step2
1 NetBanking
2 step1
2 xy
3 step1
3 NetBanking
2021-11-01 4 step1
2021-12-01 4 NetBanking
Now to select the customers from the same date and from the same group, I am going to loop over these groups and use 2 masks to select the required records:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
# select the column with the Pages
pages = data_per_visitor["Pages"].str
# make 2 boolean masks, for the records with step and netbanking
has_step = pages.contains("step")
has_netbanking = pages.contains("NetBanking")
# to get the records after each 'step' records, apply a diff on 'has_step'
# Convert to int first for the correct result
# each diff with outcome -1 fulfills this requirement. Make a
# mask based on this requirement
diff_step = has_step.astype(int).diff()
records_after_step = diff_step == -1
# combine the 2 mask to create your final mask to make a selection
mask = records_after_step & has_netbanking
# select the records and print to screen
selection = data_per_visitor[mask]
if not selection.empty:
print(selection.reset_index()[index_names])
This gives the following output:
Dates Visitor_ID
0 2021-10-01 1
1 2021-10-01 3
EDIT:
I was reading your question again. The solution above assumed that only records with 'NetBanking' directly following a record with 'step' is valid. That is why I thought your example input was not corresponding with your desired output. However, in case you are allowing rows in between an occurrence with 'step' and the first 'netbanking', the solution does not work. In that case, it is better to explicitly iterate of the rows of your dataframe per date and client id. An example then would be:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
after_step = False
index_selection = list()
data_per_visitor.reset_index(inplace=True)
for index, records in data_per_visitor.iterrows():
page = records["Pages"]
if "step" in page and not after_step:
after_step = True
if "NetBanking" in page and after_step:
index_selection.append(index)
after_step = False
selection = data_per_visitor.reindex(index_selection)
if not selection.empty:
print(selection.reset_index()[index_names]
Normally I would not recommend to use 'iterrows' as it is really slow, but in this case I don't see an easy other solution. The output of the second algorithm is the same as the first for my data. In case you do include the third line from your example data, the second algorithm still gives the same output.

Is there a difference between pd.Series.dt.weekday vs pd.Series.dt.dayofweek?

I am with time in Pandas and have seen that there are two ways of extracting a Day of Week integer from a timestamp. These are pd.Series.dt.weekday and pd.Series.dt.dayofweek.
The documentation says that both return: Series or Index containing integers indicating the day number where the day of the week with Monday=0, Sunday=6.
Am I missing something or are these two functions effectively the same?
In the "See Also" section, it does describe the other function as an Alias. Does this answer my question?
You solved your question:
| dayofweek
| The day of the week with Monday=0, Sunday=6.
|
| Return the day of the week. It is assumed the week starts on
| Monday, which is denoted by 0 and ends on Sunday which is denoted
| by 6. This method is available on both Series with datetime
| values (using the `dt` accessor) or DatetimeIndex.
|
| Returns
| -------
| Series or Index
| Containing integers indicating the day number.
|
| See Also
| --------
| Series.dt.dayofweek : Alias.
| Series.dt.weekday : Alias. # <-- YES IT'S AN ALIAS
| Series.dt.day_name : Returns the name of the day of the week.
Source code:
dayofweek = day_of_week
weekday = dayofweek

Converting VBA script to Python Script

I'm trying to convert some VBA scripts into a python script, and I have been having troubles trying to figure some few things out, as the results seem different from what the excel file gives.
So I have an example dataframe like this :
|Name | A_Date |
_______________________
|RAHEAL | 04/30/2020|
|GIFTY | 05/31/2020|
||ERIC | 03/16/2020|
|PETER | 05/01/2020|
|EMMANUEL| 12/15/2019|
|BABA | 05/23/2020|
and I would want to achieve this result(VBA script result) :
|Name | A_Date | Sold
__________________________________
|RAHEAL | 04/30/2020| No
|GIFTY | 05/31/2020| Yes
||ERIC | 03/16/2020| No
|PETER | 05/01/2020| Yes
|EMMANUEL| 12/15/2019| No
|BABA | 05/23/2020| Yes
By converting this VBA script :
Range("C2").Select
Selection.Value = _
"=IF(RC[-1]>=(INT(R2C2)-DAY(INT(R2C2))+1),""Yes"",""No"")"
Selection.AutoFill Destination:=Range("C2", "C" & Cells(Rows.Count, 1).End(xlUp).Row)
Range("C1").Value = "Sold"
ActiveSheet.Columns("C").Copy
ActiveSheet.Columns("C").PasteSpecial xlPasteValues
Simply :=IF(B2>=(INT($B$2)-DAY(INT($B$2))+1),"Yes","No")
To this Python script:
sales['Sold']=np.where(sales['A_Date']>=(sales['A_Date'] - pd.to_timedelta(sales.A_Date.dt.day, unit='d'))+ timedelta(days=1),'Yes','No')
But I keep getting a "Yes" throughout.... could anyone help me spot out where I might have made some kind of mistake
import pandas as pd
df = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['04/30/2020','05/31/2020','03/16/2020',
'05/01/2020','12/15/2019','05/23/2020']})
df['A_Date'] = pd.to_datetime(df['A_Date'])
print(df)
df['Sold'] = df['A_Date'] >= df['A_Date'].iloc[0].replace(day=1)
df['Sold'] = df['Sold'].map({True:'Yes', False:'No'})
print(df)
output:
Name A_Date
0 RAHEAL 2020-04-30
1 GIFTY 2020-05-31
2 ERIC 2020-03-16
3 PETER 2020-05-01
4 EMMANUEL 2019-12-15
5 BABA 2020-05-23
Name A_Date Sold
0 RAHEAL 2020-04-30 Yes
1 GIFTY 2020-05-31 Yes
2 ERIC 2020-03-16 No
3 PETER 2020-05-01 Yes
4 EMMANUEL 2019-12-15 No
5 BABA 2020-05-23 Yes
If I read the formula right - if A_Date value >= 04/01/2020 (i.e. first day of month for date in B2), so RAHEAL should be Yes too
I don't know if you noticed (and if this is intended), but if A_Date value has a fractional part (i.e. time), when you calculate the value for 1st of the month, there is room for error. If the time in B2 is let's say 10:00 AM, when you calculate cut value, it will be 04/1/2020 10:00. Then if you have another value, let's say 04/01/2020 09:00, it will be evaluated as False/No. This is how it works also in your Excel formula.
EDIT (12 Jan 2021): Note, values in column A_Date are of type datetime.datetime or datetime.date. Presumably they are converted when reading the Excel file or explicitly afterwards.
Very much embarassed I didn't see the simple elegant solution that buran gave +. I did more of a literal translation.
first_date.toordinal() - 693594 is the integer date value for your initial date, current_date.toordinal() - 693594 is the integer date value for the current iteration of the dates column. I apply your cell formula logic to each A_Date row value and output as the corresponding Sold column value.
import pandas as pd
from datetime import datetime
def is_sold(current_date:datetime, first_date:datetime, day_no:int)->str:
# use of toordinal idea from #rjha94 https://stackoverflow.com/a/47478659
if current_date.toordinal() - 693594 >= first_date.toordinal() - 693594 - day_no + 1:
return "Yes"
else:
return "No"
sales = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['2020-04-30','2020-05-31','2020-03-16','2020-05-01','2019-12-15','2020-05-23']})
sales['A_Date'] = pd.to_datetime(sales['A_Date'], errors='coerce')
sales['Sold'] = sales['A_Date'].apply(lambda x:is_sold(x, sales['A_Date'][0], x.day))
print(sales)

Pandas: select row where one specific column stays constant

I have a DataFrame that keeps track the item_id with its price over time period:
item_id | item_price | day
1 | 10 | 09-02-2000 # DD-MM-YYYY format
2 | 24 | 10-02-2000
1 | 10 | 20-02-2000
...
As you can see, the price of the item 1 does not change over time. How do I select all of the items that the specific column item_price does not change over time? I tried groupby(), but it does not seem to work right.
EDIT: desired output is all the item_id that its price does not change over time. For example: item_id_list = [1, ...]. It can be a list or a DataFrame Series.
here it is little bro
df.groupby('item_id').item_price.nunique()
And you keep the ones with one unique price.

How to add new row in pandas dataframe? [duplicate]

I have an existing dataframe which I need to add an additional column to which will contain the same value for every row.
Existing df:
Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450
New df:
Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450
I know how to append an existing series / dataframe column. But this is a different situation, because all I need is to add the 'Name' column and set every row to the same value, in this case 'abc'.
df['Name']='abc' will add the new column and set all rows to that value:
In [79]:
df
Out[79]:
Date, Open, High, Low, Close
0 01-01-2015, 565, 600, 400, 450
In [80]:
df['Name'] = 'abc'
df
Out[80]:
Date, Open, High, Low, Close Name
0 01-01-2015, 565, 600, 400, 450 abc
You can use insert to specify where you want to new column to be. In this case, I use 0 to place the new column at the left.
df.insert(0, 'Name', 'abc')
Name Date Open High Low Close
0 abc 01-01-2015 565 600 400 450
Summing up what the others have suggested, and adding a third way
You can:
assign(**kwargs):
df.assign(Name='abc')
access the new column series (it will be created) and set it:
df['Name'] = 'abc'
insert(loc, column, value, allow_duplicates=False)
df.insert(0, 'Name', 'abc')
where the argument loc ( 0 <= loc <= len(columns) ) allows you to insert the column where you want.
'loc' gives you the index that your column will be at after the insertion. For example, the code above inserts the column Name as the 0-th column, i.e. it will be inserted before the first column, becoming the new first column. (Indexing starts from 0).
All these methods allow you to add a new column from a Series as well (just substitute the 'abc' default argument above with the series).
Single liner works
df['Name'] = 'abc'
Creates a Name column and sets all rows to abc value
I want to draw more attention to a portion of #michele-piccolini's answer.
I strongly believe that .assign is the best solution here. In the real world, these operations are not in isolation, but in a chain of operations. And if you want to support a chain of operations, you should probably use the .assign method.
Here is an example using snowfall data at a ski resort (but the same principles would apply to say ... financial data).
This code reads like a recipe of steps. Both assignment (with =) and .insert make this much harder:
raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
parse_dates=['DATE'])
def clean_alta(df):
return (df
.loc[:, ['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE',
'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
.groupby(pd.Grouper(key='DATE', freq='W'))
.agg({'PRCP': 'sum', 'TMAX': 'max', 'TMIN': 'min', 'SNOW': 'sum', 'SNWD': 'mean'})
.assign(LOCATION='Alta',
T_RANGE=lambda w_df: w_df.TMAX-w_df.TMIN)
)
clean_alta(raw)
Notice the line .assign(LOCATION='Alta', that creates a column with a single value in the middle of the rest of the operations.
One Line did the job for me.
df['New Column'] = 'Constant Value'
df['New Column'] = 123
You can Simply do the following:
df['New Col'] = pd.Series(["abc" for x in range(len(df.index))])
This single line will work.
df['name'] = 'abc'
The append method has been deprecated since Pandas 1.4.0
So instead use the above method only if using actual pandas DataFrame object:
df["column"] = "value"
Or, if setting value on a view of a copy of a DataFrame, use concat() or assign():
This way the new Series created has the same index as original DataFrame, and so will match on exact rows
# adds a new column in view `where_there_is_one` named
# `client` with value `display_name`
# `df` remains unchanged
df = pd.DataFrame({"number": ([1]*5 + [0]*5 )})
where_there_is_one = df[ df["number"] == 1]
where_there_is_one = pd.concat([
where_there_is_one,
pd.Series(["display_name"]*df.shape[0],
index=df.index,
name="client")
],
join="inner", axis=1)
# Or use assign
where_there_is_one = where_there_is_one.assign(client = "display_name")
Output:
where_there_is_one: df:
| 0 | number | client | | 0 | number |
| --- | --- | --- | |---| -------|
| 0 | 1 | display_name | | 0 | 1 |
| 1 | 1 | display_name | | 1 | 1 |
| 2 | 1 | display_name | | 2 | 1 |
| 3 | 1 | display_name | | 3 | 1 |
| 4 | 1 | display_name | | 4 | 1 |
| 5 | 0 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |
Ok, all, I have a similar situation here but if i take this code to use: df['Name']='abc'
instead 'abc' the name for the new column I want to take from somewhere else in the csv file.
As you can see from the picture, df is not cleaned yet but I want to create 2 columns with the name "ADI dms rivoli" which will continue for every row, and the same for the "December 2019". Hope it is clear for you to understand, it was hard to explaine, sorry.

Categories

Resources