Pandas: Date comparison with column addition based on values in rows - python

I have a number of excel files that following a similar format:
|name| email| cat1| cat2| cat3
smith email 01JAN2016 01JAN2014 01JAN2015
The first two columns contain strings (name and email addrs) while each of the following columns contain dates when each person completed each item in cat(x).
I would like to run a comparison to current_date, adding a new column 'status' which will have a value of 'compliant' or 'delinquent' based on whether any date in a row is prior to current date then output the new dataframe to an excel spreadsheet.
My initial attempts let me filter 'older' dates rather easily, however, when I tried to add a column using a conditional everything started to break:
import pandas as pd
import numpy as np
import datetime
current_date = datetime.datetime.now()
writer = pd.ExcelWriter('pd_output.xlsx', engine='xlsxwriter', datetime_format= 'mmm d yyy')
df = pd.read_excel(tracker,'Sheet1')
print(df.values) # Displays dates as 'Timestamp('2016-01-01 00:00:00') any value which is < current_date displays as 'True' else 'False'
print(df < current_date) # removes dates that are not older than current_date but does not delete column, ie someone with no old dates will still show up with column 3+ being blank
# a couple version of what I have been trying - unsuccessfully
df['Status'] = np.where(df[df < current_date], 'delinquent', 'compliant' # error: 'wrong number of items passed
df['Status'] = np.where(df == 'True', 'delinquent', 'compliant' # error: 'str' obj has no attr 'view'
df['Status' = df.Set.map(lambda x: 'delinquent' if 'True' in df else 'compliant' # from another post - error 'no attr 'Set' or 'map'
# send to output excel
df.to_excel(writer,sheet_name='Sheet1')
I would like to have an output which either displays rows with the 'Status' column addition showing where there was an 'offending date' within the row - detonated with 'compliant' or 'delinquent.' I feel like I am making my comparisons incorrectly (using True instead of another .where) but can't seem to get it right.

When you want to create a new column based on values of one or more other columns, you usually use one of the apply functions. When the function is of multiple columns, as is the case here, you use DataFrame.apply. Here is an approximation of what I think you are trying to do:
df['Status'] = df.apply (
lambda df : (
'delinquent'
if any (df[i] < current_date for i in ("cat1","cat2","cat3"))
else 'compliant'
) ,
axis = 1
)
(FYI I thought from your logic that "delinquent" meant the date was before the current date, if I was wrong please reverse the < symbol to > in what I have above.)
Let's unpack this a little. The apply applies a vectorized function to the entire dataframe. We need to apply to the entire dataframe because we are looking at more than one column; shortly, we will specify which ones. The function is the lambda we've defined. The axis = 1 argument tells apply to apply the lambda to each row (this is not the default, the default is axis = 0, which applies to each column - not what we want). The lambda itself looks at all 3 of your date columns by name, returning "delinquent" if any one of them are before the current date. I use the any() with the generator expression inside just to avoid the drudgery of writing something like if df["cat1"] < current_date or df["cat2"] < current_date or df["cat3"] < current_date and so forth.
Note that all of this depends on your 3 date columns being of type datetime - I am assuming that they are.
If you had only one date column, say, "cat1", you could use the slightly simpler Series.apply on that one column.
df['Status'] = df['cat1'].apply (
lambda x : 'delinquent' if x < current_date else 'compliant'
)
The rationale for doing this is the simpler function and the lack of the axis argument. So generally, people use Series.apply when they are applying a function of only one column, and DataFrame.apply if the function is of more than one column.

Related

calculate sum of rows in pandas dataframe grouped by date

I have a csv that I loaded into a Pandas Dataframe.
I then select only the rows with duplicate dates in the DF:
df_dups = df[df.duplicated(['Date'])].copy()
I'm trying to get the sum of all the rows with the exact same date for 4 columns (all float values), like this:
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].sum()
However, this does not give the desired result. When I examine df_sum.groups, I've noticed that it did not include the first date in the indices. So for two items with the same date, there would only be one index in the groups object.
pprint(df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].groups)
I have no idea how to get the sum of all duplicates.
I've also tried:
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].apply(lambda x : x.sum())
This gives the same result, which makes sense I guess, as the indices in the groupby object are not complete. What am I missing here?
Check the documentation for the method duplicated. By default duplicates are marked with True except for the first occurence, which is why the first date is not included in your sums.
You only need to pass in keep=False in duplicated for your desired behaviour.
df_dups = df[df.duplicated(['Date'], keep=False)].copy()
After that the sum can be calculated properly with the expression you wrote
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].apply(lambda x : x.sum())

Making a new column based on 2 other columns

I am trying to calculate a new column labeled in the code as "Sulphide-S(calc)-C_%S", this column can be calculated from one of two options (see below in the code). Both these columns wont be filled at the same time. So I want it to calculate from the column that has data present. Presently, I have this but the second equation overwrites the first.
df["Sulphide-S(calc)-C_%S"] = df["Total-S_%S"] - df["Sulphate-S(HCL Leachable)_%S"]
df.head()
df["Sulphide-S(calc)-C_%S"] = df["Total-S_%S"]- df["Sulphate-S_%S"]
df.head()
You can use the apply function in pandas to create a new column based on other columns, resulting in a Series that you can add to your original dataframe. Without knowing what your dataframe looks like, the following code might not work directly until you replace the if condition with a working condition to detect the empty dataframe spot.
def create_sulfide_col(row):
if row["Sulphate-S(HCL Leachable)_%S"] is None:
val = row["Total-S_%S"] - row["Sulphate-S(HCL Leachable)_%S"]
else:
val = ["Total-S_%S"]- df["Sulphate-S_%S"]
return val
df["Sulphide-S(calc)-C_%S"] = df.apply(lambda row: create_sulfide_col(row), axis='columns')
If I'm understanding what you're saying correctly, the second equation overwrites the first because they have the same column name. Try changing the column name in one or both of the "Sulphide-S(calc)-C_%S" to something else like "Sulphide-S(calc)-C_%S_A" and "Sulphide-S(calc)-C_%S_B":
df["Sulphide-S(calc)-C_%S_A"] = df["Total-S_%S"] - df["Sulphate-S(HCL Leachable)_%S"]
df.head()
df["Sulphide-S(calc)-C_%S_B"] = df["Total-S_%S"]- df["Sulphate-S_%S"]
df.head()

Python Dataframe: how do I perform vlookup equivalent

this is the df (which is a subset of a much larger df)
df = pd.DataFrame({
'Date': ['04/03/2020', '06/04/2020','08/06/2020','12/12/2020'],
'Tval' : [0.01,0.015,-0.023,-0.0005]
})
and if I need the Tval for say '06/04/2020' (just a single date i need value for). how do I get it? I know merge and join can be used to replicate vlookup function in python but what if its a single value you looking for? Whats the best way to perform the task?
Pandas docs recommend using loc or iloc for most of lookups:
df = pd.DataFrame({
'Date': ['04/03/2020', '06/04/2020','08/06/2020','12/12/2020'],
'Tval' : [0.01,0.015,-0.023,-0.0005]
})
df.loc[df.Date == '06/04/2020', 'Tval']
Here the first part of the expression in the brackets df.Date == '06/04/2020' selects an index of a row(s) you want to see and the second part specifies which column(s) you want to have displayed.
If instead you wanted to see the data for the entire row, you could re-write it as df.loc[df.Date == '06/04/2020', : ].
Selecting in a dataframe works like this:
df.loc[df.Date == '06/06/2020', "Tval"]
The way to make sense of this is:
df.Date=='06/06/2020'
Out:
0 False
1 False
2 False
3 False
Name: Date, dtype: bool
produces a series of True/False values showing for which rows in the Date column match the equality. If you give a DataFrame such a series, it will select out only the rows where the series is True. To see that, see what you get from:
df.loc[df.Date=='06/06/2020']
Out[]:
Date Tval
2 08/06/2020 -0.023
Finally, to see the values of 'Tval' we just do:
df.loc[df.Date == '06/06/2020', "Tval"]

Python Pandas DataFrame - How to sum values in 1 column based on partial match in another column (date type)?

I have encountered some issues while processing my dataset using Pandas DataFrame.
Here is my dataset:
My data types are displayed below:
My dataset is derived from:
MY_DATASET = pd.read_excel(EXCEL_FILE_PATH, index_col = None, na_values = ['NA'], usecols = "A, D")
I would like to sum all values in the "NUMBER OF PEOPLE" column for each month in the "DATE" column. For example, all values in "NUMBER OF PEOPLE" column would be added as long as the value in the "DATE" column was "2020-01", "2020-02" ...
However, I am stuck since I am unsure how to use the .groupby on partial match.
After 1) is completed, I am also trying to convert the values in the "DATE" column from YYYY-MM-DD to YYYY-MMM, like 2020-Jan.
However, I am unsure if there is such a format.
Does anyone know how to resolve these issues?
Many thanks!
Check
s = df['NUMBER OF PEOPLE'].groupby(pd.to_datetime(df['DATE'])).dt.strftime('%Y-%b')).sum()
You can get an abbeviated month name using strftime('%b') but the month name will be all in lowercase:
df['group_time'] = df.date.apply(lambda x: x.strftime('%Y-%B'))
If you need the first letter of the month in uppercase, you could do something like this:
df.group_date = df.group_date.apply(lambda x: f'{x[0:5]}{x[5].upper()}{x[6:]}'
# or in one step:
df['group_date']= df.date.apply(lambda x: x.strftime('%Y-%B')).apply(lambda x: f'{x[0:5]}
...: {x[5].upper()}{x[6:]}')
Now you just need to .groupby and .sum():
result = df['NUMBER OF PEOPLE'].groupby(df.group_date).sum()
I did some tinkering around and found that this worked for me as well:
Cheers all

Row wise operations in pandas dataframe based on dates (sorting issue)

This question has two parts:
1) Is there a better way to do this?
2) If NO to #1, how can I fix my date issue?
I have a dataframe as follows
GROUP DATE VALUE DELTA
A 12/20/2015 2.5 ??
A 11/30/2015 25
A 1/31/2016 8.3
B etc etc
B etc etc
C etc etc
C etc etc
This is a representation, there are close to 100 rows for each group (each row representing a unique date).
For each letter in GROUP, I want to find the change in value between successive dates. So for example for GROUP A I want the change between 11/30/2015 and 12/20/2015, which is -22.5. Currently I am doing the following:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df.sort_values('DATE',ascending=True)
df_out = []
for GROUP in df.GROUP.unique():
x = df[df.GROUP == GROUP]
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
The challenge I am running into is the dates are not sorted correctly. So when the shift takes place and I calculate the delta it is not really the delta between successive dates.
Is this the right approach to handle? If so how can I fix my date issue? I have reviewed/tried the following to no avail:
Applying datetime format in pandas for sorting
how to make a pandas dataframe column into a datetime object showing just the date to correctly sort
doing calculations in pandas dataframe based on trailing row
Pandas - Split dataframe into multiple dataframes based on dates?
Answering my own question. This works:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df_out = []
for ID in df.GROUP.unique():
x = df[df.GROUP == ID]
x.sort_values('DATE',ascending=True, inplace=True)
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
1) Added inplace=True to sort value.
2) Added the sort within the for loop.
3) Changed by loop from using GROUP to ID since it is also the name of a column name, which I imagine is considered sloppy?

Categories

Resources