Python Dataframe: how do I perform vlookup equivalent - python

this is the df (which is a subset of a much larger df)
df = pd.DataFrame({
'Date': ['04/03/2020', '06/04/2020','08/06/2020','12/12/2020'],
'Tval' : [0.01,0.015,-0.023,-0.0005]
})
and if I need the Tval for say '06/04/2020' (just a single date i need value for). how do I get it? I know merge and join can be used to replicate vlookup function in python but what if its a single value you looking for? Whats the best way to perform the task?

Pandas docs recommend using loc or iloc for most of lookups:
df = pd.DataFrame({
'Date': ['04/03/2020', '06/04/2020','08/06/2020','12/12/2020'],
'Tval' : [0.01,0.015,-0.023,-0.0005]
})
df.loc[df.Date == '06/04/2020', 'Tval']
Here the first part of the expression in the brackets df.Date == '06/04/2020' selects an index of a row(s) you want to see and the second part specifies which column(s) you want to have displayed.
If instead you wanted to see the data for the entire row, you could re-write it as df.loc[df.Date == '06/04/2020', : ].

Selecting in a dataframe works like this:
df.loc[df.Date == '06/06/2020', "Tval"]
The way to make sense of this is:
df.Date=='06/06/2020'
Out:
0 False
1 False
2 False
3 False
Name: Date, dtype: bool
produces a series of True/False values showing for which rows in the Date column match the equality. If you give a DataFrame such a series, it will select out only the rows where the series is True. To see that, see what you get from:
df.loc[df.Date=='06/06/2020']
Out[]:
Date Tval
2 08/06/2020 -0.023
Finally, to see the values of 'Tval' we just do:
df.loc[df.Date == '06/06/2020', "Tval"]

Related

Figuring out if an entire column in a Pandas dataframe is the same value or not

I have a pandas dataframe that works just fine. I am trying to figure out how to tell if a column with a label that I know if correct does not contain all the same values.
The code
below errors out for some reason when I want to see if the column contains -1 in each cell
# column = "TheColumnLabelThatIsCorrect"
# df = "my correct dataframe"
# I get an () takes 1 or 2 arguments but 3 is passed in error
if (not df.loc(column, estimate.eq(-1).all())):
I just learned about .eq() and .all() and hopefully I am using them correctly.
It's a syntax issue - see docs for .loc/indexing. Specifically, you want to be using [] instead of ()
You can do something like
if not df[column].eq(-1).all():
...
If you want to use .loc specifically, you'd do something similar:
if not df.loc[:, column].eq(-1).all():
...
Also, note you don't need to use .eq(), you can just do (df[column] == -1).all()) if you prefer.
You could drop duplicates and if you get only one record it means all records are the same.
import pandas as pd
df = pd.DataFrame({'col': [1, 1, 1, 1]})
len(df['col'].drop_duplicates()) == 1
> True
Question not as clear. Lets try the following though
Contains only -1 in each cell
df['estimate'].eq(-1).all()
Contains -1 in any cell
df['estimate'].eq(-1).any()
Filter out -1 and all columns
df.loc[df['estimate'].eq(-1),:]
df['column'].value_counts() gives you a list of all unique values and their counts in a column. As for checking if all the values are a specific number, you can do that by dropping duplicates and checking the length to be 1.
len(set(df['column'])) == 1

Only return rows from a dataframe which match criteria in two columns

I have a dataframe from which I want to return only the rows in which the values in column '1' match a specific string and in column '2' the value is an integer.
I have the following code in which I attempt to generate a set of indexes which match the criteria and then only pull these rows through from the dataframe.
Ok_index = df[(df['1']== "string") & (df['2'] % 1 == 0)].index
new_df = df.iloc[Ok_index]
I understand the issue will be with the second conditional statement but I don't know how to apply the same logic from the string check to the integer check.
The following dataframe:
1
2
'String'
1.5
'String'
10
'Not string'
10
Should return this dataframe:
1
2
'String'
10
Check with is_integer
df['2'].apply(lambda x : x.is_integer())
0 False
1 True
2 True
Name: 2, dtype: bool
Actually your error is in your second line. You are retrieving the index from the dataframe, so you need to use .loc in order to filter it. Essentialy:
new_df = df.loc[Ok_index]
But if you want to use all pandas' power, you can actually do all this in a single line:
new_df = df[(df['1']== "string") & (df['2'] % 1 == 0)]
You don't need to get the index for the desirable rows first, and then filter the dataframe. You can do all this at once.

Comparing 2 datetime64[ns] dataframe columns

I have two date columns namely date1 and date2.
I am trying to select rows which have date1 later than date2
I tried to
print df[df.loc[df['date1']>df['date2']]]
but I recieved an error
ValueError: Boolean array expected for the condition, not float64
In either case, the idea is to retrieve a boolean mask. This boolean mask will then be used to index into the dataframe and retrieve corresponding rows. First, generate a mask:
mask = df['date1'] > df['date2']
Now, use this mask to index df:
df = df.loc[mask]
This can be written in a single line.
df = df.loc[df['date1'] > df['date2']]
You do not need to perform another level of indexing after this, df now has your final result. I recommend loc if you are planning to perform operations and reassignment on this filtered dataframe, because loc always returns a copy, while plain indexing returns a view.
Below are some more methods of doing the same thing:
Option 1
df.query
df.query('date1 > date2')
Option 2
df.eval
df[df.eval('date1 > date2')]
If your columns are not dates, you might as well cast them now. Use pd.to_datetime:
df.date1 = pd.to_datetime(df.date1)
df.date2 = pd.to_datetime(df.date2)
Or, when loading your CSV, make sure to set the parse_dates switch on:
df = pd.read_csv(..., parse_dates=['date1, date2'])

Pandas: Date comparison with column addition based on values in rows

I have a number of excel files that following a similar format:
|name| email| cat1| cat2| cat3
smith email 01JAN2016 01JAN2014 01JAN2015
The first two columns contain strings (name and email addrs) while each of the following columns contain dates when each person completed each item in cat(x).
I would like to run a comparison to current_date, adding a new column 'status' which will have a value of 'compliant' or 'delinquent' based on whether any date in a row is prior to current date then output the new dataframe to an excel spreadsheet.
My initial attempts let me filter 'older' dates rather easily, however, when I tried to add a column using a conditional everything started to break:
import pandas as pd
import numpy as np
import datetime
current_date = datetime.datetime.now()
writer = pd.ExcelWriter('pd_output.xlsx', engine='xlsxwriter', datetime_format= 'mmm d yyy')
df = pd.read_excel(tracker,'Sheet1')
print(df.values) # Displays dates as 'Timestamp('2016-01-01 00:00:00') any value which is < current_date displays as 'True' else 'False'
print(df < current_date) # removes dates that are not older than current_date but does not delete column, ie someone with no old dates will still show up with column 3+ being blank
# a couple version of what I have been trying - unsuccessfully
df['Status'] = np.where(df[df < current_date], 'delinquent', 'compliant' # error: 'wrong number of items passed
df['Status'] = np.where(df == 'True', 'delinquent', 'compliant' # error: 'str' obj has no attr 'view'
df['Status' = df.Set.map(lambda x: 'delinquent' if 'True' in df else 'compliant' # from another post - error 'no attr 'Set' or 'map'
# send to output excel
df.to_excel(writer,sheet_name='Sheet1')
I would like to have an output which either displays rows with the 'Status' column addition showing where there was an 'offending date' within the row - detonated with 'compliant' or 'delinquent.' I feel like I am making my comparisons incorrectly (using True instead of another .where) but can't seem to get it right.
When you want to create a new column based on values of one or more other columns, you usually use one of the apply functions. When the function is of multiple columns, as is the case here, you use DataFrame.apply. Here is an approximation of what I think you are trying to do:
df['Status'] = df.apply (
lambda df : (
'delinquent'
if any (df[i] < current_date for i in ("cat1","cat2","cat3"))
else 'compliant'
) ,
axis = 1
)
(FYI I thought from your logic that "delinquent" meant the date was before the current date, if I was wrong please reverse the < symbol to > in what I have above.)
Let's unpack this a little. The apply applies a vectorized function to the entire dataframe. We need to apply to the entire dataframe because we are looking at more than one column; shortly, we will specify which ones. The function is the lambda we've defined. The axis = 1 argument tells apply to apply the lambda to each row (this is not the default, the default is axis = 0, which applies to each column - not what we want). The lambda itself looks at all 3 of your date columns by name, returning "delinquent" if any one of them are before the current date. I use the any() with the generator expression inside just to avoid the drudgery of writing something like if df["cat1"] < current_date or df["cat2"] < current_date or df["cat3"] < current_date and so forth.
Note that all of this depends on your 3 date columns being of type datetime - I am assuming that they are.
If you had only one date column, say, "cat1", you could use the slightly simpler Series.apply on that one column.
df['Status'] = df['cat1'].apply (
lambda x : 'delinquent' if x < current_date else 'compliant'
)
The rationale for doing this is the simpler function and the lack of the axis argument. So generally, people use Series.apply when they are applying a function of only one column, and DataFrame.apply if the function is of more than one column.

Return multiple objects from an apply function in Pandas

I'm practicing with using apply with Pandas dataframes.
So I have cooked up a simple dataframe with dates, and values:
dates = pd.date_range('2013',periods=10)
values = list(np.arange(1,11,1))
DF = DataFrame({'date':dates, 'value':values})
I have a second dataframe, which is made up of 3 rows of the original dataframe:
DFa = DF.iloc[[1,2,4]]
So, I'd like to use the 2nd dataframe, DFa, and get the dates from each row (using apply), and then find and sum up any dates in the original dataframe, that came earlier:
def foo(DFa, DF=DF):
cutoff_date = DFa['date']
ans=DF[DF['date'] < cutoff_date]
DFa.apply(foo, axis=1)
Things work fine. My question is, since I've created 3 ans, how do I access these values?
Obviously I'm new to apply and I'm eager to get away from loops. I just don't understand how to return values from apply.
Your function needs to return a value. E.g.,
def foo(df1, df2):
cutoff_date = df1.date
ans = df2[df2.date < cutoff_date].value.sum()
return ans
DFa.apply(lambda x: foo(x, DF), axis=1)
Also, note that apply returns a DataFrame. So your current function would return a DataFrame for each row in DFa, so you would end up with a DataFrame of DataFrames
There's a bit of a mixup the way you're using apply. With axis=1, foo will be applied to each row (see the docs), and yet your code implies (by the parameter name) that its first parameter is a DataFrame.
Additionally, you state that you want to sum up the original DataFrame's values for those less than the date. So foo needs to do this, and return the values.
So the code needs to look something like this:
def foo(row, DF=DF):
cutoff_date = row['date']
return DF[DF['date'] < cutoff_date].value.sum()
Once you make the changes, as foo returns a scalar, then apply will return a series:
>> DFa.apply(foo, axis=1)
1 1
2 3
4 10
dtype: int64

Categories

Resources