I have two date columns namely date1 and date2.
I am trying to select rows which have date1 later than date2
I tried to
print df[df.loc[df['date1']>df['date2']]]
but I recieved an error
ValueError: Boolean array expected for the condition, not float64
In either case, the idea is to retrieve a boolean mask. This boolean mask will then be used to index into the dataframe and retrieve corresponding rows. First, generate a mask:
mask = df['date1'] > df['date2']
Now, use this mask to index df:
df = df.loc[mask]
This can be written in a single line.
df = df.loc[df['date1'] > df['date2']]
You do not need to perform another level of indexing after this, df now has your final result. I recommend loc if you are planning to perform operations and reassignment on this filtered dataframe, because loc always returns a copy, while plain indexing returns a view.
Below are some more methods of doing the same thing:
Option 1
df.query
df.query('date1 > date2')
Option 2
df.eval
df[df.eval('date1 > date2')]
If your columns are not dates, you might as well cast them now. Use pd.to_datetime:
df.date1 = pd.to_datetime(df.date1)
df.date2 = pd.to_datetime(df.date2)
Or, when loading your CSV, make sure to set the parse_dates switch on:
df = pd.read_csv(..., parse_dates=['date1, date2'])
Related
I have a data frame which is resultant of some manipulations of 2 other dfs
Note: PKEY_ID is the user defined index
Now I want another resultant data frame which would contain only the columns having any non-null values. Below is my code
diff_col_lst = [col for col in df if ~df[col].isna().all()]
err_df = df[diff_col_lst]
Output
Now the Data frame is actually empty since there are no columns at all, but on checking the df.shape
err_df.shape
Q-1 : Despite the dataframe is empty it says there are 10 rows. How to get the shape as (0,0) if the df is empty without doing explicit check of df.empty and manipulating shapes.
Q -2: Can we make the below code even more abstract?
diff_col_lst = [col for col in df if ~df[col].isna().all()]
err_df = df[diff_col_lst]
First for Q2:
You could use
err_df = df.loc[:, df.any()]
If you look at documentation of any, it says Return whether any element is True, potentially over an axis. The default axis is 0 or index. So it looks at each column from top to bottom along the index and see if there is any non-null value and if found, it returns True along that axis. We don't need to supply axis=0, because its the default value. It will return all such columns which have at least one non-null value.
Now you use .loc to access the columns returned by df.any() and the : part says that I need all indices.
Q1. IIUC, you are filtering by columns which have a non-null value. Then if you need to check if any columns are there are not and you want to check only using shape then you can shape on columns like
err_df.columns.shape
which should give (0,) in this case.
Or you could use size which indicates number of elements. In this case it will return 0 for
err_df.size
this is the df (which is a subset of a much larger df)
df = pd.DataFrame({
'Date': ['04/03/2020', '06/04/2020','08/06/2020','12/12/2020'],
'Tval' : [0.01,0.015,-0.023,-0.0005]
})
and if I need the Tval for say '06/04/2020' (just a single date i need value for). how do I get it? I know merge and join can be used to replicate vlookup function in python but what if its a single value you looking for? Whats the best way to perform the task?
Pandas docs recommend using loc or iloc for most of lookups:
df = pd.DataFrame({
'Date': ['04/03/2020', '06/04/2020','08/06/2020','12/12/2020'],
'Tval' : [0.01,0.015,-0.023,-0.0005]
})
df.loc[df.Date == '06/04/2020', 'Tval']
Here the first part of the expression in the brackets df.Date == '06/04/2020' selects an index of a row(s) you want to see and the second part specifies which column(s) you want to have displayed.
If instead you wanted to see the data for the entire row, you could re-write it as df.loc[df.Date == '06/04/2020', : ].
Selecting in a dataframe works like this:
df.loc[df.Date == '06/06/2020', "Tval"]
The way to make sense of this is:
df.Date=='06/06/2020'
Out:
0 False
1 False
2 False
3 False
Name: Date, dtype: bool
produces a series of True/False values showing for which rows in the Date column match the equality. If you give a DataFrame such a series, it will select out only the rows where the series is True. To see that, see what you get from:
df.loc[df.Date=='06/06/2020']
Out[]:
Date Tval
2 08/06/2020 -0.023
Finally, to see the values of 'Tval' we just do:
df.loc[df.Date == '06/06/2020', "Tval"]
My dataframe until now,
and I am trying to convert cols which is a list of all columns from 0 to 188 ( cols = list(hdata.columns[ range(0,188) ]) ) which are in this format yyyy-mm to datetimeIndex. There are other few columns as well which are 'string' Names and can't be converted to dateTime hence,so I tried doing this,
hdata[cols].columns = pd.to_datetime(hdata[cols].columns) #convert columns to **datetimeindex**
But this is not working.
Can you please figure out what is wrong here?
Edit:
A better way to work on this type of data is to use Split-Apply-Combine method.
Step 1: Split the data which you want to perform some specific operation.
nonReqdf = hdata.iloc[:,188:].sort_index()
reqdf= reqdf.drop(['CountyName','Metro','RegionID','SizeRank'],axis=1)
Step 2: do the operations. In my case it was converting the dataframe columns with year and months to datetimeIndex. And resample it quarterly.
reqdf.columns = pd.to_datetime(reqdf.columns)
reqdf = reqdf.resample('Q',axis=1).mean()
reqdf = reqdf.rename(columns=lambda x: str(x.to_period('Q')).lower()).sort_index() # renaming so that string is yyyy**q**<1/2/3/4> like 2012q1 or 2012q2 likewise
Step 3: Combine the two splitted dataframe.(merge can be used but may depend on what you want)
reqdf = pd.concat([reqdf,nonReqdf],axis=1)
In order to modify some of the labels from an Index (be it for rows or columns), you need to use df.rename as in
for i in range(188):
df.rename({df.columns[i]: pd.to_datetime(df.columns[i])},
axis=1, inplace=True)
Or you can avoid looping by building a full sized index to cover all the columns with
df.columns = (
pd.to_datetime(cols) # pass the list with strings to get a partial DatetimeIndex
.append(df.columns.difference(cols)) # complete the index with the rest of the columns
)
I want to select rows based on a mask, idx. I can think of two different possibilities, either using iloc or just using brackets. I have shown the two possibilities (on a dataframe df) below. Are they both equally viable?
idx = (df["timestamp"] >= 5) & (df["timestamp"] <= 10)
idx = idx.values
hr = df["hr"].iloc[idx]
timestamps = df["timestamp"].iloc[idx]
or the following one:
idx = (df["timestamp"] >= 5) & (df["timestamp"] <= 10)
hr = df["hr"][idx]
timestamps = df["timestamp"][idx]
No, they are not the same. One uses direct syntax while the other relies on chained indexing.
The crucial points are:
pd.DataFrame.iloc is used primarily for integer position-based indexing.
pd.DataFrame.loc is most often used with labels or Boolean arrays.
Chained indexing, i.e. via df[x][y], is explicitly discouraged and is never necessary.
idx.values returns the numpy array representation of idx series. This cannot feed .iloc and is not necessary to feed .loc, which can take idx directly.
Below are two examples which would work. In either example, you can use similar syntax to mask a dataframe or series. For example, df['hr'].loc[mask] would work as well as df.loc[mask].
iloc
Here we use numpy.where to extract integer indices of True elements in a Boolean series. iloc does accept Boolean arrays but, in my opinion, this is less clear; "i" stands for integer.
idx = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
mask = np.where(idx)[0]
df = df.iloc[mask]
loc
Using loc is more natural when we are already querying by specific series.
mask = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
df = df.loc[mask]
When masking only rows, you can omit the loc accessor altogether and use df[mask].
If masking by rows and filtering for a column, you can use df.loc[mask, 'col_name']
Indexing and Selecting Data is fundamental to pandas: there is no substitute for reading the official documentation.
Don't mix __getitem__ based indexing and (i)loc based. Use one or the other. I prefer (i)loc when you're accessing by index, and __getitem__ when you're accessing by column or using boolean indexing.
Here's some commonly bad methods of indexing:
df.loc[idx].loc[:, col]
df.loc[idx][col]
df[column][idx]
df[column].loc[idx]
The correct method for all the above would be df.loc[idx, col]. If idx is an integer label, use df.loc[df.index[idx], col].
Most of these solutions will cause issues down the pipeline (mainly in the form of a SettingWithCopyWarning), when you try assigning to them, because these create views and are tied to the original DataFrame they're viewing into.
The correct solution to all these versions is df.iloc[idx, df.columns.get_loc(column)] Note that idx is an array of integer indexes, and column is a string label. Similarly for loc.
If you have an array of booleans, use loc instead, like this: df.loc[boolean_idx, column]
Furthermore, these are fine: df[column], and df[boolean_mask]
There are rules for indexing a single row or single column. Depending on how it is done, you will get either a Series or DataFrame. So, if you want to index the 100th row from a DataFrame df as a DataFrame slice, you need to do:
df.iloc[[100], :] # `:` selects every column
And not
df.iloc[100, :]
And similarly for the column-based indexing.
Lastly, if you want to index a single scalar, use at or iat.
OTOH, for your requirement, I would suggest a third alternative:
ts = df.loc[df.timestamp.between(5, 10), 'timestamp']
Or if you're subsetting the entire thing,
df = df[df.timestamp.between(5, 10)]
I'm practicing with using apply with Pandas dataframes.
So I have cooked up a simple dataframe with dates, and values:
dates = pd.date_range('2013',periods=10)
values = list(np.arange(1,11,1))
DF = DataFrame({'date':dates, 'value':values})
I have a second dataframe, which is made up of 3 rows of the original dataframe:
DFa = DF.iloc[[1,2,4]]
So, I'd like to use the 2nd dataframe, DFa, and get the dates from each row (using apply), and then find and sum up any dates in the original dataframe, that came earlier:
def foo(DFa, DF=DF):
cutoff_date = DFa['date']
ans=DF[DF['date'] < cutoff_date]
DFa.apply(foo, axis=1)
Things work fine. My question is, since I've created 3 ans, how do I access these values?
Obviously I'm new to apply and I'm eager to get away from loops. I just don't understand how to return values from apply.
Your function needs to return a value. E.g.,
def foo(df1, df2):
cutoff_date = df1.date
ans = df2[df2.date < cutoff_date].value.sum()
return ans
DFa.apply(lambda x: foo(x, DF), axis=1)
Also, note that apply returns a DataFrame. So your current function would return a DataFrame for each row in DFa, so you would end up with a DataFrame of DataFrames
There's a bit of a mixup the way you're using apply. With axis=1, foo will be applied to each row (see the docs), and yet your code implies (by the parameter name) that its first parameter is a DataFrame.
Additionally, you state that you want to sum up the original DataFrame's values for those less than the date. So foo needs to do this, and return the values.
So the code needs to look something like this:
def foo(row, DF=DF):
cutoff_date = row['date']
return DF[DF['date'] < cutoff_date].value.sum()
Once you make the changes, as foo returns a scalar, then apply will return a series:
>> DFa.apply(foo, axis=1)
1 1
2 3
4 10
dtype: int64