Pandas: Find closest date - without set_index - multiple conditions - python

We have the following Pandas Dataframe:
# Stackoverflow question
data = {'category':[1, 2, 3, 1, 2, 3, 1, 2, 3], 'date':['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02', '2000-01-03', '2000-01-03', '2000-01-03']}
df = pd.DataFrame(data=data)
df['date'] = pd.to_datetime(df['date'])
df
category date
0 1 2000-01-01
1 2 2000-01-01
2 3 2000-01-01
3 1 2000-01-02
4 2 2000-01-02
5 3 2000-01-02
6 1 2000-01-03
7 2 2000-01-03
8 3 2000-01-03
How can we query this dataframe to find the date 2000-01-02 with category 3? So we are looking for the row with index 5.
It should be accomplished without set_index('date').
The reason for this is as follows, when setting the index on the actual data rather than the example data I receive the following error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Take a subset of relevant category, subtract the target date, and get idxmin
tmp = df.loc[df.category.eq(3)]
(tmp.date - pd.to_datetime("2000-01-02")).abs().idxmin()
# 5

To get the (first) closest index date with category 3 you could use:
m = df['category'].eq(3)
d = df['date'].sub(pd.Timestamp('2000-01-02')).abs()
d.loc[m].idxmin()
output: 5

df[(df['category']==3) & (df['date']==pd.Timestamp(2000,1,2))]
To get the list of all indices:
df.index[(df['category']==3) & (df['date']==pd.Timestamp(2000,1,2))].tolist()
Okay :)

Related

How to retrieve unnamed columns after a groupby and unstack?

I have a dataset of events with a date column which I need to display in a weekly plot and do some more data processing afterwards. After some googling I found pd.Grouper(freq="W") so I am using that to group the events by week and display them. My problem is that after doing the groupby and ungroup I end up with a data frame where there is an unnamed column that I am unable to refer to except using iloc. This is an issue because in later plots I am grouping by other columns so I need a way to refer to this column by name, not iloc.
Here's a reproducible example of my dataset:
from datetime import datetime
from faker import Faker
fake = Faker()
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 2, 1)
# Generate data frame of 30 random dates in January 2023
df = pd.DataFrame(
{"date": [fake.date_time_between(start_date=start_date, end_date=end_date) for i in range(30)],
"dummy": [1 for i in range(30)]}) # There's probably a better way of counting than this
grouper = df.set_index("date").groupby([pd.Grouper(freq="W"), 'dummy'])
result = grouper['dummy'].count().unstack('dummy').fillna(0)
The result data frame that I get has weird indexes/columns that I am unable to navigate:
>>> print(result)
dummy 1
date
2023-01-01 1
2023-01-08 3
2023-01-15 4
2023-01-22 9
2023-01-29 8
2023-02-05 5
>>> print(result.columns)
Int64Index([1], dtype='int64', name='dummy')
Then only column here is dummy, but even after result.dummy I get an AttributeError
I've also tried result.reset_index():
dummy date 1
0 2023-01-01 1
1 2023-01-08 3
2 2023-01-15 4
3 2023-01-22 9
4 2023-01-29 8
5 2023-02-05 5
But for this data frame I can only get the date column - the counts column named "1" cannot be accessed using result.reset_index()["1"] as I get an AttributeError
I am completely perplexed by what is going on here, pandas is really powerful but sometimes I find it incredibly unintuitive. I've checked several pages of the docs and checked if there's another index level (there isn't). Can someone who's better at pandas help me out here?
I just want a way to convert the grouped data frame into something like this:
date counts
0 2023-01-01 1
1 2023-01-08 3
2 2023-01-15 4
3 2023-01-22 9
4 2023-01-29 8
5 2023-02-05 5
Where date and counts are columns and there is an unnamed index
You can solve this by simply doing:
from datetime import datetime
from faker import Faker
fake = Faker()
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 2, 1)
# Generate data frame of 30 random dates in January 2023
df = pd.DataFrame(
{"date": [fake.date_time_between(start_date=start_date, end_date=end_date) for i in range(30)],
"dummy": [1 for i in range(30)]}) # There's probably a better way of counting than this
result = df.groupby([pd.Grouper(freq="W", key='date'), 'dummy'], squeeze=True)['dummy'].count()
result = result.reset_index(name='counts')
result = result.drop(['dummy'], axis = 1)
which gives
date counts
0 2023-01-01 3
1 2023-01-08 7
2 2023-01-15 5
3 2023-01-22 5
4 2023-01-29 8
5 2023-02-05 2

Getting the max value from a list of columns by their index in Pandas

I have a dataframe with a variety of columns, but the key part of data I am looking to extract is in columns which are named using datetime values which hold a floating point number for currency.
I am basically just looking to find the max value of any column that is of a date value (i.e. 2021-01-15 00:00:00) per row. I originally used list() to try find any column with '-' in but guessing due to the format I can't directly reference the datetime values?
Example
df:
index, ID, Cost, 2021-01-01 00:00:00, 2021-01-08 00:00:00, 2021-01-15 00:00:00
0, 1, 4000, 40.50, 50.55, 60.99
0, 1, 500, 20.50, 80.55, 160.99
0, 1, 4000, 40.50, 530.55, 1660.99
0, 1, 5000, 40.50, 90.55, 18860.99
0, 1, 9000, 40.50, 590.55, 73760.99
You can find the 'date' columns using a list comprehension which will return the columns that contain /. Then you can use max(axis=1) to create the column which will show the highest value per row, of your date like columns:
date_cols = [c for c in list(df) if '/' in c]
df['max_per_row'] = df[date_cols].max(axis=1)
prints:
index ID Cost ... 08/01/2021 00:00 15/01/2021 00:00 max_per_row
0 0 1 4000 ... 50.55 60.99 60.99
1 0 1 500 ... 80.55 160.99 160.99
2 0 1 4000 ... 530.55 1660.99 1660.99
3 0 1 5000 ... 90.55 18860.99 18860.99
4 0 1 9000 ... 590.55 73760.99 73760.99
Use DataFrame.iloc for select all columns without first 2:
df['new'] = df.iloc[:, 2:].max(axis=1)
If need select float columns use DataFrame.select_dtypes:
df['new'] = df.select_dtypes('float').max(axis=1)
For columns with - use DataFrame.filter:
df['new'] = df.filter(like='-').max(axis=1)

Setting values with pandas.DataFrame

Having this DataFrame:
import pandas
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar'])
df.set_index(['a', 'b'], inplace=True)
df
I would like to replace the Series in there with a new one that is simply the old one, but resampled to a day period (i.e. x.resample('D').sum().dropna()).
When I try:
df['foo'][0] = df['foo'][0].resample('D').sum().dropna()
That seems to work well:
However, I get a warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The question is, how should I do this instead?
Notes
Things I have tried but do not work (resampling or not, the assignment raises an exception):
df.iloc[0].loc['foo'] = df.iloc[0].loc['foo']
df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo']
df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo']
A bit more information about the data (in case it is relevant):
The real DataFrame has more columns in the multi-index. Not all of them necessarily integers, but more generally numerical and categorical. The index is unique (i.e.: there is only one row with a given index value).
The real DataFrame has, of course, many more rows in it (thousands).
There are not necessarily only two columns in the DataFrame and there may be more than 1 columns containing a Series type. Columns usually contain series, categorical data and numerical data as well. Any single column is always single-typed (either numerical, or categorical, or series).
The series contained in each cell usually have a variable length (i.e.: two series/cells in the DataFrame do not, unless pure coincidence, have the same length, and will probably never have the same index anyway, as dates vary as well between series).
Using Python 3.5.1 and Pandas 0.18.1.
This should work:
df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna()
Pandas is complaining about chained indexing but when you don't do it that way it's facing problems assigning whole series to a cell. With iat you can force something like that. I don't think it would be a preferable thing to do, but seems like a working solution.
Simply set df.is_copy = False before asignment of new value.
Hierarchical data in pandas
It really seems like you should consider restructure your data to take advantage of pandas features such as MultiIndexing and DateTimeIndex. This will allow you to still operate on a index in the typical way while being able to select on multiple columns across the hierarchical data (a,b, andbar).
Restructured Data
import pandas as pd
# Define Index
dates = pd.date_range('2016-01-01', periods=5, freq='H')
# Define Series
s = pd.Series([0, 1, 2, 3, 4], index=dates)
# Place Series in Hierarchical DataFrame
heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar'])
df = pd.DataFrame(s, columns=heirIndex)
print df
a 1
b 2
bar 8
2016-01-01 00:00:00 0
2016-01-01 01:00:00 1
2016-01-01 02:00:00 2
2016-01-01 03:00:00 3
2016-01-01 04:00:00 4
Resampling
With the data in this format, resampling becomes very simple.
# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna()
print df_resampled
a 1
b 2
bar 8
2016-01-01 10
Update (from data description)
If the data has variable length Series each with a different index and non-numeric categories that is ok. Let's make an example:
# Define Series
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
# Define Series
dates2 = pandas.date_range('2016-01-14', periods=6, freq='H')
s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2)
# Define DataFrames
df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c']))
df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c']))
df = pd.concat([df1, df2])
print df
a 1 2
b 2 5
bar 8 5
c cat1 cat3
2016-01-01 00:00:00 0.0 NaN
2016-01-01 01:00:00 1.0 NaN
2016-01-01 02:00:00 2.0 NaN
2016-01-01 03:00:00 3.0 NaN
2016-01-01 04:00:00 4.0 NaN
2016-01-14 00:00:00 NaN -200.0
2016-01-14 01:00:00 NaN 10.0
2016-01-14 02:00:00 NaN 24.0
2016-01-14 03:00:00 NaN 30.0
2016-01-14 04:00:00 NaN 40.0
2016-01-14 05:00:00 NaN 100.0
The only issues is that after resampling. You will want to use how='all' while dropping na rows like this:
# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna(how='all')
print df_resampled
a 1 2
b 2 5
bar 8 5
c cat1 cat3
2016-01-01 10.0 NaN
2016-01-14 NaN 4.0

reindex to add missing dates to pandas dataframe

I try to parse a CSV file which looks like this:
dd.mm.yyyy value
01.01.2000 1
02.01.2000 2
01.02.2000 3
I need to add missing dates and fill according values with NaN. I used Series.reindex like in this question:
import pandas as pd
ts=pd.read_csv(file, sep=';', parse_dates='True', index_col=0)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
But in result, values for certain dates are swapped due to date format (i.e. mm/dd instead of dd/mm):
01.01.2000 1
02.01.2000 3
03.01.2000 NaN
...
...
31.01.2000 NaN
01.02.2000 2
I tried several ways (i.e. add dayfirst=True to read_csv) to do it right but still can't figure it out. Please, help.
Set parse_dates to the first column with parse_dates=[0]:
ts = pd.read_csv(file, sep=';', parse_dates=[0], index_col=0, dayfirst=True)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
print(ts)
prints:
value
2000-01-01 1
2000-01-02 2
2000-01-03 NaN
...
2000-01-31 NaN
2000-02-01 3
parse_dates=[0] tells pandas to explicitly parse the first column as dates. From the docs:
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index.
If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
{'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
A fast-path exists for iso8601-formatted dates.

Pandas Return Only Repeated Results

I have a Pandas DataFrame with the columns:
UserID, Date, (other columns that we can ignore here)
I'm trying to select out only users that have visited on multiple dates. I'm currently doing it with groupby(['UserID', 'Date']) and a for loop, where I drop users with only one result, but I feel like there is a much better way to do this.
Thanks
It depends on exact format of output you want to get, but you can count distinct Dates inside each UserID and get all where this count > 1 (like having count(distinct Date) > 1 in SQL):
>>> df
Date UserID
0 2013-01-01 00:00:00 1
1 2013-01-02 00:00:00 2
2 2013-01-02 00:00:00 2
3 2013-01-02 00:00:00 1
4 2013-01-02 00:00:00 3
>>> g = df.groupby('UserID').Date.nunique()
>>> g
UserID
1 2
2 1
3 1
>>> g > 1
UserID
1 True
2 False
3 False
dtype: bool
>>> g[g > 1]
UserID
1 2
you see that you get UserID = 1 as a result, it's the only user visited on multiple dates
To count unique date counts for every UserID:
df.groupby("UserID").Date.agg(lambda s:len(s.unique()))
The you can drop users with only one count.
For the sake of adding another answer, you can also use indexing with list comprehension
DF = pd.DataFrame({'UserID' : [1, 1, 2, 3, 4, 4, 5], 'Data': np.random.rand(7)})
DF.ix[[row for row in DF.index if list(DF.UserID).count(DF.UserID[row])>1]]
This might be as much work as your for loop, but its just another option for you to consider....

Categories

Resources