I have a data frame CaliSimNG
CaliSimNG
Date Sim1 Sim2 Sim3 Sim4 Sim5
0 2018-01-01 4.410628 5.181019 3.283512 2.289767 6.930455
1 2018-01-02 3.919023 5.572350 4.899945 1.858528 7.724655
2 2018-01-03 4.804969 4.477524 7.339943 1.963685 8.186425
3 2018-01-04 4.226408 4.208243 18.850381 1.967792 27.341537
4 2018-01-05 4.441108 3.731662 14.349406 2.000143 7.804742
I want to select the row from certain dates. The dates are marked by date time array DesiredDates
DesiredDates
array(['2018-01-01T19:00:00.000000000-0500',
'2018-01-04T19:00:00.000000000-0500',
'2018-01-05T19:00:00.000000000-0500'],
dtype='datetime64[ns]')
how can I get a subset of CaliSimNG using the datetime index in DesiredDates?
Thanks
You can do an inner join using the pandas "merge" function as described here.
For example:
left = pd.DataFrame({'Date': ['date1', 'date2', 'date3'], 'v': [1, 2, 3]})
right = pd.DataFrame({'Date': ['date2']})
joined = pd.merge(left, right, on='Date')
Produces:
joined
Date v
0 date2 2
Related
I have some data I want to count by month. The column I want count has three different possible values, each representing a different car sold. Here is an example of my dataframe:
Date Type_Car_Sold
2015-01-01 00:00:00 2
2015-01-01 00:00:00 1
2015-01-01 00:00:00 1
2015-01-01 00:00:00 3
... ...
I want to make it so I have a dataframe that counts each specific car type sold by month separately, so looking like this:
Month Car_Type_1 Car_Type_2 Car_Type_3 Total_Cars_Sold
1 15 12 17 44
2 9 18 20 47
... ... ... ... ...
How exactly would I go about doing this? I've tried doing:
cars_sold = car_data['Type_Car_Sold'].groupby(car_data.Date.dt.month).agg('count')
but that just sums up all the cars sold in the month, rather than breaking it down by the total amount of each type sold. Any thoughts?
Maybe not the cleanest solution, but this should get you pretty close
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"Date": [datetime(2022,1,1), datetime(2022,1,1), datetime(2022,2,1), datetime(2022,2,1)],
"Type": [1, 2, 1, 1],
})
df['Date'] = df["Date"].dt.to_period('M')
df['Value'] = 1
print(pd.pivot_table(df, values='Value', index=['Date'], columns=['Type'], aggfunc='count'))
Type 1 2
Date
2022-01 1.0 1.0
2022-02 2.0 NaN
Alternatively you can also pass multiple columns to groupby:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"Date": [datetime(2022,1,1), datetime(2022,1,1), datetime(2022,2,1), datetime(2022,2,1)],
"Type": [1, 2, 1, 1],
})
df['Date'] = df["Date"].dt.to_period('M')
df.groupby(['Date', 'Type']).size()
Date Type
2022-01 1 1
2 1
2022-02 1 2
dtype: int64
This seems to have the unfortunate side effect of excluding keys with zero value. Also the result is multiindexed rows rather than having the index as rows+columns.
For more information on this approach, check this question.
Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03
This question already has answers here:
Get only the first and last rows of each group with pandas
(3 answers)
Closed 3 years ago.
Looking to return a dataframe which contains the last row (the row with most recent date index) of each group, where the second level of the multi-index is filtered by a logical indexing condition.
Here is a toy example included to explain better:
import numpy as np
import pandas as pd
from datetime import datetime
dates = pd.date_range(start='1/1/2018', end='1/4/2018').to_pydatetime().tolist() * 2
ids = ['z7321', 'z7321', 'z7321', 'z7321', 'b2134', 'b2134', 'b2134', 'b2134']
arrays = [ids, dates]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['key', 'date'])
df = pd.DataFrame(data=np.random.randn(len(index)), index=index, columns=['change'])
print(df)
change
key date
z7321 2018-01-01 -0.701605
2018-01-02 -0.934580
2018-01-03 0.186554
2018-01-04 0.417024
b2134 2018-01-01 0.682699
2018-01-02 -0.913633
2018-01-03 0.330347
2018-01-04 -0.706429
The condition would be to return the last row where df[df.index.get_level_values(1) <= datetime(2018, 1, 2)]
The desired output looks like this:
change
key date
z7321 2018-01-02 -0.934580
b2134 2018-01-02 -0.913633
Additional Considerations:
Directly selecting the rows using df[df.index.get_level_values(1) == datetime(2018, 1, 2)] isn't an option since the second index level (date level) may not contain an exact date match for the specified value of datetime(2018, 1, 2)
The date index may not contain the same values across the key groups/index. i.e. 'z7321' could have different dates in the second level index than 'b2134'
As I wrote my toy example, I ended up finding a way to get the desired output. Hopefully this solution is helpful to someone else or perhaps can be improved upon.
The following provides the desired output:
df1 = df[df.index.get_level_values(1) <= datetime(2018, 1, 2)].groupby(level='key', as_index=False).nth(-1)
print(df1)
change
key date
z7321 2018-01-02 -0.934580
b2134 2018-01-02 -0.913633
Which also works for cases where the second index level is inconsistent across the first level groups:
import numpy as np
import pandas as pd
from datetime import datetime
dates = pd.date_range(start='1/1/2018', end='1/4/2018').to_pydatetime().tolist()
dates += pd.date_range(start='12/29/2017', end='1/1/2018').to_pydatetime().tolist()
ids = ['z7321', 'z7321', 'z7321', 'z7321', 'b2134', 'b2134', 'b2134', 'b2134']
arrays = [ids, dates]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['key', 'date'])
df = pd.DataFrame(data=np.random.randn(len(index)), index=index, columns=['change'])
print(df)
change
key date
z7321 2018-01-01 -1.420757
2018-01-02 -0.297835
2018-01-03 0.693520
2018-01-04 0.909420
b2134 2017-12-29 -1.577685
2017-12-30 0.632395
2017-12-31 1.158273
2018-01-01 -0.242314
df1 = df[df.index.get_level_values(1) <= datetime(2018, 1, 2)].groupby(level='key', as_index=False).nth(-1)
print(df1)
change
key date
z7321 2018-01-02 -0.297835
b2134 2018-01-01 -0.242314
I have the following dictionary and dataframe
cust_dict = {'ABC': Empty DataFrame
Columns: [Date, Particulars, Vch No., Outwards, Amount]
Index: [], 'BCD': Empty DataFrame
Columns: [Date, Particulars, Vch No., Outwards, Amount]
Index: []}
df
Date Particulars Vch Type
0 2017-04-01 00:00:00 ABC Sales
1 2017-04-06 00:00:00 BCD Sales
1 2017-04-05 00:00:00 ABC Sales
I am trying to get take 'ABC' from df as the key and pull up the dataframe from the dictionary and add the date to the Date column in the dictionary nested dataframe. I have tried .at, append,assign.
for index, row in df.iterrows():
print(row['Particulars'])
cust_name = row['Particulars']
cust_dict[cust_name] = cust_dict[cust_name]['Date'].append(date)
cust_dict[cust_name].at['Date'] = row['Date']
# A lot of variations of .at
if cust_name == 'ABC':
code = 4
cust_dict[cust_name]['Particulars'] = code
elif cust_name == 'BCD:
code = 5
cust_dict[cust_name]['Particulars'] = code
I am not sure how to go about this or is this simply not possible?
The df will have multiple rows and the particulars column will have a company say ABC 4-5 times or more.
Expected output:
cust_dict['ABC']
Date Particulars Vch Type
0 2017-04-01 00:00:00 4 Sales
1 2017-04-05 00:00:00 4 Sales
This is one way via a dictionary comprehension.
As below, I advise you use a dictionary to map Particulars instead of an if / elif construct.
import pandas as pd
df = pd.DataFrame([['2017-04-01 00:00:00', 'ABC', 'Sales'],
['2017-04-06 00:00:00', 'BCD', 'Sales'],
['2017-04-05 00:00:00', 'ABC', 'Sales']],
index=[0, 1, 1],
columns=['Date', 'Particulars', 'Vch Type'])
part_map = {'ABC': 4, 'BCD': 5}
result = {k: df[df['Particulars'] == k].assign(Particulars=part_map[k]) \
for k in df['Particulars'].unique()}
print(result['ABC'])
# Date Particulars Vch Type
# 0 2017-04-01 00:00:00 4 Sales
# 1 2017-04-05 00:00:00 4 Sales
Assume that I have the following data set
import pandas as pd, numpy, datetime
start, end = datetime.datetime(2015, 1, 1), datetime.datetime(2015, 12, 31)
date_list = pd.date_range(start, end, freq='B')
numdays = len(date_list)
value = numpy.random.normal(loc=1e3, scale=50, size=numdays)
ids = numpy.repeat([1], numdays)
test_df = pd.DataFrame({'Id': ids,
'Date': date_list,
'Value': value})
I would now like to calculate the maximum within each business quarter for test_df. One possiblity is to use resample using rule='BQ', how='max'. However, I'd like to keep the structure of the array and just generate another column with the maximum for each BQ, have you guys got any suggestions on how to do this?
I think the following should work for you, this groups on the quarter and calls transform on the 'Value' column and returns the maximum value as a Series with it's index aligned to the original df:
In [26]:
test_df['max'] = test_df.groupby(test_df['Date'].dt.quarter)['Value'].transform('max')
test_df
Out[26]:
Date Id Value max
0 2015-01-01 1 1005.498555 1100.197059
1 2015-01-02 1 1032.235987 1100.197059
2 2015-01-05 1 986.906171 1100.197059
3 2015-01-06 1 984.473338 1100.197059
........
256 2015-12-25 1 997.965285 1145.215837
257 2015-12-28 1 929.652812 1145.215837
258 2015-12-29 1 1086.128017 1145.215837
259 2015-12-30 1 921.663949 1145.215837
260 2015-12-31 1 938.189566 1145.215837
[261 rows x 4 columns]