I need to query a panda dataframe handed to me which contains a quarterly date format.
Data
import pandas as pd
import datetime
table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
[datetime.datetime(2015, 1, 27), 1, 0.5],
[datetime.datetime(2015, 1, 31), 1, 0.5],
[datetime.datetime(2015, 4, 1), 1, 2],
[datetime.datetime(2015, 4, 3), 1, 2],
[datetime.datetime(2015, 4, 15), 1, 2],
[datetime.datetime(2015, 5, 28), 1, 2],
[datetime.datetime(2015, 5, 1), 1, 3],
[datetime.datetime(2015, 5, 17), 1, 3],
[datetime.datetime(2015, 8, 31), 1, 3]]
df = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])
df = df.assign(Date = lambda x: x.Date.dt.to_period('Q'))
Code
df.query("Date == '2015Q2'")
results in an empty dataframe.
For me working if compare by quarter period:
df = df.query("Date == #pd.Period('2015Q2', 'Q')")
print (df)
Date Id Value
3 2015Q2 1 2.0
4 2015Q2 1 2.0
5 2015Q2 1 2.0
6 2015Q2 1 2.0
7 2015Q2 1 3.0
8 2015Q2 1 3.0
If use boolean indexing it working correct:
df = df[df["Date"] == '2015Q2']
print (df)
Date Id Value
3 2015Q2 1 2.0
4 2015Q2 1 2.0
5 2015Q2 1 2.0
6 2015Q2 1 2.0
7 2015Q2 1 3.0
8 2015Q2 1 3.0
Related
I am trying to calculate a sum for each date field, however I only want to calculate the sum of IDs that are in both the current and next date, so a rolling comparison of IDs and then a groupby sum. Currently I have to loop over the dataframe which is very slow.
For example my df:
df = pd.DataFrame({
'Date': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
'ID': [ 1, 2, 3, 4 , 2, 3, 4 , 2, 3, 4, 5, 1, 2, 3, 4],
'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
})
Ideally I want to group the dataframe by Date and only sum the IDs that are common between two dates, for example below. However this is very slow.
tmpL = df.groupby('Date')['ID'].apply(list)
tmpV = df.groupby('Date')['Value'].sum()
for i in range(1, tmpL.shape[0]):
res = list(set(tmpL.iloc[i]) - set(tmpL.iloc[i - 1]))
v = df.loc[ df.ID.isin(res) & (df.Date == tmpL.index[i]), 'Value'].sum()
tmpV.iloc[i] = tmpV.iloc[i] - v
tmpV
Date
1 10
2 18
3 27
4 42
Name: Value, dtype: int64
Is there a way to do this in pandas without looping over the dataframe?
Use DataFrame.pivot_table with aggregate sum, compare for not equal with DataFrame.diff, and last passed to DataFrame.mask with sum:
df1 = df.pivot_table(index='Date', columns='ID', values='Value', aggfunc='sum')
s = df1.mask(df1.notna().diff().fillna(False)).sum(axis=1)
print (s)
Date
1 10.0
2 18.0
3 27.0
4 42.0
dtype: float64
First solution, I think slowier:
You can get all not matched sets by convert original to sets, then use Series.diff, Series.explode and get all matched values of original by DataFrame.merge, last aggregate sum and subtract:
tmpL = (df.groupby('Date')['ID'].apply(set)
.diff()
.explode()
.reset_index()
.merge(df)
.groupby('Date')['Value']
.sum())
tmpV = df.groupby('Date')['Value'].sum()
out = tmpV.sub(tmpL, fill_value=0)
print (out)
Date
1 10.0
2 18.0
3 27.0
4 42.0
Try:
df = df.pivot_table(index='Date', columns='ID', values='Value')#.reset_index()
condition = df.notna() & df.notna().shift(1)
condition.iloc[0,:]=True
print(df[condition].sum(axis=1))
Output:
Date
1 10.0
2 18.0
3 27.0
4 42.0
Suppose I have the following data frame
from pandas import DataFrame
Cars = { 'value': [10, 31, 661, 1, 51, 61, 551],
'action1': [1, 1, 1, 1, 1, 1, 1],
'price1': [ 12,0, 15,3, 0, 12,0],
'action2': [2, 2, 2, 2, 2, 2, 2],
'price2': [ 0, 16, 19, 0, 1, 10,0],
'action3': [3, 3, 3, 3, 3, 3, 3],
'price3': [ 14, 36, 9, 0, 0, 0,0]
}
df = DataFrame(Cars,columns= ['value', 'action1', 'price1', 'action2', 'price2', 'action3', 'price3'])
print (df)
How can I select randomly value (action and price) among 3 columns? As a result I want to have a dataframe that will look something like this one?
RandCars = {'value': [10, 31, 661, 1, 51, 61, 551],
'action': [1, 3, 1, 3, 1, 2, 2],
'price': [ 12, 36, 15, 0, 3, 10, 0]
}
df2 = DataFrame(RandCars, columns = ['value','action', 'price'])
print(df2)
Use:
#get columns names not starting by action or price
cols = df.columns[~df.columns.str.startswith(('action','price'))]
print (cols)
Index(['value'], dtype='object')
#convert filtered columns to 2 numpy arrays
arr1 = df.filter(regex='^action').values
arr2 = df.filter(regex='^price').values
#pandas 0.24+
#arr1 = df.filter(regex='^action').to_numpy()
#arr2 = df.filter(regex='^price').to_numpy()
i, c = arr1.shape
#create random positions of both DataFrames to new df
idx = np.random.choice(np.arange(c), i)
df3 = pd.DataFrame({'action': arr1[np.arange(len(df)), idx],
'price': arr2[np.arange(len(df)), idx]},
index=df.index)
print (df3)
action price
0 2 0
1 3 36
2 3 9
3 1 3
4 3 0
5 1 12
6 1 0
#add all another columns by join
df4 = df[cols].join(df3)
print (df4)
value action price
0 10 2 0
1 31 3 36
2 661 3 9
3 1 1 3
4 51 3 0
5 61 1 12
6 551 1 0
Let's assume that I have the following data-frame:
df_raw = pd.DataFrame({"id": [102, 102, 103, 103, 103], "val1": [9,2,4,7,6], "val2": [np.nan, 3, np.nan, 4, 5], "val3": [4, np.nan, np.nan, 5, 1], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 3)]})
I want to have access to the rows where the first occurrence of each id is. So these rows would be:
df_first = pd.DataFrame({"id": [102, 103], "val1": [9, 4], "val2": [np.nan, np.nan], "val3": [4, np.nan], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2003, 4, 4)]})
Basically, at the end what I would like to achieve is fill up the NaNs that appear in the first occurrence of each id. So the final data frame might be:
df_processed = pd.DataFrame({"id": [102, 102, 103, 103, 103], "val1": [9,2,4,7,6], "val2": [-1, 3, -1, 4, 5], "val3": [4, np.nan, -1, 5, 1], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 3)]})
An important note is that the rows are already grouped by id and date and sorted in a ascending manner. So they appear exactly as in the provided example.
IIUC using drop_duplicates then concat
df1=df_raw.drop_duplicates('id').fillna(-1)
target=pd.concat([df1,df_raw.loc[~df_raw.index.isin(df1.index)]]).sort_index()
target
date id val1 val2 val3
0 2002-01-01 102 9 -1.0 4.0
1 2002-03-03 102 2 3.0 NaN
2 2003-04-04 103 4 -1.0 -1.0
3 2003-08-09 103 7 4.0 5.0
4 2005-02-03 103 6 5.0 1.0
You can use pd.Series.duplicated with Boolean row indexing:
mask = ~df_raw['id'].duplicated()
val_cols = ['val2', 'val3']
df_raw.loc[mask, val_cols] = df_raw.loc[mask, val_cols].fillna(-1)
print(df_raw)
id val1 val2 val3 date
0 102 9 -1.0 4.0 2002-01-01
1 102 2 3.0 NaN 2002-03-03
2 103 4 -1.0 -1.0 2003-04-04
3 103 7 4.0 5.0 2003-08-09
4 103 6 5.0 1.0 2005-02-03
I have NOAA weather data. In it raw state it has year and month as rows and then days as columns. I want to expand the number of rows so that each row has a year, month, and day with the appropriate data in each row.
There is also a weather variables column where each row represents a different weather variable collected each month. The number of weather variables collected in a month can change. (In January there are two (tmax, tmin), in February there are three (tmax, tmin, prcp), and in March there is one (tmin).)
Here is an example df.
example_df = pd.DataFrame({'station': ['USC1', 'USC1', 'USC1', 'USC1', 'USC1', 'USC1'],
'year': [1993, 1993, 1993, 1993,1993, 1993],
'month': [1, 1, 2, 2, 2, 3],
'attribute':['tmax', 'tmin', 'tmax', 'tmin', 'prcp', 'tmax'],
'day1': range(1, 7, 1),
'day2': range(1, 7, 1),
'day3': range(1, 7, 1),
'day4': range(1, 7, 1),
})
example_df = example_df[['station', 'year', 'month', 'attribute', 'day1', 'day2', 'day3', 'day4']]
This is the solution I want,
solution_df = pd.DataFrame({'station': ['USC1', 'USC1', 'USC1', 'USC1', 'USC1', 'USC1','USC1', 'USC1', 'USC1', 'USC1', 'USC1', 'USC1'],
'year': [1993, 1993, 1993, 1993,1993, 1993, 1993, 1993, 1993, 1993,1993, 1993],
'month': [1, 1,1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'day':[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'tmax': [1, 1, 1, 1, 3, 3, 3, 3, 6, 6, 6, 6],
'tmin': [2, 2, 2, 2, 4, 4, 4, 4, np.nan, np.nan, np.nan, np.nan],
'prcp': [np.nan, np.nan, np.nan, np.nan, 5, 5, 5, 5, np.nan, np.nan, np.nan, np.nan]
})
solution_df = solution_df[['station', 'year', 'month', 'day', 'tmax', 'tmin', 'prcp']]
I have tried .T, pivot, melt, stack, and unstack to get the day columns to be rows with the correct months.
This is as close as I have gotten to success with the example dataset.
record_arr = example_df.to_records()
new_df = pd.DataFrame({'station': np.nan,
'year': np.nan,
'month':np.nan,
'day': np.nan,
'tmax':np.nan,
'tmin': np.nan,
'prcp':np.nan},
index = [1]
)
new_df.append ({'station': record_arr[0][1], 'year': record_arr[0][2], 'month':record_arr[0][3], 'tmax':record_arr[0][5], 'tmin':record_arr[1][5] }, ignore_index = True)
This requires pivot as well as melt (or unstack and stack). This is how I got it in two steps
df1 = example_df.set_index(['station', 'year', 'month', 'attribute']).stack().reset_index()
df1.set_index(['station', 'year', 'month', 'level_4','attribute'])[0].unstack().reset_index()
attribute station year month level_4 prcp tmax tmin
0 USC1 1993 1 day1 NaN 1.0 2.0
1 USC1 1993 1 day2 NaN 1.0 2.0
2 USC1 1993 1 day3 NaN 1.0 2.0
3 USC1 1993 1 day4 NaN 1.0 2.0
4 USC1 1993 2 day1 5.0 3.0 4.0
5 USC1 1993 2 day2 5.0 3.0 4.0
6 USC1 1993 2 day3 5.0 3.0 4.0
7 USC1 1993 2 day4 5.0 3.0 4.0
8 USC1 1993 3 day1 NaN 6.0 NaN
9 USC1 1993 3 day2 NaN 6.0 NaN
10 USC1 1993 3 day3 NaN 6.0 NaN
11 USC1 1993 3 day4 NaN 6.0 NaN
Question: Does the get_group method work on a DataFrame with a DatetimeIndexResamplerGroupby index? If so, what is the appropriate syntax?
Sample data:
data = [[2, 4, 1, datetime.datetime(2017, 1, 1)],
[2, 4, 2, datetime.datetime(2017, 1, 5)],
[3, 4, 1, datetime.datetime(2017, 1, 7)]]
df1 = pd.DataFrame(data, columns=list('abc') + ['dates'])
gb3 = df1.set_index('dates').groupby('a').resample('D')
DatetimeIndexResamplerGroupby [freq=<Day>, axis=0, closed=left, label=left, convention=e, base=0]
gb3.sum()
a b c
a dates
2 2017-01-01 2.0 4.0 1.0
2017-01-02 NaN NaN NaN
2017-01-03 NaN NaN NaN
2017-01-04 NaN NaN NaN
2017-01-05 2.0 4.0 2.0
3 2017-01-07 3.0 4.0 1.0
The get_group method is working for me on a pandas.core.groupby.DataFrameGroupBy object.
I've tried various approaches, the typical error is TypeError: Cannot convert input [(0, 1)] of type <class 'tuple'> to Timestamp
The below should be what you're looking for (if I understand the question correctly):
import pandas as pd
import datetime
data = [[2, 4, 1, datetime.datetime(2017, 1, 1)],
[2, 4, 2, datetime.datetime(2017, 1, 5)],
[3, 4, 1, datetime.datetime(2017, 1, 7)]]
df1 = pd.DataFrame(data, columns=list('abc') + ['dates'])
gb3 = df1.groupby(['a',pd.Grouper('dates')])
gb3.get_group((2, '2017-01-01'))
Out[14]:
a b c dates
0 2 4 1 2017-01-01
I believe resample/pd.Grouper can be used interchangeably in this case (someone correct me if I'm wrong). Let me know if this works for you.
Yes it does, the following code returns the monthly values sum of the year 2015
df.resample('MS').sum().resample('Y').get_group('2015-12-31')