I want to convert an hourly Pandas Series into a DataFrame as a DataFrame indexed only with the date and each hour as a column.
For example, let's say I have this Series:
YEAR = 2017
serie = pd.Series(pd.date_range(
start=f'{YEAR}-01-01', end=f'{YEAR}-12-31 23:00:00', freq='H'))
But I want it like:
h01 h02 h03 h04 h05 ...
Date
2017-01-01 data data data data data ...
I believe your Series is with DatetimeIndex and filled some data.
Then need DataFrame.pivot with DataFrame.assign for new columns created by DatetimeIndex.date and DatetimeIndex.strftime and Series.to_frame for one columns DataFrame:
YEAR = 2017
serie = pd.Series(np.arange(8760), pd.date_range(
start=f'{YEAR}-01-01', end=f'{YEAR}-12-31 23:00:00', freq='H'))
df = serie.to_frame('vals').assign(date = lambda x: x.index.date,
hour = lambda x: x.index.strftime('h%H'))
#print (df)
df1 = df.pivot('date','hour','vals')
#print (df1)
Another solution:
serie.index = [serie.index.date, serie.index.strftime('h%H')]
df1 = serie.unstack()
Related
My dataframe has values of how many red cars are sold on a specific month. I have to build a predictive model to predict monthly sale
I want the current data frame to be converted into the format below for time series modeling.
How can I read the column and row header to create a date column? I am hoping for a new data frame.
You can use melt() to transform the dataframe from the wide to the long format. Then we combine the Year and month information to make an actual date:
import pandas as pd
df = pd.DataFrame({'YEAR' : [2021,2022],
'JAN' : [5, 232],
'FEB':[545, 48]})
df2 = df.melt(id_vars = ['YEAR'], var_name = 'month', value_name = 'sales')
df2['date'] = df2.apply(lambda row: pd.to_datetime(str(row['YEAR']) + row['month'], format = '%Y%b'), axis = 1)
df2.sort_values('date')[['date', 'sales']]
this gives the output:
date sales
0 2021-01-01 5
2 2021-02-01 545
1 2022-01-01 232
3 2022-02-01 48
(for time series analysis you would probably want to set the date column as index)
I have a pandas dataFrame with 3 columns of weather data - temperature, time and the name of the weather station.
It looks like this:
Time
Station_name
Temperature
2022-05-12 22:09:35+00:00
station_a
18.3
2022-05-12 22:09:42+00:00
station_b
18.0
I would like to calculate the temperature difference of station_a from station_b at every same minute (as the time stamps are not exactly equal but precise at minute-level (and there is only one measurement every 10 minutes) in a new column.
Is there a way to do this?
You can use a merge_asof on the two sub-dataframes:
df['Time'] = pd.to_datetime(df['Time'])
out = (pd
.merge_asof(df[df['Station_name'].eq('station_a')],
df[df['Station_name'].eq('station_b')],
on='Time', direction='nearest',
tolerance=pd.Timedelta('1min'),
suffixes=('_a', '_b')
)
.set_index('Time')
.eval('diff = Temperature_b - Temperature_a')
['diff']
)
output:
Time
2022-05-12 22:09:35+00:00 -0.3
Name: diff, dtype: float64
You can also try to round the times, but it is more risky if one time gets rounded up and the other down:
df['Time'] = pd.to_datetime(df['Time'])
(df
.assign(Time=df['Time'].dt.round('10min'))
.pivot('Time', 'Station_name', 'Temperature')
.eval('diff = station_b - station_a')
)
output:
Station_name station_a station_b diff
Time
2022-05-12 22:10:00+00:00 18.3 18.0 -0.3
If you have this pandas dataframe
from datetime import datetime
import pandas as pd
data = [{"Time":datetime(2022,5,12,22,9,35), "Station_name":"station_a", "Temperature": 18.3},
{"Time":datetime(2022,5,12,22,9,42), "Station_name":"station_b", "Temperature": 18.0 },
{"Time":datetime(2022,5,12,22,10,35), "Station_name":"station_a", "Temperature": 17.3},
{"Time":datetime(2022,5,12,22,10,42), "Station_name":"station_b", "Temperature": 18.0 }]
df = pd.DataFrame(data)
truncate to minutes: Truncate `TimeStamp` column to hour precision in pandas `DataFrame`
pivot tables / reshape: https://pandas.pydata.org/docs/user_guide/reshaping.html
#truncate to minutes
df["Time_trunc"] = df["Time"].values.astype('<M8[m]')
#Set index (in order to pivot) and pivot (unstack)
df = df.set_index(["Time_trunc",'Station_name'])
df_pivoted = df.unstack()
#Flatten multi-columns
df_new = pd.DataFrame(df_pivoted.to_records())
df_new.columns = ["Time_trunc", "Temp_station_a", "Temp_station_b", "time_station_a", "Time_station_b"]
#Add Diff of temperatures
df_new["DiffAbs"] = abs(df_new["Temp_station_a"]-df_new["Temp_station_b"])
Resulting DataFrame Image
You can use pandas.Series.diff
For example:
df['Temperature_diff'] = df['Temperature'].diff()
I have dataset with 800 rows and i want to create new column with date, and in each row in should increase on one day.
import datetime
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(800):
df['Date'] = date + datetime.timedelta(days=x)
In each column date is equal to '2014-01-12', as i inderstand it fills as if x is always equal to 799
Each time through the loop you are updating the ENTIRE Date column. You see the results of the 800th update at the end.
You could use a date range:
dr = pd.date_range('5/11/2011', periods=800, freq='D')
df = pd.DataFrame({'Date': dr})
print(df)
Date
0 2011-05-11
1 2011-05-12
2 2011-05-13
3 2011-05-14
4 2011-05-15
.. ...
795 2013-07-14
796 2013-07-15
797 2013-07-16
798 2013-07-17
799 2013-07-18
Or:
df['Date'] = dr
pandas is nice tool which can repeate some calculations without using for-loop.
When you use df['Date'] = ... then you assign the same value to all cells in column.
You have to use df.loc[x, 'Date'] = ... to assign to single cell.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(10):
df.loc[x,'Date'] = date + datetime.timedelta(days=x)
print(df)
But you could use also pd.date_range() for this.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
df['Date'] = pd.date_range(date, periods=10)
print(df)
I have a dataframe with the following fields. For each Id, I have two records, that represent different latitude and longitudes. I'm trying to achieve a resultant dataframe that groups by current dataframe based on id and put its latitude and longitude into different fields.
I tried with the group by function but I do not get the intended results. Any help would be greatly appreciated.
Id StartTime StopTime Latitude Longitude
101 14:42:28 14:47:56 53.51 118.12
101 22:10:01 22:12:49 33.32 333.11
Result:
Id StartLat StartLong DestLat DestLong
101 53.51 118.12 33.32 333.11
You can use groupby with apply function for return flatten DataFrame to Series:
df = df.groupby('Id')['Latitude','Longitude'].apply(lambda x: pd.Series(x.values.ravel()))
df.columns = ['StartLat', 'StartLong', 'DestLat', 'DestLong']
df = df.reset_index()
print (df)
Id StartLat StartLong DestLat DestLong
0 101 53.51 118.12 33.32 333.11
If problem:
TypeError: Series.name must be a hashable type
try change Series to DataFrame, but then need unstack with droplevel:
df = df.groupby('Id')['Latitude','Longitude']
.apply(lambda x: pd.DataFrame(x.values.ravel()))
.unstack()
df.columns = df.columns.droplevel(0)
df.columns = ['StartLat', 'StartLong', 'DestLat', 'DestLong']
df = df.reset_index()
print (df)
Id StartLat StartLong DestLat DestLong
0 101 53.51 118.12 33.32 333.11
New to multiindexing in Pandas. I have data that looks like this
Date Time value
2014-01-14 12:00:04 .424
12:01:12 .342
12:01:19 .341
...
12:05:49 .23
2014-05-12 ...
1:02:42 .23
....
For now, I want to access the last time for every single date and store the value in some array. I've made a multiindex like this
df= pd.read_csv("df.csv",index_col=0)
df.index = pd.to_datetime(df.index,infer_datetime_format=True)
df.index = pd.MultiIndex.from_arrays([df.index.date,df.index.time],names=['Date','Time'])
df= df[~df.index.duplicated(keep='first')]
dates = df.index.get_level_values(0)
So I have dates saved as an array. I want to iterate through the dates but can't either get the syntax right or am accessing the values incorrectly. I've tried a for loop but can't get it to run (for date in dates) and can't do direct access either (df.loc[dates[i]] or something like that). Also the number of time variables in each date varies. Is there any way to fix this?
This sounds like a groupby/max operation. More specifically, you want to group by the Date and aggregate the Times by taking the max. Since aggregation can only be done over column values, we'll need to change the Time index level into a column (by using reset_index):
import pandas as pd
df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(['Date', 'Time'])
df = df.reset_index('Time', drop=False)
max_times = df.groupby(level=0)['Time'].max()
print(max_times)
yields
Date
2014-01-14 12:05:49
2014-05-12 1:02:42
Name: Time, dtype: object
If you wish to select the entire row, then you could use idxmax -- but there is a caveat. idxmax returns index labels. Therefore, the index must be unique for the labels to signify unique rows. Since the Date level is not by itself unique, to use idxmax we'll need to reset_index completely (to make an index of unique integers):
df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '1:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_timedelta(df['Time'])
df = df.set_index(['Date', 'Time'])
df = df.reset_index()
idx = df.groupby(['Date'])['Time'].idxmax()
print(df.loc[idx])
yields
Date Time value
3 2014-01-14 12:05:49 0.23
5 2014-05-12 01:02:42 0.23
I don't see a good way to do this while keeping the MultiIndex.
It is easier to perform the groupby operation before setting the MultiIndex.
Moreover, it is probably preferable to preserve the datetimes as one value instead of splitting it into two parts. Note that given a datetime/period-like Series, the .dt accessor gives you easy access to the date and the time as needed. Thus you can group by the Date without making a Date column:
df = pd.DataFrame({'DateTime': ['2014-01-14 12:00:04', '2014-01-14 12:01:12', '2014-01-14 12:01:19', '2014-01-14 12:05:49', '2014-05-12 01:01:59', '2014-05-12 01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['DateTime'] = pd.to_datetime(df['DateTime'])
# df = pd.read_csv('df.csv', parse_dates=[0])
idx = df.groupby(df['DateTime'].dt.date)['DateTime'].idxmax()
result = df.loc[idx]
print(result)
yields
DateTime value
3 2014-01-14 12:05:49 0.23
5 2014-05-12 01:02:42 0.23