Extract and transform Data from Dataset - python

I have a Dataset that is in following format:
Time-Stamp (dd-mm-yyyy) Temperature
I need to extract Day and Month from the Time-Stamp information from each observation in the series
Current Dataset format:
0 1981-01-01 20.7
1 1981-01-02 17.9
2 1981-01-03 18.8
Desired Dataset format:
month day temperature
1 1 20.7
1 2 17.9
1 3 18.8
1 4 14.6

import the data into pandas dataframe, you'll have 3 columns, then the date column split it as:
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
where df is your dataframe and you will have your split data

Related

How to find row with the highest value for the day in pandas and gets the categorial percentages?

Here is my dataset:
date CAT_A CAT_B CAT_C
2018-01-01 5:00 12 223 155
2018-01-01 6:00 199 68 72
...
2018-12-31 23:00 56 92 237
The data shows every hour for every day of the year. So I want to know in pandas how I can find the highest value row for each day, and then get the categorical percentages at that hour. For example if the highest hour was 5:00 for day 01-01 then CAT_A: 3.07%, CAT_B: 57.2% CAT_C: 29.7%
We sum the three columns:
df["sum_categories"] = df.sum(axis=1)
We groupby on daily basis and obtain the index of the max daily row:
idx = df.resample("D")["sum_categories"].idxmax()
We select the rows with this index and calculate proportion:
df.loc[idx,["CAT_A", "CAT_B", "CAT_C"]].div(df.loc[idx,"sum_categories"].values)
Use DataFrameGroupBy.idxmax by Series created by sum and divide by DataFrame.div filtered rows by DataFrame.loc, multiple by 100 and round:
#if necessary DatetimeIndex
#df = df.set_index('date')
s = df.sum(axis=1)
idx = s.groupby(pd.Grouper(freq="D")).idxmax()
df = df.loc[idx].div(s.loc[idx], axis=0).mul(100).round(2)
print (df)
CAT_A CAT_B CAT_C
date
2018-01-01 05:00:00 3.08 57.18 39.74

How to subtract dataframe with month index from dataframe with datetime index

I have two dataframes, one is called Clim and one is called O3_mda8_3135. Clim is a dataframe including monthly average meteorological parameters for one year of data; here is a sample of the dataframe:
Clim.head(12)
Out[7]:
avgT_2551 avgT_5330 ... avgNOx_3135(ppb) avgCO_3135(ppm)
Month ...
1 14.924181 13.545691 ... 48.216128 0.778939
2 16.352172 15.415385 ... 36.110385 0.605629
3 20.530879 19.684720 ... 20.974544 0.460571
4 23.738576 22.919158 ... 14.270995 0.432855
5 26.961927 25.779007 ... 11.087005 0.334505
6 32.208322 31.225072 ... 12.801409 0.384325
7 35.280124 34.265880 ... 10.732970 0.321284
8 35.428857 34.433351 ... 11.916420 0.326389
9 32.008317 30.856782 ... 15.236616 0.343405
10 25.691444 24.139874 ... 24.829518 0.467317
11 19.310550 17.827946 ... 36.339847 0.621938
12 14.186050 12.860077 ... 49.173287 0.720708
[12 rows x 20 columns]
I also have the dataframe O3_mda8_3135, which was created by first calculating the rolling 8 hour average of each component, then finding the maximum daily value of ozone, which is why all of the timestamps and indices are different. There is one value for each meteorological parameter every day of the year. Here's a sample of this dataframe:
O3_mda8_3135
Out[9]:
date Temp_C_2551 ... CO_3135(ppm) O3_mda8_3135
12 2018-01-01 12:00:00 24.1 ... 0.294 10.4000
36 2018-01-02 12:00:00 26.3 ... 0.202 9.4375
60 2018-01-03 12:00:00 22.8 ... 0.184 7.1625
84 2018-01-04 12:00:00 25.6 ... 0.078 8.2500
109 2018-01-05 13:00:00 27.3 ... NaN 9.4500
... ... ... ... ...
8653 2018-12-27 13:00:00 19.6 ... 0.115 35.1125
8676 2018-12-28 12:00:00 14.9 ... 0.097 39.4500
8700 2018-12-29 12:00:00 13.9 ... 0.092 38.1250
8724 2018-12-30 12:00:00 17.4 ... 0.186 35.1375
8753 2018-12-31 17:00:00 8.3 ... 0.110 30.8875
[365 rows x 24 columns]
I am wondering how to subtract the average values in Clim from the corresponding columns and rows in O3_mda8_3135. For example, I would like to subtract the average value for temperature at site 2551 in January (avgT_2551 Month 1 in the Clim dataframe) from every day in January in the other dataframe O3_mda8_3135, column name Temp_C_2551.
avgT_2551 corresponds to Temp_C_2551 in the other dataframe
Is there a simple way to do this? Should I extract the month from the datetime and put it into another column for the O3_mda8_3135 dataframe? I am still a beginner and would appreciate any advice or tips.
I saw this post How to subtract the mean of a month from each day in that month? but there was not enough information given for me to understand what actions were being performed.
I figured it out on my own, thanks to Stack Overflow posts :)
I created new columns in both dataframes corresponding to the month. I had originally set the index in Clim to the Month using Clim = Clim.set_index('Month') so I removed that line. Then, I created a column for Month in the O3_mda8_3135 dataframe. After that, I merged the two dataframes based on the 'Month' column, then used the pd.sub function to subtract the columns I desired.
Here's some example code, sorry the variables are so long but this dataframe is huge.
O3_mda8_3135['Month'] = O3_mda8_3135['date'].dt.month
O3_mda8_3135_anom = pd.merge(O3_mda8_3135, Clim, how='left', on=('Month'))
O3_mda8_3135_anom['O3_mda8_3135_anom'] = O3_mda8_3135_anom['O3_mda8_3135'].sub(O3_mda8_3135_anom['MDA8_3135'])
These posts helped me answer my question:
python pandas extract year from datetime: df['year'] = df['date'].year is not working
How to calculate monthly mean of a time seies data and substract the monthly mean with the values of that month of each year?
Find difference between 2 columns with Nulls using pandas

Append values in pandas where value equals other value

I have two data frames:
dfi = pd.read_csv('C:/Users/Mauricio/Desktop/inflation.csv')
dfm = pd.read_csv('C:/Users/Mauricio/Desktop/maturity.csv')
# equals the following
observation_date CPIAUCSL
0 1947-01-01 21.48
1 1947-02-01 21.62
2 1947-03-01 22.00
3 1947-04-01 22.00
4 1947-05-01 21.95
observation_date DGS10
0 1962-01-02 4.06
1 1962-01-03 4.03
2 1962-01-04 3.99
3 1962-01-05 4.02
4 1962-01-08 4.03
I created a copy as df doing the following:
df = dfi.copy(deep=True)
which returns an exact copy of dfi, dfi dates go by month and dfm dates go by day, I want to create a new column in df that everytime a date in dfi == a date in dfm, to append the DGS10 value in it.
I have this so far:
for date in df.observation_date:
for date2 in dfm.observation_date:
if date==date2:
df['mat_rate'] = dfm['DGS10']
# this is what I get but dates do not match values
observation_date CPIAUCSL mat_rate
0 1947-01-01 21.48 4.06
1 1947-02-01 21.62 4.03
2 1947-03-01 22.00 3.99
3 1947-04-01 22.00 4.02
4 1947-05-01 21.95 4.03
It works but does not append the dates where date == date2 what can I do so it appends the values where date equals date2 only?
Thank you!
If the date formats are inconsistent, convert them first:
dfi.observation_date = pd.to_datetime(dfi.observation_date, format='%Y-%m-%d')
dfm.observation_date = pd.to_datetime(dfm.observation_date, format='%Y-%m-%d')
Now, getting your result should be easy with a merge:
df = dfi.merge(dfm, on='observation_date')

multi-monthly mean with pandas' Series

I have a sequence of datetime objects and a series of data which spans through several years. A can create a Series object and resample it to group it by months:
df=pd.Series(varv,index=dates)
multiMmean=df.resample("M", how='mean')
print multiMmean
This, however, outputs
2005-10-31 172.4
2005-11-30 69.3
2005-12-31 187.6
2006-01-31 126.4
2006-02-28 187.0
2006-03-31 108.3
...
2014-01-31 94.6
2014-02-28 82.3
2014-03-31 130.1
2014-04-30 59.2
2014-05-31 55.6
2014-06-30 1.2
which is a list of the mean value for each month of the series. This is not what I want. I want 12 values, one for every month of the year with a mean for each month through the years. How do I get that for multiMmean?
I have tried using resample("M",how='mean') on multiMmean and list comprehensions but I cannot get it to work. What am I missing?
Thank you.
the following worked for me:
# create some random data with datetime index spanning 17 months
s = pd.Series(index=pd.date_range(start=dt.datetime(2014,1,1), end = dt.datetime(2015,6,1)), data = np.random.randn(517))
In [25]:
# now calc the mean for each month
s.groupby(s.index.month).mean()
Out[25]:
1 0.021974
2 -0.192685
3 0.095229
4 -0.353050
5 0.239336
6 -0.079959
7 0.022612
8 -0.254383
9 0.212334
10 0.063525
11 -0.043072
12 -0.172243
dtype: float64
So we can groupby the month attribute of the datetimeindex and call mean this will calculate the mean for all months

Reshaping tables in pandas

Below is an extract of a dataframe which I have created my merging multiple query log dataframes:
keyword hits date average time
1 the cat sat on 10 10-Jan 10
2 who is the sea 5 10-Jan 1.2
3 under the earth 30 1-Dec 2.5
4 what is this 100 1-Feb 9
Is there a way I can pivot the data using Pandas so that rows are daily dates (e.g. 1-Jan, 2-Jan etc.) and the corresponding 1 column to each date is the daily sum of hits (sum of the hits for that day e.g. sum of hits for 1-Jan) divided by the monthly sum of hits (e.g. for the whole of Jan) for that month (i.e. the month normalised daily hit percentage for each day)
Parse the dates so we can extract the month later.
In [99]: df.date = df.date.apply(pd.Timestamp)
In [100]: df
Out[100]:
keyword hits date average time
1 the cat sat on 10 2013-01-10 00:00:00 10.0
2 who is the sea 5 2013-01-10 00:00:00 1.2
3 under the earth 30 2013-12-01 00:00:00 2.5
4 what is this 100 2013-02-01 00:00:00 9.0
Group by day and sum the hits.
In [101]: daily_totals = df.groupby('date').hits.sum()
In [102]: daily_totals
Out[102]:
date
2013-01-10 15
2013-02-01 100
2013-12-01 30
Name: hits, dtype: int64
Group by month, and divide each row (each daily total) by the sum of all the daily totals in that month.
In [103]: normalized_totals = daily_totals.groupby(lambda d: d.month).transform(lambda x: float(x)/x.sum())
In [104]: normalized_totals
Out[104]:
date
2013-01-10 1
2013-02-01 1
2013-12-01 1
Name: hits, dtype: int64
Your simple example only gave one day in each month, so all these are 1.

Categories

Resources