turn of outliers boxplot in python - python

Any ideas on how to turn of the outliers for the box plot?
Or send a link to the documentation for kind="box" because I've not been able to find it
Code is:
df9=df.ix[Month+"-2015":Month+"-2015"]
df9=df9.rename(columns={'hour_mean': "2015"})
x=pd.concat([df1['2004'],df2['2005'],df3['2006'],df4['2007'],df5['2008'],df6['2012'],df7['2013'],df8['2014'],df9['2015']],axis=1)
ax=x.plot(kind="box")
Because its plotting out a series of series: Seemed to be the only way of doing a box plot. So it plots it for 2004 then 2005etc.
x looks like:
Date 2004 2005 2006 2007 2008
01/12/2004 00:00 9.8
01/12/2004 01:00 4.5
01/12/2004 02:00 2.7
01/12/2004 03:00 5.7
01/12/2004 04:00 10.7
01/12/2004 05:00 10.2
01/12/2004 06:00 11.3
01/12/2004 07:00 7.3
01/12/2004 08:00 7.2
01/12/2004 09:00 6.6
01/12/2004 10:00 9.7
01/12/2004 11:00 1.7
01/12/2004 12:00 3.3
01/12/2004 13:00 8.8
01/12/2004 14:00 5.4
01/12/2004 15:00 8.5
01/12/2004 16:00 1.9
01/12/2004 17:00 3.1
01/12/2004 18:00 6.1
01/12/2004 19:00 -4.1
01/12/2004 20:00 4.8
01/12/2004 21:00 2.1
01/12/2004 22:00 2.6

Have you tried
ax=x.plot(kind="box", showfliers=False)
Outliers in pyplot are called fliers. Check this doc.

Related

Calculation of the daily average

How to transform this data so that the pm 2.5 pm 10 columns are the average of the whole day. The data I collected (example here below) collects data every 15 minutes.
Pm 2.5 Pm 10 Created At
0 6.00 19.20 2021-06-21 19:00
1 4.70 17.00 2021-06-21 19:15
2 4.80 16.70 2021-06-21 19:30
3 5.10 12.10 2021-06-21 19:45
4 7.90 19.10 2021-06-21 20:00
Let's resample the dataframe:
df['Created At'] = pd.to_datetime(df['Created At'])
df.resample('D', on='Created At').mean()
Pm 2.5 Pm 10
Created At
2021-06-21 5.7 16.82
You can use pd.Grouper and then transform if you want to preserve the dataframe shape:
df['Created At'] = pd.to_datetime(df['Created At'])
df[['Pm 2.5', 'Pm 10']] = df.groupby(pd.Grouper(key='Created At', freq='D'))\
[['Pm 2.5', 'Pm 10']].transform('mean')
Output:
Pm 2.5 Pm 10 Created At
0 5.7 16.82 2021-06-21 19:00:00
1 5.7 16.82 2021-06-21 19:15:00
2 5.7 16.82 2021-06-21 19:30:00
3 5.7 16.82 2021-06-21 19:45:00
4 5.7 16.82 2021-06-21 20:00:00
here is one way do it
convert the date using to_datetime, grab the date and carry out the mean
df.groupby(pd.to_datetime(df['Created At']).dt.date).mean()
Created At Pm 2.5 Pm 10
0 2021-06-21 5.7 16.82

Transpose hourly data in rows to new columns Pandas

I have a table recorded hourly weather data like this:
Year Month Day Time Temp
Date/Time
2005-01-01 00:00:00 2005 1 1 0:00 6.0
2005-01-01 01:00:00 2005 1 1 1:00 6.1
2005-01-01 02:00:00 2005 1 1 2:00 6.7
2005-01-01 03:00:00 2005 1 1 3:00 6.8
2005-01-01 04:00:00 2005 1 1 4:00 6.3
2005-01-01 05:00:00 2005 1 1 5:00 6.6
2005-01-01 06:00:00 2005 1 1 6:00 6.9
2005-01-01 07:00:00 2005 1 1 7:00 7.1
2005-01-01 08:00:00 2005 1 1 8:00 6.9
2005-01-01 09:00:00 2005 1 1 9:00 6.7
2005-01-01 10:00:00 2005 1 1 10:00 7.1
2005-01-01 11:00:00 2005 1 1 11:00 7.1
2005-01-01 12:00:00 2005 1 1 12:00 7.2
2005-01-01 13:00:00 2005 1 1 13:00 7.7
2005-01-01 14:00:00 2005 1 1 14:00 8.8
2005-01-01 15:00:00 2005 1 1 15:00 8.6
2005-01-01 16:00:00 2005 1 1 16:00 7.4
2005-01-01 17:00:00 2005 1 1 17:00 6.8
2005-01-01 18:00:00 2005 1 1 18:00 6.3
2005-01-01 19:00:00 2005 1 1 19:00 5.9
2005-01-01 20:00:00 2005 1 1 20:00 5.6
2005-01-01 21:00:00 2005 1 1 21:00 3.6
2005-01-01 22:00:00 2005 1 1 22:00 2.6
2005-01-01 23:00:00 2005 1 1 23:00 1.7
I wanted to save the dataframe in this format:
How can I transpose the dataframe and create new columns for each record?
1) Copy below data
Date/Time Year Month Day Time Temp
0 1/1/2005 0:00 2005 1 1 0:00 6.0
1 1/1/2005 1:00 2005 1 1 1:00 6.1
2 1/1/2005 2:00 2005 1 1 2:00 6.7
3 1/1/2005 3:00 2005 1 1 3:00 6.8
4 1/1/2005 4:00 2005 1 1 4:00 6.3
5 1/1/2005 5:00 2005 1 1 5:00 6.6
6 1/1/2005 6:00 2005 1 1 6:00 6.9
7 1/1/2005 7:00 2005 1 1 7:00 7.1
8 1/1/2005 8:00 2005 1 1 8:00 6.9
9 1/1/2005 9:00 2005 1 1 9:00 6.7
10 1/1/2005 10:00 2005 1 1 10:00 7.1
11 1/1/2005 11:00 2005 1 1 11:00 7.1
12 1/1/2005 12:00 2005 1 1 12:00 7.2
2) Use pd.read_clipboard with a double space or more param due to space in Date/Time column
import pandas as pd
df=pd.read_clipboard('\s\s+')
df
3) format date/time columns and create a pivot table and reset/rename axis.
df['Date/Time']=pd.to_datetime(df['Date/Time'],format='%m/%d/%Y
%H:%M').dt.strftime('%m/%d/%Y')
df['Time']=pd.to_datetime(df['Time']).dt.time
df=pd.pivot_table(df, index='Date/Time', columns='Time', values='Temp').reset_index().rename_axis(index=None, columns=None)
df['Date/Time']=df['Date/Time'].apply(lambda x:(x + ' 0:00'))
df
Output:
Date/Time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00 10:00:00 11:00:00 12:00:00
01/01/2005 6.0 6.1 6.7 6.8 6.3 6.6 6.9 7.1 6.9 6.7 7.1 7.1 7.2
This does the trick similarly to previous answer but on a new date column such that only one row per day is created:
datetimes = pd.date_range("2005-01-01 00:00:00","2005-01-02 23:00:00", freq="1h")
df = pd.DataFrame({"Date/Time": datetimes, "temp": rand(len(datetimes))})
df["Date"] = df["Date/Time"].dt.date
df["Hour"] = df["Date/Time"].dt.hour
reshaped= df.pivot(index='Date', columns='Hour', values='temp')
reshaped.columns = ['HR'+str(hour) for hour in reshaped.columns]

Filling in missing hours in a Pandas dataframe

Date_NZST Time_NZST Radiation_Amount_MJ/m2
5/08/2011 0:00 0
5/08/2011 1:00 0
5/08/2011 2:00 0
5/08/2011 3:00 0
5/08/2011 4:00 0
5/08/2011 5:00 0
5/08/2011 6:00 0
5/08/2011 7:00 0
5/08/2011 8:00 0
5/08/2011 9:00 0.37
5/08/2011 10:00 0.41
5/08/2011 11:00 1.34
5/08/2011 12:00 0.87
5/08/2011 14:00 1.69
5/08/2011 15:00 1.53
5/08/2011 16:00 1.14
5/08/2011 17:00 0.63
5/08/2011 18:00 0.08
5/08/2011 19:00 0
5/08/2011 20:00 0
5/08/2011 21:00 0
5/08/2011 22:00 0
5/08/2011 23:00 0
I have an Excel spreadsheet that contains hourly measurements of solar irradiance everyday for a year. It has 3 columns, Date_NZST, Time_NZST and Radiation_Amount_MJ/m2.
I'm trying to find a way to automatically find all missing hours, generate a row for that missing hour and fill it with a - symbol in the Radiation_Amount_MJ/m2 column. For example, hour 13:00 is missing so I'd like to make row between the 12:00 and 14:00 rows with the correct date and fill the Radiation_Amount_MJ/m2 column with a -. All dates are present, just some hours are missing.
5/08/2011 11:00 1.34
5/08/2011 12:00 0.87
5/08/2011 13:00 -
5/08/2011 14:00 1.69
5/08/2011 15:00 1.53
I've been doing this in Excel but this is a very tedious process and there could be hundreds of missing points. I've resorted to trying to do it using the Pandas library in Python and I saw this thread (Fill in missing hours in a pandas dataframe) and I tried to alter the answer code to fit my data but I got really confused by the line
df.groupby('area').\
apply(lambda x : x.set_index('Datetime').resample('H').mean().fillna(0)).\
reset_index()'
and how to repurpose it to my data. Anyone have any ideas?
First we create a datetime index which contains the date + time with pd.to_datetime.
Then we use resample to get hourly data, and use fillna to fill the missing vallues with a -:
df.set_index(pd.to_datetime(df['Date_NZST'] + ' ' + df['Time_NZST']), inplace=True)
df = df.drop(columns=['Date_NZST', 'Time_NZST'])
df = df.resample('H').first().fillna('-')
Output
Radiation_Amount_MJ/m2
2011-05-08 00:00:00 0
2011-05-08 01:00:00 0
2011-05-08 02:00:00 0
2011-05-08 03:00:00 0
2011-05-08 04:00:00 0
2011-05-08 05:00:00 0
2011-05-08 06:00:00 0
2011-05-08 07:00:00 0
2011-05-08 08:00:00 0
2011-05-08 09:00:00 0.37
2011-05-08 10:00:00 0.41
2011-05-08 11:00:00 1.34
2011-05-08 12:00:00 0.87
2011-05-08 13:00:00 -
2011-05-08 14:00:00 1.69
2011-05-08 15:00:00 1.53
2011-05-08 16:00:00 1.14
2011-05-08 17:00:00 0.63
2011-05-08 18:00:00 0.08
2011-05-08 19:00:00 0
2011-05-08 20:00:00 0
2011-05-08 21:00:00 0
2011-05-08 22:00:00 0
2011-05-08 23:00:00 0
If you want the datetime out of your index use df.reset_index()
Note, by filling in - in a numeric column, it gets converted to object type.

Pandas Dataframe Pivot and reindex n timeseries

I have a pandas dataframe containing n time series in the same Datetime column, each one associated to a different Id, with a corresponding value associated. I would like to pivot the table and reindex to the nearest timestamp. Notice that there can be cases where a timestamp is missing, as in Id-3, in this case the value would need to become NaN.
Datetime Id Value
5-26-17 8:00 1 2.3
5-26-17 8:30 1 4.5
5-26-17 9:00 1 7
5-26-17 9:30 1 8.1
5-26-17 10:00 1 7.9
5-26-17 10:30 1 3.4
5-26-17 11:00 1 2.1
5-26-17 11:30 1 1.8
5-26-17 12:00 1 0.4
5-26-17 8:02 2 2.6
5-26-17 8:32 2 4.8
5-26-17 9:02 2 7.3
5-26-17 9:32 2 8.4
5-26-17 10:02 2 8.2
5-26-17 10:32 2 3.7
5-26-17 11:02 2 2.4
5-26-17 11:32 2 2.1
5-26-17 12:02 2 0.7
5-26-17 8:30 3 4.5
5-26-17 9:00 3 7
5-26-17 9:30 3 8.1
5-26-17 10:00 3 7.9
5-26-17 10:30 3 3.4
5-26-17 11:00 3 2.1
5-26-17 11:30 3 1.8
5-26-17 12:00 3 0.4
Expected results:
Datetime Id-1 Id-2 Id-3
5-26-17 8:00 2.3 2.6 NaN
5-26-17 8:30 4.5 4.8 4.5
5-26-17 9:00 7 7.3 7
5-26-17 9:30 8.1 8.4 8.1
5-26-17 10:00 7.9 8.2 7.9
5-26-17 10:30 3.4 3.7 3.4
5-26-17 11:00 2.1 2.4 2.1
5-26-17 11:30 1.8 2.1 1.8
5-26-17 12:00 0.4 0.7 0.4
How would you do this?
I believe need convert column to datetimes and floor by 30 minutes by floor, last pivot and add_prefix:
df['Datetime'] = pd.to_datetime(df['Datetime']).dt.floor('30T')
df = df.pivot('Datetime','Id','Value').add_prefix('Id-')
print (df)
Id Id-1 Id-2 Id-3
Datetime
2017-05-26 08:00:00 2.3 2.6 NaN
2017-05-26 08:30:00 4.5 4.8 4.5
2017-05-26 09:00:00 7.0 7.3 7.0
2017-05-26 09:30:00 8.1 8.4 8.1
2017-05-26 10:00:00 7.9 8.2 7.9
2017-05-26 10:30:00 3.4 3.7 3.4
2017-05-26 11:00:00 2.1 2.4 2.1
2017-05-26 11:30:00 1.8 2.1 1.8
2017-05-26 12:00:00 0.4 0.7 0.4
Another solution is use resample with mean:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = (df.set_index('Datetime')
.groupby('Id')
.resample('30T')['Value']
.mean().unstack(0)
.add_prefix('Id-'))
print (df)
Id Id-1 Id-2 Id-3
Datetime
2017-05-26 08:00:00 2.3 2.6 NaN
2017-05-26 08:30:00 4.5 4.8 4.5
2017-05-26 09:00:00 7.0 7.3 7.0
2017-05-26 09:30:00 8.1 8.4 8.1
2017-05-26 10:00:00 7.9 8.2 7.9
2017-05-26 10:30:00 3.4 3.7 3.4
2017-05-26 11:00:00 2.1 2.4 2.1
2017-05-26 11:30:00 1.8 2.1 1.8
2017-05-26 12:00:00 0.4 0.7 0.4

Making matching algorithm between two data frames more efficient

I have two data frames eg.
Shorter time frame ( 4 hourly )
Time Data_4h
1/1/01 00:00 1.1
1/1/01 06:00 1.2
1/1/01 12:00 1.3
1/1/01 18:00 1.1
2/1/01 00:00 1.1
2/1/01 06:00 1.2
2/1/01 12:00 1.3
2/1/01 18:00 1.1
3/1/01 00:00 1.1
3/1/01 06:00 1.2
3/1/01 12:00 1.3
3/1/01 18:00 1.1
Longer time frame ( 1 day )
Time Data_1d
1/1/01 00:00 1.1
2/1/01 00:00 1.6
3/1/01 00:00 1.0
I want to label the shorter time frame data with the data from the longer time frame data but n-1 days, leaving NaN where the n-1 day doesn't exist.
For example,
Final merged data combining 4h and 1d
Time Data_4h Data_1d
1/1/01 00:00 1.1 NaN
1/1/01 06:00 1.2 NaN
1/1/01 12:00 1.3 NaN
1/1/01 18:00 1.1 NaN
2/1/01 00:00 1.1 1.1
2/1/01 06:00 1.2 1.1
2/1/01 12:00 1.3 1.1
2/1/01 18:00 1.1 1.1
3/1/01 00:00 1.1 1.6
3/1/01 06:00 1.2 1.6
3/1/01 12:00 1.3 1.6
3/1/01 18:00 1.1 1.6
So for 1/1 - it tried to find 31/12 but couldn't find it so it was labelled as NaN. For 2/1, it searched for 1/1 and labelled those entires with 1.1 - the value for 1/1. For 3/1, it searched for 2/1 and labelled those entires with 1.6 - the value for 2/1.
It is important to note that the timeframe datas may have large gaps. So I can't access the rows in the larger time frame directly.
What is the best way to do this?
Currently I am iterating through all the rows of the smaller timeframe and then searching for the larger time frame date using a filter like:
large_tf_data[(large_tf_data.index <= target_timestamp)][0]
Where target_timestamp is calculated on each row in the smaller time frame data frame.
This is extremely slow! Any suggestions on how to speed it up?
First, take care of dates
dayfirstme = lambda d: pd.to_datetime(d.Time, dayfirst=True)
df = df.assign(Time=dayfirstme)
df2 = df2.assign(Time=dayfirstme)
Then Convert df2 to something useful
d2 = df2.assign(Time=lambda d: d.Time + pd.Timedelta(1, 'D')).set_index('Time').Data_1d
Apply magic
df.join(df.Time.dt.date.map(d2).rename(d2.name))
Time Data_4h Data_1d
0 2001-01-01 00:00:00 1.1 NaN
1 2001-01-01 06:00:00 1.2 NaN
2 2001-01-01 12:00:00 1.3 NaN
3 2001-01-01 18:00:00 1.1 NaN
4 2001-01-02 00:00:00 1.1 1.1
5 2001-01-02 06:00:00 1.2 1.1
6 2001-01-02 12:00:00 1.3 1.1
7 2001-01-02 18:00:00 1.1 1.1
8 2001-01-03 00:00:00 1.1 1.6
9 2001-01-03 06:00:00 1.2 1.6
10 2001-01-03 12:00:00 1.3 1.6
11 2001-01-03 18:00:00 1.1 1.6
I'm sure there are other ways but I didn't want to think about this anymore.

Categories

Resources