Could someone please guide how to groupby no. of hours from hourly based index to find how many hours of null values are there in a specific month? Therefore, I am thinking of having a dataframe with monthly based index.
Below given is the dataframe which has timestamp as index and another column with has occassionally null values.
timestamp
rel_humidity
1999-09-27 05:00:00
82.875
1999-09-27 06:00:00
83.5
1999-09-27 07:00:00
83.0
1999-09-27 08:00:00
80.6
1999-09-27 09:00:00
nan
1999-09-27 10:00:00
nan
1999-09-27 11:00:00
nan
1999-09-27 12:00:00
nan
I tried this but the resulting dataframe is not what I expected.
gap_in_month = OG_1998_2022_gaps.groupby(OG_1998_2022_gaps.index.month, OG_1998_2022_gaps.index.year).count()
I always struggle with groupby in function. Therefore, highly appreciate any help. Thanks in advance!
If need 0 if no missing value per month create mask by Series.isna, convert DatetimeIndex to month periods by DatetimeIndex.to_period and aggregate sum - Trues of mask are processing like 1 or alternative with Grouper:
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(OG_1998_2022_gaps.index.to_period('m')).sum())
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(pd.Grouper(freq='m')).sum())
If need only matched rows solution is similar, but first filter by boolean indexing and then aggregate counts by GroupBy.size:
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(OG_1998_2022_gaps.index.to_period('m')).size())
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(pd.Grouper(freq='m')).size())
Alternative to groupby, but (in my opinion) much nicer, is to use pd.Series.resample:
import pandas as pd
# Some sample data with a DatetimeIndex:
series = pd.Series(
np.random.choice([1.0, 2.0, 3.0, np.nan], size=2185),
index=pd.date_range(start="1999-09-26", end="1999-12-26", freq="H")
)
# Solution:
series.isna().resample("M").sum()
# Note that GroupBy.count and Resampler.count count the number of non-null values,
# whereas you seem to be looking for the opposite :)
In your case:
OG_1998_2022_gaps['rel_humidity'].isna().resample("M").sum()
I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN
I have a large csv file with millions of rows. The data looks like this. 2 columns (date, score) and million rows. I need the missing dates (for example 1/1/16, 2/1/16, 4/1/16) to have '0' values in the 'score' column and keep my existing 'date' and 'score' intact, all in the same csv. But,I also have multiple (hundreds probably) scores on many dates. So really having trouble to code it. Looked up quite a few examples on stackoverflow but none of them seemed to work yet.
date score
3/1/16 0.6369
5/1/16 -0.2023
6/1/16 0.25
7/1/16 0.0772
9/1/16 -0.4215
12/1/16 0.296
15/1/16 0.25
15/1/16 0.7684
15/1/16 0.8537
...
...
31/12/18 0.5646
This is what I have done so far. But all I am getting is an index column filled with 3 years of my 'date' and 'score' columns filled with '0'. I will really appreciate your answers and suggestions. Thank you very much.
import csv
import pandas as pd
import datetime as dt
df =pd.read_csv('myfile.csv')
dtr =pd.date_range('01.01.2016', '31.12.2018')
df.index = pd.DatetimeIndex(df.index)
df =df.reindex(dtr,fill_value = 0)
df.to_csv('missingDateCorrected.csv', encoding ='utf-8', index =True)
Note: I know I put index as True that's why the index is appearing but don't know why the 'date' column is not filling. If I put parse_dates =['date'] in my pd.read_csv I get the 'date' column filled with dates from 1970 with the same results as before.
You can do it like this:
(I did it with a smaller timeframe so change the date so that it fits you.)
import pandas as pd
x = {"date":["3/1/16","5/1/16","5/1/16"],
"score":[4,5,6]}
df = pd.DataFrame.from_dict(x)
df["date"] = pd.to_datetime(df["date"], format='%d/%m/%y')
df.set_index("date",inplace=True)
dtr =pd.date_range('01.01.2016', '01.10.2016', freq='D')
s = pd.Series(index=dtr)
df = pd.concat([df,s[~s.index.isin(df.index)]]).sort_index()
df = df.drop([0],axis=1).fillna(0)
print(df)
Output
score
2016-01-01 0.0
2016-01-02 0.0
2016-01-03 4.0
2016-01-04 0.0
2016-01-05 5.0
2016-01-05 6.0
2016-01-06 0.0
2016-01-07 0.0
2016-01-08 0.0
2016-01-09 0.0
2016-01-10 0.0
With file
Because you ask in the comment here an example with file:
df = pd.read_csv('myfile.csv', index_col=0)
df.index = pd.to_datetime(df.index, format='%d/%m/%y')
dtr =pd.date_range('01.01.2016', '01.10.2016', freq='D')
s = pd.Series(index=dtr)
df = pd.concat([df,s[~s.index.isin(df.index)]]).sort_index()
df = df.drop([0],axis=1).fillna(0)
df.to_csv('missingDateCorrected.csv', encoding ='utf-8', index =True)
Just an idea . Try resampling with 1 day and fill zeros .
like : nd = df.resample('D').pad()
Not very efficient but will work.
import pandas as pd
df = pd.read_csv('myfile.csv', index_col=0)
df.index = pd.to_datetime(df.index, format='%d/%m/%y')
dtr = pd.date_range('01.01.2016', '31.12.2018')
# Create an empty DataFrame from selected date range
empty = pd.DataFrame(index=dtr, columns=['score'])
# Append your CSV file
df = pd.concat([df, empty[~empty.index.isin(df.index)]]).sort_index().fillna(0)
df.to_csv('missingDateCorrected.csv', encoding='utf-8', index=True)
I have a pandas dataset like this:
Date WaterTemp Discharge AirTemp Precip
0 2012-10-05 00:00 10.9 414.0 39.2 0.0
1 2012-10-05 00:15 10.1 406.0 39.2 0.0
2 2012-10-05 00:45 10.4 406.0 37.4 0.0
...
63661 2016-10-12 14:30 10.5 329.0 15.8 0.0
63662 2016-10-12 14:45 10.6 323.0 19.4 0.0
63663 2016-10-12 15:15 10.8 329.0 23 0.0
I want to extend each row so that I get a dataset that looks like:
Date WaterTemp 00:00 WaterTemp 00:15 .... Discharge 00:00 ...
0 2012-10-05 10.9 10.1 414.0
There will be at most 72 readings for each date so I should have 288 columns in addition to the date and index columns, and at most I should have at most 1460 rows (4 years * 365 days in year - possibly some missing dates). Eventually, I will use the 288-column dataset in a classification task (I'll be adding the label later), so I need to convert this dataframe to a 2d array (sans datetime) to feed into the classifier, so I can't simply group by date and then access the group. I did try grouping based on date, but I was uncertain how to change each group into a single row. I also looked at joining. It looks like joining could suit my needs (for example a join based on (day, month, year)) but I was uncertain how to split things into different pandas dataframes so that the join would work. What is a way to do this?
PS. I do already know how to change the my datetimes in my Date column to dates without the time.
I figured it out. I group the readings by time of day of reading. Each group is a dataframe in and of itself, so I just then need to concatenate the dataframes based on date. My code for the whole function is as follows.
import pandas
def readInData(filename):
#read in files and remove missing values
ds = pandas.read_csv(filename)
ds = ds[ds.AirTemp != 'M']
#set index to date
ds['Date'] = pandas.to_datetime(ds.Date, yearfirst=True, errors='coerce')
ds.Date = pandas.DatetimeIndex(ds.Date)
ds.index = ds.Date
#group by time (so group readings by time of day of reading, i.e. all readings at midnight)
dg = ds.groupby(ds.index.time)
#initialize the final dataframe
df = pandas.DataFrame()
for name, group in dg: #for each group
#each group is a dateframe
try:
#set unique column names except for date
group.columns = ['Date', 'WaterTemp'+str(name), 'Discharge'+str(name), 'AirTemp'+str(name), 'Precip'+str(name)]
#ensure date is the index
group.index = group.Date
#remove time from index
group.index = group.index.normalize()
#join based on date
df = pandas.concat([df, group], axis=1)
except: #if the try catch block isn't here, throws errors! (three for my dataset?)
pass
#remove duplicate date columns
df = df.loc[:,~df.columns.duplicated()]
#since date is index, drop the first date column
df = df.drop('Date', 1)
#return the dataset
return df
I'm reading in a csv file with a date time column that has randomly interspersed blocks of non date time text (5 lines in a block at a time and sometimes multiple blocks in a row). See below for an example snipped of the data file:
Date,Time,Count,Fault,Battery
12/22/2015,05:24.0,39615.0,0.0,6.42
12/22/2015,05:25.0,39616.0,0.0,6.42
12/22/2015,05:26.0,39617.0,0.0,6.42
12/22/2015,05:27.0,39618.0,0.0,6.42
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
12/22/2015,19:57.0,39619.0,0.0,6.42
12/22/2015,19:58.0,39620.0,0.0,6.42
12/22/2015,19:59.0,39621.0,0.0,6.42
12/22/2015,20:00.0,39622.0,0.0,6.42
12/22/2015,20:01.0,39623.0,0.0,6.42
12/22/2015,20:02.0,39624.0,0.0,6.42
I can read the data from the clipboard and into a dataframe as follows:
df = pd.read_clipboard(sep=',')
I am looking for a way to clean the 'Date' column of non date formatted strings prior to converting to a datetime index. I have tried converting the column to an index and then to a list and filtering like this:
df.index=df['Date']
df = df[~df.index.get_loc('RMR')]
df = df[~df.index.get_loc('Default Site')]
df = df[~df.index.get_loc('X2CMBasicOpticsBurst')]
df = df[~df.index.get_loc('Sonde STSO3275')]
df = df.dropna()
I can then parse the dates and times together and get a proper datetime index using date parse tools.
However, the contents of the text fields can change and this approach seems very limited and non-pythonic.
Therefore, I'm looking for a better, more flexible and dynamic method to automatically skip these non date fields in the index, hopefully without having to know the details of their contents (e.g. skipping a 4 row block when a blank line is encountered).
Thanks in advance.
Well, you can use to_datetime
df.loc[:, 'Date'] = pd.to_datetime(df.Date, errors='coerce')
element that is not a datetime would be transformed to NaT
then you can drop it.
df = df.dropna()
I think you can use read_csv with dropna and to_datetime:
import pandas as pd
import io
temp=u"""Date,Time,Count,Fault,Battery
12/22/2015,05:24.0,39615.0,0.0,6.42
12/22/2015,05:25.0,39616.0,0.0,6.42
12/22/2015,05:26.0,39617.0,0.0,6.42
12/22/2015,05:27.0,39618.0,0.0,6.42
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
12/22/2015,19:57.0,39619.0,0.0,6.42
12/22/2015,19:58.0,39620.0,0.0,6.42
12/22/2015,19:59.0,39621.0,0.0,6.42
12/22/2015,20:00.0,39622.0,0.0,6.42
12/22/2015,20:01.0,39623.0,0.0,6.42
12/22/2015,20:02.0,39624.0,0.0,6.42"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['Date','Time']])
df = df.dropna()
df['Date_Time'] = pd.to_datetime(df.Date_Time, format="%m/%d/%Y %H:%M.%S")
print df
Date_Time Count Fault Battery
0 2015-12-22 05:24:00 39615.0 0.0 6.42
1 2015-12-22 05:25:00 39616.0 0.0 6.42
2 2015-12-22 05:26:00 39617.0 0.0 6.42
3 2015-12-22 05:27:00 39618.0 0.0 6.42
14 2015-12-22 19:57:00 39619.0 0.0 6.42
15 2015-12-22 19:58:00 39620.0 0.0 6.42
16 2015-12-22 19:59:00 39621.0 0.0 6.42
17 2015-12-22 20:00:00 39622.0 0.0 6.42
18 2015-12-22 20:01:00 39623.0 0.0 6.42
19 2015-12-22 20:02:00 39624.0 0.0 6.42