Pandas select columns and data dependant on header - python

I have a large .csv file. I want to select only the column with he time/date and 20 other columns which I know by header.
As a test I try to take only the column with the header 'TIMESTAMP' I know this is
4207823 rows long in the .csv and it only contains dates and times. The code below selects the TIMESTAMP column but also carries on to take values from other columns as shown below:
import csv
import numpy as np
import pandas
low_memory=False
f = pandas.read_csv('C:\Users\mmso2\Google Drive\MABL Wind\_Semester 2 2016\Wind Farm Info\DataB\DataB - NaN2.csv', dtype = object)#convert file to variable so it can be edited
time = f[['TIMESTAMP']]
time = time[0:4207823]#test to see if this stops time taking other data
print time
output
TIMESTAMP
0 2007-08-15 21:10:00
1 2007-08-15 21:20:00
2 2007-08-15 21:30:00
3 2007-08-15 21:40:00
4 2007-08-15 21:50:00
5 2007-08-15 22:00:00
6 2007-08-15 22:10:00
7 2007-08-15 22:20:00
8 2007-08-15 22:30:00
9 2007-08-15 22:40:00
10 2007-08-15 22:50:00
11 2007-08-15 23:00:00
12 2007-08-15 23:10:00
13 2007-08-15 23:20:00
14 2007-08-15 23:30:00
15 2007-08-15 23:40:00
16 2007-08-15 23:50:00
17 2007-08-16 00:00:00
18 2007-08-16 00:10:00
19 2007-08-16 00:20:00
20 2007-08-16 00:30:00
21 2007-08-16 00:40:00
22 2007-08-16 00:50:00
23 2007-08-16 01:00:00
24 2007-08-16 01:10:00
25 2007-08-16 01:20:00
26 2007-08-16 01:30:00
27 2007-08-16 01:40:00
28 2007-08-16 01:50:00
29 2007-08-16 02:00:00 #these are from the TIMESTAMP column
... ...
679302 221.484 #This is from another column
679303 NaN
679304 2015-09-23 06:40:00
679305 NaN
679306 NaN
679307 2015-09-23 06:50:00
679308 NaN
679309 NaN
679310 2015-09-23 07:00:00

The problem was due to an error in the input file so simple use of usecols in pandas.read_csv worked.
code below demonstrates the selection of a few columns of data
import csv
import pandas
low_memory=False
#read only the selected columns
df = pandas.read_csv('DataB - Copy - Copy.csv',delimiter=',', dtype = object,
usecols=['TIMESTAMP', 'igmmx_U_77m', 'igmmx_U_58m', ])
print df # see what the data looks like
outfile = open('DataB_GreaterGabbardOnly.csv','wb')#somewhere to write the data to
df.to_csv(outfile)#save selection to the blank .csv created above

Related

Using Pandas to filter 2 specific day of year

I have a big CSV dataset and i wish to filter my dataset with use of Pandas and save it into new CSV File
The aim is to find all the records for 1 and 15 days of the year
when i used following code it is work
print (df[(df['data___date_time'].dt.day == 1)])
and result appear as follow:
data___date_time NO2 SO2 PM10
26 2020-07-01 00:00:00 1.591616 0.287604 NaN
27 2020-07-01 01:00:00 1.486401 NaN NaN
28 2020-07-01 02:00:00 1.362056 NaN NaN
29 2020-07-01 03:00:00 1.295101 0.194399 NaN
30 2020-07-01 04:00:00 1.260667 0.362168 NaN
... ... ... ...
17054 2022-07-01 19:00:00 2.894369 2.077140 19.34
17055 2022-07-01 20:00:00 3.644265 1.656386 23.09
17056 2022-07-01 21:00:00 2.907760 1.291555 23.67
17057 2022-07-01 22:00:00 2.974715 1.318185 27.68
17058 2022-07-01 23:00:00 2.858022 1.169057 25.18
However when i used following code nothing comes out
print (df[(df['data___date_time'].dt.day == 1) & (df['data___date_time'].dt.day == 15)])
this just gave me:
Empty DataFrame
Columns: [data___date_time, NO2, SO2, PM10]
Index: []
Is there any idea what could be the problem
There is logical problem, is not possible same row 1 and 15, need | for bitwise OR. If need test multiple values simplier is use Series.isin:
df = pd.DataFrame({'data___date_time': pd.date_range('2000-01-01', periods=20)})
print (df[df['data___date_time'].dt.day.isin([1,15])])
data___date_time
0 2000-01-01
14 2000-01-15

Pandas series add index and value at the beginning of the series object

I got a pandas Series object with an index starting from '2020-01-01 01:00:00+00:00' till 2020-12-19. I would like to add the index '2020-01-01 00:00:00+00:00' with the value np.nan.
data_dict[i]
date_time
2020-01-01 01:00:00+00:00 13
2020-01-01 02:00:00+00:00 13
2020-01-01 03:00:00+00:00 13
2020-01-01 04:00:00+00:00 13
2020-01-01 05:00:00+00:00 13
...
2020-12-18 20:00:00+00:00 25
2020-12-18 21:00:00+00:00 25
2020-12-18 22:00:00+00:00 25
2020-12-18 23:00:00+00:00 25
2020-12-19 00:00:00+00:00 20
When I use:
nan = pd.Series([np.nan], index=['2020-01-01 00:00:00+00:00'])
data_dict[i].append(nan)
data_dict[i].sort_index()
seems like nothing happens?
data_dict[i]
date_time
2020-01-01 01:00:00+00:00 13
2020-01-01 02:00:00+00:00 13
2020-01-01 03:00:00+00:00 13
2020-01-01 04:00:00+00:00 13
2020-01-01 05:00:00+00:00 13
...
2020-12-18 21:00:00+00:00 25
2020-12-18 22:00:00+00:00 25
2020-12-18 23:00:00+00:00 25
2020-12-19 00:00:00+00:00 20
How would I add it at the right place(ie: at the beginning of the series object)
If use .append new data are added to end of Series, for correct order sorting values of DatetimeIndex:
s = s1.append(s2).sort_index()
If need to append to start of Series, swap order - add second Series to first one:
s = s2.append(s1)
EDIT: Here is necessary assign back Series.append and create DatetimeIndex in added Series:
nan = pd.Series([np.nan], index=pd.to_datetime(['2020-01-01 00:00:00+00:00']))
data_dict[i] = data_dict[i].append(nan)
data_dict[i] = data_dict[i].sort_index()

Loop through rows of dataframe using re.compile().split()

I have a dataframe taht consists of 1 column and several rows. Each of these rows is constructed in the same way: -timestamp- value1 value2 value3 -timestamp- value 4 value5 value6 ...
The timestamps have this format: YYYY-MM-DD HH:MM:SS and the values are number with 2 decimals.
I would like to make a new dataframe that has the individual timestamps in one row and the related values in the next row.
I managed to get the expected result linewise with regex but not for the entire dataframe.
My code so far:
#input dataframe
data.head()
values
0 2020-05-12 10:00:00 12.07 13 11.56 ... 2020-05-12 10:00:01 11.49 17 5.67...
1 2020-05-12 10:01:00 11.49 17 5.67 ... 2020-05-12 10:01:01 12.07 13 11.56...
2 2020-05-12 10:02:00 14.29 18 11.28 ... 2020-05-12 10:02:01 13.77 18 7.43...
test = data['values'].iloc[0] #first row of data
row1 = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(test)
df_row1 = pd.DataFrame(row1)
df_row1.head()
values
0 2020-05-12 10:00:00
1 12.07 13.79 15.45 17.17 18.91 14.91 12.35 14....
2 2020-05-12 10:00:01
3 12.48 13.96 13.88 15.57 18.46 15.0 13.65 14.6...
#trying the same for the entire dataframe
for row in data:
df_new = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(row)
print(df_new)
['values']
My question now is how can I loop through the rows of my dataframe and get the expected result?
In case you want to first split the lines and extract the values to columns, be aware you can use str.extract. Using named grouping in your regular expression it will automatically assign the columns for your dataframe
split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"
extract_values = r"(?P<date>\d{4}-\d{2}-\d{2})\s(?P<time>\d{2}:\d{2}:\d{2})\s(?P<value_one>.*?)\s(?P<value_two>.*?)\s(?P<value_three>.*?)$"
df = pd.DataFrame([{
"value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67",
},{
"value": "2020-05-13 10:00:00 14.07 13 15.56 2020-05-16 10:00:02 11.51 18 5.69",
}])
df = df["value"].str.split(split_line).explode().str.extract(extract_values, expand=True)
print(df)
# date time value_one value_two value_three
# 0 2020-05-12 10:00:00 12.07 13 11.56
# 0 2020-06-12 11:00:00 13.07 16 11.16
# 0 2020-05-12 10:00:01 11.49 17 5.67
# 1 2020-05-13 10:00:00 14.07 13 15.56
# 1 2020-05-16 10:00:02 11.51 18 5.69
In case you do not know the number of groups after the date and time use split rather than a regular expression. I would suggest something like this:
split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"
df = pd.DataFrame([{
"value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67",
},{
"value": "2020-05-13 10:00:00 14.07 13 14 15 15.56 2020-05-16 10:00:02 11.51 18 5.69",
}])
df = df["value"].str.split(split_line).explode().reset_index()
df = df['value'].str.split(" ").apply(pd.Series)
df.columns = [f"col_{col}" for col in df.columns]
print(df)
# col_0 col_1 col_2 col_3 col_4 col_5 col_6
# 0 2020-05-12 10:00:00 12.07 13 11.56 NaN NaN
# 1 2020-06-12 11:00:00 13.07 16 11.16 NaN NaN
# 2 2020-05-12 10:00:01 11.49 17 5.67 NaN NaN
# 3 2020-05-13 10:00:00 14.07 13 14 15 15.56
# 4 2020-05-16 10:00:02 11.51 18 5.69 NaN NaN
You don't need to loop through the rows to get the result instead, you can use Series.str.split to split the given series around the delimiter, the delimiter in this case would be a regular expression. Then you can use DataFrame.explode to transform each element in a list-likes to seperate rows.
Use:
data["values"] = data["values"].str.split(r'\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})')
data = data.explode("values")
data["values"] = data["values"].str.split(r'(?<=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})\s+')
data = data.explode("values").reset_index(drop=True)
print(data)
This resulting dataframe data should look like:
values
0 2020-05-12 10:00:00
1 12.07 13 11.56
2 2020-05-12 10:00:01
3 11.49 17 5.67
4 2020-05-12 10:01:00
5 11.49 17 5.67
6 2020-05-12 10:01:01
7 12.07 13 11.56
8 2020-05-12 10:02:00
9 14.29 18 11.28
10 2020-05-12 10:02:01
11 13.77 18 7.43

filter pandas dataframe by time

I have a pandas dataframe which I want to subset on time greater or less than 12pm. First i convert my string datetime to datetime[64]ns object in pandas.
segments_data['time'] = pd.to_datetime((segments_data['time']))
Then I separate time,date,month,year & dayofweek like below.
import datetime as dt
segments_data['date'] = segments_data.time.dt.date
segments_data['year'] = segments_data.time.dt.year
segments_data['month'] = segments_data.time.dt.month
segments_data['dayofweek'] = segments_data.time.dt.dayofweek
segments_data['time'] = segments_data.time.dt.time
My time column looks like following.
segments_data['time']
Out[1906]:
07:43:00
07:52:00
08:00:00
08:42:00
09:18:00
09:18:00
09:18:00
09:23:00
12:32:00
12:43:00
12:55:00
Name: time, dtype: object
Now I want to subset dataframe with time greater than 12pm and time less than 12pm.
segments_data.time[segments_data['time'] < 12:00:00]
It doesn't work because time is a string object.
Update
From pandas docs at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html. Thanks to Frederick in the comments.
Create dataframe with datetimes in it:
i = pd.date_range('2018-04-09', periods=4, freq='1D20min')
ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
ts
A
2018-04-09 00:00:00 1
2018-04-10 00:20:00 2
2018-04-11 00:40:00 3
2018-04-12 01:00:00 4
Use between_time:
ts.between_time('0:15', '0:45')
A
2018-04-10 00:20:00 2
2018-04-11 00:40:00 3
You get the times that are not between two times by setting start_time later than end_time:
ts.between_time('0:45', '0:15')
A
2018-04-09 00:00:00 1
2018-04-12 01:00:00 4
Old Answer
Leave a column as the raw datetime, call it ts:
segments_data['ts'] = pd.to_datetime((segments_data['time']))
Next you can cast the datetime to an H:M:S string and use between(start,end) seems to work:
In [227]:
segments_data=pd.DataFrame(x,columns=['ts'])
segments_data.ts = pd.to_datetime(segments_data.ts)
segments_data
Out[227]:
ts
0 2016-01-28 07:43:00
1 2016-01-28 07:52:00
2 2016-01-28 08:00:00
3 2016-01-28 08:42:00
4 2016-01-28 09:18:00
5 2016-01-28 09:18:00
6 2016-01-28 09:18:00
7 2016-01-28 09:23:00
8 2016-01-28 12:32:00
9 2016-01-28 12:43:00
10 2016-01-28 12:55:00
In [228]:
segments_data[segments_data.ts.dt.strftime('%H:%M:%S').between('00:00:00','12:00:00')]
Out[228]:
ts
0 2016-01-28 07:43:00
1 2016-01-28 07:52:00
2 2016-01-28 08:00:00
3 2016-01-28 08:42:00
4 2016-01-28 09:18:00
5 2016-01-28 09:18:00
6 2016-01-28 09:18:00
7 2016-01-28 09:23:00
Even though this post is 5 years old I just ran into this same problem and decided to post what I was able to get to work. I tried the between_time function but that did not work for me because the index on the dataframe had to be a datetime and I wanted to use one of the dataframe time columns to filter.
# Import datetime libraries
from datetime import datetime, date, time
avail_df['Start'].dt.time
1 08:36:44
2 08:49:14
3 09:26:00
5 08:34:22
7 08:34:19
8 09:09:05
9 12:27:43
10 12:29:14
12 09:05:55
13 09:14:11
14 09:21:41
15 11:28:26
16 12:25:10
17 16:02:52
18 08:53:51
# Use "time()" function to create start/end parameter I used 9:00am for this example
avail_df.loc[avail_df['Start'].dt.time > time(9,00)]
3 09:26:00
8 09:09:05
9 12:27:43
10 12:29:14
12 09:05:55
13 09:14:11
14 09:21:41
15 11:28:26
16 12:25:10
17 16:02:52
20 09:04:50
21 09:21:35
22 09:22:05
23 09:47:05
24 09:55:05

How do I group hourly data by day and count only values greater than a set amount in Pandas?

I am new to Pandas but have been working with python for a few years now.
I have a large data set of hourly data with multiple columns. I need to group the data by day then count how many times the value is above 85 for each day for each column.
example data:
date KMRY KSNS PCEC1 KFAT
2014-06-06 13:00:00 56.000000 63.0 17 11
2014-06-06 14:00:00 58.000000 61.0 17 11
2014-06-06 15:00:00 63.000000 63.0 16 10
2014-06-06 16:00:00 67.000000 65.0 12 11
2014-06-06 17:00:00 67.000000 67.0 10 13
2014-06-06 18:00:00 72.000000 75.0 9 14
2014-06-06 19:00:00 77.000000 79.0 9 15
2014-06-06 20:00:00 84.000000 81.0 9 23
2014-06-06 21:00:00 81.000000 86.0 12 31
2014-06-06 22:00:00 84.000000 84.0 13 28
2014-06-06 23:00:00 83.000000 86.0 15 34
2014-06-07 00:00:00 84.000000 86.0 16 36
2014-06-07 01:00:00 86.000000 89.0 17 43
2014-06-07 02:00:00 86.000000 89.0 20 44
2014-06-07 03:00:00 89.000000 89.0 22 49
2014-06-07 04:00:00 86.000000 86.0 22 51
2014-06-07 05:00:00 86.000000 89.0 21 53
From the sample above my results should look like the following:
date KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0
Any help you be greatly appreciated.
(D_RH>85).sum()
The above code gets me close but I need a daily break down also not just the column counts.
One way would be to make date a DatetimeIndex and then groupby the result of the comparison to 85. For example:
>>> df["date"] = pd.to_datetime(df["date"]) # only if it isn't already
>>> df = df.set_index("date")
>>> (df > 85).groupby(df.index.date).sum()
KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0

Categories

Resources