I'm reading in a csv file with a date time column that has randomly interspersed blocks of non date time text (5 lines in a block at a time and sometimes multiple blocks in a row). See below for an example snipped of the data file:
Date,Time,Count,Fault,Battery
12/22/2015,05:24.0,39615.0,0.0,6.42
12/22/2015,05:25.0,39616.0,0.0,6.42
12/22/2015,05:26.0,39617.0,0.0,6.42
12/22/2015,05:27.0,39618.0,0.0,6.42
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
12/22/2015,19:57.0,39619.0,0.0,6.42
12/22/2015,19:58.0,39620.0,0.0,6.42
12/22/2015,19:59.0,39621.0,0.0,6.42
12/22/2015,20:00.0,39622.0,0.0,6.42
12/22/2015,20:01.0,39623.0,0.0,6.42
12/22/2015,20:02.0,39624.0,0.0,6.42
I can read the data from the clipboard and into a dataframe as follows:
df = pd.read_clipboard(sep=',')
I am looking for a way to clean the 'Date' column of non date formatted strings prior to converting to a datetime index. I have tried converting the column to an index and then to a list and filtering like this:
df.index=df['Date']
df = df[~df.index.get_loc('RMR')]
df = df[~df.index.get_loc('Default Site')]
df = df[~df.index.get_loc('X2CMBasicOpticsBurst')]
df = df[~df.index.get_loc('Sonde STSO3275')]
df = df.dropna()
I can then parse the dates and times together and get a proper datetime index using date parse tools.
However, the contents of the text fields can change and this approach seems very limited and non-pythonic.
Therefore, I'm looking for a better, more flexible and dynamic method to automatically skip these non date fields in the index, hopefully without having to know the details of their contents (e.g. skipping a 4 row block when a blank line is encountered).
Thanks in advance.
Well, you can use to_datetime
df.loc[:, 'Date'] = pd.to_datetime(df.Date, errors='coerce')
element that is not a datetime would be transformed to NaT
then you can drop it.
df = df.dropna()
I think you can use read_csv with dropna and to_datetime:
import pandas as pd
import io
temp=u"""Date,Time,Count,Fault,Battery
12/22/2015,05:24.0,39615.0,0.0,6.42
12/22/2015,05:25.0,39616.0,0.0,6.42
12/22/2015,05:26.0,39617.0,0.0,6.42
12/22/2015,05:27.0,39618.0,0.0,6.42
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
12/22/2015,19:57.0,39619.0,0.0,6.42
12/22/2015,19:58.0,39620.0,0.0,6.42
12/22/2015,19:59.0,39621.0,0.0,6.42
12/22/2015,20:00.0,39622.0,0.0,6.42
12/22/2015,20:01.0,39623.0,0.0,6.42
12/22/2015,20:02.0,39624.0,0.0,6.42"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['Date','Time']])
df = df.dropna()
df['Date_Time'] = pd.to_datetime(df.Date_Time, format="%m/%d/%Y %H:%M.%S")
print df
Date_Time Count Fault Battery
0 2015-12-22 05:24:00 39615.0 0.0 6.42
1 2015-12-22 05:25:00 39616.0 0.0 6.42
2 2015-12-22 05:26:00 39617.0 0.0 6.42
3 2015-12-22 05:27:00 39618.0 0.0 6.42
14 2015-12-22 19:57:00 39619.0 0.0 6.42
15 2015-12-22 19:58:00 39620.0 0.0 6.42
16 2015-12-22 19:59:00 39621.0 0.0 6.42
17 2015-12-22 20:00:00 39622.0 0.0 6.42
18 2015-12-22 20:01:00 39623.0 0.0 6.42
19 2015-12-22 20:02:00 39624.0 0.0 6.42
Related
I have a large csv file with millions of rows. The data looks like this. 2 columns (date, score) and million rows. I need the missing dates (for example 1/1/16, 2/1/16, 4/1/16) to have '0' values in the 'score' column and keep my existing 'date' and 'score' intact, all in the same csv. But,I also have multiple (hundreds probably) scores on many dates. So really having trouble to code it. Looked up quite a few examples on stackoverflow but none of them seemed to work yet.
date score
3/1/16 0.6369
5/1/16 -0.2023
6/1/16 0.25
7/1/16 0.0772
9/1/16 -0.4215
12/1/16 0.296
15/1/16 0.25
15/1/16 0.7684
15/1/16 0.8537
...
...
31/12/18 0.5646
This is what I have done so far. But all I am getting is an index column filled with 3 years of my 'date' and 'score' columns filled with '0'. I will really appreciate your answers and suggestions. Thank you very much.
import csv
import pandas as pd
import datetime as dt
df =pd.read_csv('myfile.csv')
dtr =pd.date_range('01.01.2016', '31.12.2018')
df.index = pd.DatetimeIndex(df.index)
df =df.reindex(dtr,fill_value = 0)
df.to_csv('missingDateCorrected.csv', encoding ='utf-8', index =True)
Note: I know I put index as True that's why the index is appearing but don't know why the 'date' column is not filling. If I put parse_dates =['date'] in my pd.read_csv I get the 'date' column filled with dates from 1970 with the same results as before.
You can do it like this:
(I did it with a smaller timeframe so change the date so that it fits you.)
import pandas as pd
x = {"date":["3/1/16","5/1/16","5/1/16"],
"score":[4,5,6]}
df = pd.DataFrame.from_dict(x)
df["date"] = pd.to_datetime(df["date"], format='%d/%m/%y')
df.set_index("date",inplace=True)
dtr =pd.date_range('01.01.2016', '01.10.2016', freq='D')
s = pd.Series(index=dtr)
df = pd.concat([df,s[~s.index.isin(df.index)]]).sort_index()
df = df.drop([0],axis=1).fillna(0)
print(df)
Output
score
2016-01-01 0.0
2016-01-02 0.0
2016-01-03 4.0
2016-01-04 0.0
2016-01-05 5.0
2016-01-05 6.0
2016-01-06 0.0
2016-01-07 0.0
2016-01-08 0.0
2016-01-09 0.0
2016-01-10 0.0
With file
Because you ask in the comment here an example with file:
df = pd.read_csv('myfile.csv', index_col=0)
df.index = pd.to_datetime(df.index, format='%d/%m/%y')
dtr =pd.date_range('01.01.2016', '01.10.2016', freq='D')
s = pd.Series(index=dtr)
df = pd.concat([df,s[~s.index.isin(df.index)]]).sort_index()
df = df.drop([0],axis=1).fillna(0)
df.to_csv('missingDateCorrected.csv', encoding ='utf-8', index =True)
Just an idea . Try resampling with 1 day and fill zeros .
like : nd = df.resample('D').pad()
Not very efficient but will work.
import pandas as pd
df = pd.read_csv('myfile.csv', index_col=0)
df.index = pd.to_datetime(df.index, format='%d/%m/%y')
dtr = pd.date_range('01.01.2016', '31.12.2018')
# Create an empty DataFrame from selected date range
empty = pd.DataFrame(index=dtr, columns=['score'])
# Append your CSV file
df = pd.concat([df, empty[~empty.index.isin(df.index)]]).sort_index().fillna(0)
df.to_csv('missingDateCorrected.csv', encoding='utf-8', index=True)
I want to subtract the present value by the previous value in each line and whenever there is N/A, it will copy the previous available value and subtract this by the previous available value.
When I run the codes below, I get the following message: 'DataFrame' object has no attribute 'value'. Could anyone please help fix it?
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('ccy_test.xlsx')
X = df.iloc[3:, 1:]
df.fillna(method='pad')
count_row = df.shape[0]
count_col = df.shape[1]
z = df.value[:count_row,1:count_col] - df.value[:count_row,:count_col-1]
dz = pd.DataFrame(z)
Sample File
There are some issues with the code you posted. The example file is a csv file, so you need to refer to "ccy_test.csv". The column Value only contains 0's, so for this example I use the column Open.
Furthermore, I added to your read_csv:
index_col=0 -> makes the first column dates the index
parse_dates=[0] -> parse the dates as dates (instead of strings)
skiprows=3 -> don't read the first rows, as they are not part of the table
header=0 -> read the first row (after the skip) as column names
So:
import pandas as pd
df = pd.read_csv('ccy_test.csv', index_col=0, parse_dates=[0], skiprows=3, header=0)
df = df.fillna(method='pad')
df['Difference'] = df.Open.diff()
print(df)
The output:
Open High Low Value Volume Difference
Dates
2018-03-01 09:30:00 0.83064 0.83121 0.83064 0.0 0.0 NaN
2018-03-01 09:31:00 0.83121 0.83128 0.83114 0.0 0.0 0.00057
2018-03-01 09:32:00 0.83128 0.83161 0.83126 0.0 0.0 0.00007
2018-03-01 09:33:00 0.83161 0.83169 0.83161 0.0 0.0 0.00033
2018-03-01 09:34:00 0.83169 0.83169 0.83145 0.0 0.0 0.00008
df.fillna(method='pad') by default won't change your dataframe, you need to redefine it with df = df.fillna(method='pad').
I have 2 dataframes I have created using pandas and stored as .csv. Each row of both dataframes has columns with date and times, but the timestamps aren't necessarily same. So, I want to create a combined pandas dataframe such that the 2 are joined on the basis of CLOSEST times.
This is my first dataframe. This is my second dataframe. I want to get kp and f107 values for each filename which are closest in date and time to the Avg_time column for each row in the first dataframe. How do I do this? Is there a merge with method='nearest' type way to do this with pandas?
You can use pd.merge_asof in Pandas 0.20.2 with a direction='nearest':
pd.merge_asof(df1.sort_values(by='file_date'),df2.sort_values(by='AST'), left_on='file_date', right_on='AST', direction='nearest')
Output:
Filename file_date Avg_time AST f107 kp
0 Na1998319 1998-11-16 2:14 1998-11-15 23:00:00 121.8 2.3
1 Na1998320 1998-11-17 2:01 1998-11-16 23:00:00 118.0 2.3
2 Na1998321 1998-11-18 0:38 1998-11-17 23:00:00 112.2 2.3
3 Na1998322 1998-11-18 20:51 1998-11-17 23:00:00 112.2 2.3
4 Na1999020 1999-01-20 22:53 1999-01-19 23:00:00 231.3 0.7
I am new to python and I have a list of five climate data replicates that I would like to separate into individual replicates. Each replicate has a length of 42734, and the total length of the data frame (df) is 213,674.
Each replicate is separated by a line where the first entry is “replicate”. I have shown the titles of each column of data above the separating line.
Index year Month Day Rain Evap Max_Temp
42734 Replicate # 2 nan nan nan
I have tried the following code, which is extremely clunky and as I have to generate 100 climate replicates, is not practical. I know there is an easier way to do this, but I do not have enough experience with python yet to figure it out.
Here is the code I wrote:
# Import replicate .txt file into a dataframe
df=pd.read_table('5_replicates.txt',sep=r"\s*"
,skiprows=12,engine='python',header=None,
names =['year', 'Month', 'Day', 'Rain', 'Evap', 'Max_T'])
len(df)
i = 42734
num_replicates = 5
## Replicate 1
replicate_1 = df[0:i]
print "length of replicate_1:", len(replicate_1)
# Replicate 2
replicate_2 = df[i+1 : 2*i+1]
print "length of replicate_2:", len(replicate_2)
# Replicate 3
replicate_3 = df[2*i+2 : 3*i+2]
print "length of replicate_3:", len(replicate_3)
# Replicate 4
replicate_4 = df[3*i+3 : 4*i+3]
print "length of replicate_4:", len(replicate_4)
# Replicate 5
replicate_5 = df[4*i+4 : 5*i+4]
print "length of replicate_5:", len(replicate_5)
Any help would be much appreciated!
## create the example data frame
df = pd.DataFrame({'year':pd.date_range(start = '2016-01-01', end='2017-01-01', freq='H'), 'rain':np.random.randn(8785), 'max_temp':np.random.randn(8785)})
df.year = df.year.astype(str) #make the year column of str type
## add index at which we enter replicate.
df.ix[np.floor(np.linspace(0,df.shape[0]-1, 5)), 'year'] = "Replicate"
In [7]: df.head()
Out[7]:
max_temp rain year
0 -1.068354 0.959108 Replicate
1 -0.219425 0.777235 2016-01-01 01:00:00
2 -0.262994 0.472665 2016-01-01 02:00:00
3 -1.761527 -0.515135 2016-01-01 03:00:00
4 -2.038738 -1.452385 2016-01-01 04:00:00
Here, I just to the following. 1), I find the indexes at which the word "Replicate" is featured and record those indexes into dictionary idx_dict. 2) create a python range for each block that essentially indexes which blocks rows are in which replicate. 3) finally, I assign the number of a replicate to each block though once you have the range object, you don't really need to do this.
#1) find where the word "replicate" is featured
indexes = df[df.year == 'Replicate'].index
#2) create the range objects
idx_dict = {}
for i in range(0,indexes.shape[0]-1):
idx_dict[i] = range(indexes[i],indexes[i+1]-1)
#3) set the replicate number in some column
df.loc[:,'rep_num'] = np.nan #preset a value for the 'rep_num' column
for i in range(0, 4):
print(i)
df.loc[idx_dict[i],'rep_num'] = i
#fill in the NAs because my indexing algorithm isn't splendid
df.rep_num.fillna(method='ffill', inplace=True)
Now, you can just subset the df as you please by the replicate number or store portions elsewhere.
#get the number of rows in each replicate:
In [26]: df.groupby("rep_num").count()
Out[26]:
max_temp rain year
rep_num
0.0 2196 2196 2196
1.0 2196 2196 2196
2.0 2196 2196 2196
3.0 2197 2197 2197
#get the portion with the first replicate
In [27]: df.loc[df.rep_num==0,:].head()
Out[27]:
max_temp rain year rep_num
0 0.976052 0.896358 Replicate 0.0
1 -0.875221 -1.110111 2016-01-01 01:00:00 0.0
2 -0.305727 0.495230 2016-01-01 02:00:00 0.0
3 0.694737 -0.356541 2016-01-01 03:00:00 0.0
4 0.325071 0.669536 2016-01-01 04:00:00 0.0
I have a pandas dataset like this:
Date WaterTemp Discharge AirTemp Precip
0 2012-10-05 00:00 10.9 414.0 39.2 0.0
1 2012-10-05 00:15 10.1 406.0 39.2 0.0
2 2012-10-05 00:45 10.4 406.0 37.4 0.0
...
63661 2016-10-12 14:30 10.5 329.0 15.8 0.0
63662 2016-10-12 14:45 10.6 323.0 19.4 0.0
63663 2016-10-12 15:15 10.8 329.0 23 0.0
I want to extend each row so that I get a dataset that looks like:
Date WaterTemp 00:00 WaterTemp 00:15 .... Discharge 00:00 ...
0 2012-10-05 10.9 10.1 414.0
There will be at most 72 readings for each date so I should have 288 columns in addition to the date and index columns, and at most I should have at most 1460 rows (4 years * 365 days in year - possibly some missing dates). Eventually, I will use the 288-column dataset in a classification task (I'll be adding the label later), so I need to convert this dataframe to a 2d array (sans datetime) to feed into the classifier, so I can't simply group by date and then access the group. I did try grouping based on date, but I was uncertain how to change each group into a single row. I also looked at joining. It looks like joining could suit my needs (for example a join based on (day, month, year)) but I was uncertain how to split things into different pandas dataframes so that the join would work. What is a way to do this?
PS. I do already know how to change the my datetimes in my Date column to dates without the time.
I figured it out. I group the readings by time of day of reading. Each group is a dataframe in and of itself, so I just then need to concatenate the dataframes based on date. My code for the whole function is as follows.
import pandas
def readInData(filename):
#read in files and remove missing values
ds = pandas.read_csv(filename)
ds = ds[ds.AirTemp != 'M']
#set index to date
ds['Date'] = pandas.to_datetime(ds.Date, yearfirst=True, errors='coerce')
ds.Date = pandas.DatetimeIndex(ds.Date)
ds.index = ds.Date
#group by time (so group readings by time of day of reading, i.e. all readings at midnight)
dg = ds.groupby(ds.index.time)
#initialize the final dataframe
df = pandas.DataFrame()
for name, group in dg: #for each group
#each group is a dateframe
try:
#set unique column names except for date
group.columns = ['Date', 'WaterTemp'+str(name), 'Discharge'+str(name), 'AirTemp'+str(name), 'Precip'+str(name)]
#ensure date is the index
group.index = group.Date
#remove time from index
group.index = group.index.normalize()
#join based on date
df = pandas.concat([df, group], axis=1)
except: #if the try catch block isn't here, throws errors! (three for my dataset?)
pass
#remove duplicate date columns
df = df.loc[:,~df.columns.duplicated()]
#since date is index, drop the first date column
df = df.drop('Date', 1)
#return the dataset
return df