pandas: read_csv excluding only certain rows - python

I'm trying to import a csv file that looks like this
Irrelevant row
"TIMESTAMP","RECORD","Site","Logger","Avg_70mSE_Avg","Avg_60mS_Avg",
"TS","RN","","","metres/second","metres/second",
"","","Smp","Smp","Avg","Avg",
"2010-05-18 12:30:00",0,"Sisters",5068,5.162,4.996
"2010-05-18 12:40:00",1,"Sisters",5068,5.683,5.571
The second row is the header but rows 0, 2, 3 are irrelevant. My code at the moment is:
parse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv('data.csv', header=1, index_col=['TIMESTAMP'],
parse_dates=['TIMESTAMP'], date_parser = parse)
The problem is that since rows 2 and 3 don't have correct dates I get an error (or at least I think this the error).
Would it be possible to exclude these rows, using something like skiprows, but for rows that are not in the beginning of the file? Or do you have any other suggestions?

You can use the skiprows keyword to ignore the rows:
pd.read_csv('data.csv', skiprows=[0, 2, 3],
index_col=['TIMESTAMP'], parse_dates=['TIMESTAMP'])
Which for your sample data gives:
RECORD Site Logger Avg_70mSE_Avg Avg_60mS_Avg
TIMESTAMP
2010-05-18 12:30:00 0 Sisters 5068 5.162 4.996
2010-05-18 12:40:00 1 Sisters 5068 5.683 5.571
The first parsed row (1) becomes the header and read_csv's default parser correctly parses the timestamp column.

Related

How to match values of a dataframe with another dataframe in Python? [duplicate]

I am merging two csv(data frame) using below code:
import pandas as pd
a = pd.read_csv(file1,dtype={'student_id': str})
df = pd.read_csv(file2)
c=pd.merge(a,df,on='test_id',how='left')
c.to_csv('test1.csv', index=False)
I have the following CSV files
file1:
test_id, student_id
1, 01990
2, 02300
3, 05555
file2:
test_id, result
1, pass
3, fail
after merge
test_id, student_id , result
1, 1990, pass
2, 2300,
3, 5555, fail
If you notice student_id has 0 appended at the beginning and it's supposed to be considered as text but after merging and using to_csv function it converts it into numeric and removes leading 0.
How can I keep the column as "text" even after to_csv?
I think its to_csv function which saves back again as numeric
Added dtype={'student_id': str} while reading csv.. but while saving it as to_csv .. it again convert it to numeric
Caveat Please use merge or join. This answer is provided to give perspective on the flexibility pandas gives you and how many different ways there are to answer the same question.
a = pd.read_csv('file1.csv', converters=dict(student_id=str), skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = pd.concat(
[d.set_index('test_id') for d in [a, df]],
axis=1, join='outer'
).reset_index()
It's not dropping the leading zero on the merge, it's dropping it on the read_csv. You can fix this by specifying that column is a string at import time:
a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)
The important part is the dtype parameter. You are telling pandas to import this column as a string. The skipinitialspace parameter is set to True, because the column headers are defined with spaces, so we strip it:
test_id, student_id
^ The student_id starts here, at the space
The final code looks like this:
a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = a.merge(df, how='left', on='test_id')
With the results dataframe looking like this:
test_id student_id result
0 1 01990 pass
1 2 02300 NaN
2 3 05555 fail
Then when you run to_csv your result should be:
test_id,student_id, result
1,01990, pass
2,02300,
3,05555, fail
Solution with join, first need read_csv with parameter dtype for convert student_id to string and remove whitespaces by skipinitialspace:
df1 = pd.read_csv(file1, dtype={'student_id': str}, skipinitialspace=True)
df2 = pd.read_csv(file2, skipinitialspace=True)
df = df1.join(df2.set_index('test_id'), on='test_id')
print (df)
test_id student_id result
0 1 01990 pass
1 2 02300 NaN
2 3 05555 fail
a = pd.read_csv(file1, dtype={'test_id': object})
df = pd.read_csv(file2, dtype={'test_id': object})
==============================================================
In[28]: pd.merge(a, b, on='test_id', how='left')
Out[28]:
test_id student_id result
0 01 1990 pass
1 02 2300 NaN
2 003 5555 fail

Upsample data and interpolate

I have the following dataframe:
Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426
...
I need to resample this to weekly resolution and to interpolate between the points. The latter part, the interpolation is straight-forward. The reindex part is a bit tricky, on the other hand, at least for me.
If I use the DataFrame.reindex() method, it will only erase all the entries from the dataframe. I have tried to do it manually, by using .loc() to create new 'NaN' entries between each consecutive months, but this method overwrites the entries I already have.
Any clue how to do it? Thanks!
I have to assume a start date, I chose 2009-12-31.
To get resample to work, you need a pd.DateTimeIndex.
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()
Replicable code
from StringIO import StringIO
import pandas as pd
text = """Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426"""
df = pd.read_csv(StringIO(text), decimal=',', delim_whitespace=True)
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()

Pandas: How to read ill formated time data?

The time of my dataframe consist of 2 coloumns: date and HrMn, like this:
How can I read them into time, and plot a time series plot? (There are other value columns, for example, speed).
I think I can get away with time.strptime('19900125'+'1200','%Y%m%d%H%M')
But the problem is that, when read from the csv, HrMn at 0000 would be parsed as 0, so
time.strptime('19900125'+'0','%Y%m%d%H%M') will fail.
UPDATE:
My current approach:
# When reading the data, pase HrMn as string
df = pd.read_csv(uipath,header=0, skipinitialspace=True, dtype={'HrMn': str})
df['time']=df.apply(lambda x:datetime.strptime("{0} {1}".format(x['date'],x['HrMn']), "%Y%m%d %H%M"),axis=1)# df.temp_date
df.index= df['time']
# Then parse it again as int
df['HrMn'] = df['HrMn'].astype(int)
You can use pd.to_datetime after you've transformed it into a string that looks like a date:
def to_date_str(r):
d = r.date[: 4] + '-' + r.date[4: 6] + '-' + r.date[6: 8]
d += ' '+ r.HrMn[: 2] + ':' + r.HrMn[2: 4]
return d
>>> pd.to_datetime(df[['date', 'HrMn']].apply(to_date_str, axis=1))
0 1990-01-25 12:00:00
dtype: datetime64[ns]
Edit
As #EdChum comments, you can do this even more simply as
pd.to_datetime(df.date.astype(str) + df.HrMn)
which string-concatenates the columns.
You may parse the dates directly while reading the CSV, where HrMn is zero padded as HHMM, i.e. a value of 0 will represent 00:00:
df = pd.read_csv(
uipath,
header=0,
skipinitialspace=True,
dtype={'HrMn': str},
parse_dates={'datetime': ['date', 'HrMn']},
date_parser=lambda x, y: pd.datetime.strptime('{0}{1:04.0f}'.format(x, int(y)),
'%Y%m%d%H%M'),
index_col='datetime'
)
I don' get why you call it "ill formatted", that format is actually quite common and pandas can parse it as is, just specify which columns you want to parse as timestamps.
df = pd.read_csv(uipath, skipinitialspace=True,
parse_dates=[['date', 'HrMn']])

Replace text with numbers using dictionary in pandas

I'm trying to replace months represented as a character (e.g. 'NOV') for their numerical counterparts ('-11-'). I can get the following piece of code to work properly.
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('NOV','-11-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('DEC','-12-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('JAN','-01-')
However, to avoid redundancy, I'd like to use a dictionary and .replace to replace the character variable for all months.
r_month1 = {'JAN':'-01-','FEB':'-02-','MAR':'-03-','APR':'-04-','MAY':'-05-','JUN':'-06-','JUL':'-07-','AUG':'-08-','SEP':'-09-','OCT':'-10-','NOV':'-11-','DEC':'-12-'}
df_cohorts.replace({'conversion_datetime': r_month1,'ltouch_datetime': r_month1})
When I enter the code above, my output dataset is unchanged. For reference, please see my sample data below.
User_ID ltouch_datetime conversion_datetime
001 11NOV14:13:12:56 11NOV14:16:12:00
002 07NOV14:17:46:14 08NOV14:13:10:00
003 04DEC14:17:46:14 04DEC15:13:12:00
Thanks!
Let me suggest a different approach: You could parse the date strings into a column of pandas TimeStamps like this:
import pandas as pd
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = pd.to_datetime(df[col], format='%d%b%y:%H:%M:%S')
print(df)
# User_ID ltouch_datetime conversion_datetime
# 0 1 2014-11-11 13:12:56 2014-11-11 16:12:00
# 1 2 2014-11-07 17:46:14 2014-11-08 13:10:00
# 2 3 2014-12-04 17:46:14 2015-12-04 13:12:00
I would stop right here, since representing dates as TimeStamps is the ideal
form for the data in Pandas.
However, if you need/want date strings with 3-letter months like 'NOV' converted to -11-, then you can convert the Timestamps with strftime and apply:
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = df[col].apply(lambda x: x.strftime('%d-%m-%y:%H:%M:%S'))
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
To answer your question literally, in order to use Series.str.replace you need a column with the month string abbreviations all by themselves. You can arrange for that by first calling Series.str.extract. Then you can join the columns back into one using apply:
import pandas as pd
import calendar
month_map = {calendar.month_abbr[m].upper():'-{:02d}-'.format(m)
for m in range(1,13)}
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
tmp = df[col].str.extract(r'(.*?)(\D+)(.*)')
tmp[1] = tmp[1].replace(month_map)
df[col] = tmp.apply(''.join, axis=1)
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
Finally, although you haven't asked for this directly, it's good to be aware
that if your data is in a file, you can parse the datestring columns into
TimeStamps directly using
import pandas as pd
import datetime as DT
df = pd.read_table(
'data', sep='\s+', parse_dates=[1,2],
date_parser=lambda x: DT.datetime.strptime(x, '%d%b%y:%H:%M:%S'))
This might be the most convenient method of all (assuming you want TimeStamps).

Time Series using numpy or pandas

I'm a beginner of Python related environment and I have problem with using time series data.
The below is my OHLC 1 minute data.
2011-11-01,9:00:00,248.50,248.95,248.20,248.70
2011-11-01,9:01:00,248.70,249.00,248.65,248.85
2011-11-01,9:02:00,248.90,249.25,248.70,249.15
...
2011-11-01,15:03:00,250.25,250.30,250.05,250.15
2011-11-01,15:04:00,250.15,250.60,250.10,250.60
2011-11-01,15:15:00,250.55,250.55,250.55,250.55
2011-11-02,9:00:00,245.55,246.25,245.40,245.80
2011-11-02,9:01:00,245.85,246.40,245.75,246.35
2011-11-02,9:02:00,246.30,246.45,245.75,245.80
2011-11-02,9:03:00,245.75,245.85,245.30,245.35
...
I'd like to extract the last "CLOSE" data per each row and convert data format like the following:
2011-11-01, 248.70, 248.85, 249.15, ... 250.15, 250.60, 250.55
2011-11-02, 245.80, 246.35, 245.80, ...
...
I'd like to calculate the highest Close value and it's time(minute) per EACH DAY like the following:
2011-11-01, 10:23:03, 250.55
2011-11-02, 11:02:36, 251.00
....
Any help would be very appreciated.
Thank you in advance,
You can use the pandas library. In the case of your data you can get the max as:
import pandas as pd
# Read in the data and parse the first two columns as a
# date-time and set it as index
df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None)
# get only the fifth column (close)
df = df[[5]]
# Resample to date frequency and get the max value for each day.
df.resample('D', how='max')
If you want to show also the times, keep them in your DataFrame as a column and pass a function that will determine the max close value and return that row:
>>> df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None,
usecols=[0, 1, 5], names=['d', 't', 'close'])
>>> df['time'] = df.index
>>> df.resample('D', how=lambda group: group.iloc[group['close'].argmax()])
close time
d_t
2011-11-01 250.60 2011-11-01 15:04:00
2011-11-02 246.35 2011-11-02 09:01:00
And if you wan't a list of the prices per day then just do a groupby per day and return the list of all the prices from every group using the apply on the grouped object:
>>> df.groupby(lambda dt: dt.date()).apply(lambda group: list(group['close']))
2011-11-01 [248.7, 248.85, 249.15, 250.15, 250.6, 250.55]
2011-11-02 [245.8, 246.35, 245.8, 245.35]
For more information take a look at the docs: Time Series
Update for the concrete data set:
The problem with your data set is that you have some days without any data, so the function passed in as the resampler should handle those cases:
def func(group):
if len(group) == 0:
return None
return group.iloc[group['close'].argmax()]
df.resample('D', how=func).dropna()

Categories

Resources