Pandas DateTimeIndex - python

I need to have a DateTimeIndex for my dataframe. Problem is my source file. The Date header is Date(dd-mm-yy), but the actual date data has the format dd:mm:yy (24:06:1970) etc. I have lots of source files so manually changing the header would be tedious and not good programing practice. How would one go about addressing this from within python?
Perhaps creating a copy of the source file opening it, searching for the date header, changing it and then closing it? I'm new to python so I'm not exactly sure if this is the best way to go about doing things and if it is, how do I implement such code?
Currently I have this;
df = pd.read_csv('test.csv',
skiprows = 4,
parse_dates = {'stamp':[0,1]},
na_values = 'NaN',
index_col = 'stamp'
)
Where column 0 is the date column in question and column 1 is the time column.
I don't get any error messages just erroneous data.
Sorry, I should have added a snippet of the csv file in question.I've now provided it below;
some stuff I dont want
some stuff I dont want
some stuff I dont want
some stuff I dont want
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day
01:07:2013,05:40:41,182.236586,659,1638.400000
01:07:2013,05:44:03,182.238924,659,1638.400000
01:07:2013,05:47:48,182.241528,659,1638.400000
01:07:2013,05:52:21,182.244687,659,1638.400000

I think the main problem is that the header line Date(dd-mm-yy), Time(hh:mm:ss), Julian_Day only appears to specify some of the column names. Pandas cannot infer what to do with the other data.
Try skipping the file's column name line and passing pandas a list of column names and defining your own date_parser:
def my_parser(date, time):
import datetime
DATE_FORMAT = '%d:%m:%Y'
TIME_FORMAT = '%H:%M:%S'
date = datetime.datetime.strptime(date, DATE_FORMAT)
time_weird_date = datetime.datetime.strptime(time, TIME_FORMAT)
return datetime.datetime.combine(date, time_weird_date.time())
import pandas as pd
from cStringIO import StringIO
data = """\
some stuff I dont want
some stuff I dont want
some stuff I dont want
some stuff I dont want
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day
01:07:2013,05:40:41,182.236586,659,1638.400000
01:07:2013,05:44:03,182.238924,659,1638.400000
01:07:2013,05:47:48,182.241528,659,1638.400000
01:07:2013,05:52:21,182.244687,659,1638.400000
"""
pd.read_csv(StringIO(data), skiprows=5, index_col=0,
parse_dates={'datetime':['date', 'time']},
names=['date','time', 'Julian_Day', 'col_2', 'col_3'],
date_parser=my_parser)
This should give you what you want.
As you said you are new to python, I should add that the from cStringIO import StringIO, data = """..., and StringIO(data) parts are just so I could include the data directly in this answer in a runnable form. You just need pd.read_csv(my_data_filename, ... in your own code

Your dates are really weird, you should just fix them all. If you really can't fix them on disk for some reason, I guess you can do it inline:
import re
from StringIO import StringIO
s = open('test.csv').read()
def rep(m):
return '%s-%s-%sT' % (m.group('YY'), m.group('mm'), m.group('dd'))
s = re.sub(r'^(?P<dd>\d\d):(?P<mm>\d\d):(?P<YY>\d{4}),', rep, s, flags=re.M)
df = pd.read_csv(StringIO(s), skiprows=5, index_col=0,
names=['time', 'Julian_Day', 'col_2', 'col_3'])
This just takes the weird dates like 01:07:2013,05:40:41 and formats them ISO style like 2013-07-01T05:40:41. Then pandas can treat them normally. Bear in mind that these are going to be in UTC.

Related

Why does Python Pandas read the string of an excel file as datetime

I have the following questions.
I have Excel files as follows:
When i read the file using
df = pd.read_excel(file,dtype=str).
the first row turned to 2003-02-14 00:00:00 while the rest are displayed as it is.
How do i prevent pd.read_excel() from converting its value into datetime or something else?
Thanks!
As #ddejohn correctly said it in the comments, the behavior you face is actually coming from Excel, automatically converting the data to date. Thus pandas will have to deal with that data AS date, and treat it later to get the correct format back as you expect, as like you say you cannot modify the input Excel file.
Here is a short script to make it work as you expect:
import pandas as pd
def rev(x: str) -> str:
'''
converts '2003-02-14 00:00:00' to '14.02.03'
'''
hours = '00:00:00'
if not hours in x:
return x
y = x.split()[0]
y = y.split('-')
return '.'.join([i[-2:] for i in y[::-1]])
file = r'C:\your\folder\path\Classeur1.xlsx'
df = pd.read_excel(file, dtype=str)
df['column'] = df['column'].apply(rev)
Replace df['column'] by your actual column name.
You then get the desired format in your dataframe.

Turning Panda Column into text file seperated by line break

I would like to create a txt file, where every line is a so called "ticker symbol" (=symbol for a stock). As a first step, I downloaded all the tickers I want via a wikipedia api:
import pandas as pd
import wikipedia as wp
html1 = wp.page("List of S&P 500 companies").html().encode("UTF-8")
df = pd.read_html(html1,header =0)[0]
df = df.drop(['SEC filings','CIK', 'Headquarters Location', 'Date first added', 'Founded'], axis = 1)
df.columns = df.columns.str.replace('Symbol', 'Ticker')
Secondly, I would like to create a txt file as mentionned above with all the ticker names of column "Ticker" from df. To do so, I probably have to do somithing similar to:
f = open("tickertest.txt","w+")
f.write("MMM\nABT\n...etc.")
f.close()
Now my problem: Does anybody know how it is possible to bring my Ticker column from df into one big string where between every ticker there is a \n or every ticker is on a new line?
You can use to_csv for this.
df.to_csv("test.txt", columns=["Ticker"], header=False, index=False)
This provides flexibility to include other columns, column names, and index values at some future point (should you need to do some sleuthing, or in case your boss asks for more information). You can even change the separator. This would be a simple modification (obvious changes, e.g.):
df.to_csv("test.txt", columns=["Ticker", "Symbol",], header=True, index=True, sep="\t")
I think the benefit of this method over jfaccioni's answer is flexibility and ease of adapability. This also gets you away from explicitly opening a file. However, if you still want to explicitly open a file you should consider using "with", which will automatically close the buffer when you break out of the current indentation. e.g.
with open("test.txt", "w") as fid:
fid.write("MMM\nABT\n...etc.")
This should do the trick:
'\n'.join(df['Ticker'].astype(str).values)

How to correctly import and plot time index format HH:MM:SS.fs into Pandas dataframe

I'm new to python and I'm trying to correctly parse a .txt data file into pandas using the time column in the format HH:MM:SS.fs as the index for the dataframe. An example line of the .txt input file looks like this:
00:07:01.250 10.7
I've tried the following code using the datetime function, however this adds todays date in addition to importing the timestamp which I don't want to be displayed. I've also read about the timestamp and timedelta functions but can't see how these would work for this use case.
df = pd.read_csv(f, engine='python', delimiter='\t+', skiprows=23, header=None, usecols=[0,3], index_col=0, names=['Time (HH:MM:SS.fs)', 'NL (%)'], decimal=',')
df.index = pd.to_datetime(df.index)
Here is the existing import code for the datetime import:
df = pd.read_csv(f, engine='python', delimiter='\t+', skiprows=23, header=None, usecols=[0,3], index_col=0, names=['Time (HH:MM:SS.fs)', 'NL (%)'], decimal=',')
df.index = pd.to_datetime(df.index)
An example line of the output looks like this:
2019-09-26 00:07:01.250 10.7
But want I want is this (without the date):
00:07:01.250 10.7
Any ideas on what I'm doing wrong?
You could read the columns as string, using:
df.some_column = df.some_column.astype('str')
And then use the "format" argument of "to_datetime". It uses the python "strptime" method to convert a string to a datetime, and that method let you specify the exact format of the converted object, as the following link you show:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
I hope this helps. Wish I have more time to test but unfortunately I don't.

parse date-time while reading 'csv' file with pandas

I am trying to parse dates while I am​ reading my data from cvs file. The command that I use is
df = pd.read_csv('/Users/n....', names=names, parse_dates=['date'])​ )
And it is working on my files generally.
But I have couple of data sets which has variety in date formats. I mean it has date format is like that (09/20/15 09:59​ ) while it has another format in other lines is like that ( 2015-09-20 10:22:01.013​ ) in the same file. And the command that I wrote above doesn't work on these file. It is working when I delete (parse_dates=['date'])​, but that time I can't use date column as datetime format, it reads that column as integer . I would be appreciate anyone could answer that!
Pandas read_csv accepts date_parser argument which you can define your own date parsing function. So for example in your case you have 2 different datetime formats you can simply do:
import datetime
def date_parser(d):
try:
d = datetime.datetime.strptime("format 1")
except ValueError:
try:
d = datetime.datetime.strptime("format 2")
except:
# both formats not match, do something about it
return d
df = pd.read_csv('/Users/n....',
names=names,
parse_dates=['date1', 'date2']),
date_parser=date_parser)
You can then parse those dates in different formats in those columns.
Like this:
df = pd.read_csv(file, names=names)
df['date'] = pd.to_datetime(df['date'])

Read specific columns with pandas or other python module

I have a csv file from this webpage.
I want to read some of the columns in the downloaded file (the csv version can be downloaded in the upper right corner).
Let's say I want 2 columns:
59 which in the header is star_name
60 which in the header is ra.
However, for some reason the authors of the webpage sometimes decide to move the columns around.
In the end I want something like this, keeping in mind that values can be missing.
data = #read data in a clever way
names = data['star_name']
ras = data['ra']
This will prevent my program to malfunction when the columns are changed again in the future, if they keep the name correct.
Until now I have tried various ways using the csv module and resently the pandas module. Both without any luck.
EDIT (added two lines + the header of my datafile. Sorry, but it's extremely long.)
# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,
An easy way to do this is using the pandas library like this.
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'
According to the latest pandas documentation you can read a csv file selecting only the columns which you want to read.
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
Here we use usecols which reads only selected columns in a dataframe.
We are using low_memory so that we Internally process the file in chunks.
Above answers are for python2. So for python 3 users I am giving this answer. You can use the code below:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print(df.keys())
# See content in 'star_name'
print(df.star_name)
Got a solution to above problem in a different way where in although i would read entire csv file, but would tweek the display part to show only the content which is desired.
import pandas as pd
df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]
This one could help in some of the scenario's in learning basics and filtering data on the basis of columns in dataframe.
I think you need to try this method.
import pandas as pd
data_df = pd.read_csv('data.csv')
print(data_df['star_name'])
print(data_df['ra'])

Categories

Resources