I am trying to do some simple analyses on the Kenneth French industry portfolios (first time with Pandas/Python), data is in txt format (see link in the code). Before I can do computations, first want to load it in a Pandas dataframe properly, but I've been struggling with this for hours:
import urllib.request
import os.path
import zipfile
import pandas as pd
import numpy as np
# paths
url = 'http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/48_Industry_Portfolios_CSV.zip'
csv_name = '48_Industry_Portfolios.CSV'
local_zipfile = '{0}/data.zip'.format(os.getcwd())
local_file = '{0}/{1}'.format(os.getcwd(), csv_name)
# download data
if not os.path.isfile(local_file):
print('Downloading and unzipping file!')
urllib.request.urlretrieve(url, local_zipfile)
zipfile.ZipFile(local_zipfile).extract(csv_name, os.path.dirname(local_file))
# read from file
df = pd.read_csv(local_file,skiprows=11)
df.rename(columns={'Unnamed: 0' : 'dates'}, inplace=True)
# build new dataframe
first_stop = df['dates'][df['dates']=='201412'].index[0]
df2 = df[:first_stop]
# convert date to datetime object
pd.to_datetime(df2['dates'], format = '%Y%m')
df2.index = df2.dates
All the columns, except dates, represent financial returns. However, due to the file formatting, these are now strings. According to Pandas docs, this should do the trick:
df2.convert_objects(convert_numeric=True)
But the columns remain strings. Other suggestions are to loop over the columns (see for example pandas convert strings to float for multiple columns in dataframe):
for d in df2.columns:
if d is not 'dates':
df2[d] = df2[d].map(lambda x: float(x)/100)
But this gives me the following warning:
home/<xxxx>/Downloads/pycharm-community-4.5/helpers/pydev/pydevconsole.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
try:
I have read the documentation on views vs copies, but having difficulty to understand why it is a problem in my case, but not in the code snippets in the question I linked to. Thanks
Edit:
df2=df2.convert_objects(convert_numeric=True)
Does the trick, although I receive a depreciation warning (strangely enough that is not in the docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html)
Some of df2:
dates Agric Food Soda Beer Smoke Toys Fun \
dates
192607 192607 2.37 0.12 -99.99 -5.19 1.29 8.65 2.50
192608 192608 2.23 2.68 -99.99 27.03 6.50 16.81 -0.76
192609 192609 -0.57 1.58 -99.99 4.02 1.26 8.33 6.42
192610 192610 -0.46 -3.68 -99.99 -3.31 1.06 -1.40 -5.09
192611 192611 6.75 6.26 -99.99 7.29 4.55 0.00 1.82
Edit2: the solution is actually more simple than I thought:
df2.index = pd.to_datetime(df2['dates'], format = '%Y%m')
df2 = df2.astype(float)/100
I would try the following to force convert everything into floats:
df2=df2.astype(float)
You can convert specific column to float(or any numerical type for that matter) by
df["column_name"] = pd.to_numeric(df["column_name"])
Posting this because pandas.convert_objects is deprecated in pandas 0.20.1
You need to assign the result of convert_objects as there is no inplace param:
df2=df2.convert_objects(convert_numeric=True)
you refer to the rename method but that one has an inplace param which you set to True.
Most operations in pandas return a copy and some have inplace param, convert_objects is one that does not. This is probably because if the conversion fails then you don't want to blat over your data with NaNs.
Also the deprecation warning is to split out the different conversion routines, presumably so you can specialise the params e.g. format string for datetime etc..
Related
Date Daily minimum temperatures in Melbourne, Australia
1/1/1981 20.7
1/2/1981 17.9
1/3/1981 18.8
1/4/1981 14.6
1/5/1981 15.8
1/6/1981 15.8
1/7/1981 15.8
My code:
from pandas import read_csv
filename = 'daily-minimum-temperatures-in-me.csv'
series=read_csv(filename,header=0,index_col=0,parse_dates=True,squeeze=True)
After execution, I get error.
Any help is appreciated
I have no problem. Do you have the right indentation?
from pandas import read_csv
filename = 'file.csv'
series=read_csv(filename,header=0,index_col=0,parse_dates=True,squeeze=True)
The csv-document is quiet suspicious.
I assume that the first element is the date and the second is the temperature? Therefore there must be a separator like ',' between this values. If the the first line (Date Daily minimum temperatures in Melbourne, Australia) is inside the csv file, you should avoid the ',' between Melbourne and Austrlia and maybe change the complete header to: data, temp
I just remove the last line of the file and it works.
I have the following data set in a csv file:
vehicle---time-----aspd[m/s]------gspd[m/s]----hdg---alt[m-msl]
veh_1---17:19.5---0.163471505---0.140000001---213---273.8900146
veh_2---17:19.5---0.505786836---0.170000002---214---273.9100037
veh_3---17:19.8---0.173484877---0.109999999---213---273.980011
veh_4---44:12.4---18.64673424---19.22999954---316---388.9299927
veh_5---44:13.0---18.13533401---19.10000038---316---389.1700134
I am trying to write a function launch_time() with two inputs (dataframe, vehicle name) that returns the first time the gspd is reported above 10.0 m/s.
The output time must be converted from a string (HH:MM:SS.SS) to a minutes after 12:00 format.
It should look something like this:
>>> launch_time(df, veh_1)
30.0
I will use this function to iterate through each vehicle and then need to record the results into a list of tuples with the format (v_name, launch time) in launch sequence order.
It should look something like this:
'veh_1', 30.0, 'veh_2', 15.0
Disclosure: my python/pandas knowledge is very entry-level.
You can use read_csv with separator -{3,} - read csv with 3 and more -:
import pandas as pd
from pandas.compat import StringIO
temp=u"""vehicle---time-----aspd[m/s]------gspd[m/s]----hdg---alt[m-msl]
veh_1---17:19.5---0.163471505---0.140000001---213---273.8900146
veh_2---17:19.5---0.505786836---0.170000002---214---273.9100037
veh_3---17:19.8---0.173484877---0.109999999---213---273.980011
veh_4---44:12.4---18.64673424---19.22999954---316---388.9299927
veh_5---45:13.0---18.13533401---19.10000038---316---389.1700134"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="-{3,}", engine='python')
print (df)
vehicle time aspd[m/s] gspd[m/s] hdg alt[m-msl]
0 veh_1 17:19.5 0.163472 0.14 213 273.890015
1 veh_2 17:19.5 0.505787 0.17 214 273.910004
2 veh_3 17:19.8 0.173485 0.11 213 273.980011
3 veh_4 44:12.4 18.646734 19.23 316 388.929993
4 veh_5 45:13.0 18.135334 19.10 316 389.170013
Then convert column time to_timedelta, filter all rows above 10m/s by boolean indexing, sort_values, group on vehicles using groupby, then get the first value in each group and last zip columns vehicle and time and convert to list:
df.time = pd.to_timedelta('00:' + df.time, unit='h').\
astype('timedelta64[m]').astype(int)
req = df[df['gspd[m/s]'] > 10].\
sort_values('time', ascending=True).\
groupby('vehicle', as_index=False).head(1)
print(req)
vehicle time aspd[m/s] gspd[m/s] hdg alt[m-msl]
4 veh_5 45 18.135334 19.10 316 389.170013
3 veh_4 44 18.646734 19.23 316 388.929993
L = list(zip(req['vehicle'],req['time']))
print (L)
[('veh_5', 45), ('veh_4', 44)]
I'm working with a large data frame and I'm struggling to find an efficient way to eliminate specific dates. Note that I'm trying to eliminate any measurements from a specific date.
Pandas has this great function, where you can call:
df.ix['2016-04-22']
and pull all rows from that day. But what if I want to eliminate all rows from '2016-04-22'?
I want a function like this:
df.ix[~'2016-04-22']
(but that doesn't work)
Also, what if I want to eliminate a list of dates?
Right now, I have the following solution:
import numpy as np
import pandas as pd
from numpy import random
###Create a sample data frame
dates = [pd.Timestamp('2016-04-25 06:48:33'), pd.Timestamp('2016-04-27 15:33:23'), pd.Timestamp('2016-04-23 11:23:41'), pd.Timestamp('2016-04-28 12:08:20'), pd.Timestamp('2016-04-21 15:03:49'), pd.Timestamp('2016-04-23 08:13:42'), pd.Timestamp('2016-04-27 21:18:22'), pd.Timestamp('2016-04-27 18:08:23'), pd.Timestamp('2016-04-27 20:48:22'), pd.Timestamp('2016-04-23 14:08:41'), pd.Timestamp('2016-04-27 02:53:26'), pd.Timestamp('2016-04-25 21:48:31'), pd.Timestamp('2016-04-22 12:13:47'), pd.Timestamp('2016-04-27 01:58:26'), pd.Timestamp('2016-04-24 11:48:37'), pd.Timestamp('2016-04-22 08:38:46'), pd.Timestamp('2016-04-26 13:58:28'), pd.Timestamp('2016-04-24 15:23:36'), pd.Timestamp('2016-04-22 07:53:46'), pd.Timestamp('2016-04-27 23:13:22')]
values = random.normal(20, 20, 20)
df = pd.DataFrame(index=dates, data=values, columns ['values']).sort_index()
### This is the list of dates I want to remove
removelist = ['2016-04-22', '2016-04-24']
This for loop basically grabs the index for the dates I want to remove, then eliminates it from the index of the main dataframe, then positively selects the remaining dates (ie: the good dates) from the dataframe.
for r in removelist:
elimlist = df.ix[r].index.tolist()
ind = df.index.tolist()
culind = [i for i in ind if i not in elimlist]
df = df.ix[culind]
Is there anything better out there?
I've also tried indexing by the rounded date+1 day, so something like this:
df[~((df['Timestamp'] < r+pd.Timedelta("1 day")) & (df['Timestamp'] > r))]
But this gets really cumbersome and (at the end of the day) I'll still be using a for loop when I need to eliminate n specific dates.
There's got to be a better way! Right? Maybe?
You can create a boolean mask using a list comprehension.
>>> df[[d.date() not in pd.to_datetime(removelist) for d in df.index]]
values
2016-04-21 15:03:49 28.059520
2016-04-23 08:13:42 -22.376577
2016-04-23 11:23:41 40.350252
2016-04-23 14:08:41 14.557856
2016-04-25 06:48:33 -0.271976
2016-04-25 21:48:31 20.156240
2016-04-26 13:58:28 -3.225795
2016-04-27 01:58:26 51.991293
2016-04-27 02:53:26 -0.867753
2016-04-27 15:33:23 31.585201
2016-04-27 18:08:23 11.639641
2016-04-27 20:48:22 42.968156
2016-04-27 21:18:22 27.335995
2016-04-27 23:13:22 13.120088
2016-04-28 12:08:20 53.730511
Same idea as #Alexander, but using properties of the DatetimeIndex and numpy.in1d:
mask = ~np.in1d(df.index.date, pd.to_datetime(removelist).date)
df = df.loc[mask, :]
Timings:
%timeit df.loc[~np.in1d(df.index.date, pd.to_datetime(removelist).date), :]
1000 loops, best of 3: 1.42 ms per loop
%timeit df[[d.date() not in pd.to_datetime(removelist) for d in df.index]]
100 loops, best of 3: 3.25 ms per loop
I'm importing text files to Pandas data frames. Number of columns can vary and also the names varies.
However, the header line always starts with ~A and read_csv interprets this a s the name of the first column, subsequently all the column names are shifted on step to the right.
Earlier I used np.genfromtxt() with the argument deletechars = 'A__' but I haven't find any equivalent function for pandas. Is there a way to exclude the name when reading or, as an second option, delete the first name but keep the columns intact?
I'm reading file like this:
in_file = pd.read_csv(file_name, header=header_row,delim_whitespace=True)
Now I got this (just as the text file looks):
~A DEPTH TIME TX1 TX2 TX3 OUT6
11705 2.94 10525.38 126.14 169.71 353.86 4.59 NaN
11706 2.93 10525.38 NaN 168.29 368.00 4.75 NaN
11707 2.92 10525.38 126.14 166.71 369.86 4.93 NaN
but I want' to get this:
DEPTH TIME TX1 TX2 TX3 OUT6
11705 2.94 10525.38 126.14 169.71 353.86 4.59
11706 2.93 10525.38 NaN 168.29 368.00 4.75
11707 2.92 10525.38 126.14 166.71 369.86 4.93
Why not just post-process?
df = ...
df_modified = df[df.columns[:-1]]
df_modified.columns = df.columns[1:]
How about if you read the file twice? First, use pd.read_csv() but skip your header row. Second, use open.readline() to parse the header and drop the first item. This can then be assigned to your dataframe.
in_file = pd.read_csv(file_name, delim_whitespace=True, header = None, skiprows = [0])
with open(file_name,'rt') as h:
hdrs = h.readline().rstrip('\n').split(',')
in_file.columns = hdrs[1:]
Choose which columns to import
in_file = pd.read_csv(file_name, header=header_row,
delim_whitespace=True,
usecols=['DEPTH','TIME','TX1','TX2','TX3','OUT6')
Ok so if the number of columns vary
and you want to remove the first column (who's name varies)
AND you do not want too do this in a Post-cv_read phase...
then
.... (Drum Roll)
import pandas as pd
#Tim.csv is
#1,2,3
#2,3,4
#3,4,5
headers=['BADCOL','Happy','Sad']
data = pd.read_csv('tim.csv').iloc[:,1:]
Data will now look like
b c
2 3
3 4
4 5
Not sure if this counts as Post-CSV processing or not...
I have two .csv files with the same initial column-header:
NAME RA DEC Mean_I1 Mean_I2 alpha_K24 class alpha_K8 class.1 Av avgAv
Mon-000101 100.27242 9.608597 11.082 10.034 0.39 I 0.39 I 31.1 31.1
Mon-000171 100.29230 9.522860 14.834 14.385 0.45 I 0.45 I 33.7 33.7
and
NAME Sdev_I1 Sdev_I2
Mon-000002, 0.023, 0.028000001,
Mon-000003, 0.016000001, 0.016000001,
I want to merge the two together so that the 'NAME' columns match up, basically just add the two Sdev_I1/Sdev_I2 to the end of the first sample. I've tried...
import pandas as pd
df1 = pd.read_csv('h7.csv',sep=r'\s+')
df2 = pd.read_csv('NEW.csv',sep=r'\s+')
df = pd.merge(df1,df2)
df.to_csv('Newh7.csv',index=False)
but it's printing the 'NAME' twice and everything seems to be out of order and with a lot of added zeroes as well. I thought I had solved this one awhile back, but I've totally lost it. Help would be appreciated. Thanks.
Here's the output file:
NAME,RA,DEC,Mean_I1,Mean_I2,alpha_K24,class,alpha_K8,class.1,Av,avgAv,Sdev_I1,Sdev_I2
Seems you didn't strip the comma symbol in the second csv, you might try to use converters to convert them:
In [81]: converters = {
'NAME': lambda x:x[:-1],
'Sdev_I1': lambda x: float(x[:-1]),
'Sdev_I2': lambda x: float(x[:-1])
}
In [82]: pd.read_csv('NEW.csv',sep=r'\s+', converters=converters)
Out[82]:
NAME Sdev_I1 Sdev_I2
0 Mon-000002 0.023 0.028
1 Mon-000003 0.016 0.016