Python newbie here who's switching from R to Python for statistical modeling and analysis.
I am working with a Pandas data structure and am trying to restructure a column that contains 'date' values. In the data below, you'll notice that some values take the 'Mar-10' format which others take a '12/1/13' format. How can I restructure a column in a Pandas data structure that contains 'dates' (technically not a date structure) so that they are uniform (contain the same structure). I'd prefer that they all follow the 'Mar-10' format. Can anyone help?
In [34]: dat["Date"].unique()
Out[34]:
array(['Jan-10', 'Feb-10', 'Mar-10', 'Apr-10', 'May-10', 'Jun-10',
'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10',
'Jan-11', 'Feb-11', 'Mar-11', 'Apr-11', 'May-11', 'Jun-11',
'Jul-11', 'Aug-11', 'Sep-11', 'Oct-11', 'Nov-11', 'Dec-11',
'Jan-12', 'Feb-12', 'Mar-12', 'Apr-12', 'May-12', 'Jun-12',
'Jul-12', 'Aug-12', 'Sep-12', 'Oct-12', 'Nov-12', 'Dec-12',
'Jan-13', 'Feb-13', 'Mar-13', 'Apr-13', 'May-13', '6/1/13',
'7/1/13', '8/1/13', '9/1/13', '10/1/13', '11/1/13', '12/1/13',
'1/1/14', '2/1/14', '3/1/14', '4/1/14', '5/1/14', '6/1/14',
'7/1/14', '8/1/14'], dtype=object)
In [35]: isinstance(dat["Date"], basestring) # not a string?
Out[35]: False
In [36]: type(dat["Date"]).__name__
Out[36]: 'Series'
I think your dates are already strings, try:
import numpy as np
import pandas as pd
date = pd.Series(np.array(['Jan-10', 'Feb-10', 'Mar-10', 'Apr-10', 'May-10', 'Jun-10',
'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10',
'Jan-11', 'Feb-11', 'Mar-11', 'Apr-11', 'May-11', 'Jun-11',
'Jul-11', 'Aug-11', 'Sep-11', 'Oct-11', 'Nov-11', 'Dec-11',
'Jan-12', 'Feb-12', 'Mar-12', 'Apr-12', 'May-12', 'Jun-12',
'Jul-12', 'Aug-12', 'Sep-12', 'Oct-12', 'Nov-12', 'Dec-12',
'Jan-13', 'Feb-13', 'Mar-13', 'Apr-13', 'May-13', '6/1/13',
'7/1/13', '8/1/13', '9/1/13', '10/1/13', '11/1/13', '12/1/13',
'1/1/14', '2/1/14', '3/1/14', '4/1/14', '5/1/14', '6/1/14',
'7/1/14', '8/1/14'], dtype=object))
date.map(type).value_counts()
# date contains 56 strings
# <type 'str'> 56
# dtype: int64
To see the types of each individual element, rather than seeing the type of the column they're contained in.
Your best bet for dealing sensibly with them is to convert them into pandas DateTime objects:
pd.to_datetime(date)
Out[18]:
0 2014-01-10
1 2014-02-10
2 2014-03-10
3 2014-04-10
4 2014-05-10
5 2014-06-10
6 2014-07-10
7 2014-08-10
8 2014-09-10
...
You may have to play around with the formats somewhat, e.g. creating two separate arrays
for each format and then merging them back together:
# Convert the Aug-10 style strings
pd.to_datetime(date, format='%b-%y', coerce=True)
# Convert the 9/1/13 style strings
pd.to_datetime(date, format='%m/%d/%y', coerce=True)
I can never remember these time formatting codes off the top of my head but there's a good rundown of them here.
Related
Using pd.read_csv I am importing a dataframe. One of the columns contains lists of strings. For example:
>>> df['topic'].head(5)
0 ['ECONOMIC PERFORMANCE', 'ECONOMICS', 'EQUITY ...
1 ['CAPACITY/FACILITIES', 'CORPORATE/INDUSTRIAL']
2 ['PERFORMANCE', 'ACCOUNTS/EARNINGS', 'CORPORAT...
3 ['PERFORMANCE', 'ACCOUNTS/EARNINGS', 'CORPORAT...
4 ['STRATEGY/PLANS', 'NEW PRODUCTS/SERVICES', 'C...
Name: topic, dtype: object
Though this column should be full of lists, pandas is importing it as strings. How can I get pandas to import this as a column of lists?
You can convert the column with strings to Python lists with ast.literal_eval. For example:
from ast import literal_eval
df["topic"] = df["topic"].apply(literal_eval)
print(df)
Prints:
topic
0 [ECONOMIC PERFORMANCE, ECONOMICS, EQUITY]
1 [CAPACITY/FACILITIES, CORPORATE/INDUSTRIAL]
2 [PERFORMANCE, ACCOUNTS/EARNINGS, CORPORATE]
3 [PERFORMANCE, ACCOUNTS/EARNINGS, CORPORATE]
4 [STRATEGY/PLANS, NEW PRODUCTS/SERVICES]
I have a data set represented in a Pandas object, see below:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
1/1/2011 0:00 1 0 0 1 9.84 14.395 81 0 3 13 16
1/1/2011 1:00 1 0 0 2 9.02 13.635 80 0 8 32 40
1/1/2011 2:00 1 0 0 3 9.02 13.635 80 0 5 27 32
p_type_1 = pd.read_csv("Bike Share Demand.csv")
p_type_1 = (p_type_1 >>
rename(date = X.datetime))
p_type_1.date.str.split(expand=True,)
p_type_1[['Date','Hour']] = p_type_1.date.str.split(" ",expand=True,)
p_type_1['date'] = pd.to_datetime(p_type_1['date'])
p_hour = p_type_1["Hour"]
p_hour
Now I am trying to take the sum of my column Hour that I created (p_hour)
p_hours = p_type_1["Hour"].sum()
p_hours
and get this error:
TypeError: must be str, not int
so I then tried:
p_hours = p_type_1(str["Hour"].sum())
p_hours
and get this error:
TypeError: 'type' object is not subscriptable
i just want the sum, what gives.
Your dataframe datatypes are problem.
Take a closer look at this question:
Convert DataFrame column type from string to datetime, dd/mm/yyyy format
Sample code that should be solution for your problem, i simplified CSV
'''
CSV
datetime,season
1/1/2011 0:00,1
1/1/2011 1:00,1
1/1/2011 2:00,1
'''
import pandas as pd
p_type_1 = pd.read_csv("Bike Share Demand.csv")
p_type_1['datetime'] = p_type_1['datetime'].astype('datetime64[ns]')
p_type_1['hour'] = [val.hour for i, val in p_type_1['datetime'].iteritems()]
print(p_type_1['hour'].sum())
There's quite a bit going on in here that's not correct. So I'll try to break down the issues and offer alternatives.
Here:
p_hours = p_type_1(str["Hour"].sum())
p_hours
What your issue is, is that you are actually trying to do this:
p_hours = p_type_1([str("Hour")].sum())
p_hours
Instead of doing that, your code technically asks for the property named 'Hour' in the string type. That's not what you are trying to do. This crash is unrelated to your core problem, and is just a syntax error.
What the problem actually is here, is that your dataframe column has mixed string and integer types together in the same column. The sum operation will concatenate string, or sum numeric types. In a mixed type, it will fail out.
In order to verify that this is the issue however, we would need to see your actual dataframe, as I have a feeling the one you gave may not be the correct one.
As a proof of concept, I created the following example:
import pandas as pd
dta = [str(x) for x in range(20)]
dta.append(12)
frame = pd.DataFrame.from_dict({
"data": dta})
print(frame["data"].sum())
>>> TypeError: can only concatenate str (not "int") to str
Note that the newer editions of pandas have more clear error messages.
I'm quite new to Python and I'm encountering a problem.
I have a dataframe where one of the columns is the departure time of flights. These hours are given in the following format : 1100.0, 525.0, 1640.0, etc.
This is a pandas series which I want to transform into a datetime series such as : S = [11.00, 5.25, 16.40,...]
What I have tried already :
Transforming my objects into string :
S = [str(x) for x in S]
Using datetime.strptime :
S = [datetime.strptime(x,'%H%M.%S') for x in S]
But since they are not all the same format it doesn't work
Using parser from dateutil :
S = [parser.parse(x) for x in S]
I got the error :
'Unknown string format'
Using the panda datetime :
S= pd.to_datetime(S)
Doesn't give me the expected result
Thanks for your answers !
Since it's a columns within a dataframe (A series), keep it that way while transforming should work just fine.
S = [1100.0, 525.0, 1640.0]
se = pd.Series(S) # Your column
# se:
0 1100.0
1 525.0
2 1640.0
dtype: float64
setime = se.astype(int).astype(str).apply(lambda x: x[:-2] + ":" + x[-2:])
This transform the floats to correctly formatted strings:
0 11:00
1 5:25
2 16:40
dtype: object
And then you can simply do:
df["your_new_col"] = pd.to_datetime(setime)
How about this?
(Added an if statement since some entries have 4 digits before decimal and some have 3. Added the use case of 125.0 to account for this)
from datetime import datetime
S = [1100.0, 525.0, 1640.0, 125.0]
for x in S:
if str(x).find(".")==3:
x="0"+str(x)
print(datetime.strftime(datetime.strptime(str(x),"%H%M.%S"),"%H:%M:%S"))
You might give it a go as follows:
# Just initialising a state in line with your requirements
st = ["1100.0", "525.0", "1640.0"]
dfObj = pd.DataFrame(st)
# Casting the string column to float
dfObj_num = dfObj[0].astype(float)
# Getting the hour representation out of the number
df1 = dfObj_num.floordiv(100)
# Getting the minutes
df2 = dfObj_num.mod(100)
# Moving the minutes on the right-hand side of the decimal point
df3 = df2.mul(0.01)
# Combining the two dataframes
df4 = df1.add(df3)
# At this point can cast to other types
Result:
0 11.00
1 5.25
2 16.40
You can run this example to verify the steps for yourself, also you can make it into a function. Make slight variations if needed in order to tweak it according to your precise requirements.
Might be useful to go through this article about Pandas Series.
https://www.geeksforgeeks.org/python-pandas-series/
There must be a better way to do this, but this works for me.
df=pd.DataFrame([1100.0, 525.0, 1640.0], columns=['hour'])
df['hour_dt']=((df['hour']/100).apply(str).str.split('.').str[0]+'.'+
df['hour'].apply((lambda x: '{:.2f}'.format(x/100).split('.')[1])).apply(str))
print(df)
hour hour_dt
0 1100.0 11.00
1 525.0 5.25
2 1640.0 16.40
I have a CSV file which looks like this:
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:31:12,24736
[30/Apr/1998:21:31:19,3781
[30/Apr/1998:21:31:22,-
[30/Apr/1998:21:31:27,24736
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:32,929
[30/Apr/1998:21:31:43,-
[30/Apr/1998:21:31:44,1139
[30/Apr/1998:21:31:52,24736
[30/Apr/1998:21:31:52,3029
[30/Apr/1998:21:32:06,24736
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:17,-
[30/Apr/1998:21:32:30,14521
[30/Apr/1998:21:32:33,11324
[30/Apr/1998:21:32:35,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:38,1647
[30/Apr/1998:21:32:38,1271
[30/Apr/1998:21:32:52,5933
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231
upto one billion,
forget about numbers column, I have a concern to convert this time-date format in my CSV file to pandas time stamp, so I can plot my dataset and visualize it according to time, as I am new in datascience,here is my approach:
step 1: take all the time colum from my CSV file into an array,
step 2: split the data from the mid where :(colon) occurs, make two new arrays of date and time,
step 3: remove "[" from date array,
step 4: replace all forward slash into dashes in the date array,
step 5: and then append date and time array to make a single pandas format,
which will be looks like this, 2017-03-22 15:16:45 as you known that I am new and my approach is naive and also wrong, if someone can help me with providing me code snippet, I will be really happy, thanks
You can pass a format to pd.to_datetime(), in this case: [%d/%b/%Y:%H:%M:%S.
Be careful with erroneous data though as seen in row 3 in sample data below ([30/Apr/1998:21:32:3l8,671). To not get an error you can pass errors=coerce, will return Not a Time (NaT).
The other way would be to replace those rows manually or write some sort of regex/replace funtion first.
import pandas as pd
data = '''\
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep=',', na_values=['-'])
df['time'] = pd.to_datetime(df['time'], format='[%d/%b/%Y:%H:%M:%S', errors='coerce')
print(df)
Returns:
time Numbers
0 1998-04-30 21:30:17 24736.0
1 1998-04-30 21:30:53 24736.0
2 NaT 671.0
3 1998-04-30 21:32:38 1512.0
4 1998-04-30 21:32:38 1136.0
5 1998-04-30 21:32:58 NaN
6 1998-04-30 21:32:59 231.0
Note that: na_values=['-'] was used here to help pandas understand the Numbers column is actually numbers and not strings.
And now we can perform actions like grouping (on minute for instance):
print(df.groupby(df.time.dt.minute)['Numbers'].mean())
#time
#30.0 24736.000000
#32.0 959.666667
How do I get the Units column to numeric?
I have a Google spreadsheet that I am reading in the date column gets converted fine.. but I'm not having much luck getting the Unit Sales column to convert to numeric I'm including all the code which uses requests to get the data:
from StringIO import StringIO
import requests
#act = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak_wF7ZGeMmHdFZtQjI1a1hhUWR2UExCa2E4MFhiWWc&output=csv&gid=1')
dataact = act.content
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=['date'])
actdf.rename(columns={'Unit Sales': 'Units'}, inplace=True) #incase the space in the name is messing me up
The different methods I have tried to get Units to get to numeric
actdf=actdf['Units'].convert_objects(convert_numeric=True)
#actdf=actdf['Units'].astype('float32')
Then I want to resample and I'm getting strange string concatenations since the numbers are still string
#actdfq=actdf.resample('Q',sum)
#actdfq.head()
actdf.head()
#actdf
so the df looks like this with just units and the date index
date
2013-09-01 3,533
2013-08-01 4,226
2013-07-01 4,281
Name: Units, Length: 161, dtype: object
You have to specify the thousands separator:
actdf = pd.read_csv(StringIO(dataact), index_col=0, parse_dates=['date'], thousands=',')
This will work
In [13]: s
Out[13]:
0 4,223
1 3,123
dtype: object
In [14]: s.str.replace(',','').convert_objects(convert_numeric=True)
Out[14]:
0 4223
1 3123
dtype: int64