Convert Quarter + Year (Datetime) into String in Pandas - python

I have some datetime info extracted into columns in Pandas. For example, I got the quarters like this:
df['quarter'] = pd.to_datetime(df['ddate'], format='%Y%m%d', errors='coerce').dt.quarter
I need to take the 'quarter' and 'year' columns and combine them into something like "Q3_2017". I can get this to work fine with a single data point like this:
'Q' + str(df['quarter'].iloc[0]) + '_' + str(df['year'].iloc[0])
But when I try to apply "str()" to a whole column I get bizarre results. For instance:
df['period'] = str(df['quarter'])
Instead of getting the quarter (e.g. "1"), I get something like this:
7222 1\n185579 4\n185580 1\n2129..
What exactly is going on and what's an easy fix?
I found a few previous solutions, but none seem to work specifically with quarters; can only find out how to do this with month or year, for example.

Try:
df['period'] = 'Q' + df['quarter'].astype(str) + '_' + df['year'].astype(str)

With Periods you can access %q for strftime.
import pandas as pd
df = pd.DataFrame({'ddate': pd.date_range('2010-01-01', freq='57D', periods=4)})
df.ddate.dt.to_period('Q').dt.strftime('Q%q_%Y')
0 Q1_2010
1 Q1_2010
2 Q2_2010
3 Q2_2010
Name: ddate, dtype: object
Or just keep the format of to_period (convert to string if you want)
df.ddate.dt.to_period("Q")
0 2010Q1
1 2010Q1
2 2010Q2
3 2010Q2
Name: ddate, dtype: period[Q-DEC]

Related

add specific part of one column values to another column

I have the following dataframe
import pandas as pd
data = {'existing_indiv': ['stac.Altered', 'MASO.MHD'], 'queries': ['modify', 'change']}
df = pd.DataFrame(data)
existing_indiv queries
0 stac.Altered modify
1 MASO.MHD change
I want to add the period and the word before the period to the beginning of the values of the queries column
Expected outcome:
existing_indiv queries
0 stac.Altered stac.modify
1 MASO.MHD MASO.change
Any ideas?
You can use .str.extract and regex ^([^.]+\.) to extract everything before the first .:
df.queries = df.existing_indiv.str.extract('^([^.]+\.)', expand=False) + df.queries
df
existing_indiv queries
0 stac.Altered stac.modify
1 MASO.MHD MASO.change
If you prefer .str.split:
df.existing_indiv.str.split('.').str[0] + '.' + df.queries
0 stac.modify
1 MASO.change
dtype: object

How to convert string into datetime?

I'm quite new to Python and I'm encountering a problem.
I have a dataframe where one of the columns is the departure time of flights. These hours are given in the following format : 1100.0, 525.0, 1640.0, etc.
This is a pandas series which I want to transform into a datetime series such as : S = [11.00, 5.25, 16.40,...]
What I have tried already :
Transforming my objects into string :
S = [str(x) for x in S]
Using datetime.strptime :
S = [datetime.strptime(x,'%H%M.%S') for x in S]
But since they are not all the same format it doesn't work
Using parser from dateutil :
S = [parser.parse(x) for x in S]
I got the error :
'Unknown string format'
Using the panda datetime :
S= pd.to_datetime(S)
Doesn't give me the expected result
Thanks for your answers !
Since it's a columns within a dataframe (A series), keep it that way while transforming should work just fine.
S = [1100.0, 525.0, 1640.0]
se = pd.Series(S) # Your column
# se:
0 1100.0
1 525.0
2 1640.0
dtype: float64
setime = se.astype(int).astype(str).apply(lambda x: x[:-2] + ":" + x[-2:])
This transform the floats to correctly formatted strings:
0 11:00
1 5:25
2 16:40
dtype: object
And then you can simply do:
df["your_new_col"] = pd.to_datetime(setime)
How about this?
(Added an if statement since some entries have 4 digits before decimal and some have 3. Added the use case of 125.0 to account for this)
from datetime import datetime
S = [1100.0, 525.0, 1640.0, 125.0]
for x in S:
if str(x).find(".")==3:
x="0"+str(x)
print(datetime.strftime(datetime.strptime(str(x),"%H%M.%S"),"%H:%M:%S"))
You might give it a go as follows:
# Just initialising a state in line with your requirements
st = ["1100.0", "525.0", "1640.0"]
dfObj = pd.DataFrame(st)
# Casting the string column to float
dfObj_num = dfObj[0].astype(float)
# Getting the hour representation out of the number
df1 = dfObj_num.floordiv(100)
# Getting the minutes
df2 = dfObj_num.mod(100)
# Moving the minutes on the right-hand side of the decimal point
df3 = df2.mul(0.01)
# Combining the two dataframes
df4 = df1.add(df3)
# At this point can cast to other types
Result:
0 11.00
1 5.25
2 16.40
You can run this example to verify the steps for yourself, also you can make it into a function. Make slight variations if needed in order to tweak it according to your precise requirements.
Might be useful to go through this article about Pandas Series.
https://www.geeksforgeeks.org/python-pandas-series/
There must be a better way to do this, but this works for me.
df=pd.DataFrame([1100.0, 525.0, 1640.0], columns=['hour'])
df['hour_dt']=((df['hour']/100).apply(str).str.split('.').str[0]+'.'+
df['hour'].apply((lambda x: '{:.2f}'.format(x/100).split('.')[1])).apply(str))
print(df)
hour hour_dt
0 1100.0 11.00
1 525.0 5.25
2 1640.0 16.40

Calculating customer lifetime using Pandas

I'm performing a Cohort analysis using python, and I am having trouble creating a new column that sums up the total months a user has stayed with us.
I know the math behind the answer, all I have to do is:
subtract the year when they canceled our service from when they started it
Multiply that by 12.
Subtract the month when they canceled our service from when they started it.
Add those two numbers together.
So in Excel, it looks like this:
=(YEAR(C2)-YEAR(B2))*12+(MONTH(C2)-MONTH(B2))
C is when the customer canceled the date, and B is when they started
The problem is that I am very new to Python and Pandas, and I am having trouble translating that function in Python
What I have tried so far:
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
It returns with an error 'Series' is not callable, and I have a general understanding of what that means.
I then tried:
def LTVCalc (Plan_Start_Date, Plan_Cancel_Date):
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
But that didn't add the Column 'Lifetime' to the DataFrame.
Anyone able to help a rookie?
I think need first convert to_datetime and then use dt.year and
dt.month:
df = pd.DataFrame({
'Plan_Cancel_Date': ['2018-07-07','2019-03-05','2020-10-08'],
'Plan_Start_Date': ['2016-02-07','2017-01-05','2017-08-08']
})
#print (df)
#if necessary convert to datetimes
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
df['Lifetime'] = ((df.Plan_Cancel_Date.dt.year - df.Plan_Start_Date.dt.year)*12 +
df.Plan_Cancel_Date.dt.month - df.Plan_Start_Date.dt.month)
print (df)
Plan_Cancel_Date Plan_Start_Date Lifetime
0 2018-07-07 2016-02-07 29
1 2019-03-05 2017-01-05 26
2 2020-10-08 2017-08-08 38

Output of column in Pandas dataframe from float to currency (negative values)

I have the following data frame (consisting of both negative and positive numbers):
df.head()
Out[39]:
Prices
0 -445.0
1 -2058.0
2 -954.0
3 -520.0
4 -730.0
I am trying to change the 'Prices' column to display as currency when I export it to an Excel spreadsheet. The following command I use works well:
df['Prices'] = df['Prices'].map("${:,.0f}".format)
df.head()
Out[42]:
Prices
0 $-445
1 $-2,058
2 $-954
3 $-520
4 $-730
Now my question here is what would I do if I wanted the output to have the negative signs BEFORE the dollar sign. In the output above, the dollar signs are before the negative signs. I am looking for something like this:
-$445
-$2,058
-$954
-$520
-$730
Please note there are also positive numbers as well.
You can use np.where and test whether the values are negative and if so prepend a negative sign in front of the dollar and cast the series to a string using astype:
In [153]:
df['Prices'] = np.where( df['Prices'] < 0, '-$' + df['Prices'].astype(str).str[1:], '$' + df['Prices'].astype(str))
df['Prices']
Out[153]:
0 -$445.0
1 -$2058.0
2 -$954.0
3 -$520.0
4 -$730.0
Name: Prices, dtype: object
You can use the locale module and the _override_localeconv dict. It's not well documented, but it's a trick I found in another answer that has helped me before.
import pandas as pd
import locale
locale.setlocale( locale.LC_ALL, 'English_United States.1252')
# Made an assumption with that locale. Adjust as appropriate.
locale._override_localeconv = {'n_sign_posn':1}
# Load dataframe into df
df['Prices'] = df['Prices'].map(locale.currency)
This creates a dataframe that looks like this:
Prices
0 -$445.00
1 -$2058.00
2 -$954.00
3 -$520.00
4 -$730.00

Converting a column of strings to numbers in Pandas

How do I get the Units column to numeric?
I have a Google spreadsheet that I am reading in the date column gets converted fine.. but I'm not having much luck getting the Unit Sales column to convert to numeric I'm including all the code which uses requests to get the data:
from StringIO import StringIO
import requests
#act = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak_wF7ZGeMmHdFZtQjI1a1hhUWR2UExCa2E4MFhiWWc&output=csv&gid=1')
dataact = act.content
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=['date'])
actdf.rename(columns={'Unit Sales': 'Units'}, inplace=True) #incase the space in the name is messing me up
The different methods I have tried to get Units to get to numeric
actdf=actdf['Units'].convert_objects(convert_numeric=True)
#actdf=actdf['Units'].astype('float32')
Then I want to resample and I'm getting strange string concatenations since the numbers are still string
#actdfq=actdf.resample('Q',sum)
#actdfq.head()
actdf.head()
#actdf
so the df looks like this with just units and the date index
date
2013-09-01 3,533
2013-08-01 4,226
2013-07-01 4,281
Name: Units, Length: 161, dtype: object
You have to specify the thousands separator:
actdf = pd.read_csv(StringIO(dataact), index_col=0, parse_dates=['date'], thousands=',')
This will work
In [13]: s
Out[13]:
0 4,223
1 3,123
dtype: object
In [14]: s.str.replace(',','').convert_objects(convert_numeric=True)
Out[14]:
0 4223
1 3123
dtype: int64

Categories

Resources