how to replace string at specific index in pandas dataframe - python

I have following dataframe in pandas
code bucket
0 08:30:00-9:00:00
1 10:00:00-11:00:00
2 12:00:00-13:00:00
I want to replace 7th character 0 with 1, my desired dataframe is
code bucket
0 08:30:01-9:00:00
1 10:00:01-11:00:00
2 12:00:01-13:00:00
How to do it in pandas?

Use indexing with str:
df['bucket'] = df['bucket'].str[:7] + '1' + df['bucket'].str[8:]
Or list comprehension:
df['bucket'] = [x[:7] + '1' + x[8:] for x in df['bucket']]
print (df)
code bucket
0 0 08:30:01-9:00:00
1 1 10:00:01-11:00:00
2 2 12:00:01-13:00:00

Avoid string operations where possible
You lose a considerable amount of functionality by working with strings only. While this may be a one-off operation, you will find that repeated string manipulations will quickly become expensive in terms of time and memory efficiency.
Use pd.to_datetime instead
You can add additional series to your dataframe with datetime objects. Below is an example which, in addition, creates an object dtype series in the format you desire.
# split by '-' into 2 series
dfs = df.pop('bucket').str.split('-', expand=True)
# convert to datetime
dfs = dfs.apply(pd.to_datetime, axis=1)
# add 1s to first series
dfs[0] = dfs[0] + pd.Timedelta(seconds=1)
# create object series from 2 times
form = '%H:%M:%S'
dfs[2] = dfs[0].dt.strftime(form) + '-' + dfs[1].dt.strftime(form)
# join to original dataframe
res = df.join(dfs)
print(res)
code 0 1 2
0 0 2018-10-02 08:30:01 2018-10-02 09:00:00 08:30:01-09:00:00
1 1 2018-10-02 10:00:01 2018-10-02 11:00:00 10:00:01-11:00:00
2 2 2018-10-02 12:00:01 2018-10-02 13:00:00 12:00:01-13:00:00

Related

How can I replace the FOR loop by something faster

I'm trying to transform my dataframe based on certain conditions. Following is my input dataframe
In [11]: df
Out[11]:
DocumentNumber I_Date N_Date P_Date Amount
0 1234 2016-01-01 2017-01-01 2017-10-23 38.38
1 2345 2016-01-02 2017-01-02 2018-03-26 41.00
2 1324 2016-01-12 2017-01-03 2018-03-26 30.37
3 5421 2016-01-13 2017-01-02 2018-03-06 269.00
4 5532 2016-01-15 2017-01-04 2018-06-30 271.00
Desired solution:
Each row is a unique document and my aim is to find the number of documents and their total amount, which meet the mentioned condition if I am running for each day and delta combination.
I am able to get to my desired result via for-loop, but I know it is not the ideal way and gets slower as my data increases. Since I am new to python, I need help to get rid of the loop by a list comprehension or any other faster option.
Code:
d1 = datetime.date(2017, 1, 1)
d2 = datetime.date(2017, 1, 15)
mydates = pd.date_range(d1, d2).tolist()
Delta = pd.Series(range(0,5)).tolist()
df_A =[]
for i in mydates:
for j in Delta:
A = df[(df["I_Date"]<i) & (df["N_Date"]>i+j) & (df["P_Date"]>i) ]
A["DateCutoff"] = i
A["Delta"]=j
A = A.groupby(['DateCutoff','Delta'],as_index=False).agg({'Amount':'sum','DocumentNumber':'count'})
A.columns = ['DateCutoff','Delta','A_PaymentAmount','A_DocumentNumber']
df_A.append(A)
df_A = pd.concat(df_A, sort = False)
Output:
In [14]: df_A
Out[14]:
DateCutoff Delta A_PaymentAmount A_DocumentNumber
0 2017-01-01 0 611.37 4
0 2017-01-01 1 301.37 2
0 2017-01-01 2 271.00 1
0 2017-01-02 0 301.37 2
0 2017-01-02 1 271.00 1
0 2017-01-03 0 271.00 1
I don't see a way to remove the loop from your code, because the loop is creating individual dataframes based on the contents of mydates and Delta.
In this example you are creating 75 different dataframes
On each dataframe you .groupby, then .agg the sum of payments and the count of document numbers.
Each dataframe is appended to a list.
pd.concat the complete list into a dataframe.
One significant improvement
Check the Boolean condition before creating the dataframe and performing the remaining operations. In this example, operations were performed on 69 empty dataframes. By checking the condition first, operations will only be performed on the 6 dataframes containing data.
condition.any() returns True as long as at least one element is True
Minor changes
datetime + int is deprecated, so change that to datetime + timedelta(days=x)
pd.Series(range(0,5)).tolist() is overkill for making a list. Now timedelta objects are needed, so use [timedelta(days=x) for x in range(5)]
Instead of iterating with two for-loops, use itertools.product on mydates and Delta. This creates a generator of tuples in the form (Timestamp('2017-01-01 00:00:00', freq='D'), datetime.timedelta(0))
Use .copy() when creating dataframe A, to prevent SettingWithCopyWarning
Note:
A list comprehension was mentioned in the question. They are just a pythonic way of making a for-loop, but don't necessarily improve performance.
All of the calculations are using pandas methods, not for-loops. The for-loop only creates the dataframe from the condition.
Updated Code:
from itertools import product
import pandas as pd
from datetime import date, timedelta
d1 = date(2017, 1, 1)
d2 = date(2017, 1, 15)
mydates = pd.date_range(d1, d2)
Delta = [timedelta(days=x) for x in range(5)]
df_list = list()
for t in product(mydates, Delta):
condition = (df["I_Date"]<t[0]) & (df["N_Date"]>t[0]+t[1]) & (df["P_Date"]>t[0])
if condition.any():
A = df[condition].copy()
A["DateCutoff"] = t[0]
A["Delta"] = t[1]
A = A.groupby(['DateCutoff','Delta'],as_index=False).agg({'Amount':'sum','DocumentNumber':'count'})
A.columns = ['DateCutoff','Delta','A_PaymentAmount','A_DocumentNumber']
df_list.append(A)
df_CutOff = pd.concat(df_list, sort = False)
Output
The same as the original
DateCutoff Delta A_PaymentAmount A_DocumentNumber
0 2017-01-01 0 611.37 4
0 2017-01-01 1 301.37 2
0 2017-01-01 2 271.00 1
0 2017-01-02 0 301.37 2
0 2017-01-02 1 271.00 1
0 2017-01-03 0 271.00 1

How to assign random values from a list to a column in a pandas dataframe?

I am working with Python in Bigquery and have a large dataframe df (circa 7m rows). I also have a list lst that holds some dates (say all days in a given month).
I am trying to create an additional column "random_day" in df with a random value from lst in each row.
I tried running a loop and apply function but being quite a large dataset it is proving challenging.
My attempts passed by the loop solution:
df["rand_day"] = ""
for i in a["row_nr"]:
rand_day = sample(day_list,1)[0]
df.loc[i,"rand_day"] = rand_day
And the apply solution, defining first my function and then calling it:
def random_day():
rand_day = sample(day_list,1)[0]
return day
df["rand_day"] = df.apply(lambda row: random_day())
Any tips on this?
Thank you
Use numpy.random.choice and if necessary convert dates by to_datetime:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
})
day_list = pd.to_datetime(['2015-01-02','2016-05-05','2015-08-09'])
#alternative
#day_list = pd.DatetimeIndex(['2015-01-02','2016-05-05','2015-08-09'])
df["rand_day"] = np.random.choice(day_list, size=len(df))
print (df)
A B rand_day
0 a 4 2016-05-05
1 b 5 2016-05-05
2 c 4 2015-08-09
3 d 5 2015-01-02
4 e 5 2015-08-09
5 f 4 2015-08-09

Pandas Time Series shows NaN after converting entries to float

I'm trying to get a time series going from a dataframe. My dataframe contains two desired columns - Timestamp and Speed. This is my code so far:
Step 1: I replaced all the spaces in the desired Speed column with 0
bus1354['Speed'].replace(' ',0,inplace=True)
Step 2: I then check to see if there are any NaN values in the Speed column after this
assert not bus1354['Speed'].isnull().any()
Step 3: I then check the first few entries of Timestamp and Speed columns together in the dataframe
bus1354[['Timestamp','Speed']].head()
This is the result I get (so far so good):
Step 4: I then truncate the Timestamp so as to only show hh:mm:ss and remove the milliseconds. I also convert to datetime format.
bus1354['Timestamp'] = pd.to_datetime(bus1354['Timestamp'].apply(lambda x : x[:7]))
Step 5: I check the result of the truncation
bus1354['Timestamp'].head()
Here's what that looks like:
Step 6: I then convert the speed to float64 from non-null object
bus1354['Speed'] = bus1354['Speed'].apply(float)
Step 7: I create a timerange and Time Series
bstimeRng = bus1354['Timestamp']
bs1354Ser = pd.Series(bus1354['Speed'], index=bstimeRng)
Step 8: Once I output my Time Series however, I get a bunch of NaN's for my Speed column.
bs1354Ser
I'm still learning the ins and outs of pandas so bear with me if this sounds like a basic question. Why is it that even after I changed the Speed column into float64, the Time Series still shows my desired Speed values as "NaN"?
Here better is use set_index:
s1354Ser = bus1354.set_index('Timestamp')['Speed']
Sample:
bus1354 = pd.DataFrame(
{'Timestamp':['08:38:00:009','08:38:00:013','08:38:00:019'],
'Speed':[42,42,43]})
print (bus1354)
Timestamp Speed
0 08:38:00:009 42
1 08:38:00:013 42
2 08:38:00:019 43
bus1354['Timestamp'] = pd.to_datetime(bus1354['Timestamp'].str[:7])
bus1354['Speed'] = bus1354['Speed'].astype(float)
s1354Ser = bus1354.set_index('Timestamp')['Speed']
print (s1354Ser)
Timestamp
2019-01-19 08:38:00 42.0
2019-01-19 08:38:00 42.0
2019-01-19 08:38:00 43.0
Name: Speed, dtype: float64
Missing values in your solution is problem data alignment:
#sample data
df = pd.DataFrame(
{'a':[0,2,3],
'b':[41,42,43]})
print (df)
a b
0 0 41
1 2 42
2 3 43
If check index of original data:
print (df.index.tolist())
[0, 1, 2]
And values of column a used for new index:
print (df['a'].tolist())
[0, 2, 3]
Then Series contructor if possible align data - old index from original by new index from a column, if value not exist are created NaNs:
s = pd.Series(df['b'], index=df['a'])
print (s)
a
0 41.0 <-align by 0 from original index
2 43.0 <-align by 2 from original index
3 NaN <- not exist 3, so NaN
Name: b, dtype: float64
But if convert values of Speed to numpy 1d array by values, then array have no index like Series:
s1354Ser = pd.Series(bus1354['Speed'].values, index=bstimeRng)
s = pd.Series(df['b'].values, index=df['a'])
print (s)
a
0 41
2 42
3 43
dtype: int64

Append values to Pandas series

I am selecting values from an SQL database through pandas, but when I want to add new values to the existing pandas series, I receive a "cannt concatenate a non-NDframe object". So I am not really sure how I should proceed.
sql = "select * from table"
df = pd.read_sql(sql, conn)
datovalue = df['Datovalue']
datovalue.append(35)
This is is how the datovalues look like when i print it out:
0 736722.0
1 736722.0
2 736723.0
3 736723.0
4 736725.0
How do add an extra (5th index in this case) value?
There are several equivalent ways to add data to a series by index:
s = pd.Series([736722.0, 736722.0, 736723.0, 736723.0, 736725.0])
# direct indexing
s[5] = 35
# loc indexing
s.loc[5] = 35
# loc indexing with unknown index
s.loc[s.index.max()+1] = 35
# append with series
s = s.append(pd.Series([35], index=[5]))
# concat with series
s = pd.concat([s, pd.Series([35], index=[5])])
print(s)
0 736722.0
1 736722.0
2 736723.0
3 736723.0
4 736725.0
5 35.0
dtype: float64
just use this
datovalue.append([35])
df = pd.DataFrame([[5],[3]])
df.append([1])
df
0
0 5
1 3
0 1

Iteratively Subset DataFrame based on Unique Times

Given the following example DataFrame:
>>> df
Times Values
0 05/10/2017 01:01:03 1
1 05/10/2017 01:05:00 2
2 05/10/2017 01:06:10 3
3 05/11/2017 08:25:20 4
4 05/11/2017 08:30:14 5
5 05/11/2017 08:30:35 6
I want to subset this DataFrame by the 'Time' column, by matching a partial string up to the hour. For example, I want to subset using partial strings which contain "05/10/2017 01:" and "05/11/2017 08:" which breaks up the subsets into two new data frames:
>>> df1
Times Values
0 05/10/2017 01:01:03 1
1 05/10/2017 01:05:00 2
2 05/10/2017 01:06:10 3
and
>>> df2
0 05/11/2017 08:25:20 4
1 05/11/2017 08:30:14 5
2 05/11/2017 08:30:35 6
Is it possible to make this subset iterative in Pandas, for multiple dates/times that similarly have the date/hour as the common identifier?
First, cast your Times column into a datetime format, and set it as the index:
df['Times'] = pd.to_datetime(df['Times'])
df.set_index('Times', inplace = True)
Then use the groupby method, with a TimeGrouper:
g = df.groupby(pd.TimeGrouper('h'))
g is an iterator that yields tuple pairs of times and sub-dataframes of those times. If you just want the sub-dfs, you can do zip(*g)[1].
A caveat: the sub-dfs are indexed by the timestamp, and pd.TimeGrouper only works when the times are the index. If you want to have the timestamp as a column, you could instead do:
df['Times'] = pd.to_datetime(df['Times'])
df['time_hour'] = df['Times'].dt.floor('1h')
g = df.groupby('time_hour')
Alternatively, you could just call .reset_index() on each of the dfs from the former method, but this will probably be much slower.
Convert Times to a hour period, groupby and then extract each group as a DF.
df1,df2=[g.drop('hour',1) for n,g in\
df.assign(hour=pd.DatetimeIndex(df.Times)\
.to_period('h')).groupby('hour')]
df1
Out[874]:
Times Values
0 2017-05-10 01:01:03 1
1 2017-05-10 01:05:00 2
2 2017-05-10 01:06:10 3
df2
Out[875]:
Times Values
3 2017-05-11 08:25:20 4
4 2017-05-11 08:30:14 5
5 2017-05-11 08:30:35 6
First make sure that the Times column is of type DateTime.
Second, set times column as index.
Third, use between_time method.
df['Times'] = pd.to_datetime(df['Times'])
df.set_index('Times', inplace=True)
df1 = df.between_time('1:00:00', '1:59:59')
df2 = df.between_time('8:00:00', '8:59:59')
If you use the datetime type you can extract things like hours and days.
times = pd.to_datetime(df['Times'])
hours = times.apply(lambda x: x.hour)
df1 = df[hours == 1]
You can use the str[] accessor to truncate the string representation of your date (you might have to cast astype(str) if your columns is a datetime and then use groupby.groups to access the dataframes as a dictionary where the keys are your truncated date values:
>>> df.groupby(df.Times.astype(str).str[0:13]).groups
{'2017-05-10 01': DatetimeIndex(['2017-05-10 01:01:03', '2017-05-10 01:05:00',
'2017-05-10 01:06:10'],
dtype='datetime64[ns]', name='time', freq=None),
'2017-05-11 08': DatetimeIndex(['2017-05-11 08:25:20', '2017-05-11 08:30:14',
'2017-05-11 08:30:35'],
dtype='datetime64[ns]', name='time', freq=None)}

Categories

Resources