I have a dataframe (df) (orginally from a excel file) and the first 9 rows are like this:
Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00 OC/OER/OPA/PMS/ M WEBB
1 NaN 2000-02-29 00:00:00 NaN DATA CORP
2 2000-1776 2000-01-02 00:00:00 OC/ORA/OE/DCP/ G KAN
3 NaN 2000-01-03 00:00:00 OC/ORA/ORO/PNC/ PALM POST
4 NaN NaN FDA/OGROP/ORA/SE-FO/FLA- NaN
5 NaN NaN DO/FLA-CB/ NaN
6 2000-1983 2000-02-02 00:00:00 FDA/OGROP/ORA/CE-FO/CHI- M EGAN
7 NaN 2000-02-03 00:00:00 DO/CHI-CB/ BERNSTEIN LIEBHARD &
8 NaN NaN NaN LONDON LLP
Type(df['Control'][1])=float;
Type(df['Recd_Date/Due_Date'][1])=datetime.datetime;
type(df['Action_Office'][1])=float;
Type(df['Signature/Requester'][1])=unicode
I want to transform this dataframe (e.g. first 9 rows) to this:
Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00,2000-02-29 00:00:00 OC/OER/OPA/PMS/ M WEBB,DATA CORP
1 2000-1776 2000-01-02 00:00:00,2000-01-03 00:00:00 OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-FO/FLA-DO/FLA-CB/ G KAN,PALM POST
2 2000-1983 2000-02-02 00:00:00,2000-02-03 00:00:00 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/ M EGAN,BERNSTEIN LIEBHARD & LONDON LLP
So basically:
Everytime pd.isnull(row['Control']) (This should be the only if condition) is true then merge this row with the previous row (whose 'control' value is not null).
And for 'Recd_Date/Due_Date' and 'Signature/Requester', add ',' (or '/') between each two values (from two merged rows) (e.g. '2000-01-31 00:00:00,2000-02-29 00:00:00' and 'G KAN,PALM POST')
For 'Action', simply merge them without any punctuations added (e.g. FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/)
Can anyone help me out pls? This is the code im trying to get it to work:
for i, row in df.iterrows():
if pd.isnull(df.ix[i]['Control_#']):
df.ix[i-1]['Recd_Date/Due_Date'] = str(df.ix[i-1]['Recd_Date/Due_Date'])+'/'+str(df.ix[i]['Recd_Date/Due_Date'])
df.ix[i-1]['Subject'] = str(df.ix[i-1]['Subject'])+' '+str(df.ix[i]['Subject'])
if str(df.ix[i-1]['Action_Office'])[-1] == '-':
df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+str(df.ix[i]['Action_Office'])
else:
df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+','+str(df.ix[i]['Action_Office'])
if pd.isnull(df.ix[i-1]['Signature/Requester']):
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+str(df.ix[i]['Signature/Requester'])
elif str(df.ix[i-1]['Signature/Requester'])[-1] == '&':
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+' '+str(df.ix[i]['Signature/Requester'])
else:
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+','+str(df.ix[i]['Signature/Requester'])
df.drop(df.index[i])
How come the drop() doesn't work? I am trying drop the current row (if its ['Control_#'] is null) so the next row (whose ['Control_#'] is null) can be added to the previous row (whose ['Control_#'] is NOT null) iteratively..
Much appreciated!!
I think you need to group the rows together and then join up the column values. The tricky part is finding a way to group together the rows in the way you want. Here is my solution...
1) Grouping Together the Rows: Static variables
Since your groups depend on a sequence in your rows I used a static variable in a method to label every row to a specific group
def rolling_group(val):
if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
return rolling_group.group
rolling_group.group = 0 #static variable
This method is applied along the Control series to sort indexes into groups, which is then used to split up the dataframe to allow you to merge rows
#groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
That is really the only tricky part after that you can just merge the rows by applying a function to each group that gives you your desired output
Full Solution Code
def rolling_group(val):
if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
return rolling_group.group
rolling_group.group = 0 #static variable
def joinFunc(g,column):
col =g[column]
joiner = "/" if column == "Action" else ","
s = joiner.join([str(each) for each in col if pd.notnull(each)])
s = re.sub("(?<=&)"+joiner," ",s) #joiner = " "
s = re.sub("(?<=-)"+joiner,"",s) #joiner = ""
s = re.sub(joiner*2,joiner,s) #fixes double joiner condition
return s
#edit above - str(each) - to convert to strings...
edit above regex to clean join string joins
if __name__ == "__main__":
df = """ Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00 OC/OER/OPA/PMS/ M WEBB
1 NaN 2000-02-29 00:00:00 NaN DATA CORP
2 2000-1776 2000-01-02 00:00:00 OC/ORA/OE/DCP/ G KAN
3 NaN 2000-01-03 00:00:00 OC/ORA/ORO/PNC/ PALM POST
4 NaN NaN FDA/OGROP/ORA/SE-FO/FLA- NaN
5 NaN NaN DO/FLA-CB/ NaN
6 2000-1983 2000-02-02 00:00:00 FDA/OGROP/ORA/CE-FO/CHI- M EGAN
7 NaN 2000-02-03 00:00:00 DO/CHI-CB/ BERNSTEIN LIEBHARD &
8 NaN NaN NaN LONDON LLP"""
df = pd.read_csv(StringIO.StringIO(df),sep = "\s\s+",engine='python')
groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
groupFunct = lambda g: pd.Series([joinFunc(g,col) for col in g.columns],index=g.columns)
print groups.apply(groupFunct)
output
Control Recd_Date/Due_Date \
0 2000-1703 2000-01-31 00:00:00,2000-02-29 00:00:00
1 2000-1776 2000-01-02 00:00:00,2000-01-03 00:00:00
2 2000-1983 2000-02-02 00:00:00,2000-02-03 00:00:00
Action \
0 OC/OER/OPA/PMS/
1 OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-...
2 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/
Signature/Requester
0 M WEBB,DATA CORP
1 G KAN,PALM POST
2 M EGAN,BERNSTEIN LIEBHARD & LONDON LLP
Related
How can I extract the string belongs to words.
here is my text,
ID Event Name Event Type
0 1 Taltz Seminar for Dermathologists Out of office
2 3 Experiment Results for Taltz In Ofice
3 4 Use of Taltz in Rheumathology OUTOFOFFICE
5 6 RHeums Experiences with Taltz IO
How can I get the Dermathologists and Rheumathology belonging string using regex.
I have tried this one.
import re
pattern = r'(derma\w+),\s(RHeums\w+).*'
df_named = df['Event Name'].str.extract(
pattern,
flags=re.I)
df_clean = df_named.reindex(
columns =
['dermatological ',
'rheumatological'])
df_clean.head()
Convert Series to new DataFrame:
df_named = pd.DataFrame({'dermatological': df['Event Name'].str.extract('(derma\w+)', flags=re.I, expand=False),
'rheumatological':df['Event Name'].str.extract('(RHeuma\w+)', flags=re.I, expand=False)})
print (df_named)
dermatological rheumatological
0 Dermathologists NaN
1 NaN NaN
2 NaN Rheumathology
3 NaN NaN
If need append new columns to original DataFrame use:
df_named = df.assign(dermatological = df['Event Name'].str.extract('(derma\w+)', flags=re.I),
rheumatological = df['Event Name'].str.extract('(RHeum\w+)', flags=re.I))
print (df_named)
ID Event Name Event Type dermatological \
0 1 Taltz Seminar for Dermathologists Out of office Dermathologists
1 3 Experiment Results for Taltz In Ofice NaN
2 4 Use of Taltz in Rheumathology OUTOFOFFICE NaN
3 6 RHeums Experiences with Taltz IO NaN
rheumatological
0 NaN
1 NaN
2 Rheumathology
3 RHeums
I have a dataframe with daily values between a couple of months like this:
London 2000-01-01 5
London 2000-01-02 nan
London 2000-01-03 nan
..
London 2000-01-31 nan
London 2000-02-01 3
London 2000-02-02 nan
London 2000-02-01 nan
...
London 2000-02-31 nan
London 2000-03-01 nan
London 2000-01-01 nan
..
so for the first two months, there is a value on the first of the month, I want to forward fill that first of the month value to the whole months value but if I just use fillna with method = ffill, the third month will also be filled with the second months value. So I want it to be like this:
London 2000-01-01 5
London 2000-01-02 5
London 2000-01-03 5
..
London 2000-01-31 5
London 2000-02-01 3
London 2000-02-02 3
London 2000-02-01 3
...
London 2000-02-31 3
London 2000-03-01 nan
London 2000-01-01 nan
..
Is there a way to forward fill only the month ahead? my startdate and enddate will be variable so for example I may have first of the month data for 2000-01 all the way up until 2000-10 but my overall dataframe may be between 2000-01 to 2000-12 so I will have two months i want with only NANS. I am having trouble because each month has a different endday so I am unsure how to set the right condition for it. The dates are in datetime format.
Option 1:
import pandas as pd
df = pd.DataFrame(index=pd.date_range('2000-01-01', '2005-01-01', freq='D'))
values_to_set = [{'value':3, 'from':'2000-01', 'to':'2000-05'},
{'value':5, 'from':'2000-06', 'to':'2000-09'}
]
for v in values_to_set:
df.loc[v['from']:v['to'], 'value'] = v['value']
df.loc['2000-09-28':'2000-10-02']
Option 2:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range('2000-01-01', '2005-01-01', freq='D'))
df.loc['2000-02-01', 'value'] = 5
df.loc['2000-05-01', 'value'] = 6
df.loc['2000-10-01', 'value'] = -1 # set stop value
df.ffill(inplace=True)
df.replace(-1, np.nan, inplace=True)
df.loc['2000-09-28':'2000-10-02']
Option 3
This is a tricky solution, probably someone else will come with a better one.
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range('2000-01-01', '2005-01-01', freq='D'))
df.loc['2000-02-01', 'value'] = 5
df.loc['2000-05-01', 'value'] = 6
_mask_1 = ~df['value'].isna() # filters non empty values
_mask_2 = (df.index.day==1) # filters 1st of each month
df.loc[_mask_1,'tmp'] = -1 # marks non empty values on a temporal column
df.loc[_mask_1|_mask_2, 'tmp'] = df['tmp'][_mask_1|_mask_2].shift(1) # moves temp values one month ahead
_mask_3 = df['tmp'] == -1 # filters next month non empty
df.loc[_mask_3, 'value'] = -1 # set stop value on 'value' column
df.drop(columns='tmp', inplace=True) # drops temporal column
# shows the stop mark for march
print(df['2000-02-28':'2000-03-02'])
# perform the forward filling
df.ffill(inplace=True)
df.replace(-1, np.nan, inplace=True)
print(df.loc['2000-02-28':'2000-03-02'])
print(df.loc['2000-05-28':'2000-06-02'])
I have the following frame:
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2
123,45,,,
123,,46,,
123,,47,,
123,,48,,
123,,49,,
123,,51,,
124,45,,,
124,,46,,
124,,47,,
124,,48,,
124,,49,,
124,,51,,
I'd like to add a 4th column that is (EVENT2TIME - EVENT1TIME)
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2, DELTA
123,45,,,,
123,,46,,,1
123,,47,,,2
123,,48,,,3
123,,49,,,4
123,,51,,,6
124,45,,,,
124,,46,,,1
124,,47,,,2
124,,48,,,3
124,,49,,,4
124,,51,,,6
I think the first thing to do is to copy the value from the row where EVENT1TIME is populated into the other instances of that USERID. But I suspect there may be a better way.
I am making some assumptions:
You want to calculate the difference between column EVENT2TIME and first row of EVENT1TIME
You want to store the results into DELTA
You can do this as follows:
import pandas as pd
df = pd.read_csv('abc.txt')
print (df)
df['DELTA'] = df.iloc[:,2] - df.iloc[0,1]
print (df)
The output of this will be:
USERID EVENT1TIME EVENT2TIME MISC1 MISC2 DELTA
0 123 45.0 NaN NaN NaN
1 123 NaN 46.0 NaN NaN 1.0
2 123 NaN 47.0 NaN NaN 2.0
3 123 NaN 48.0 NaN NaN 3.0
4 123 NaN 49.0 NaN NaN 4.0
5 123 NaN 51.0 NaN NaN 6.0
If you know EVENT1TIME is always and only in the first row, just store it as a variable and subtract it.
val = df.EVENT1TIME[0]
df['DELTA'] = df.EVENT2TIME - val
If you have multiple values every so often in EVENT1TIME, use some logic to back or forward fill all the empty rows for EVENT1TIME. This fill is not stored in the final output df.
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.ffill() # forward fill (down) all nan values
# OR
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.bfill() # back fill (up) all nan values
EDIT: Keeping this for continuity despite how hacky it is.
locations = list(df[~np.isnan(df.EVENT1TIME)].index)
vals = df.EVENT1TIME.loc[locations] # all EVENT1TIME values
locations.append(df.EVENT1TIME.index[-1]+1) # last row index + 1
last_loc = locations[0]
for idx, next_loc in enumerate(locations[1:]):
temp = df.loc[last_loc:next_loc-1]
df['DELTA'].loc[last_loc:next_loc-1] = temp.EVENT2VALUE - vals[last_loc]
last_loc = next_loc
I have a dataframe, I am struggling to create a column based out of other columns, I will share the problem for a sample data.
Date Target1 Close
0 2019-04-17 209.2440 203.130005
1 2019-04-17 212.2155 203.130005
2 2019-04-17 213.6330 203.130005
3 2019-04-17 213.0555 203.130005
4 2019-04-17 212.6250 203.130005
5 2019-04-17 212.9820 203.130005
6 2019-04-17 213.1395 203.130005
7 2019-04-16 209.2860 199.250000
8 2019-04-16 209.9055 199.250000
9 2019-04-16 210.3045 199.250000
I want to create another column (for each observation) (called days_to_hit_target for example) which is the difference of days such that close hits (or comes very close to target of specific day), when it does closes that very closely, then it counts the difference of days and put them in the column days_to_hit_target.
This should work:
daysAboveTarget = []
for i in range(len(df.Date)):
try:
dayAboveTarget = df.iloc[i:].loc[(df.Close > df.Target1[i])]['Date'].iloc[0]
except IndexError:
dayAboveTarget = None
daysAboveTarget.append(dayAboveTarget)
daysAboveTarget = pd.Series(daysAboveTarget)
df['days_to_hit_target'] = daysAboveTarget - df.Date
I sort of overused iloc and loc here, so let me explain.
The variable dayAboveTarget gets the date when the price closes above the target. The first iloc subsets the dataframe to only future dates, the first loc finds the actual results, the second iloc gets only the first result. We need the exception for days where the price never goes above target.
NOTE I use python 3.7.1 and pandas 0.23.4. I came up with something very dirty; I am sure there is a neater and more efficient way of doing this.
### Create sample data
date_range = pd.date_range(start="1/1/2018", end="20/1/2018", freq="6H", closed="right")
target1 = np.random.uniform(10, 30, len(date_range))
close = [[i]*4 for i in np.random.uniform(10,30, len(date_range)//4)]
close_flat = np.array([item for sublist in close for item in sublist])
df = pd.DataFrame(np.array([np.array(date_range.date), target1,
close_flat]).transpose(), columns=["date", "target", "close"])
### Create the column you need
# iterating over the days and finding days when the difference between
# "close" of current day and all "target" is lower than 0.25 OR the "target"
# value is greater than "close" value.
thresh = 0.25
date_diff_arr = np.zeros(len(df))
for i in range(0,len(df),4):
diff_lt_thresh = df[(abs(df.target-df.close.iloc[i]) < thresh) | (df.target > df.close.iloc[i])]
# only keep the findings from the next day onwards
diff_lt_thresh = diff_lt_thresh.loc[i+4:]
if not diff_lt_thresh.empty:
# find day difference only if something under thresh is found
days_diff = (diff_lt_thresh.iloc[0].date - df.iloc[i].date).days
else:
# otherwise write it as nan
days_diff = np.nan
# fill in the np.array which will be used to write to the df
date_diff_arr[i:i+4] = days_diff
df["date_diff"] = date_diff_arr
Sample output:
0 2018-01-01 21.64 26.7319 2.0
1 2018-01-01 22.9047 26.7319 2.0
2 2018-01-01 26.0945 26.7319 2.0
3 2018-01-02 10.2155 26.7319 2.0
4 2018-01-02 17.5602 11.0507 1.0
5 2018-01-02 12.0368 11.0507 1.0
6 2018-01-02 19.5923 11.0507 1.0
7 2018-01-03 21.8168 11.0507 1.0
8 2018-01-03 11.5433 16.8862 1.0
9 2018-01-03 27.3739 16.8862 1.0
10 2018-01-03 26.9073 16.8862 1.0
11 2018-01-04 19.6677 16.8862 1.0
12 2018-01-04 25.3599 27.3373 1.0
13 2018-01-04 22.7479 27.3373 1.0
14 2018-01-04 18.7246 27.3373 1.0
15 2018-01-05 25.4122 27.3373 1.0
16 2018-01-05 28.3294 23.8469 1.0
maybe a little faster solution:
import pandas as pd
# df is your DataFrame
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values("Date")
def days_to_hit(x, no_hit_default=None):
return next(
((df["Date"].iloc[j+x.name] - x["Date"]).days
for j in range(len(df)-x.name)
if df["Close"].iloc[j+x.name] >= x["Target1"]), no_hit_default)
df["days_to_hit_target"] = df.apply(days_to_hit, axis=1)
I have a csv file which is something like below
date,mean,min,max,std
2018-03-15,3.9999999999999964,inf,0.0,100.0
2018-03-16,0.46403712296984756,90.0,0.0,inf
2018-03-17,2.32452732452731,,0.0,143.2191767899579
2018-03-18,2.8571428571428523,inf,0.0,100.0
2018-03-20,0.6928406466512793,100.0,0.0,inf
2018-03-22,2.8675703858185635,,0.0,119.05383697172658
I want to select those column values which is > 20 and < 500 that is (20 to 500) and put those values along with date in another column of a dataframe.The other dataframe looks something like this
Date percentage_change location
2018-02-14 23.44 BOM
So I want to get the date, value from the csv and add it into the new dataframe at appropriate columns.Something like
Date percentage_change location
2018-02-14 23.44 BOM
2018-03-15 100.0 NaN
2018-03-16 90.0 NaN
2018-03-17 143.2191767899579 NaN
.... .... ....
Now I am aware of functions like df.max(axis=1) and df.min(axis=1) which gives you the min and max but not sure for finding values based on a range.So how can this be achieved?
Given dataframes df1 and df2, you can achieve this via aligning column names, cleaning numeric data, and then using pd.DataFrame.append.
df_app = df1.loc[:, ['date', 'mean', 'min', 'std']]\
.rename(columns={'date': 'Date'})\
.replace(np.inf, 0)\
.fillna(0)
print(df_app)
df_app['percentage_change'] = np.maximum(df_app['min'], df_app['std'])
print(df_app)
df_app = df_app[df_app['percentage_change'].between(20, 500)]
res = df2.append(df_app.loc[:, ['Date', 'percentage_change']])
print(res)
# Date location percentage_change
# 0 2018-02-14 BOM 23.440000
# 0 2018-03-15 NaN 100.000000
# 1 2018-03-16 NaN 90.000000
# 2 2018-03-17 NaN 143.219177
# 3 2018-03-18 NaN 100.000000
# 4 2018-03-20 NaN 100.000000
# 5 2018-03-22 NaN 119.053837