Multiple dataframe transformations based on a single column - python

I was looking for a similar question but I did not find a solution for what I want to do. any help is welcome
so here is the code to get an example of my Dataframe :
import pandas as pd
L = [[0.1998,'IN TIME,IN TIME','19708,19708','MR SD#5 W/Z SD#6 X/Y',20.5],
[0.3983,'LATE,IN TIME','11206,18054','MR SD#4 A/B SD#1 C/D',19.97]]
df = pd.DataFrame(L,columns=['Time','status','F_nom','info','Delta'])
output :
I would like to create two new rows for each row in my main dataframe based on 'Info' column
as we can see on the column 'Info' in my main dataframe each row contains two different SD#
i would like to have only one SD# per row
Also i would like to keep the corresponding values of the columns : Time , Status , F_norm ,Delta
Finaly create a new column 'type info' that contains the specific string for each SD# (W/Z or A/B etc.) and all this by keeping the index of my main data_frame !
Here is the desired result :
I hope i was clear enough, waiting for your returns thank you.

Use:
#split values by comma or whitespace
df['status'] = df['status'].str.split(',')
df['F_nom'] = df['F_nom'].str.split(',')
info = df.pop('info').str.split()
#select values by indexing
df['info'] = info.str[1::2]
df['type_info'] = info.str[2::2]
#reshape to Series
s = df.set_index(['Time','Delta']).stack()
#create new DataFrame and reshape to expected output
df1 = (pd.DataFrame(s.values.tolist(), index=s.index)
.stack()
.unstack(2)
.reset_index(level=2, drop=True)
.reset_index())
print (df1)
Time Delta status F_nom info type_info
0 0.1998 20.50 IN TIME 19708 SD#5 W/Z
1 0.1998 20.50 IN TIME 19708 SD#6 X/Y
2 0.3983 19.97 LATE 11206 SD#4 A/B
3 0.3983 19.97 IN TIME 18054 SD#1 C/D
Another solution:
df['status'] = df['status'].str.split(',')
df['F_nom'] = df['F_nom'].str.split(',')
info = df.pop('info').str.split()
df['info'] = info.str[1::2]
df['type_info'] = info.str[2::2]
from itertools import chain
lens = df['status'].str.len()
df = pd.DataFrame({
'Time' : df['Time'].values.repeat(lens),
'status' : list(chain.from_iterable(df['status'].tolist())),
'F_nom' : list(chain.from_iterable(df['F_nom'].tolist())),
'info' : list(chain.from_iterable(df['info'].tolist())),
'Delta' : df['Delta'].values.repeat(lens),
'type_info' : list(chain.from_iterable(df['type_info'].tolist())),
})
print (df)
Time status F_nom info Delta type_info
0 0.1998 IN TIME 19708 SD#5 20.50 W/Z
1 0.1998 IN TIME 19708 SD#6 20.50 X/Y
2 0.3983 LATE 11206 SD#4 19.97 A/B
3 0.3983 IN TIME 18054 SD#1 19.97 C/D

Related

Trouble with Finding Date Range in Pandas

I have a dataset which has a list of subjects, a start date, and an end date. I'm trying to do a loop so that for each subject I have a list of dates between the start date and end date. I've tried so many ways to do this based on previous posts but still having issues.
an example of the dataframe:
Participant # Start_Date End_Date
1 23-04-19 25-04-19
An example of the output I want:
Participant # Range
1 23-04-19
1 24-04-19
1 25-04-19
Right now my code looks like this:
subjs_490 = tracksheet_490['Participant #']
for subj_490 in subjs_490:
temp_a = tracksheet_490[tracksheet_490['Participant #'].isin([subj_490])]
start = temp_a['Start_Date']
end = temp_a['End_Date'
start_dates = pd.to_datetime(pd.Series(start), format = '%d-%m-%y')
end_dates = pd.to_datetime(pd.Series(end), format = '%d-%m-%y')
date_range = pd.date_range(start_dates, end_dates).tolist()
With this method I'm getting the following error:
Cannot convert input [1 2016-05-03 Name: Start_Date, dtype: datetime64[ns]] of type to Timestamp
Expanding ranges tends to be a slow process. You can create the date_range and then explode it to get what you want. Moving 'Participant #' to the index makes sure it's repeated for all rows that are exploded.
df = (df.set_index('Participant #')
.apply(lambda x: pd.date_range(x.start_date, x.end_date), axis=1) # :( slow
.rename('Range')
.explode()
.reset_index())
Participant # Range
0 1 2019-04-23
1 1 2019-04-24
2 1 2019-04-25
If you can't use explode another option is to create a separate DataFrame for each row and then concat them all together.
pd.concat([pd.DataFrame({'Participant #': par, 'Range': pd.date_range(start, end)})
for par,start,end in zip(df['Participant #'], df['start_date'], df['end_date'])],
ignore_index=True)

How to split a pandas data frame column to multiple columns based on the text value contained

Let's say there is column like below.
df = pd.DataFrame(['A-line B-station 9-min C-station 3-min',
'D-line E-station 8-min F-line G-station 5-min',
'G-line H-station 1-min I-station 6-min J-station 8-min'],
columns=['station'])
A,B,C is just arbitrary characters and there are whole bunch of rows like this.
station
0 A-line B-station 9-min C-station 3-min
1 D-line E-station 8-min F-line G-station 5-min
2 G-line H-station 1-min I-station 6-min J-stati...
How can we make columns like below?
Line1 Station1-1 Station1-2 Station1-3 Line2 Station2-1
0 A-line B-station C-station null null null
1 D-line E-station null null F-line G-station
2 G-line H-station I-station J-station null null
stationX-X means that Station (line number) - (order of station)
Station1-1 means first station for first line(line1)
Station1-2 means second station for first line(line1)
Station2-1 means first station for second line(line2)
I tried to split by delimiter; however, it doesn't work since every row has different number of lines and stations.
What I maybe need is to split columns based on their characters contained. For example, I could store first '-line' to Line1 and store first '-station' to station1-1.
Does anybody have any ideas how to do this?
Any small thoughts help me!
Thank you!
First create Series with Series.str.split and DataFrame.stack:
s = df['station'].str.split(expand=True).stack()
Then remove values ending with min by boolean indexing with Series.str.endswith:
df1 = s[~s.str.endswith('min')].to_frame('data').rename_axis(('a','b'))
Then create counters for lines and for station rows with filtering and GroupBy.cumcount:
df1['Line'] = (df1[df1['data'].str.endswith('line')]
.groupby(level=0)
.cumcount()
.add(1)
.astype(str))
df1['Line'] = df1['Line'].ffill()
df1['station'] = (df1[df1['data'].str.endswith('station')]
.groupby(['a','Line'])
.cumcount()
.add(1)
.astype(str))
Create Series with join, replace missing values by df1['Line'] by Series.fillna:
df1['station'] = (df1['Line'] + '-' + df1['station']).fillna(df1['Line'])
Reshape by DataFrame.set_index with DataFrame.unstack:
df1 = df1.set_index('station', append=True)['data'].reset_index(level=1, drop=True).unstack()
Rename columns names - not before for avoid wrong sorted:
df1 = df1.rename(columns = lambda x: 'Station' + x if '-' in x else 'Line' + x)
Remove columns name:
df1.columns.name = None
df1.index.name = None
print (df1)
Line1 Station1-1 Station1-2 Station1-3 Line2 Station2-1
0 A-line B-station C-station NaN NaN NaN
1 D-line E-station NaN NaN F-line G-station
2 G-line H-station I-station J-station NaN NaN

check for date and time between two columns in pandas data frame

I have two data frames:
The first date frame is:
import pandas as pd
df1 = pd.DataFrame({'serialNo':['aaaa','bbbb','cccc','ffff','aaaa','bbbb','aaaa'],
'Name':['Sayonti','Ruchi','Tony','Gowtam','Toffee','Tom','Sayonti'],
'testName': [4402, 3747 ,5555,8754,1234,9876,3602],
'moduleName': ['singing', 'dance','booze', 'vocals','drama','paint','singing'],
'endResult': ['WARNING', 'FAILED', 'WARNING', 'FAILED','WARNING','FAILED','WARNING'],
'Date':['2018-10-5','2018-10-6','2018-10-7','2018-10-8','2018-10-9','2018-10-10','2018-10-8'],
'Time_df1':['23:26:39','22:50:31','22:15:28','21:40:19','21:04:15','20:29:11','19:54:03']})
The second data frame is:
df2 = pd.DataFrame({'serialNo':['aaaa','bbbb','aaaa','ffff','xyzy','aaaa'],
'Food':['Strawberry','Coke','Pepsi','Nuts','Apple','Candy'],
'Work': ['AP', 'TC','OD', 'PU','NO','PM'],
'Date':['2018-10-1','2018-10-6','2018-10-2','2018-10-3','2018-10-5','2018-10-10'],
'Time_df2':['09:00:00','10:00:00','11:00:00','12:00:00','13:00:00','14:00:00']
})
I am joining the two based on serial number:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
result = pd.merge(df1,df2,on=['serialNo'],how='inner')
Now I want that Date_y lies within 3 days of Date_x starting from Date_x
which means Date_X+(1,2,3 days) should be Date_y. And I can get that as below but I also want to check for the time range which I do not know how to achieve
result = result[result.Date_x.sub(result.Date_y).dt.days.between(0,3)]
I want to check for the time such that Time_df2 is within 6 hours of start time being Time_df1. Please help?
You could have a column within your dataframe that combines the date and the time. Here's an example of combining a single row in the dataframe:
# Combining Date_x and time_df1
value_1_x = datetime.datetime.combine(result['Date_x'][0].date() ,\
datetime.datetime.strptime(result['Time_df1'][0], '%H:%M:%S').time())
# Combining date_y and time_df2
value_2_y = datetime.datetime.combine(result['Date_y'][0].date() , \
datetime.datetime.strptime(result['Time_df2'][0], '%H:%M:%S').time())
Then given two datetime objects, you can simply subtract to find the difference you are looking for:
difference = value_1_x - value_2_y
print(difference)
Which gives the output:
4 days, 14:26:39
My understanding is that you are looking to see if something is within 3 days and 6 hours (or a total of 78 hours). You can convert this to hours easily, and then make the desired comparison:
hours_difference = abs(value_1_x - value_2_y).total_seconds() / 3600.0
print(hours_difference)
Which gives the output:
110.44416666666666
Hope that helps!

iterations over list in dataframe

I have the following issue:
I have a dataframe with 3 columns :
The first is userID, the second is invoiceType and the third the time of creation of the invoice.
df = pd.read_csv('invoice.csv')
Output: UserID InvoiceType CreateTime
1 a 2018-01-01 12:31:00
2 b 2018-01-01 12:34:12
3 a 2018-01-01 12:40:13
1 c 2018-01-09 14:12:25
2 a 2018-01-12 14:12:29
1 b 2018-02-08 11:15:00
2 c 2018-02-12 10:12:12
I am trying to plot the invoice cycle for each user. I need to create2 new columns, time_diff, and time_diff_wrt_first_invoice. time_diff will represent the time difference between each invoice for each user and time_diff_wrt_first_invoice will represent the time difference between all the invoices and the first invoice, which will be interesting for ploting purposes. This is my code:
"""
********** Exploding a variable that is a list in each dataframe cell
"""
def explode_list(df,x):
return (df[x].apply(pd.Series)
.stack()
.reset_index(level = 1, drop=True)
.to_frame(x))
"""
****** applying explode_list to all the columns ******
"""
def explode_listDF(df):
exploaded_df = pd.DataFrame()
for x in df.columns.tolist():
exploaded_df = pd.concat([exploaded_df, explode_list(df,x)],
axis = 1)
return exploaded_df
"""
******** Getting the time difference column in pivot table format
"""
def pivoted_diffTime(df1, _freq=60):
# _ freq is 1 for minutes frequency
# _freq is 60 for hour frequency
# _ freq is 60*24 for daily frequency
# _freq is 60*24*30 for monthly frequency
df = df.sort_values(['UserID', 'CreateTime'])
df_pivot = df.pivot_table(index = 'UserID',
aggfunc= lambda x : list(v for v in x)
)
df_pivot['time_diff'] = [[0]]*len(df_pivot)
for user in df_pivot.index:
try:
_list = [0]+[math.floor((x - y).total_seconds()/(60*_freq))
for x,y in zip(df_pivot.loc[user, 'CreateTime'][1:],
df_pivot.loc[user, 'CreateTime'][:-1])]
df_pivot.loc[user, 'time_diff'] = _list
except:
print('There is a prob here :', user)
return df_pivot
"""
***** Pipelining the two functions to obtain an exploaded dataframe
with time difference ******
"""
def get_timeDiff(df, _frequency):
df = explode_listDF(pivoted_diffTime(df, _freq=_frequency))
return df
And once I have time_diff, I am creating time_diff_wrt_first_variable this way:
# We initialize this variable
df_with_timeDiff['time_diff_wrt_first_invoice'] =
[[0]]*len(df_with_timeDiff)
# Then we loop over users and we apply a cumulative sum over time_diff
for user in df_with_timeDiff.UserID.unique():
df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff_wrt_first_i nvoice'] = np.cumsum(df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff'])
The problem is that I have a dataframe with hundreds of thousands of users and it's so time consuming. I am wondering if there is a solution that fits better my need.
Check out .loc[] for pandas.
df_1 = pd.DataFrame(some_stuff)
df_2 = df_1.loc[tickers['column'] >= some-condition, 'specific-column']
you can access specific columns, run a loop to check for certain types of conditions, and if you add a comma after the condition and put in a specific column name it'll only return that column.
I'm not 100% sure if that answers whatever question you're asking, cause I didn't actually see one, but it seemed like you were running a lot of for loops and stuff to isolate columns, which is what .loc[] is for.
I have found a better solution. Here's my code :
def next_diff(x):
return ([0]+[(b-a).total_seconds()/3600 for b,a in zip(x[1:], x[:-1])])
def create_timediff(df):
df.sort_values(['UserID', 'CreateTime'], inplace=True)
a = df.groupby('UserID').agg({'CreateTime' :lambda x : list(v for v in x)}).CreateTime.apply(next_diff)
b = a.apply(np.cumsum)
a = a.reset_index()
b = b.reset_index()
# Here I explode the lists inside the cell
rows1= []
_ = a.apply(lambda row: [rows1.append([row['UserID'], nn])
for nn in row.CreateTime], axis=1)
rows2 = []
__ = b.apply(lambda row: [rows2.append([row['UserID'], nn])
for nn in row.CreateTime], axis=1)
df1_new = pd.DataFrame(rows1, columns=a.columns).set_index(['UserID'])
df2_new = pd.DataFrame(rows2, columns=b.columns).set_index(['UserID'])
df = df.set_index('UserID')
df['time_diff']= df1_new['CreateTime']
df['time_diff_wrt_first_invoice'] = df2_new['CreateTime']
df.reset_index(inplace=True)
return df

Python pandas dataframe sort_values does not work

I have the following pandas data frame which I want to sort by 'test_type'
test_type tps mtt mem cpu 90th
0 sso_1000 205.263559 4139.031090 24.175933 34.817701 4897.4766
1 sso_1500 201.127133 5740.741266 24.599400 34.634209 6864.9820
2 sso_2000 203.204082 6610.437558 24.466267 34.831947 8005.9054
3 sso_500 189.566836 2431.867002 23.559557 35.787484 2869.7670
My code to load the dataframe and sort it is, the first print line prints the data frame above.
df = pd.read_csv(file) #reads from a csv file
print df
df = df.sort_values(by=['test_type'], ascending=True)
print '\nAfter sort...'
print df
After doing the sort and printing the dataframe content, the data frame still looks like below.
Program output:
After sort...
test_type tps mtt mem cpu 90th
0 sso_1000 205.263559 4139.031090 24.175933 34.817701 4897.4766
1 sso_1500 201.127133 5740.741266 24.599400 34.634209 6864.9820
2 sso_2000 203.204082 6610.437558 24.466267 34.831947 8005.9054
3 sso_500 189.566836 2431.867002 23.559557 35.787484 2869.7670
I expect row 3 (test type: sso_500 row) to be on top after sorting. Can someone help me figure why it's not working as it should?
Presumbaly, what you're trying to do is sort by the numerical value after sso_. You can do this as follows:
import numpy as np
df.ix[np.argsort(df.test_type.str.split('_').str[-1].astype(int).values)
This
splits the strings at _
converts what's after this character to the numerical value
Finds the indices sorted according to the numerical values
Reorders the DataFrame according to these indices
Example
In [15]: df = pd.DataFrame({'test_type': ['sso_1000', 'sso_500']})
In [16]: df.sort_values(by=['test_type'], ascending=True)
Out[16]:
test_type
0 sso_1000
1 sso_500
In [17]: df.ix[np.argsort(df.test_type.str.split('_').str[-1].astype(int).values)]
Out[17]:
test_type
1 sso_500
0 sso_1000
Alternatively, you could also extract the numbers from test_type and sort them. Followed by reindexing DF according to those indices.
df.reindex(df['test_type'].str.extract('(\d+)', expand=False) \
.astype(int).sort_values().index).reset_index(drop=True)

Categories

Resources