How can I extract the string belongs to words.
here is my text,
ID Event Name Event Type
0 1 Taltz Seminar for Dermathologists Out of office
2 3 Experiment Results for Taltz In Ofice
3 4 Use of Taltz in Rheumathology OUTOFOFFICE
5 6 RHeums Experiences with Taltz IO
How can I get the Dermathologists and Rheumathology belonging string using regex.
I have tried this one.
import re
pattern = r'(derma\w+),\s(RHeums\w+).*'
df_named = df['Event Name'].str.extract(
pattern,
flags=re.I)
df_clean = df_named.reindex(
columns =
['dermatological ',
'rheumatological'])
df_clean.head()
Convert Series to new DataFrame:
df_named = pd.DataFrame({'dermatological': df['Event Name'].str.extract('(derma\w+)', flags=re.I, expand=False),
'rheumatological':df['Event Name'].str.extract('(RHeuma\w+)', flags=re.I, expand=False)})
print (df_named)
dermatological rheumatological
0 Dermathologists NaN
1 NaN NaN
2 NaN Rheumathology
3 NaN NaN
If need append new columns to original DataFrame use:
df_named = df.assign(dermatological = df['Event Name'].str.extract('(derma\w+)', flags=re.I),
rheumatological = df['Event Name'].str.extract('(RHeum\w+)', flags=re.I))
print (df_named)
ID Event Name Event Type dermatological \
0 1 Taltz Seminar for Dermathologists Out of office Dermathologists
1 3 Experiment Results for Taltz In Ofice NaN
2 4 Use of Taltz in Rheumathology OUTOFOFFICE NaN
3 6 RHeums Experiences with Taltz IO NaN
rheumatological
0 NaN
1 NaN
2 Rheumathology
3 RHeums
Related
I have a dataframe df_corp:
ID arrival_date leaving_date
1 01/02/20 05/02/20
2 01/03/20 07/03/20
1 12/02/20 20/02/20
1 07/03/20 10/03/20
2 10/03/20 15/03/20
I would like to find the difference between leaving_date of a row and arrival date of the next entry with respect to ID. Basically I want to know how long before they book again.
So it'll look something like this.
ID arrival_date leaving_date time_between
1 01/02/20 05/02/20 NaN
2 01/03/20 07/03/20 NaN
1 12/02/20 20/02/20 7
1 07/03/20 10/03/20 15
2 10/03/20 15/03/20 3
I've tried grouping by ID to do the sum but I'm seriously lost on how to get the value from the next row and a different column in one.
You need to convert to_datetime and to perform a GroupBy.shift to get the previous departure date:
# arrival
a = pd.to_datetime(df_corp['arrival_date'], dayfirst=True)
# previous departure per ID
l = pd.to_datetime(df_corp['leaving_date'], dayfirst=True).groupby(df_corp['ID']).shift()
# difference in days
df_corp['time_between'] = (a-l).dt.days
output:
ID arrival_date leaving_date time_between
0 1 01/02/20 05/02/20 NaN
1 2 01/03/20 07/03/20 NaN
2 1 12/02/20 20/02/20 7.0
3 1 07/03/20 10/03/20 16.0
4 2 10/03/20 15/03/20 3.0
In a pandas data frame I would like to find the mean values of a column, grouped by a 'customized' year.
An example would be to compute the mean values of school marks for a school year (e.g. Sep/YYYY to Aug/YYYY+1).
The pandas docs gives some information on offsets and business year etc., but I can't really make any sense out of that to get a working example.
Here is a minimal example where mean values of school marks are computed per year (Jan-Dec), which is what I do not want.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(low=1, high=5, size=36),
index=pd.date_range('2001-09-01', freq='M', periods=36),
columns=['marks'])
df_yearly = df.groupby(pd.Grouper(freq="A")).mean()
This could yield e.g.:
print(df):
marks
2001-09-30 1
2001-10-31 4
2001-11-30 2
2001-12-31 1
2002-01-31 4
2002-02-28 1
2002-03-31 2
2002-04-30 1
2002-05-31 3
2002-06-30 3
2002-07-31 3
2002-08-31 3
2002-09-30 4
2002-10-31 1
...
2003-11-30 4
2003-12-31 2
2004-01-31 1
2004-02-29 2
2004-03-31 1
2004-04-30 3
2004-05-31 4
2004-06-30 2
2004-07-31 2
2004-08-31 4
print(df_yearly):
marks
2001-12-31 2.000000
2002-12-31 2.583333
2003-12-31 2.666667
2004-12-31 2.375000
My desired output would correspond to something like:
2001-09/2002-08 mean_value
2002-09/2003-08 mean_value
2003-09/2004-08 mean_value
Many thanks!
We can manually compute the school years:
# if month>=9 we move it to the next year
school_years = df.index.year + (df.index.month>8).astype(int)
Another option is to use fiscal year starting from September:
school_years = df.index.to_period('Q-AUG').qyear
And we can groupby:
df.groupby(school_years).mean()
Output:
marks
2002 2.333333
2003 2.500000
2004 2.500000
One more approach
a = (df.index.month == 9).cumsum()
val = df.groupby(a, sort=False)['marks'].mean().reset_index()
dates = df.index.to_series().groupby(a, sort=False).agg(['first', 'last']).reset_index()
dates.merge(val, on='index')
Output
index first last marks
0 1 2001-09-30 2002-08-31 2.750000
1 2 2002-09-30 2003-08-31 2.333333
2 3 2003-09-30 2004-08-31 2.083333
I extracted some tables from a html file using Pandas and BeautifulSoup. However, the output is really messy. For some reason, Pandas adds columns and makes the Original columns unaligned. The header and rest of the columns are not in the same column anymore. Similarly, it cuts off columns when ) or % are in the column.
This is the code I used to extract the table I want:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.sec.gov/Archives/edgar/data/1961/0001264931-18-000031.txt')
soup = BeautifulSoup(r.content, "lxml")
tables = soup.find_all('table')
for i, table in enumerate(tables):
if (('income tax' in str(table)) or ('Income tax' in str(table))) and (('Statutory' in str(table)) or ('statutory' in str(table))):
df = pd.read_html(str(table))
print(df[0])
else:
pass
Not sure why df[0] prints the whole list, but that is no problem for me. The table I am intersted in looks like:
[ 0 1 2 3 4 \
0 NaN NaN 2017.0 NaN 2016
1 Income tax computed at the federal statutory rate NaN NaN 34 %
2 Income tax computed at the state statutory rate NaN NaN 5 %
3 Valuation allowance NaN NaN (39 )%
4 Total deferred tax asset NaN NaN 0 %
5 6 7 8
0 NaN NaN NaN NaN
1 NaN NaN 34 %
2 NaN NaN 5 %
3 NaN NaN (39 )%
4 NaN NaN 0 % ]
If I display the html code this table is based on, it looks clean with columns aligned. However, using Pandas it is messy. Anyone knows how to solve this?
Each row in this dataframe represents an order and executionStatus.x has some info about the order status.
Those executionStatus.x columns are automatically created by flatten_json by
amirziai, depending on how many arguments there are. So if there are 3 statuses for one order like in row 0, there will be up to executionStatus.2. Since row 1 and 2 only have one status, it only has values in executionStatus.0.
My problem is I cannot match "ORDER_FULFILLED" because I don't know how many executionStatuses there will be and I would need to write the exact column name like so df[df['executionStatus.0'].str.match('ORDER_FULFILLED')].
executionStatus.0 executionStatus.1 executionStatus.2 \
0 REQUESTED_AMOUNT_ROUNDED MEOW ORDER_FULFILLED
1 ORDER_FULFILLED NaN NaN
2 NOT_AN_INFUNDING_LOAN NaN NaN
investedAmount loanId requestedAmount OrderInstructId
0 50.0 22222 55.0 55555
1 25.0 33333 25.0 55555
2 0.0 44444 25.0 55555
Is there a way to get the entire row or index that matched with "ORDER_FULFILLED" element in the entire dataframe?
Ideally, the matched dataframe should look like this because row 0 and row 1 have ORDER_FULFILLED in the executionStatuses and row 3 does not so it should be excluded. Thanks!
investedAmount loanId requestedAmount OrderInstructId
0 50.0 22222 55.0 55555
1 25.0 33333 25.0 55555
Use df.filter() for getting the similar columns containing executionStatus with a boolean mask:
df[df.filter(like='executionStatus').eq('ORDER_FULFILLED').any(axis=1)]
executionStatus.0 executionStatus.1 executionStatus.2 \
0 REQUESTED_AMOUNT_ROUNDED MEOW ORDER_FULFILLED
1 ORDER_FULFILLED NaN NaN
investedAmount loanId requestedAmount OrderInstructId
0 50 22222 55 55555
1 25 33333 25 55555
If you want to delete he execution columns from output, use:
df.loc[df.filter(like='executionStatus').eq('ORDER_FULFILLED').any(axis=1),\
df.columns.difference(df.filter(like='executionStatus').columns)
I have a dataframe (df) (orginally from a excel file) and the first 9 rows are like this:
Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00 OC/OER/OPA/PMS/ M WEBB
1 NaN 2000-02-29 00:00:00 NaN DATA CORP
2 2000-1776 2000-01-02 00:00:00 OC/ORA/OE/DCP/ G KAN
3 NaN 2000-01-03 00:00:00 OC/ORA/ORO/PNC/ PALM POST
4 NaN NaN FDA/OGROP/ORA/SE-FO/FLA- NaN
5 NaN NaN DO/FLA-CB/ NaN
6 2000-1983 2000-02-02 00:00:00 FDA/OGROP/ORA/CE-FO/CHI- M EGAN
7 NaN 2000-02-03 00:00:00 DO/CHI-CB/ BERNSTEIN LIEBHARD &
8 NaN NaN NaN LONDON LLP
Type(df['Control'][1])=float;
Type(df['Recd_Date/Due_Date'][1])=datetime.datetime;
type(df['Action_Office'][1])=float;
Type(df['Signature/Requester'][1])=unicode
I want to transform this dataframe (e.g. first 9 rows) to this:
Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00,2000-02-29 00:00:00 OC/OER/OPA/PMS/ M WEBB,DATA CORP
1 2000-1776 2000-01-02 00:00:00,2000-01-03 00:00:00 OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-FO/FLA-DO/FLA-CB/ G KAN,PALM POST
2 2000-1983 2000-02-02 00:00:00,2000-02-03 00:00:00 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/ M EGAN,BERNSTEIN LIEBHARD & LONDON LLP
So basically:
Everytime pd.isnull(row['Control']) (This should be the only if condition) is true then merge this row with the previous row (whose 'control' value is not null).
And for 'Recd_Date/Due_Date' and 'Signature/Requester', add ',' (or '/') between each two values (from two merged rows) (e.g. '2000-01-31 00:00:00,2000-02-29 00:00:00' and 'G KAN,PALM POST')
For 'Action', simply merge them without any punctuations added (e.g. FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/)
Can anyone help me out pls? This is the code im trying to get it to work:
for i, row in df.iterrows():
if pd.isnull(df.ix[i]['Control_#']):
df.ix[i-1]['Recd_Date/Due_Date'] = str(df.ix[i-1]['Recd_Date/Due_Date'])+'/'+str(df.ix[i]['Recd_Date/Due_Date'])
df.ix[i-1]['Subject'] = str(df.ix[i-1]['Subject'])+' '+str(df.ix[i]['Subject'])
if str(df.ix[i-1]['Action_Office'])[-1] == '-':
df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+str(df.ix[i]['Action_Office'])
else:
df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+','+str(df.ix[i]['Action_Office'])
if pd.isnull(df.ix[i-1]['Signature/Requester']):
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+str(df.ix[i]['Signature/Requester'])
elif str(df.ix[i-1]['Signature/Requester'])[-1] == '&':
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+' '+str(df.ix[i]['Signature/Requester'])
else:
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+','+str(df.ix[i]['Signature/Requester'])
df.drop(df.index[i])
How come the drop() doesn't work? I am trying drop the current row (if its ['Control_#'] is null) so the next row (whose ['Control_#'] is null) can be added to the previous row (whose ['Control_#'] is NOT null) iteratively..
Much appreciated!!
I think you need to group the rows together and then join up the column values. The tricky part is finding a way to group together the rows in the way you want. Here is my solution...
1) Grouping Together the Rows: Static variables
Since your groups depend on a sequence in your rows I used a static variable in a method to label every row to a specific group
def rolling_group(val):
if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
return rolling_group.group
rolling_group.group = 0 #static variable
This method is applied along the Control series to sort indexes into groups, which is then used to split up the dataframe to allow you to merge rows
#groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
That is really the only tricky part after that you can just merge the rows by applying a function to each group that gives you your desired output
Full Solution Code
def rolling_group(val):
if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
return rolling_group.group
rolling_group.group = 0 #static variable
def joinFunc(g,column):
col =g[column]
joiner = "/" if column == "Action" else ","
s = joiner.join([str(each) for each in col if pd.notnull(each)])
s = re.sub("(?<=&)"+joiner," ",s) #joiner = " "
s = re.sub("(?<=-)"+joiner,"",s) #joiner = ""
s = re.sub(joiner*2,joiner,s) #fixes double joiner condition
return s
#edit above - str(each) - to convert to strings...
edit above regex to clean join string joins
if __name__ == "__main__":
df = """ Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00 OC/OER/OPA/PMS/ M WEBB
1 NaN 2000-02-29 00:00:00 NaN DATA CORP
2 2000-1776 2000-01-02 00:00:00 OC/ORA/OE/DCP/ G KAN
3 NaN 2000-01-03 00:00:00 OC/ORA/ORO/PNC/ PALM POST
4 NaN NaN FDA/OGROP/ORA/SE-FO/FLA- NaN
5 NaN NaN DO/FLA-CB/ NaN
6 2000-1983 2000-02-02 00:00:00 FDA/OGROP/ORA/CE-FO/CHI- M EGAN
7 NaN 2000-02-03 00:00:00 DO/CHI-CB/ BERNSTEIN LIEBHARD &
8 NaN NaN NaN LONDON LLP"""
df = pd.read_csv(StringIO.StringIO(df),sep = "\s\s+",engine='python')
groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
groupFunct = lambda g: pd.Series([joinFunc(g,col) for col in g.columns],index=g.columns)
print groups.apply(groupFunct)
output
Control Recd_Date/Due_Date \
0 2000-1703 2000-01-31 00:00:00,2000-02-29 00:00:00
1 2000-1776 2000-01-02 00:00:00,2000-01-03 00:00:00
2 2000-1983 2000-02-02 00:00:00,2000-02-03 00:00:00
Action \
0 OC/OER/OPA/PMS/
1 OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-...
2 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/
Signature/Requester
0 M WEBB,DATA CORP
1 G KAN,PALM POST
2 M EGAN,BERNSTEIN LIEBHARD & LONDON LLP