I have another problem with pandas. I can do the below task utilizing loops but it would be very inefficient due to the size of the input. Please let me know if there is a pandas solution.
I would like create a new DF_C based on DF A. I need to create multiple rows based on the columns COL_A and COL_B (the values will be separated by commas). State will always have one element in it.
The sequence of rows does not matter.
I have a DF A:
State COL_A B_COL
01 01 03, 01
02 01, 03 01, 04
02 07 03
04 01 05
I would like a resulting df_c:
State COL_A B_COL
01 01 03
01 01 01
02 01 01
02 01 04
02 03 01
02 03 04
02 07 03
04 01 05
you can do by first use str.split on both COL_A and B_COL, then chain with one explode on each column like:
df_ = (df.assign(COL_A=lambda x: x['COL_A'].str.split(', '),
B_COL=lambda x: x['B_COL'].str.split(', '))
.explode('COL_A')
.explode('B_COL')
)
print (df_)
State COL_A B_COL
0 1 01 03
0 1 01 01
1 2 01 01
1 2 01 04
1 2 03 01
1 2 03 04
2 2 07 03
3 4 01 05
EDIT: if you are after efficiency, maybe consider doing
df_ = pd.DataFrame(
[(s, a, b)
for s, cola, colb in zip(df['State'], df['COL_A'], df['B_COL'])
for a in cola.split(', ') for b in colb.split(', ')],
columns=df.columns)
An alternative to Ben.T's second solution, using itertools :
from itertools import product,chain
flatten = chain.from_iterable
result = flatten(product([state],col_a.split(","),b_col.split(","))
for state, col_a,b_col in df.to_numpy())
pd.DataFrame(result, columns = df.columns)
State COL_A B_COL
0 1 01 03
1 1 01 01
2 2 01 01
3 2 01 04
4 2 03 01
5 2 03 04
6 2 07 03
7 4 01 05
Related
I have one table:
Index
Month_1
Month_2
Paid
01
12
10
02
09
03
03
02
04
04
01
08
The output should be:
Index
Month_1
Month_2
Paid
01
12
10
Yes
02
09
03
03
02
04
Yes
04
01
08
Logic: Add 'Yes' to the Paid field whose Month_1 and Month_2 are nearby
You can subtract columns, get absolute values and compare if equal or less like threshold, e.g. 2 and then set values in numpy.where:
df['Paid'] = np.where(df['Month_1'].sub(df['Month_2']).abs().le(2), 'Yes','')
print (df)
Index Month_1 Month_2 Paid
0 01 12 10 Yes
1 02 9 3
2 03 2 4 Yes
3 04 1 8
str= '01-01-2020 01/01/2020 01 Oct 2020 01 October 2020'
all = re.findall(r"[\d]{1,2}-[\d]{1,2}-[\d]{2,4}", str)
for s in all:
print(s)
So far I have tried this one I'm getting only one format 01-01-2020 but I want extract all type of date format any help would highly appreciate
Initial note: i'm not an expert of python. As #TimBiegeleisen stated in the comment to the question, using python's libraries is probably better way to achieve that.
So, i can only help you to write proper regex pattern, which is:
(\d{2}[-\/]\d{2}[-\/]\d{4}|\d{2}.\w{3,}.\d{4})
For further details, please see online regex tester
Another regex, that tries to match month names as well (regex101):
import re
txt = '''
01-01-2020
01/01/2020
01 Oct 2020
01 October 2020
01 Jan 2020
01 Feb 2020
01 Mar 2020
01 May 2020
01 Jun 2020
01 Jul 2020
01 Aug 2020
01 Sep 2020
01 Oct 2020
01 Nov 2020
01 Dec 2020
12 December 1991'''
r = re.compile(r'\d{2}[ /-](?:\d{2}|Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[ /-]\d{4}')
for d in r.findall(txt):
print(d)
Prints:
01-01-2020
01/01/2020
01 Oct 2020
01 October 2020
01 Jan 2020
01 Feb 2020
01 Mar 2020
01 May 2020
01 Jun 2020
01 Jul 2020
01 Aug 2020
01 Sep 2020
01 Oct 2020
01 Nov 2020
01 Dec 2020
12 December 1991
df2 = pd.DataFrame({'person_id':[11,11,11,11,11,12,12,13,13,14,14,14,14],
'admit_date':['01/01/2011','01/01/2009','12/31/2013','12/31/2017','04/03/2014','08/04/2016',
'03/05/2014','02/07/2011','08/08/2016','12/31/2017','05/01/2011','05/21/2014','07/12/2016']})
df2 = df2.melt('person_id', value_name='dates')
df2['dates'] = pd.to_datetime(df2['dates'])
What I would like to do is
a) Exclude/filter out records from the data frame if a subject has Dec 31st and Jan 1st in its records. Please note that year doesn't matter.
If a subject has either Dec 31st or Jan 1st, we leave them as is.
But if they have both Dec 31st and Jan 1st, we remove one (either Dec 31st or Jan 1st) of them. note they could have multiple entries with the same date as well. Like person_id = 11
I could only do the below
df2_new = df2['dates'] != '2017-12-31' #but this excludes if a subject has only `Dec 31st on 2017`. How can I ignore the dates and not consider `year`
df2[df2_new]
My expected output is like as shown below
For person_id = 11, we drop 12-31 because it had both 12-31 and 01-01 in their records whereas for person_id = 14, we don't drop 12-31 because it has only 12-31 in its records.
We drop 12-31 only when both 12-31 and 01-01 appear in a person's records.
Use:
s = df2['dates'].dt.strftime('%m-%d')
m1 = s.eq('01-01').groupby(df2['person_id']).transform('any')
m2 = s.eq('12-31').groupby(df2['person_id']).transform('any')
m3 = np.select([m1 & m2, m1 | m2], [s.ne('12-31'), True], default=True)
df3 = df2[m3]
Result:
# print(df3)
person_id variable dates
0 11 admit_date 2011-01-01
1 11 admit_date 2009-01-01
4 11 admit_date 2014-04-03
5 12 admit_date 2016-08-04
6 12 admit_date 2014-03-05
7 13 admit_date 2011-02-07
8 13 admit_date 2016-08-08
9 14 admit_date 2017-12-31
10 14 admit_date 2011-05-01
11 14 admit_date 2014-05-21
12 14 admit_date 2016-07-12
Another way
Coerce the date to day month.
Create temp column where 31st Dec is converted to 1st Jan
Drop duplicates by Person id and the temp column keeping first.
df2['dates']=df2['dates'].dt.strftime('%d %b')
df2=df2.assign(check=np.where(df2.dates=='31 Dec','01 Jan', df2.dates)).drop_duplicates(['person_id', 'variable', 'check'], keep='first').drop(columns=['check'])
person_id variable dates check
0 11 admit_date 01 Jan 01 Jan
4 11 admit_date 03 Apr 03 Apr
5 12 admit_date 04 Aug 04 Aug
6 12 admit_date 05 Mar 05 Mar
7 13 admit_date 07 Feb 07 Feb
8 13 admit_date 08 Aug 08 Aug
9 14 admit_date 31 Dec 01 Jan
10 14 admit_date 01 May 01 May
11 14 admit_date 21 May 21 May
12 14 admit_date 12 Jul 12 Jul
I have a grouped dataframe that looks as follows:
player_id shot_type count
01 03 3
02 01 3
03 2
03 01 4
I want to add an additional column which is the mean of the shot_type counts by player_id which would look as follows:
player_id shot_type count mean_shot_type_count_player
01 03 3 (3+2)/2
02 01 3 (3+4)/2
03 2 (3+2)/2
03 01 4 (3+4)/2
Use GroupBy.transform:
df['mean_shot_type_count_player']=df.groupby('shot_type')['count'].transform('mean')
print(df)
Output:
player_id shot_type count mean_shot_type_count_player
0 01 03 3 2.5
1 02 01 3 3.5
2 03 2 2.5
3 03 01 4 3.5
I have below dataframe
ID1 ID2 mon price
10 2 06 500
20 3 07 200
20 3 08 300
20 3 09 400
21 2 07 100
21 2 08 200
21 2 09 300
Required output :-
ID1 ID2 mon price ID1_shift ID2_shift mon_shift price_shift
10 2 06 500 10 2 06 500
20 3 07 200 20 3 07 200
20 3 08 300 20 3 07 200
20 3 09 400 20 3 08 300
21 2 07 100 21 2 07 100
21 3 08 200 21 2 07 100
21 4 09 300 21 3 08 200
I tried using df.shift() by different ways but was not successfull.
YOur valueable comments will be helpful.
I want to shift dataframe group by (ID1,ID2) and if NaN then fill with current values.
I tried below but it works with single column.
df["price_shift"]=df.groupby(["ID1","ID2"]).price.shift().fillna(df["price"])
Thanks
I came up with below , but this is feasible for less no of columns. Is there any way where complete row can be shifted with group by as above ?
df1['price_shift']=df.groupby(['ID1','ID2']).price.shift(1).fillna(df['price'])
df1['mon_shift']=df.groupby(['ID1','ID2']).mon.shift(1).fillna(df['mon'])
df1[['ID1_shift','ID2_shift']]=df[['ID1','ID2']]
df2=pd.concat([df, df1],axis=1)
df2
try the below:
for column_name in df.columns:
df[column_name+"_shift"]=df[column_name]
cheers