How to extract row with mixed value - python

I have to extract rows from a pandas dataframe with values in 'Date of birth' column which occur in a list with dates.
import pandas as pd
df = pd.DataFrame({'Name': ['Jack', 'Mary', 'David', 'Bruce', 'Nick', 'Mark', 'Carl', 'Sofie'],
'Date of birth': ['1973', '1999', '1995', '1992/1991', '2000', '1969', '1994', '1989/1990']})
dates = ['1973', '1992', '1969', '1989']
new_df = df.loc[df['Date of birth'].isin(dates)]
print(new_df)
0 Jack 1973
1 Mary 1999
2 David 1995
3 Bruce 1992/1991
4 Nick 2000
5 Mark 1969
6 Carl 1994
7 Sofie 1989/1990
Eventually I get the table below. As you can see, Bruce's and Sofie's rows are absent since the value is followed by / and another value. How should I split up these two filter them out?
Name Date of birth
0 Jack 1973
5 Mark 1969

You could use str.contains:
import pandas as pd
df = pd.DataFrame({'Name': ['Jack', 'Mary', 'David', 'Bruce', 'Nick', 'Mark', 'Carl', 'Sofie'],
'Date of birth': ['1973', '1999', '1995', '1992/1991', '2000', '1969', '1994', '1989/1990']})
dates = ['1973', '1992', '1969', '1989']
new_df = df.loc[df['Date of birth'].str.contains(rf"\b{'|'.join(dates)}\b")]
print(new_df)
Output
Name Date of birth
0 Jack 1973
3 Bruce 1992/1991
5 Mark 1969
7 Sofie 1989/1990
The string rf"\b{'|'.join(dates)}\b" is a regex pattern that will match any of string that contains any of the dates.

I like #DaniMesejo way better but here is a way splitting up the values and stacking:
df[df['Date of birth'].str.split('/', expand=True).stack().isin(dates).max(level=0)]
Output:
Name Date of birth
0 Jack 1973
3 Bruce 1992/1991
5 Mark 1969
7 Sofie 1989/1990

Related

Speeding up for-loops using pandas for feature engineering

I have a dataframe with the following headings:
payer
recipient_country
date of payment
Each rows shows a transaction, and a row (Bob,UK,1st January 2023) shows that a payer Bob sent a payment to the UK on 1st January 2023.
For each row in this table I need to find the number of times that the payer for that row has sent a payment to the country for that row in the past. So for the row above I would want to find the number of times that Bob has sent money to the UK prior to 1st January 2023.
This is for feature engineering purposes.
I have done this using a for loop in which I iterate through rows and do a pandas loc call for each row to find rows with an earlier date with the same payer and country, but this is far too slow for the number of rows I have to process.
Can anyone think of a way to speed up this process using some fast pandas functions?
Thanks!
Testing on this toy data frame:
df = pd.DataFrame(
[{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-01 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-02 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-03 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-04 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-05 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-06 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-07 00:00:00')}]
)
Just group by and cumulatively count:
>>> df['trns_bf'] = df.sort_values(by='date').groupby(['name', 'country'])['name'].cumcount()
name country date trns_bf
0 Bob UK 2023-01-01 0
1 Bob UK 2023-01-02 1
2 Bob UK 2023-01-03 2
3 Cob UK 2023-01-04 0
4 Cob UK 2023-01-05 1
5 Cob UK 2023-01-06 2
6 Cob UK 2023-01-07 3
You need to sort first, to ensure that elements before are not confused with elements after. I interpreted "prior" in your question literally: eg there are no transactions before Bob's transaction to the UK on 1 Jan 2023.
Each row gets its own count for transactions with that name to that country before that date. If there are multiple transactions on one day, determine how you want to deal with that. I would probably use another group by and select the maximum value for that day: df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max() and then merge the result back (indexing will make it difficult to attach directly as above).

Create new fake data with new primary keys from existing dataframe python

I have a dataframe as following:
df1 = pd.DataFrame({'id': ['1a', '2b', '3c'], 'name': ['Anna', 'Peter', 'John'], 'year': [1999, 2001, 1993]})
I want to create new data by randomly re-arranging values in each column but for column id I also need to add a random letter at the end of the values, then add the new data to existing df1 as following:
df1 = pd.DataFrame({'id': ['1a', '2b', '3c', '2by', '1ao', '1az', '3cc'], 'name': ['Anna', 'Peter', 'John', 'John', 'Peter', 'Anna', 'Anna'], 'year': [1999, 2001, 1993, 1999, 1999, 2001, 2001]})
Could anyone help me, please? Thank you very much.
Use DataFrame.sample and add random letter by numpy.random.choice:
import string
N = 5
df2 = (df1.sample(n=N, replace=True)
.assign(id =lambda x:x['id']+np.random.choice(list(string.ascii_letters),size=N)))
df1 = df1.append(df2, ignore_index=True)
print (df1)
id name year
0 1a Anna 1999
1 2b Peter 2001
2 3c John 1993
3 1aY Anna 1999
4 3cp John 1993
5 3cE John 1993
6 2bz Peter 2001
7 3cu John 1993

How to target specific rows for replacement with order is different between dataframes?

Say I have 2 dataframes that I want to fillna with but the order is not the exact same, is it possible to target specific columns as part of the mapping?
Here's a example:
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
print(data)
print(dataNew)
print(dataNew.fillna(data))
My output is not right because if you see, dataframe new's country data is not in the same order(uk/us are shifted and so is fr/uk at the end). Is there a way to match it based on year, country and sales before replacing the NaN value in the rep column?
The output I'm looking for is, like the first data column. I'm trying to understand how I could have filled in the NA's by selecting a matching cells in another df. To make the question easier I made the first data column have less fields so the question is purely about mapping/searching
year country sales rep
0 2016 uk 10 john
1 2016 usa 21 john
2 2015 fr 20 claire
3 2014 fr 10 kyle
4 2013 uk 12 kyle
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'sales': [10, 21, 20, 10,12],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
# join on these three columns and get the rep column from other dataframe
cols = ['year', 'country', 'sales']
dataNew = dataNew.drop('rep', 1).join(data.set_index(cols), on = cols)
print(data)
output:
year country sales rep
0 2016 uk 10 john
1 2016 usa 21 john
2 2015 fr 20 claire
3 2014 fr 10 kyle
4 2013 uk 12 kyle
You can also use Multi-Index along with fillna():
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
(dataNew.set_index(['year', 'country']).sort_index().fillna(
data.set_index(['year', 'country']).sort_index())
).reset_index()
year country sales rep
0 2013 uk 12 kyle
1 2014 fr 10 kyle
2 2015 fr 20 claire
3 2016 uk 10 john
4 2016 usa 21 john

Creating a new dataframe from applying a function on all cell of a dataframe

I have a data frame, df, like this:
data = {'A': ['Jason (121439)', 'Molly (194439)', 'Tina (114439)', 'Jake (127859)', 'Amy (122579)'],
'B': ['Bob (127439)', 'Mark (136489)', 'Tyler (121443)', 'John (126259)', 'Anna(174439)'],
'C': ['Jay (121596)', 'Ben (12589)', 'Toom (123586)', 'Josh (174859)', 'Al(121659)'],
'D': ['Paul (123839)', 'Aaron (124159)', 'Steve (161899)', 'Vince (179839)', 'Ron (128379)']}
df = pd.DataFrame(data)
And I want to create a new data frame with one column with the name and the other column with the number between parenthesis, which would look like this:
data2 = {'Name': ['Jason ', 'Molly ', 'Tina ', 'Jake ', 'Amy '],
'ID#': ['121439', '194439', '114439', '127859', '122579']}
result = pd.DataFrame(data2)
I try different things, but it all did not work:
1)
List_name=pd.DataFrame()
List_id=pd.DataFrame()
List_both=pd.DataFrame(columns=["Name","ID"])
for i in df.columns:
left=df[i].str.split("(",1).str[0]
right=df[i].str.split("(",1).str[1]
List_name=List_name.append(left)
List_id=List_id.append(right)
List_both=pd.concat([List_name,List_id], axis=1)
List_both
2) applying a function on all cell
Names = lambda x: x.str.split("(",1).str[0]
IDS = Names = lambda x: x.str.split("(",1).str[1]
But I was wondering how to do it in order to store it in a data frame that will look like result...
You can use stack followed by str.extract.
(df.stack()
.str.strip()
.str.extract(r'(?P<Name>.*?)\s*\((?P<ID>.*?)\)$')
.reset_index(drop=True))
Name ID
0 Jason 121439
1 Bob 127439
2 Jay 121596
3 Paul 123839
4 Molly 194439
5 Mark 136489
6 Ben 12589
7 Aaron 124159
8 Tina 114439
9 Tyler 121443
10 Toom 123586
11 Steve 161899
12 Jake 127859
13 John 126259
14 Josh 174859
15 Vince 179839
16 Amy 122579
17 Anna 174439
18 Al 121659
19 Ron 128379

Replacing datetime values using for loop pandas

I have a df look like below, but much bigger. There are some incorrect dates under the column of lastDate, and they are only incorrect if there is something in correctDate column, right next to them.
dff = pd.DataFrame(
{"lastDate":['2016-3-27', '2016-4-11', '2016-3-27', '2016-3-27', '2016-5-25', '2016-5-31'],
"fixedDate":['2016-1-3', '', '2016-1-18', '2016-4-5', '2016-2-27', ''],
"analyst":['John Doe', 'Brad', 'John', 'Frank', 'Claud', 'John Doe']
})
First one is what I have and the second one is what I'd like to have after the loop
First convert these columns to datetime dtypes:
for col in ['fixedDate', 'lastDate']:
df[col] = pd.to_datetime(df[col])
Then you could use
mask = pd.notnull(df['fixedDate'])
df.loc[mask, 'lastDate'] = df['fixedDate']
For example,
import pandas as pd
df = pd.DataFrame( {"lastDate":['2016-3-27', '2016-4-11', '2016-3-27', '2016-3-27', '2016-5-25', '2016-5-31'], "fixedDate":['2016-1-3', '', '2016-1-18', '2016-4-5', '2016-2-27', ''], "analyst":['John Doe', 'Brad', 'John', 'Frank', 'Claud', 'John Doe'] })
for col in ['fixedDate', 'lastDate']:
df[col] = pd.to_datetime(df[col])
mask = pd.notnull(df['fixedDate'])
df.loc[mask, 'lastDate'] = df['fixedDate']
print(df)
yields
analyst fixedDate lastDate
0 John Doe 2016-01-03 2016-01-03
1 Brad NaT 2016-04-11
2 John 2016-01-18 2016-01-18
3 Frank 2016-04-05 2016-04-05
4 Claud 2016-02-27 2016-02-27
5 John Doe NaT 2016-05-31

Categories

Resources