Extract ID numeric value from a column - python

I have below data set, I need to extract 9 numeric ID from the "Notes" column.
below some of the code I tried, but sometimes I don't get the correct output. sometimes there are few spaces between numbers, or a symbol, or sometimes there are numeric values that are not part of the ID etc.. any idea how to do this more efficient?
DF['Output'] = DF['Notes'].str.replace(' '+' '+' '+'-', '')
DF['Output'] = DF['Notes'].str.replace(' '+' '+'-', '')
DF['Output'] = DF['Notes'].str.replace(' '+'-', '')
DF['Output'] = DF['Notes'].str.replace('-', '')
DF['Output'] = DF['Notes'].str.replace('\D', ' ')
DF['Output'] = DF['Notes'].str.findall(r'(\d{9,})').apply(', '.join)
Notes
Expected Output
ab. 325% xyz
0
GHY12345678 9
123456789
FTY 234567 891
234567891
BNM 567 891 524; 123 Ltd
567891524
2.5%mnkl, 3234 56 78 9; TGH 1235 z
323456789
RTF 956 327-12 8 TYP
956327128
X Y Z 1.59% 2345 567 81; one 35 in
234556781
VTO 126%, 12345 67
0
2.6% 1234 ABC 3456 1 2 4 91
345612491

# replace known character in b/w the numbers with null
# extract the 9 digits
df['output']=(df['Notes'].str.replace(r'[\s|\-]','',regex=True)
.str.extract(r'(\d{9})').fillna(0))
df
Notes Expected Output output
0 ab. 325% xyz 0 0
1 GHY12345678 9 123456789 123456789
2 FTY 234567 891 234567891 234567891
3 BNM 567 891 524; 123 Ltd 567891524 567891524
4 2.5%mnkl, 3234 56 78 9; TGH 1235 z 323456789 323456789
5 RTF 956 327-12 8 TYP 956327128 956327128
6 X Y Z 1.59% 2345 567 81; one 35 in 234556781 234556781
7 VTO 126%, 12345 67 0 0
8 2.6% 1234 ABC 3456 1 2 4 91 345612491 345612491

Using str.replace to first strip off spaces and dashes, followed by str.extract to find 9 digit numbers, we can try:
DF["Output"] = DF["Notes"].str.replace('[ -]+', '', regex=True)
.str.extract(r'(?<!\d)(\d{9})(?!\d)')
For an explanation of the regex pattern, we place non digit boundary markers around \d{9} to ensure that we only match 9 digit numbers. Here is how the regex works:
(?<!\d) ensure that what precedes is a non digit OR the start of the column
(\d{9}) match and capture exactly 9 digits
(?!\d) ensure that what follows is a non digit OR the end of the column

Related

Extracting features from dataframe

I have pandas dataframe like this
ID Phone ex
0 1 5333371000 533
1 2 5354321938 535
2 3 3840812 384
3 4 5451215 545
4 5 2125121278 212
For example if "ex" start to 533,535,545 new variable should be :
Sample output :
ID Phone ex iswhat
0 1 5333371000 533 personal
1 2 5354321938 535 personal
2 3 3840812 384 notpersonal
3 4 5451215 545 personal
4 5 2125121278 212 notpersonal
How can i do that ?
You can use np.where:
df['iswhat'] = np.where(df['ex'].isin([533, 535, 545]), 'personal', 'not personal')
print(df)
# Output
ID Phone ex iswhat
0 1 5333371000 533 personal
1 2 5354321938 535 personal
2 3 3840812 384 not personal
3 4 5451215 545 personal
4 5 2125121278 212 not personal
Update
You can also use your Phone column directly:
df['iswhat'] = np.where(df['Phone'].astype(str).str.match('533|535|545'),
'personal', 'not personal')
Note: If Phone column contains strings you can safely remove .astype(str).
We can use np.where along with str.contains:
df["iswhat"] = np.where(df["ex"].str.contains(r'^(?:533|535|545)$'),
'personal', 'notpersonal')

Remove rows from pandas dataframe with condition

I have a dataframe that looks like this:
import pandas as pd
### create toy data set
data = [[1111,'10/1/2021',21,123],
[1111,'10/1/2021',-21,123],
[1111,'10/1/2021',21,123],
[2222,'10/2/2021',15,234],
[2222,'10/2/2021',15,234],
[3333,'10/3/2021',15,234],
[3333,'10/3/2021',15,234]]
df = pd.DataFrame(data,columns = ['Individual','date','number','cc'])
What I want to do is remove rows where Individual, date, and cc are the same, but number is a negative value in one case and a positive in the other case. For example, in the first three rows, I would remove rows 1 and 2 (because 21 and -21 values are equal in absolute terms), but I don't want to remove row 3 (because I have already accounted for the negative value in row 2 by eliminating row 1). Also, I don't want to remove duplicated values if the corresponding number values are positive. I have tried a variety of duplicated() approaches, but just can't get it right.
Expected results would be:
Individual date number cc
0 1111 10/1/2021 21 123
1 2222 10/2/2021 15 234
2 2222 10/2/2021 15 234
3 3333 10/3/2021 15 234
4 3333 10/3/2021 15 234
Thus, the first two rows are removed, but not the third row, since the negative value is already accounted for.
Any assistance would be appreciated. I am trying to do this without a loop, but it may be unavoidable. It seems similar to this question, but I can't figure out how to make it work in my case, as I am trying to avoid loops.
I can't be sure since you did not post your expected output, but you could try the below. Create a separate df called n that contains the rows with -ve 'number' and join it to the original with indicator=True.
n = df.loc[df.number.le(0)].drop('number',axis=1)
df = pd.merge(df,n,'left',indicator=True)
>>> df
Individual date number cc _merge
0 1111 10/1/2021 21 123 both
1 1111 10/1/2021 -21 123 both
2 1111 10/1/2021 21 123 both
3 2222 10/2/2021 15 234 left_only
4 2222 10/2/2021 15 234 left_only
5 3333 10/3/2021 15 234 left_only
6 3333 10/3/2021 15 234 left_only
This will allow us to identify the Individual/date/cc groups that have a -ve 'number' row.
Then you can locate the rows with 'both' in _merge, and only use those to perform a groupby.head(2), concatenating that with the rest of the df:
out = pd.concat([df.loc[df._merge.eq('both')].groupby(['Individual','date','cc']).head(2),
df.loc[df._merge.ne('both')]]).drop('_merge',axis=1)
Which prints:
Individual date number cc
0 1111 10/1/2021 21 123
1 1111 10/1/2021 -21 123
3 2222 10/2/2021 15 234
4 2222 10/2/2021 15 234
5 3333 10/3/2021 15 234
6 3333 10/3/2021 15 234

Filter or selecting data between two rows in pandas by multiple labels

So I have this df or table coming from a pdf tranformation on this way example:
ElementRow
ElementColumn
ElementPage
ElementText
X1
Y1
X2
Y2
1
50
0
1
Emergency Contacts
917
8793
2191
8878
2
51
0
1
Contact
1093
1320
1451
1388
3
51
2
1
Relationship
2444
1320
3026
1388
4
51
7
1
Work Phone
3329
1320
3898
1388
5
51
9
1
Home Phone
4260
1320
4857
1388
6
51
10
1
Cell Phone
5176
1320
5684
1388
7
51
12
1
Priority Phone
6143
1320
6495
1388
8
51
14
1
Contact Address
6542
1320
7300
1388
9
51
17
1
City
7939
1320
7300
1388
10
51
18
1
State
8808
1320
8137
1388
11
51
21
1
Zip
9134
1320
9294
1388
12
52
0
1
Silvia Smith
1093
1458
1973
1526
13
52
2
1
Mother
2444
1458
2783
1526
13
52
7
1
(123) 456-78910
5176
1458
4979
1526
14
52
10
1
Austin
7939
1458
8406
1526
15
52
15
1
Texas
8808
1458
8961
1526
16
52
20
1
76063
9134
1458
9421
1526
17
52
2
1
1234 Parkside Ct
6542
1458
9421
1526
18
53
0
1
Naomi Smith
1093
2350
1973
1526
19
53
2
1
Aunt
2444
2350
2783
1526
20
53
7
1
(123) 456-78910
5176
2350
4979
1526
21
53
10
1
Austin
7939
2350
8406
1526
22
53
15
1
Texas
8808
2350
8961
1526
23
53
20
1
76063
9134
2350
9421
1526
24
53
2
1
3456 Parkside Ct
6542
2350
9421
1526
25
54
40
1
End Employee Line
6542
2350
9421
1526
25
55
0
1
Emergency Contacts
917
8793
2350
8878
I'm trying to separate each register by rows taking as a reference ElementRow column and keep the headers from the first rows and then iterate through the other rows after. The column X1 has a reference on which header should be the values. I would like to have the data like this way.
Contact
Relationship
Work Phone
Cell Phone
Priority
ContactAddress
City
State
Zip
1
Silvia Smith
Mother
(123) 456-78910
1234 Parkside Ct
Austin
Texas
76063
2
Naomi Smith
Aunt
(123) 456-78910
3456 Parkside Ct
Austin
Texas
76063
Things I tried:
To take rows between iterating through the columns. tried to slice taking the first index and the last index but showed this error:
emergStartIndex = df.index[df['ElementText'] == 'Emergency Contacts']
emergLastIndex = df.index[df['ElementText'] == 'End Employee Line']
emerRows_between = df.iloc[emergStartIndex:emergLastIndex]
TypeError: cannot do positional indexing on RangeIndex with these indexers [Int64Index([...
That way is working with this numpy trick.
emerRows_between = df.iloc[np.r_[1:54,55:107]]
emerRows_between
but when trying to replace the index showed this:
emerRows_between = df.iloc[np.r_[emergStartIndex:emergLastIndex]]
emerRows_between
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I tried iterating row by row like this but in some point the df reach the end and I'm receiving index out of bound.
emergencyContactRow1 = df['ElementText','X1'].iloc[emergStartIndex+1].reset_index(drop=True)
emergencyContactRow2 = df['ElementText','X1'].iloc[emergStartIndex+2].reset_index(drop=True)
emergencyContactRow3 = df['ElementText','X1'].iloc[emergStartIndex+3].reset_index(drop=True)
emergencyContactRow4 = df['ElementText','X1'].iloc[emergStartIndex+4].reset_index(drop=True)
emergencyContactRow5 = df['ElementText','X1'].iloc[emergStartIndex+5].reset_index(drop=True)
emergencyContactRow6 = df['ElementText','X1'].iloc[emergStartIndex+6].reset_index(drop=True)
emergencyContactRow7 = df['ElementText','X1'].iloc[emergStartIndex+7].reset_index(drop=True)
emergencyContactRow8 = df['ElementText','X1'].iloc[emergStartIndex+8].reset_index(drop=True)
emergencyContactRow9 = df['ElementText','X1'].iloc[emergStartIndex+9].reset_index(drop=True)
emergencyContactRow10 = df['ElementText','X1'].iloc[emergStartIndex+10].reset_index(drop=True)
frameEmergContact1 = [emergencyContactRow1 , emergencyContactRow2 , emergencyContactRow3, emergencyContactRow4, emergencyContactRow5, emergencyContactRow6, emergencyContactRow7, , emergencyContactRow8,, emergencyContactRow9, , emergencyContactRow10]
df_emergContact1= pd.concat(frameEmergContact1 , axis=1)
df_emergContact1.columns = range(df_emergContact1.shape[1])
So how to make this code dynamic or how to avoid the index out of bound errors and keep my headers taking as a reference only the first row after the Emergency Contact row?. I know I didn't try to use the X1 column yet, but I have to resolve first how to iterate through those multiple indexes.
Each iteration from Emergency Contact index to End Employee line belongs to one person or one employee from the whole dataframe, so the idea after capture all those values is to keep also a counter variable to see how many times the data is captured between those two indexes.
It's a bit ugly, but this should do it. Basically you don't need the first or last two rows, so if you get rid of those, then pivot the X1 and ElemenTex columns you will be pretty close. Then it's a matter of getting rid of null values and promoting the first row to header.
df = df.iloc[1:-2][['ElementTex','X1','ElementRow']].pivot(columns='X1',values='ElementTex')
df = pd.DataFrame([x[~pd.isnull(x)] for x in df.values.T]).T
df.columns = df.iloc[0]
df = df[1:]
Split the dataframe into chunks whenever "Emergency Contacts" appears in column "ElementText"
Parse each chunk into the required format
Append to the output
import numpy as np
list_of_df = np.array_split(data, data[data["ElementText"]=="Emergency Contacts"].index)
output = pd.DataFrame()
for frame in list_of_df:
df = frame[~frame["ElementText"].isin(["Emergency Contacts", "End Employee Line"])].dropna()
if df.shape[0]>0:
temp = pd.DataFrame(df.groupby("X1")["ElementText"].apply(list).tolist()).T
temp.columns = temp.iloc[0]
temp = temp.drop(0)
output = output.append(temp, ignore_index=True)
>>> output
0 Contact Relationship Work Phone ... City State Zip
0 Silvia Smith Mother None ... Austin Texas 76063
1 Naomi Smith Aunt None ... Austin Texas 76063

How to check the number of characters in a string in a dataframe then remove the first character if it has 3 characters and startswith a number?

I have a dataframe in the form:
OCCUPATION
AGE
AREA_CODE
Employed
26
011
Employed
45
012
Student
812
021
Self-Employed
926
011
It is understood that an error occurred when entering the AGE data into the table (8 and 9 were made prefixes to the ages). I do not want to drop the rows, so is there an effective way to check that AGE has three characters, & startswith 8 or 9, then remove the 8 or 9 resulting in the dataframe below:
OCCUPATION
AGE
AREA_CODE
Employed
26
011
Employed
45
012
Student
12
021
Self-Employed
26
011
Note: the Age column is currently in integer format.
It's a simple math operation:
df['AGE'] = df['AGE'] % 100 + 100 * (df['AGE'] // 100 == 1)
Which means you take the last 2 digits of the age, and add the hundreds only if it's 1.

Extract integers after double space with regex

I have a dataframe where I want to extract stuff after double space. For all rows in column NAME there is a double white space after the company names before the integers.
NAME INVESTMENT PERCENT
0 APPLE COMPANY A 57 638 232 stocks OIL LTD 0.12322
1 BANANA 1 COMPANY B 12 946 201 stocks GOLD LTD 0.02768
2 ORANGE COMPANY C 8 354 229 stocks GAS LTD 0.01786
df = pd.DataFrame({
'NAME': ['APPLE COMPANY A 57 638 232 stocks', 'BANANA 1 COMPANY B 12 946 201 stocks', 'ORANGE COMPANY C 8 354 229 stocks'],
'PERCENT': [0.12322, 0.02768 , 0.01786]
})
I have this earlier, but it also includes integers in the company name:
df['STOCKS']=df['NAME'].str.findall(r'\b\d+\b').apply(lambda x: ''.join(x))
Instead I tried to extract after double spaces
df['NAME'].str.split('(\s{2})')
which gives output:
0 [APPLE COMPANY A, , 57 638 232 stocks]
1 [BANANA 1 COMPANY B, , 12 946 201 stocks]
2 [ORANGE COMPANY C, , 8 354 229 stocks]
However, I want the integers that occur after double spaces to be joined/merged and put into a new column.
NAME PERCENT STOCKS
0 APPLE COMPANY A 0.12322 57638232
1 BANANA 1 COMPANY B 0.02768 12946201
2 ORANGE COMPANY C 0.01786 12946201
How can I modify my second function to do what I want?
Following the original logic you may use
df['STOCKS'] = df['NAME'].str.extract(r'\s{2,}(\d+(?:\s\d+)*)', expand=False).str.replace(r'\s+', '')
df['NAME'] = df['NAME'].str.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks', '')
Output:
NAME PERCENT STOCKS
0 APPLE COMPANY A 0.12322 57638232
1 BANANA 1 COMPANY B 0.02768 12946201
2 ORANGE COMPANY C 0.01786 8354229
Details
\s{2,}(\d+(?:\s\d+)*) is used to extract the first occurrence of whitespace-separated consecutive digit chunks after 2 or more whitespaces and .replace(r'\s+', '') removes any whitespaces in that extracted text afterwards
.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks' updates the text in the NAME column, it removes 2 or more whitespaces, consecutive whitespace-separated digit chunks and then 1+ whitespaces and stocks. Actually, the last \s+stocks may be replaced with .* if there are other words.
Another pandas approach, which will cast STOCKS to numeric type:
df_split = (df['NAME'].str.extractall('^(?P<NAME>.+)\s{2}(?P<STOCKS>[\d\s]+)')
.reset_index(level=1, drop=True))
df_split['STOCKS'] = pd.to_numeric(df_split.STOCKS.str.replace('\D', ''))
Assign these columns back into your original DataFrame:
df[['NAME', 'STOCKS']] = df_split[['NAME', 'STOCKS']]
COMPANY_NAME STOCKS PERCENT
0 APPLE COMPANY A 57638232 0.12322
1 BANANA 1 COMPANY B 12946201 0.02768
2 ORANGE COMPANY C 8354229 0.01786
You can use look behind and look ahead operators.
''.join(re.findall(r'(?<=\s{2})(.*)(?=stocks)',string)).replace(' ','')
This catches all characters between two spaces and the word stocks and replace all the spaces with null.
Another Solution using Split
df["NAME"].apply(lambda x:x[x.find(' ')+2:x.find('stocks')-1].replace(' ',''))
Reference:-
Look_behind
You can try
df['STOCKS'] = df['NAME'].str.split(',')[2].replace(' ', '')
df['NAME'] = df['NAME'].str.split(',')[0]
This can be done without using regex by using split.
df['STOCKS'] = df['NAME'].apply(lambda x: ''.join(x.split(' ')[1].split(' ')[:-1]))
df['NAME'] = df['NAME'].str.replace(r'\s?\d+(?:\s\d+).*', '')

Categories

Resources