Extracting features from dataframe

Extracting features from dataframe - python

I have pandas dataframe like this
ID Phone ex
0 1 5333371000 533
1 2 5354321938 535
2 3 3840812 384
3 4 5451215 545
4 5 2125121278 212
For example if "ex" start to 533,535,545 new variable should be :
Sample output :
ID Phone ex iswhat
0 1 5333371000 533 personal
1 2 5354321938 535 personal
2 3 3840812 384 notpersonal
3 4 5451215 545 personal
4 5 2125121278 212 notpersonal
How can i do that ?

You can use np.where:
df['iswhat'] = np.where(df['ex'].isin([533, 535, 545]), 'personal', 'not personal')
print(df)
# Output
ID Phone ex iswhat
0 1 5333371000 533 personal
1 2 5354321938 535 personal
2 3 3840812 384 not personal
3 4 5451215 545 personal
4 5 2125121278 212 not personal
Update
You can also use your Phone column directly:
df['iswhat'] = np.where(df['Phone'].astype(str).str.match('533|535|545'),
'personal', 'not personal')
Note: If Phone column contains strings you can safely remove .astype(str).

We can use np.where along with str.contains:
df["iswhat"] = np.where(df["ex"].str.contains(r'^(?:533|535|545)$'),
'personal', 'notpersonal')

Related

Extract ID numeric value from a column

I have below data set, I need to extract 9 numeric ID from the "Notes" column.
below some of the code I tried, but sometimes I don't get the correct output. sometimes there are few spaces between numbers, or a symbol, or sometimes there are numeric values that are not part of the ID etc.. any idea how to do this more efficient?
DF['Output'] = DF['Notes'].str.replace(' '+' '+' '+'-', '')
DF['Output'] = DF['Notes'].str.replace(' '+' '+'-', '')
DF['Output'] = DF['Notes'].str.replace(' '+'-', '')
DF['Output'] = DF['Notes'].str.replace('-', '')
DF['Output'] = DF['Notes'].str.replace('\D', ' ')
DF['Output'] = DF['Notes'].str.findall(r'(\d{9,})').apply(', '.join)
Notes
Expected Output
ab. 325% xyz
0
GHY12345678 9
123456789
FTY 234567 891
234567891
BNM 567 891 524; 123 Ltd
567891524
2.5%mnkl, 3234 56 78 9; TGH 1235 z
323456789
RTF 956 327-12 8 TYP
956327128
X Y Z 1.59% 2345 567 81; one 35 in
234556781
VTO 126%, 12345 67
0
2.6% 1234 ABC 3456 1 2 4 91
345612491

# replace known character in b/w the numbers with null
# extract the 9 digits
df['output']=(df['Notes'].str.replace(r'[\s|\-]','',regex=True)
.str.extract(r'(\d{9})').fillna(0))
df
Notes Expected Output output
0 ab. 325% xyz 0 0
1 GHY12345678 9 123456789 123456789
2 FTY 234567 891 234567891 234567891
3 BNM 567 891 524; 123 Ltd 567891524 567891524
4 2.5%mnkl, 3234 56 78 9; TGH 1235 z 323456789 323456789
5 RTF 956 327-12 8 TYP 956327128 956327128
6 X Y Z 1.59% 2345 567 81; one 35 in 234556781 234556781
7 VTO 126%, 12345 67 0 0
8 2.6% 1234 ABC 3456 1 2 4 91 345612491 345612491

Using str.replace to first strip off spaces and dashes, followed by str.extract to find 9 digit numbers, we can try:
DF["Output"] = DF["Notes"].str.replace('[ -]+', '', regex=True)
.str.extract(r'(?<!\d)(\d{9})(?!\d)')
For an explanation of the regex pattern, we place non digit boundary markers around \d{9} to ensure that we only match 9 digit numbers. Here is how the regex works:
(?<!\d) ensure that what precedes is a non digit OR the start of the column
(\d{9}) match and capture exactly 9 digits
(?!\d) ensure that what follows is a non digit OR the end of the column

Go through every row in a dataframe, search for this values in a second dataframe, if it matches, get a value from df1 and another value from df2

I have two dataframes:
Researchers: a list of all researcher and their id_number
Samples: a list of samples and all researchers related to it, there may be several researchers in the same cell.
I want to go through every row in the researcher table and check if they occur in each row of the Table Samples. If they do I want to get: a) their id from the researcher table and the sample number from the Samples table.
Table researcher
id_researcher full_name
0 1 Jack Sparrow
1 2 Demi moore
2 3 Bickman
3 4 Charles Darwin
4 5 H. Haffer
Table samples
sample_number collector
230 INPA A 231 Haffer
231 INPA A 232 Charles Darwin
232 INPA A 233 NaN
233 INPA A 234 NaN
234 INPA A 235 Jack Sparrow; Demi Moore ; Bickman
Output I want:
id_researcher num_samples
0 5 INPA A 231
1 4 INPA A 232
2 1 INPA A 235
3 2 INPA A 235
4 3 INPA A 235
I was able to it with a loop in regular python with the following code, but it is extremely low and quite long. Does anyone know a faster and simpler way? perhaps with pandas apply?
id_researcher = []
id_num_sample = []
for c in range(len(data_researcher)):
for a in range(len(data_samples)):
if pd.isna(data_samples['collector'].iloc[a]) == False and data_researcher['full_name'].iloc[c] in data_samples['collector'].iloc[a]:
id_researcher.append(data_researcher['id_researcher'].iloc[c])
id_num_sample.append(data_samples['No TEC'].iloc[a])
data_researcher_sample = pd.DataFrame.from_dict({'id_pesq': id_researcher, 'num_sample': id_num_sample}).sort_values(by='num_amostra')

You have a few data cleaning job to do such as 'Moore' in lowercase, 'Haffer' with first name initials in one case and none in the other, etc. After normalizing your two dataframes, you can split and explode collections and use merge:
samples['collector'] = samples['collector'].str.split(';')
samples = samples.explode('collector')
samples['collector'] = samples['collector'].str.strip()
out = researchers.merge(samples, right_on='collector', left_on='full_name', how='left')[['id_researcher','sample_number']].sort_values(by='sample_number').reset_index(drop=True)
Output:
id_researcher sample_number
0 5 INPA A 231
1 4 INPA A 232
2 1 INPA A 235
3 2 INPA A 235
4 3 INPA A 235

Filter or selecting data between two rows in pandas by multiple labels

So I have this df or table coming from a pdf tranformation on this way example:
ElementRow
ElementColumn
ElementPage
ElementText
X1
Y1
X2
Y2
1
50
0
1
Emergency Contacts
917
8793
2191
8878
2
51
0
1
Contact
1093
1320
1451
1388
3
51
2
1
Relationship
2444
1320
3026
1388
4
51
7
1
Work Phone
3329
1320
3898
1388
5
51
9
1
Home Phone
4260
1320
4857
1388
6
51
10
1
Cell Phone
5176
1320
5684
1388
7
51
12
1
Priority Phone
6143
1320
6495
1388
8
51
14
1
Contact Address
6542
1320
7300
1388
9
51
17
1
City
7939
1320
7300
1388
10
51
18
1
State
8808
1320
8137
1388
11
51
21
1
Zip
9134
1320
9294
1388
12
52
0
1
Silvia Smith
1093
1458
1973
1526
13
52
2
1
Mother
2444
1458
2783
1526
13
52
7
1
(123) 456-78910
5176
1458
4979
1526
14
52
10
1
Austin
7939
1458
8406
1526
15
52
15
1
Texas
8808
1458
8961
1526
16
52
20
1
76063
9134
1458
9421
1526
17
52
2
1
1234 Parkside Ct
6542
1458
9421
1526
18
53
0
1
Naomi Smith
1093
2350
1973
1526
19
53
2
1
Aunt
2444
2350
2783
1526
20
53
7
1
(123) 456-78910
5176
2350
4979
1526
21
53
10
1
Austin
7939
2350
8406
1526
22
53
15
1
Texas
8808
2350
8961
1526
23
53
20
1
76063
9134
2350
9421
1526
24
53
2
1
3456 Parkside Ct
6542
2350
9421
1526
25
54
40
1
End Employee Line
6542
2350
9421
1526
25
55
0
1
Emergency Contacts
917
8793
2350
8878
I'm trying to separate each register by rows taking as a reference ElementRow column and keep the headers from the first rows and then iterate through the other rows after. The column X1 has a reference on which header should be the values. I would like to have the data like this way.
Contact
Relationship
Work Phone
Cell Phone
Priority
ContactAddress
City
State
Zip
1
Silvia Smith
Mother
(123) 456-78910
1234 Parkside Ct
Austin
Texas
76063
2
Naomi Smith
Aunt
(123) 456-78910
3456 Parkside Ct
Austin
Texas
76063
Things I tried:
To take rows between iterating through the columns. tried to slice taking the first index and the last index but showed this error:
emergStartIndex = df.index[df['ElementText'] == 'Emergency Contacts']
emergLastIndex = df.index[df['ElementText'] == 'End Employee Line']
emerRows_between = df.iloc[emergStartIndex:emergLastIndex]
TypeError: cannot do positional indexing on RangeIndex with these indexers [Int64Index([...
That way is working with this numpy trick.
emerRows_between = df.iloc[np.r_[1:54,55:107]]
emerRows_between
but when trying to replace the index showed this:
emerRows_between = df.iloc[np.r_[emergStartIndex:emergLastIndex]]
emerRows_between
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I tried iterating row by row like this but in some point the df reach the end and I'm receiving index out of bound.
emergencyContactRow1 = df['ElementText','X1'].iloc[emergStartIndex+1].reset_index(drop=True)
emergencyContactRow2 = df['ElementText','X1'].iloc[emergStartIndex+2].reset_index(drop=True)
emergencyContactRow3 = df['ElementText','X1'].iloc[emergStartIndex+3].reset_index(drop=True)
emergencyContactRow4 = df['ElementText','X1'].iloc[emergStartIndex+4].reset_index(drop=True)
emergencyContactRow5 = df['ElementText','X1'].iloc[emergStartIndex+5].reset_index(drop=True)
emergencyContactRow6 = df['ElementText','X1'].iloc[emergStartIndex+6].reset_index(drop=True)
emergencyContactRow7 = df['ElementText','X1'].iloc[emergStartIndex+7].reset_index(drop=True)
emergencyContactRow8 = df['ElementText','X1'].iloc[emergStartIndex+8].reset_index(drop=True)
emergencyContactRow9 = df['ElementText','X1'].iloc[emergStartIndex+9].reset_index(drop=True)
emergencyContactRow10 = df['ElementText','X1'].iloc[emergStartIndex+10].reset_index(drop=True)
frameEmergContact1 = [emergencyContactRow1 , emergencyContactRow2 , emergencyContactRow3, emergencyContactRow4, emergencyContactRow5, emergencyContactRow6, emergencyContactRow7, , emergencyContactRow8,, emergencyContactRow9, , emergencyContactRow10]
df_emergContact1= pd.concat(frameEmergContact1 , axis=1)
df_emergContact1.columns = range(df_emergContact1.shape[1])
So how to make this code dynamic or how to avoid the index out of bound errors and keep my headers taking as a reference only the first row after the Emergency Contact row?. I know I didn't try to use the X1 column yet, but I have to resolve first how to iterate through those multiple indexes.
Each iteration from Emergency Contact index to End Employee line belongs to one person or one employee from the whole dataframe, so the idea after capture all those values is to keep also a counter variable to see how many times the data is captured between those two indexes.

It's a bit ugly, but this should do it. Basically you don't need the first or last two rows, so if you get rid of those, then pivot the X1 and ElemenTex columns you will be pretty close. Then it's a matter of getting rid of null values and promoting the first row to header.
df = df.iloc[1:-2][['ElementTex','X1','ElementRow']].pivot(columns='X1',values='ElementTex')
df = pd.DataFrame([x[~pd.isnull(x)] for x in df.values.T]).T
df.columns = df.iloc[0]
df = df[1:]

Split the dataframe into chunks whenever "Emergency Contacts" appears in column "ElementText"
Parse each chunk into the required format
Append to the output
import numpy as np
list_of_df = np.array_split(data, data[data["ElementText"]=="Emergency Contacts"].index)
output = pd.DataFrame()
for frame in list_of_df:
df = frame[~frame["ElementText"].isin(["Emergency Contacts", "End Employee Line"])].dropna()
if df.shape[0]>0:
temp = pd.DataFrame(df.groupby("X1")["ElementText"].apply(list).tolist()).T
temp.columns = temp.iloc[0]
temp = temp.drop(0)
output = output.append(temp, ignore_index=True)
>>> output
0 Contact Relationship Work Phone ... City State Zip
0 Silvia Smith Mother None ... Austin Texas 76063
1 Naomi Smith Aunt None ... Austin Texas 76063

Create A New DataFrame Based on Conditions of Multiple DataFrames

I have two datasets: one with cancer positive patients (df_pos), and the other with the cancer negative patients (df_neg).
df_pos
id
0 123
1 124
2 125
df_neg
id
0 234
1 235
2 236
I want to compile these datasets into one with an extra column if the patient has cancer or not (yes or no).
Here is my desired outcome:
id outcome
0 123 yes
1 124 yes
2 125 yes
3 234 no
4 235 no
5 236 no
What would be a smarter approach to compile these?
Any suggestions would be appreciated. Thanks!

Use pandas.DataFrame.append and pandas.DataFrame.assign:
>>> df_pos.assign(outcome='Yes').append(df_neg.assign(outcome='No'), ignore_index=True)
id outcome
0 123 Yes
1 124 Yes
2 125 Yes
3 234 No
4 235 No
5 236 No

df_pos['outcome'] = True
df_neg['outcome'] = False
df = pd.concat([df_pos, df_neg]).reset_index(drop=True)

Grouping Herarchical Parent-Child data using Pandas Dataframe - Python

I have a data frame which I want to group based on the value of another column in the same data frame.
For example:
The Parent_ID and Child ID are linked and defines who is related to who in a hierarchical tree.
The dataframe looks like (input from a csv file)
No Name ID Parent_Id
1 Tom 211 111
2 Galie 209 111
3 Remo 200 101
4 Carmen 212 121
5 Alfred 111 191
6 Marvela 101 111
7 Armin 234 101
8 Boris 454 109
9 Katya 109 323
I would like to group this data frame based on the ID and Parent_ID in the below grouping, and generate CSV files out of this based on the top level parent. I.e, Alfred.csv, Carmen.csv (will have only its own entry, ice line #4) , Katya.csv using the to_csv() function.
Alfred
|_ Galie
_ Tom
_ Marvela
|_ Remo
_ Armin
Carmen
Katya
|_ Boris
And, I want to create a new column in the same data frame, that will have a tag indicating the hierarchy. Like:
No Name ID Parent_Id Tag
1 Tom 211 111 Alfred
2 Galie 209 111 Alfred
3 Remo 200 101 Marvela, Alfred
4 Carmen 212 121
5 Alfred 111 191
6 Marvela 101 111 Alfred
7 Armin 234 101 Marvela, Alfred
8 Boris 454 109 Katya
9 Katya 109 323
Note that the names can repeat, but the ID will be unique.
Kindly let me know how to achieve this using pandas. I tried out groupby() but seems a little complicated and not getting what I intend. There should be one file for each parent, and the child records in the parent file.
If a child has other child (like marvel), it qualifies to have its own csv file.
And the final output would be
Alfred.csv - All records matching Galie, Tom, Marvela
Marvela.csv - All records matching Remo, Armin
Carmen.csv - Only record matching carmen (row)
Katya.csv - all records matching katya, boris

I would write a recursive function to do this.
First, create dictionary of {id:name}, {parent:id} and the recursive function.
id_name_dict = dict(zip(df.ID, df.Name))
parent_dict = dict(zip(df.ID, df.Parent_Id))
def find_parent(x):
value = parent_dict.get(x, None)
if value is None:
return ""
else:
# Incase there is a id without name.
if id_name_dict.get(value, None) is None:
return "" + find_parent(value)
return str(id_name_dict.get(value)) +", "+ find_parent(value)
Then create the new column with Series.apply and remove ', ' with Series.str.strip
df['Tag'] = df.ID.apply(lambda x: find_parent(x)).str.rstrip(', ')
df
No Name ID Parent_Id Tag
0 1 Tom 211 111 Alfred
1 2 Galie 209 111 Alfred
2 3 Remo 200 101 Marvela, Alfred
3 4 Carmen 212 121
4 5 Alfred 111 191
5 6 Marvela 101 111 Alfred
6 7 Armin 234 101 Marvela, Alfred
7 8 Boris 454 109 Katya
8 9 Katya 109 323

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.