I have a data frame which I want to group based on the value of another column in the same data frame.
For example:
The Parent_ID and Child ID are linked and defines who is related to who in a hierarchical tree.
The dataframe looks like (input from a csv file)
No Name ID Parent_Id
1 Tom 211 111
2 Galie 209 111
3 Remo 200 101
4 Carmen 212 121
5 Alfred 111 191
6 Marvela 101 111
7 Armin 234 101
8 Boris 454 109
9 Katya 109 323
I would like to group this data frame based on the ID and Parent_ID in the below grouping, and generate CSV files out of this based on the top level parent. I.e, Alfred.csv, Carmen.csv (will have only its own entry, ice line #4) , Katya.csv using the to_csv() function.
Alfred
|_ Galie
_ Tom
_ Marvela
|_ Remo
_ Armin
Carmen
Katya
|_ Boris
And, I want to create a new column in the same data frame, that will have a tag indicating the hierarchy. Like:
No Name ID Parent_Id Tag
1 Tom 211 111 Alfred
2 Galie 209 111 Alfred
3 Remo 200 101 Marvela, Alfred
4 Carmen 212 121
5 Alfred 111 191
6 Marvela 101 111 Alfred
7 Armin 234 101 Marvela, Alfred
8 Boris 454 109 Katya
9 Katya 109 323
Note that the names can repeat, but the ID will be unique.
Kindly let me know how to achieve this using pandas. I tried out groupby() but seems a little complicated and not getting what I intend. There should be one file for each parent, and the child records in the parent file.
If a child has other child (like marvel), it qualifies to have its own csv file.
And the final output would be
Alfred.csv - All records matching Galie, Tom, Marvela
Marvela.csv - All records matching Remo, Armin
Carmen.csv - Only record matching carmen (row)
Katya.csv - all records matching katya, boris
I would write a recursive function to do this.
First, create dictionary of {id:name}, {parent:id} and the recursive function.
id_name_dict = dict(zip(df.ID, df.Name))
parent_dict = dict(zip(df.ID, df.Parent_Id))
def find_parent(x):
value = parent_dict.get(x, None)
if value is None:
return ""
else:
# Incase there is a id without name.
if id_name_dict.get(value, None) is None:
return "" + find_parent(value)
return str(id_name_dict.get(value)) +", "+ find_parent(value)
Then create the new column with Series.apply and remove ', ' with Series.str.strip
df['Tag'] = df.ID.apply(lambda x: find_parent(x)).str.rstrip(', ')
df
No Name ID Parent_Id Tag
0 1 Tom 211 111 Alfred
1 2 Galie 209 111 Alfred
2 3 Remo 200 101 Marvela, Alfred
3 4 Carmen 212 121
4 5 Alfred 111 191
5 6 Marvela 101 111 Alfred
6 7 Armin 234 101 Marvela, Alfred
7 8 Boris 454 109 Katya
8 9 Katya 109 323
Related
I have two dataframes:
Researchers: a list of all researcher and their id_number
Samples: a list of samples and all researchers related to it, there may be several researchers in the same cell.
I want to go through every row in the researcher table and check if they occur in each row of the Table Samples. If they do I want to get: a) their id from the researcher table and the sample number from the Samples table.
Table researcher
id_researcher full_name
0 1 Jack Sparrow
1 2 Demi moore
2 3 Bickman
3 4 Charles Darwin
4 5 H. Haffer
Table samples
sample_number collector
230 INPA A 231 Haffer
231 INPA A 232 Charles Darwin
232 INPA A 233 NaN
233 INPA A 234 NaN
234 INPA A 235 Jack Sparrow; Demi Moore ; Bickman
Output I want:
id_researcher num_samples
0 5 INPA A 231
1 4 INPA A 232
2 1 INPA A 235
3 2 INPA A 235
4 3 INPA A 235
I was able to it with a loop in regular python with the following code, but it is extremely low and quite long. Does anyone know a faster and simpler way? perhaps with pandas apply?
id_researcher = []
id_num_sample = []
for c in range(len(data_researcher)):
for a in range(len(data_samples)):
if pd.isna(data_samples['collector'].iloc[a]) == False and data_researcher['full_name'].iloc[c] in data_samples['collector'].iloc[a]:
id_researcher.append(data_researcher['id_researcher'].iloc[c])
id_num_sample.append(data_samples['No TEC'].iloc[a])
data_researcher_sample = pd.DataFrame.from_dict({'id_pesq': id_researcher, 'num_sample': id_num_sample}).sort_values(by='num_amostra')
You have a few data cleaning job to do such as 'Moore' in lowercase, 'Haffer' with first name initials in one case and none in the other, etc. After normalizing your two dataframes, you can split and explode collections and use merge:
samples['collector'] = samples['collector'].str.split(';')
samples = samples.explode('collector')
samples['collector'] = samples['collector'].str.strip()
out = researchers.merge(samples, right_on='collector', left_on='full_name', how='left')[['id_researcher','sample_number']].sort_values(by='sample_number').reset_index(drop=True)
Output:
id_researcher sample_number
0 5 INPA A 231
1 4 INPA A 232
2 1 INPA A 235
3 2 INPA A 235
4 3 INPA A 235
I have a dataframe that looks like this:
import pandas as pd
### create toy data set
data = [[1111,'10/1/2021',21,123],
[1111,'10/1/2021',-21,123],
[1111,'10/1/2021',21,123],
[2222,'10/2/2021',15,234],
[2222,'10/2/2021',15,234],
[3333,'10/3/2021',15,234],
[3333,'10/3/2021',15,234]]
df = pd.DataFrame(data,columns = ['Individual','date','number','cc'])
What I want to do is remove rows where Individual, date, and cc are the same, but number is a negative value in one case and a positive in the other case. For example, in the first three rows, I would remove rows 1 and 2 (because 21 and -21 values are equal in absolute terms), but I don't want to remove row 3 (because I have already accounted for the negative value in row 2 by eliminating row 1). Also, I don't want to remove duplicated values if the corresponding number values are positive. I have tried a variety of duplicated() approaches, but just can't get it right.
Expected results would be:
Individual date number cc
0 1111 10/1/2021 21 123
1 2222 10/2/2021 15 234
2 2222 10/2/2021 15 234
3 3333 10/3/2021 15 234
4 3333 10/3/2021 15 234
Thus, the first two rows are removed, but not the third row, since the negative value is already accounted for.
Any assistance would be appreciated. I am trying to do this without a loop, but it may be unavoidable. It seems similar to this question, but I can't figure out how to make it work in my case, as I am trying to avoid loops.
I can't be sure since you did not post your expected output, but you could try the below. Create a separate df called n that contains the rows with -ve 'number' and join it to the original with indicator=True.
n = df.loc[df.number.le(0)].drop('number',axis=1)
df = pd.merge(df,n,'left',indicator=True)
>>> df
Individual date number cc _merge
0 1111 10/1/2021 21 123 both
1 1111 10/1/2021 -21 123 both
2 1111 10/1/2021 21 123 both
3 2222 10/2/2021 15 234 left_only
4 2222 10/2/2021 15 234 left_only
5 3333 10/3/2021 15 234 left_only
6 3333 10/3/2021 15 234 left_only
This will allow us to identify the Individual/date/cc groups that have a -ve 'number' row.
Then you can locate the rows with 'both' in _merge, and only use those to perform a groupby.head(2), concatenating that with the rest of the df:
out = pd.concat([df.loc[df._merge.eq('both')].groupby(['Individual','date','cc']).head(2),
df.loc[df._merge.ne('both')]]).drop('_merge',axis=1)
Which prints:
Individual date number cc
0 1111 10/1/2021 21 123
1 1111 10/1/2021 -21 123
3 2222 10/2/2021 15 234
4 2222 10/2/2021 15 234
5 3333 10/3/2021 15 234
6 3333 10/3/2021 15 234
So I have this df or table coming from a pdf tranformation on this way example:
ElementRow
ElementColumn
ElementPage
ElementText
X1
Y1
X2
Y2
1
50
0
1
Emergency Contacts
917
8793
2191
8878
2
51
0
1
Contact
1093
1320
1451
1388
3
51
2
1
Relationship
2444
1320
3026
1388
4
51
7
1
Work Phone
3329
1320
3898
1388
5
51
9
1
Home Phone
4260
1320
4857
1388
6
51
10
1
Cell Phone
5176
1320
5684
1388
7
51
12
1
Priority Phone
6143
1320
6495
1388
8
51
14
1
Contact Address
6542
1320
7300
1388
9
51
17
1
City
7939
1320
7300
1388
10
51
18
1
State
8808
1320
8137
1388
11
51
21
1
Zip
9134
1320
9294
1388
12
52
0
1
Silvia Smith
1093
1458
1973
1526
13
52
2
1
Mother
2444
1458
2783
1526
13
52
7
1
(123) 456-78910
5176
1458
4979
1526
14
52
10
1
Austin
7939
1458
8406
1526
15
52
15
1
Texas
8808
1458
8961
1526
16
52
20
1
76063
9134
1458
9421
1526
17
52
2
1
1234 Parkside Ct
6542
1458
9421
1526
18
53
0
1
Naomi Smith
1093
2350
1973
1526
19
53
2
1
Aunt
2444
2350
2783
1526
20
53
7
1
(123) 456-78910
5176
2350
4979
1526
21
53
10
1
Austin
7939
2350
8406
1526
22
53
15
1
Texas
8808
2350
8961
1526
23
53
20
1
76063
9134
2350
9421
1526
24
53
2
1
3456 Parkside Ct
6542
2350
9421
1526
25
54
40
1
End Employee Line
6542
2350
9421
1526
25
55
0
1
Emergency Contacts
917
8793
2350
8878
I'm trying to separate each register by rows taking as a reference ElementRow column and keep the headers from the first rows and then iterate through the other rows after. The column X1 has a reference on which header should be the values. I would like to have the data like this way.
Contact
Relationship
Work Phone
Cell Phone
Priority
ContactAddress
City
State
Zip
1
Silvia Smith
Mother
(123) 456-78910
1234 Parkside Ct
Austin
Texas
76063
2
Naomi Smith
Aunt
(123) 456-78910
3456 Parkside Ct
Austin
Texas
76063
Things I tried:
To take rows between iterating through the columns. tried to slice taking the first index and the last index but showed this error:
emergStartIndex = df.index[df['ElementText'] == 'Emergency Contacts']
emergLastIndex = df.index[df['ElementText'] == 'End Employee Line']
emerRows_between = df.iloc[emergStartIndex:emergLastIndex]
TypeError: cannot do positional indexing on RangeIndex with these indexers [Int64Index([...
That way is working with this numpy trick.
emerRows_between = df.iloc[np.r_[1:54,55:107]]
emerRows_between
but when trying to replace the index showed this:
emerRows_between = df.iloc[np.r_[emergStartIndex:emergLastIndex]]
emerRows_between
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I tried iterating row by row like this but in some point the df reach the end and I'm receiving index out of bound.
emergencyContactRow1 = df['ElementText','X1'].iloc[emergStartIndex+1].reset_index(drop=True)
emergencyContactRow2 = df['ElementText','X1'].iloc[emergStartIndex+2].reset_index(drop=True)
emergencyContactRow3 = df['ElementText','X1'].iloc[emergStartIndex+3].reset_index(drop=True)
emergencyContactRow4 = df['ElementText','X1'].iloc[emergStartIndex+4].reset_index(drop=True)
emergencyContactRow5 = df['ElementText','X1'].iloc[emergStartIndex+5].reset_index(drop=True)
emergencyContactRow6 = df['ElementText','X1'].iloc[emergStartIndex+6].reset_index(drop=True)
emergencyContactRow7 = df['ElementText','X1'].iloc[emergStartIndex+7].reset_index(drop=True)
emergencyContactRow8 = df['ElementText','X1'].iloc[emergStartIndex+8].reset_index(drop=True)
emergencyContactRow9 = df['ElementText','X1'].iloc[emergStartIndex+9].reset_index(drop=True)
emergencyContactRow10 = df['ElementText','X1'].iloc[emergStartIndex+10].reset_index(drop=True)
frameEmergContact1 = [emergencyContactRow1 , emergencyContactRow2 , emergencyContactRow3, emergencyContactRow4, emergencyContactRow5, emergencyContactRow6, emergencyContactRow7, , emergencyContactRow8,, emergencyContactRow9, , emergencyContactRow10]
df_emergContact1= pd.concat(frameEmergContact1 , axis=1)
df_emergContact1.columns = range(df_emergContact1.shape[1])
So how to make this code dynamic or how to avoid the index out of bound errors and keep my headers taking as a reference only the first row after the Emergency Contact row?. I know I didn't try to use the X1 column yet, but I have to resolve first how to iterate through those multiple indexes.
Each iteration from Emergency Contact index to End Employee line belongs to one person or one employee from the whole dataframe, so the idea after capture all those values is to keep also a counter variable to see how many times the data is captured between those two indexes.
It's a bit ugly, but this should do it. Basically you don't need the first or last two rows, so if you get rid of those, then pivot the X1 and ElemenTex columns you will be pretty close. Then it's a matter of getting rid of null values and promoting the first row to header.
df = df.iloc[1:-2][['ElementTex','X1','ElementRow']].pivot(columns='X1',values='ElementTex')
df = pd.DataFrame([x[~pd.isnull(x)] for x in df.values.T]).T
df.columns = df.iloc[0]
df = df[1:]
Split the dataframe into chunks whenever "Emergency Contacts" appears in column "ElementText"
Parse each chunk into the required format
Append to the output
import numpy as np
list_of_df = np.array_split(data, data[data["ElementText"]=="Emergency Contacts"].index)
output = pd.DataFrame()
for frame in list_of_df:
df = frame[~frame["ElementText"].isin(["Emergency Contacts", "End Employee Line"])].dropna()
if df.shape[0]>0:
temp = pd.DataFrame(df.groupby("X1")["ElementText"].apply(list).tolist()).T
temp.columns = temp.iloc[0]
temp = temp.drop(0)
output = output.append(temp, ignore_index=True)
>>> output
0 Contact Relationship Work Phone ... City State Zip
0 Silvia Smith Mother None ... Austin Texas 76063
1 Naomi Smith Aunt None ... Austin Texas 76063
I have two datasets: one with cancer positive patients (df_pos), and the other with the cancer negative patients (df_neg).
df_pos
id
0 123
1 124
2 125
df_neg
id
0 234
1 235
2 236
I want to compile these datasets into one with an extra column if the patient has cancer or not (yes or no).
Here is my desired outcome:
id outcome
0 123 yes
1 124 yes
2 125 yes
3 234 no
4 235 no
5 236 no
What would be a smarter approach to compile these?
Any suggestions would be appreciated. Thanks!
Use pandas.DataFrame.append and pandas.DataFrame.assign:
>>> df_pos.assign(outcome='Yes').append(df_neg.assign(outcome='No'), ignore_index=True)
id outcome
0 123 Yes
1 124 Yes
2 125 Yes
3 234 No
4 235 No
5 236 No
df_pos['outcome'] = True
df_neg['outcome'] = False
df = pd.concat([df_pos, df_neg]).reset_index(drop=True)
This question already has an answer here:
Convert pandas.groupby to dict
(1 answer)
Closed 4 years ago.
Given a table (/dataFrame) x:
name day earnings revenue
Oliver 1 100 44
Oliver 2 200 69
John 1 144 11
John 2 415 54
John 3 33 10
John 4 82 82
Is it possible to split the table into two tables based on the name column (that acts as an index), and nest the two tables under the same object (not sure about the exact terms to use). So in the example above, tables[0] will be:
name day earnings revenue
Oliver 1 100 44
Oliver 2 200 69
and tables[1] will be:
name day earnings revenue
John 1 144 11
John 2 415 54
John 3 33 10
John 4 82 82
Note that that the number of rows in each 'sub-table' may vary.
Cheers,
Create dictionary of DataFrames:
dfs = dict(tuple(df.groupby('name')))
And then select by keys - value of name column:
print (dfs['Oliver'])
print (dfs['John'])