How to join concatenated values to new values in python - python

Hi im new to python and trying to understand joining
I have two dataframe -
df1
OutputValues
12-99
22-99
264-99
12-323,138-431
4-21
12-123
df2
OldId NewId
99 191
84 84
323 84
59 59
431 59
208 59
60 59
58 59
325 59
390 59
324 59
564 564
123 564
21 21
I want to join both of these based on the second half of the values in df1 i.e. the values after the hifen, for example 12--99 joins old id 99 in df2 and 4-21 to old id 21.
The final new output dataframe should join to the new values in df2 and look like-
df3
OutputValues OutputValues2
12-99 12-191
22-99 22-191
264-99 264-191
12-323,138-431 12-323,138-59
4-21 4-21
12-123,4-325 12-564,4-59
As you see, now the first part of the concatenation is joined with the new id in my desired final output dataframe df3 where there is 99 it is replaced with 191, 123 is replaced with 564 and 325 with 59,etc
How can i do this?

Let's extract both parts, map the last part then concatenate back:
s = df1.OutputValues.str.extractall('(\d+-)(\d+)');
df1['OutputValues2'] = (s[0]+s[1].map(df2.astype(str).set_index('OldId')['NewId'])
).groupby(level=0).agg(','.join)
Output:
OutputValues OutputValues2
0 12-99 12-191
1 22-99 22-191
2 264-99 264-191
3 12-323,138-431 12-84,138-59
4 4-21 4-21
5 12-123 12-564
Update: Looks like simple replace would also work, but this might fail in some edge cases:
df1['OutputValues2'] = df1.OutputValues.replace(('-'+df2.astype(str))
.set_index('OldId')['NewId'],
regex=True)

df1=df1['OutputValues'].str.split(',').explode().str.split('\-',expand=True).join(df1)#Separate explode to separate OutputValues and join them back to df1
df3=df2.astype(str).merge(g, left_on='OldId', right_on=1)#merge df2 and new df1
df3=df3.assign(OutputValues2=df3[0].str.cat(h.NewId, sep='-')).drop(columns=['OldId','NewId',0,1])#Create OutputValues2 and drop unrequired columns
df3.groupby('OutputValues')['OutputValues2'].agg(','.join).reset_index()
OutputValues OutputValues2
0 12-123 12-564
1 12-323,138-431 12-84,138-59
2 12-99 12-191
3 22-99 22-191
4 264-99 264-191
5 4-21 4-21

Related

Find occurences where a column from one dataframe equals another, based on condition

I have the following two dataframes, which have different size: df1 (966 rows x 2 cols), df2 (36 rows, 2 cols), where
df1:
Video_# Selected Joint.1
484 1 Left_shoulder
778 1 Left_shoulder
418 1 Right_shoulder
964 1 Right_shoulder
193 1 Right_shoulder
... ... ... ... ...
285 36 Right_elbow
267 36 Left_hand
216 36 Shoulder_centre
139 36 Right_shoulder
df2:
Video_# Ann.1
0 1 Shoulder_center
1 2 Head
2 3 Right_hip
... ... ... ... ...
33 34 Left_knee
34 35 Right_knee
35 36 Right_shoulder
Where Video_# goes from 1-36. In df2 the Video_# column is just a single occurrence of each, so 1-36 just once. Whereas in df1 there are multiple occurrences of each 1-36, not the same number for each 1-36 (I hope that makes sense).
What I want to check is the number of occurrences df1['Selected Joint.1'] == df2['Ann.1'] based on Video_#. So the expected output is (eg.):
Video_# Equality Occurrences
1 3
2 5
... ... ...
36 6
Is that possible?
Use DataFrame.merge with GroupBy.size:
df = (df1.merge(df2,
left_on=['Video_#','Selected Joint.1'],
right_on=['Video_#','Ann.1'])
.groupby('Video_')
.size()
.reset_index(name='Equality Occurrences'))

Can I read a range of rows using pandas data frame on some column value?

This is my data,
prakash 101
Ram 107
akash 103
sakshi 115
vidushi 110
aman 106
lakshay 99
I want to select all rows from akash to vidushi or all rows from Ram to aman. In real scenarios, there will be thousand of rows and multiple columns and I will be getting multiple queries to select a range of rows on the basis of some column value. how can i do that?
Heres the right way to do it..
start = 'akash'
end = 'vidushi'
l = list(df['names']) #ordered list of names
subl = l[l.index(start):l.index(end)+1] #list of names between the start and end
df[df['names'].isin(subl)] #filter dataset for list of names
2 akash 103
3 sakshi 115
4 vidushi 110
Create some variables (which you can adjust), then use .loc and .index[0] (note: df[0] can be replaced with the name of your header, so if your header is called Names, then change all instances of df[0] to df['Names']:
var1 = 'Ram'
var2 = 'aman'
a = df.loc[df[0]==var1].index[0]
b = df.loc[df[0]==var2].index[0]
c = df.iloc[a:b+1,:]
c
output:
0 1
1 Ram 107
2 akash 103
3 sakshi 115
4 vidushi 110
5 aman 106
try set_index then use loc
df = pd.DataFrame({"name":["prakash","Ram","akash","sakshi","vidushi","aman","lakshay"],"val":[101,107,103,115,110,106,99]})
(df.set_index(['name']).loc["akash":"vidushi"]).reset_index()
output:
name val
0 akash 103
1 sakshi 115
2 vidushi 110
You can use the range to select rows
print x[2:4]
#output
akash 103
sakshi 115
vidushi 110
aman 106
If you want to fill the values based on a specific column you can use np.where

How to manipulate the index in one dataframe and filter for indices in another

I started learning pandas and stuck at below issue:
I have two large DataFrames:
df1=
ID KRAS ATM
TCGA-3C-AAAU-01A-11R-A41B-07 101 32
TCGA-3C-AALI-01A-11R-A41B-07 101 75
TCGA-3C-AALJ-01A-31R-A41B-07 102 65
TCGA-3C-ARLJ-01A-61R-A41B-07 87 54
df2=
ID BRCA1 ATM
TCGA-A1-A0SP 54 65
TCGA-3C-AALI 191 8
TCGA-3C-AALJ 37 68
The ID is the index in both df. First, I want to cut the name of the ID to only the first 10 digits ( convert TCGA-3C-AAAU-01A-11R-A41B-07 to TCGA-3C-AAAU) in df1. Then I want to produce a new df from df1 which has the ID that exist in df2.
df3 should look:
ID KRAS ATM
TCGA-3C-AALI 101 75
TCGA-3C-AALJ 102 65
I tried different ways but failed. Any suggestions on this, please?
Here is one way using vectorised functions:
# truncate to first 10 characters, or 12 including '-'
df1['ID'] = df1['ID'].str[:12]
# filter for IDs in df2
df3 = df1[df1['ID'].isin(df2['ID'])]
Result
ID KRAS ATM
1 TCGA-3C-AALI 101 75
2 TCGA-3C-AALJ 102 65
Explanation
Use .str accessor to limit df1['ID'] to first 12 characters.
Mask df1 to include only IDs found in df2.
IIUC TCGA-3C-AAAU this contain 12 character :-)
df3=df1.assign(ID=df1.ID.str[:12]).loc[lambda x:x.ID.isin(df2.ID),:]
df3
Out[218]:
ID KRAS ATM
1 TCGA-3C-AALI 101 75
2 TCGA-3C-AALJ 102 65

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Merging 2 csv data sets with Python a common ID column- one csv has multiple records for a unique ID

I'm very new to Python.Any support is much appreciated
I have two csv files that I'm trying to Merge using a Student_ID column and create a new csv file.
csv 1 : every entry has a unique studentID
Student_ID Age Course startYear
119 24 Bsc 2014
csv2: has multiple records for a studentID as it has a new entry for every subject the student is taking
Student_ID sub_name marks Sub_year_level
119 Botany1 60 2
119 Anatomy 70 2
119 cell bio 75 3
129 Physics1 78 2
129 Math1 60 1
i want to merge the two csv file so that I have all records and columns from csv1 and new additional created columns where I want to get from csv2 the average mark(has to be calculated) for each subject_year_level per student. So the final csv file will have unique Student_Ids in all records
What I want my new output csv file to look like:
Student_ID Age Course startYear level1_avg_mark levl2_avg_mark levl3_avgmark
119 24 Bsc 2014 60 65 70
You can use pivot_table with join:
Notice: parameter fill_value replace NaN to 0, if not necessary remove it and default aggregate function is mean.
df2 = df2.pivot_table(index='Student_ID', \
columns='Sub_year_level', \
values='marks', \
fill_value=0) \
.rename(columns='level{}_avg_mark'.format)
print (df2)
Sub_year_level level1_avg_mark level2_avg_mark level3_avg_mark
Student_ID
119 0 65 75
129 60 78 0
df = df1.join(df2, on='Student_ID')
print (df)
Student_ID Age Course startYear level1_avg_mark level2_avg_mark \
0 119 24 Bsc 2014 0 65
level3_avg_mark
0 75
EDIT:
Need custom function:
print (df2)
Student_ID sub_name marks Sub_year_level
0 119 Botany1 0 2
1 119 Botany1 0 2
2 119 Anatomy 72 2
3 119 cell bio 75 3
4 129 Physics1 78 2
5 129 Math1 60 1
f = lambda x: x[x != 0].mean()
df2 = df2.pivot_table(index='Student_ID',columns='Sub_year_level', values='marks',aggfunc=f)
.rename(columns='level{}_avg_mark'.format).reset_index()
print (df2)
Sub_year_level Student_ID level1_avg_mark level2_avg_mark level3_avg_mark
0 119 NaN 72.0 75.0
1 129 60.0 78.0 NaN
You can use groupby to calculate the average marks per level.
Then unstack to get all levels in one row.
rename the columns.
Once that is done, the groupby + unstack has conveniently left Student_ID in the index which allows for an easy join. All that is left is to do the join and specify the on parameter.
d1.join(
d2.groupby(
['Student_ID', 'Sub_year_level']
).marks.mean().unstack().rename(columns='level{}_avg_mark'.format),
on='Student_ID'
)

Categories

Resources