Merging three dataframes with three similar indexes: - python

I have three dataframes
df1 :
Date ID Number ID2 info_df1
2021-12-11 1 34 36 60
2021-12-10 2 33 35 57
2021-12-09 3 32 34 80
2021-12-08 4 3133 55
df2:
Date ID Number ID2 info_df2
2021-12-10 2 18 20 50
2021-12-11 1 34 36 89
2021-12-10 2 33 35 40
2021-12-09 3 32 34 92
df3:
Date ID Number ID2 info_df3
2021-12-10 2 18 20 57
2021-12-10 2 18 20 63
2021-12-11 1 34 36 52
2021-12-10 2 33 35 33
I need a data frame with info column from df1,df2 and df3 and Date,ID,Number,ID2 as index.
Format of the merged dataframe should consist these columns:
Date ID Number ID2 info_df1 info_df2
info_df3

If you trying to merge the dataframe based on Date, I think what you need is merge function:
mergedDf = df1.merge(df2, on="Date").merge(df3, on="Date");
mergedDf.set_index("ID2", inplace = True)
But if you are trying to merge dataframes based on multiple columns, you can use a list of column names on the on argument:
mergedDf = df1.merge(df2, on=["Date", "ID", "ID2"]).merge(df3, on=["Date", "ID", "ID2"]);
mergedDf.set_index("ID2", inplace = True)

Two steps:
first, pandas.concat(<DFs-list>) all those DFs into a df;
then, define a multi-index with df.set_index(<col-names-list>).
That will do it. Sure, you have to read some docs (here below), but those two steps should be about it.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.set_levels.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_frame.html

As others have mentioned, you need to merge the dataframes together. Using the built-in function functools.reduce, we can do this dynamically (for any number of dataframes) and easily:
i = 0
def func(x, y):
global i
i += 1
return y.merge(x.rename({'info': f'info_df{i + 1}'}, axis=1), on=['Date', 'ID', 'Number', 'ID2'], how='outer')
dfs = [df1, df2, df3]
new_df = ft.reduce(func, dfs).rename({'info': 'info_df1'}, axis=1)
Output:
>>> new_df
Date ID Number ID2 info_df1 info_df7 info_df6
0 2021-12-10 2 18 20 57.0 50.0 NaN
1 2021-12-10 2 18 20 63.0 50.0 NaN
2 2021-12-11 1 34 36 52.0 89.0 60.0
3 2021-12-10 2 33 35 33.0 40.0 57.0
4 2021-12-09 3 32 34 NaN 92.0 80.0
5 2021-12-08 4 31 33 NaN NaN 55.0

Related

Dividing one dataframe by another in python using pandas with float values

I have two separate data frames named df1 and df2 as shown below:
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 51 58 0.879310
1 1 16 20 95 115 0.826087
2 2 9 9 33 42 0.785714
3 2 12 86 51 137 0.372263
4 2 67 41 98 139 0.705036
5 3 8 0 0 0 0.000000
6 4 99 32 26 58 0.448276
7 4 101 100 24 124 0.193548
8 4 115 69 26 95 0.273684
9 5 6 40 57 97 0.587629
10 5 19 53 87 140 0.621429
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 64 71 0.901408
1 1 16 10 90 100 0.900000
2 2 9 79 86 165 0.521212
3 2 12 12 73 85 0.858824
4 2 67 54 96 150 0.640000
5 3 8 0 0 0 0.000000
6 4 99 86 28 114 0.245614
7 4 101 32 25 57 0.438596
8 4 115 97 16 113 0.141593
9 5 6 86 43 129 0.333333
10 5 19 59 27 86 0.313953
I have already found the sum values for df1 and df2 in Allele_Count and Coverage Depth but I need to divide the resulting Alt_Allele_Count and Coverage_Depth of both df's with one another to fine the total allele frequency(AF). I have tried dividing the two variable and got the error message :
TypeError: float() argument must be a string or a number, not 'DataFrame'
when I tried to convert them to floats and this table when I laft it as a df:
Alt_Allele_Count Coverage_Depth
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
My code so far:
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1[['Alt_Allele_Count']] + df2[['Alt_Allele_Count']])
print(Alt_Allele_Count)
Coverage_Depth = (df1[['Coverage_Depth']] + df2[['Coverage_Depth']]).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)
The error stems from the difference between a pandas series and a dataframe. Series are 1 dimensional structures like a singular column, while dataframes are 2d objects like tables. Series added together make a new series of values while dataframes added together make something a lot less usable.
Taking slices of a dataframe can either result in a series or dataframe object depending on how you do it:
df['column_name'] -> Series
df[['column_name', 'column_2']] -> Dataframe
So in the line:
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
df1[['Ref_Allele_Count']] becomes a singular column dataframe rather than a series.
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
Should return the correct result here. Same goes for the rest of the columns you're adding together.
This can be fixed by only using once set of brackets '[]' while referring to a column in a pandas df, rather than 2.
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
# note that I changed your double brackets ([["col_name"]]) to single (["col_name"])
# this results in pd.Series objects instead of pd.DataFrame objects
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1['Alt_Allele_Count'] + df2['Alt_Allele_Count'])
print(Alt_Allele_Count)
Coverage_Depth = (df1['Coverage_Depth'] + df2['Coverage_Depth']).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)

How Count/Sum Values Based On Multiple Conditions in Multiple Columns

I have a shipping records table with approx. 100K rows and
I want to calculate, for each row, for each material, how many qtys were shipped in last 30 days.
As you can see in below example, calculated qty depends on "material, shipping date".
I've tried to write very basic code and couldn't find a way to apply it to all rows.
df[(df['malzeme']==material) & (df['cikistarihi'] < shippingDate) & (df['cikistarihi'] >= (shippingDate-30))]['qty'].sum()
material
shippingDate
qty
shipped qtys in last 30 days
A
23.01.2019
8
0
A
28.01.2019
41
8
A
31.01.2019
66
49 (8+41)
A
20.03.2019
67
0
B
17.02.2019
53
0
B
26.02.2019
35
53
B
11.03.2019
4
88 (53+35)
B
20.03.2019
67
106 (35+4+67)
You can use .groupby with .rolling:
# convert the shippingData to datetime:
df["shippingDate"] = pd.to_datetime(df["shippingDate"], dayfirst=True)
# sort the values (if they aren't already)
df = df.sort_values(["material", "shippingDate"])
df["shipped qtys in last 30 days"] = (
df.groupby("material")
.rolling("30D", on="shippingDate", closed="left")["qty"]
.sum()
.fillna(0)
.values
)
print(df)
Prints:
material shippingDate qty shipped qtys in last 30 days
0 A 2019-01-23 8 0.0
1 A 2019-01-28 41 8.0
2 A 2019-01-31 66 49.0
3 A 2019-03-20 67 0.0
4 B 2019-02-17 53 0.0
5 B 2019-02-26 35 53.0
6 B 2019-03-11 4 88.0
7 B 2019-03-20 67 39.0
EDIT: Add .sort_values() before groupby

Join two Pandas Dataframes keeping specific string in specific column

I have two Dataframes that I'm trying to merge/join/concatenate (I'm not sure the right term).
I don't care about the index. ID is the unique identifier for each row.
There are a LOT of columns of data (simplified here to A-D) but for each unique ID, the columns of data will be the same between Dataframes EXCEPT for the final QC columns.
I want to join the two Dataframes, such that when there are duplicate entries (as determined by a duplication in the ID column), any instance of most of the rows is kept (first or last) but the value kept for QC_1 and QC_2 is where there is actually a value (in this case I've used the string 'Fail' but I could switch to a Bool and keep True if that makes this any easier).
I've tried iterations of .merge and .join. The closest I've gotten is with .concat or .append but then I can't figure out how to have the duplicated rows combined into one. Essentially, I'm at a loss.
df1
Index ID A B C D QC_1
3 13 10 15 17 100 Fail
4 17 20 25 27 110 Fail
7 42 30 35 37 120 Fail
12 115 40 45 47 130 Fail
df2
Index ID A B C D QC_2
2 6 11 16 18 101 Fail
4 17 20 25 27 110 Fail
7 42 30 35 37 120 Fail
13 152 41 46 48 131 Fail
goal
Index ID A B C D QC_1 QC_2
3 13 10 15 17 100 Fail NaN
4 17 20 25 27 110 Fail Fail
7 42 30 35 37 120 Fail Fail
12 115 40 45 47 130 Fail NaN
2 6 11 16 18 101 NaN Fail
13 152 41 46 48 131 NaN Fail
Use combine_first:
print (df1.set_index("ID").combine_first(df2.set_index("ID")).reset_index())
ID A B C D Index QC_1 QC_2
0 6 11.0 16.0 18.0 101.0 2.0 NaN Fail
1 13 10.0 15.0 17.0 100.0 3.0 Fail NaN
2 17 20.0 25.0 27.0 110.0 4.0 Fail Fail
3 42 30.0 35.0 37.0 120.0 7.0 Fail Fail
4 115 40.0 45.0 47.0 130.0 12.0 Fail NaN
5 152 41.0 46.0 48.0 131.0 13.0 NaN Fail
This does the job.
import pandas as pd
df1 = pd.DataFrame({'ID':[13,17,42,115],
'A':[10,20,30,40],
'B':[15,25,35,45],
'C': list(range(17,48,10)),
'D':list(range(100, 131,10)),
'QC_1':4*['Fail']
})
df2 = pd.DataFrame({'ID':[6,17,42,152],
'A':[11,20,30,41],
'B':[16,25,35,46],
'C': [18,27,37,48],
'D':[101,110,120,131],
'QC_2':4*['Fail']
})
result = df1.merge(
df2,
how='outer',
left_on=['ID', 'A', 'B', 'C', 'D'],
right_on=['ID', 'A', 'B', 'C', 'D']
)
Try this:
cols = list(df1.columns[:-1])
pd.merge(df1, df2, on=cols, how='outer')
Here I assume that you want to compare using every column of df1 except for the last (QC_1). Adapt the cols variable to your need.

Cumulative sum sorted descending within a group. Pandas

I've faced a problem while applying sort_values() and cumsum() within a group.
I have a dataaset:
Basically, I need to sort values within a group, get cumulative sales and select those lines that compose 90% of sales.
to get first
and then, just select 90% of sales within each region
I have tried the following but the last line doesn't work. I returns an error: Cannot access callable attribute 'sort_values' of 'SeriesGroupBy' objects, try using the 'apply' method
I've tried apply also..
import pandas as pd
df = pd.DataFrame({'id':['id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8', 'id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8'],
'region':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,],
'sales':[54,34,23,56,78,98,76,34,27,89,76,54,34,45,56,54]})
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cumul'] = df.groupby(df['region'])['sales'].sort_values(ascending=False).cumsum()
Thank you for any suggestions
You can definitely sort the dataframe first, then do groupby():
df.sort_values(['region','sales'], ascending=[True,False],inplace=True)
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cummul'] = df.groupby('region')['%'].cumsum()
# filter
df[df['cummul'].le(0.9)]
Output:
id region sales % cummul
5 id_6 1 98 0.216336 0.216336
4 id_5 1 78 0.172185 0.388521
6 id_7 1 76 0.167770 0.556291
3 id_4 1 56 0.123620 0.679912
0 id_1 1 54 0.119205 0.799117
1 id_2 1 34 0.075055 0.874172
9 id_2 2 89 0.204598 0.204598
10 id_3 2 76 0.174713 0.379310
14 id_7 2 56 0.128736 0.508046
11 id_4 2 54 0.124138 0.632184
15 id_8 2 54 0.124138 0.756322
13 id_6 2 45 0.103448 0.859770
First we use your logic to create the % column, but we multiply by 100 and round to whole numbers.
Then we sort by region and %, no need for groupby.
After we sort, we create the cumul column.
And finally we select those within the 90% range with query:
df['%'] = df['sales'].div(df.groupby('region')['sales'].transform('sum')).mul(100).round()
df = df.sort_values(['region', '%'], ascending=[True, False])
df['cumul'] = df.groupby('region')['%'].cumsum()
df.query('cumul.le(90)')
output
id region sales % cumul
5 id_6 1 98 22.0 22.0
4 id_5 1 78 17.0 39.0
6 id_7 1 76 17.0 56.0
0 id_1 1 54 12.0 68.0
3 id_4 1 56 12.0 80.0
1 id_2 1 34 8.0 88.0
9 id_2 2 89 20.0 20.0
10 id_3 2 76 17.0 37.0
14 id_7 2 56 13.0 50.0
11 id_4 2 54 12.0 62.0
15 id_8 2 54 12.0 74.0
13 id_6 2 45 10.0 84.0
If you just need the sales data without the percentage, this can easily be done with method chaining:
(
df
.sort_values(by='sales', ascending=False)
.groupby('region')
.apply(lambda x[x.sales > x.sales.quantile(.1)])
.reset_index(level=0, drop=True)
)
Output
id region sales
5 id_6 1 98
4 id_5 1 78
6 id_7 1 76
3 id_4 1 56
0 id_1 1 54
1 id_2 1 34
7 id_8 1 34
9 id_2 2 89
10 id_3 2 76
14 id_7 2 56
11 id_4 2 54
15 id_8 2 54
13 id_6 2 45
12 id_5 2 34
This works because getting all values greater than 10% is essentially the same as getting the top 90%.

Create new column filled with random elements based on a categorical column

I have a pandas dataframe that looks like this:
ID Cat
87 A
56 A
67 A
76 D
36 D
Column ID has unique integers, while Cat contains categorical variables.
Now I would like to add two new columns with conditions about Cat.
The desirable result should look like this:
ID Cat New1 New2
87 A 67 36
56 A 67 76
67 A 56 36
76 D 36 56
36 D 76 67
Column New1: for each row, pick a random ID with the SAME category as the current row ID, with replacements. The randomly picked ID should not be the same as the current row ID.
Column New2: for each row, pick a random ID with a DIFFERENT category than the current row ID, with replacements.
How can I do this efficiently?
I tried to find a solution using vectors but was unable. This solution iterates through the index and calculates new values for New1 and New2.
This will achieve the result I believe you are looking for.
for i in df.index:
# Grab the category variable for each row.
cat = df.loc[i,'Cat']
# Set column New1
mask1 = df['Cat'] == cat
mask2 = df.index != i
df.at[i,'New1']= df[mask1 & mask2]["ID"].sample().iloc[0]
# Set column New2
mask3 = df['Cat'] != cat
df.at[i,'New2']= df[mask3]["ID"].sample().iloc[0]
print(df) 1st one:
ID Cat New1 New2
0 87 A 56.0 76.0
1 56 A 87.0 36.0
2 67 A 56.0 76.0
3 76 D 36.0 87.0
4 36 D 76.0 87.0
print(df) 2nd one:
ID Cat New1 New2
0 87 A 67.0 36.0
1 56 A 87.0 36.0
2 67 A 87.0 76.0
3 76 D 36.0 67.0
4 36 D 76.0 67.0
You can see from these result you are getting random results through the use of sample().
My previous answer did not correctly generate the column "new1". Understanding that a valid solution has been posted and accepted, I am posting this to offer an alternative.
df = pd.DataFrame.from_dict({'ID':(87,56,67,76,36),'CAT':('A','A','A','D','D')})
df['New1'] = [np.random.choice(df[(df['CAT']==cat) & (df['ID']!=iden)]['ID']) for cat, iden in zip(df['CAT'],df['ID'])]
df['New2'] = [np.random.choice(df[df['CAT']!=cat]['ID']) for cat in df['CAT']]
In [11]: df
Out[12]:
CAT ID New1 New2
0 A 87 67 76
1 A 56 67 76
2 A 67 56 36
3 D 76 36 87
4 D 36 76 67

Categories

Resources