I have two Dataframes that I'm trying to merge/join/concatenate (I'm not sure the right term).
I don't care about the index. ID is the unique identifier for each row.
There are a LOT of columns of data (simplified here to A-D) but for each unique ID, the columns of data will be the same between Dataframes EXCEPT for the final QC columns.
I want to join the two Dataframes, such that when there are duplicate entries (as determined by a duplication in the ID column), any instance of most of the rows is kept (first or last) but the value kept for QC_1 and QC_2 is where there is actually a value (in this case I've used the string 'Fail' but I could switch to a Bool and keep True if that makes this any easier).
I've tried iterations of .merge and .join. The closest I've gotten is with .concat or .append but then I can't figure out how to have the duplicated rows combined into one. Essentially, I'm at a loss.
df1
Index ID A B C D QC_1
3 13 10 15 17 100 Fail
4 17 20 25 27 110 Fail
7 42 30 35 37 120 Fail
12 115 40 45 47 130 Fail
df2
Index ID A B C D QC_2
2 6 11 16 18 101 Fail
4 17 20 25 27 110 Fail
7 42 30 35 37 120 Fail
13 152 41 46 48 131 Fail
goal
Index ID A B C D QC_1 QC_2
3 13 10 15 17 100 Fail NaN
4 17 20 25 27 110 Fail Fail
7 42 30 35 37 120 Fail Fail
12 115 40 45 47 130 Fail NaN
2 6 11 16 18 101 NaN Fail
13 152 41 46 48 131 NaN Fail
Use combine_first:
print (df1.set_index("ID").combine_first(df2.set_index("ID")).reset_index())
ID A B C D Index QC_1 QC_2
0 6 11.0 16.0 18.0 101.0 2.0 NaN Fail
1 13 10.0 15.0 17.0 100.0 3.0 Fail NaN
2 17 20.0 25.0 27.0 110.0 4.0 Fail Fail
3 42 30.0 35.0 37.0 120.0 7.0 Fail Fail
4 115 40.0 45.0 47.0 130.0 12.0 Fail NaN
5 152 41.0 46.0 48.0 131.0 13.0 NaN Fail
This does the job.
import pandas as pd
df1 = pd.DataFrame({'ID':[13,17,42,115],
'A':[10,20,30,40],
'B':[15,25,35,45],
'C': list(range(17,48,10)),
'D':list(range(100, 131,10)),
'QC_1':4*['Fail']
})
df2 = pd.DataFrame({'ID':[6,17,42,152],
'A':[11,20,30,41],
'B':[16,25,35,46],
'C': [18,27,37,48],
'D':[101,110,120,131],
'QC_2':4*['Fail']
})
result = df1.merge(
df2,
how='outer',
left_on=['ID', 'A', 'B', 'C', 'D'],
right_on=['ID', 'A', 'B', 'C', 'D']
)
Try this:
cols = list(df1.columns[:-1])
pd.merge(df1, df2, on=cols, how='outer')
Here I assume that you want to compare using every column of df1 except for the last (QC_1). Adapt the cols variable to your need.
Related
I have a traffic data that looks like this. Here, each column have data in format meters:seconds. Like in row 1 column 2, 57:9 represents 57 meters and 9 seconds.
0
1
2
3
4
5
6
7
8
9
0:0
57:9
166:34
178:37
203:44
328:63
344:65
436:77
737:108
None
0:0
166:34
178:37
203:43
328:61
436:74
596:51
737:106
None
None
0:0
57:6
166:30
178:33
203:40
328:62
344:64
436:74
596:91
None
0:0
203:43
328:61
None
None
None
None
None
None
None
0:0
57:7
166:20
178:43
203:10
328:61
None
None
None
None
I want to extract meters values from the dataframe and store them in a list in ascending order. Then create a new dataframe in which the the column header will be the meters value (present in the list). Then it will match the meter value in the parent dataframe and add the corresponding second value beneath. The missing meters:second pair should be replaced by NaN and the current pair at the position would move to next column within same row.
The desired outcome is:
list = [0,57,166,178,203,328,344,436,596,737]
dataframe:
0
57
166
178
203
328
344
436
596
737
0
9
34
37
44
63
65
77
NaN
108
0
NaN
34
37
43
61
NaN
74
51
106
0
6
30
33
40
62
64
74
91
None
0
NaN
NaN
NaN
43
61
None
None
None
None
0
7
20
43
10
61
None
None
None
None
I know I must use a loop to iterate over whole dataframe. I am new to python so I am unable to solve this. I tried using str.split() but it work only on 1 column. I have 98 columns and 290 rows. This is just one month data. I will be having 12 month data. So, need suggestions and help.
Try:
tmp = df1.apply(
lambda x: dict(
map(int, val.split(":"))
for val in x
if isinstance(val, str) and ":" in val
),
axis=1,
).to_list()
out = pd.DataFrame(tmp)
print(out[sorted(out.columns)])
Prints:
0 57 166 178 203 328 344 436 596 737
0 0 9.0 34.0 37.0 44 63 65.0 77.0 NaN 108.0
1 0 NaN 34.0 37.0 43 61 NaN 74.0 51.0 106.0
2 0 6.0 30.0 33.0 40 62 64.0 74.0 91.0 NaN
3 0 NaN NaN NaN 43 61 NaN NaN NaN NaN
4 0 7.0 20.0 43.0 10 61 NaN NaN NaN NaN
I have three dataframes
df1 :
Date ID Number ID2 info_df1
2021-12-11 1 34 36 60
2021-12-10 2 33 35 57
2021-12-09 3 32 34 80
2021-12-08 4 3133 55
df2:
Date ID Number ID2 info_df2
2021-12-10 2 18 20 50
2021-12-11 1 34 36 89
2021-12-10 2 33 35 40
2021-12-09 3 32 34 92
df3:
Date ID Number ID2 info_df3
2021-12-10 2 18 20 57
2021-12-10 2 18 20 63
2021-12-11 1 34 36 52
2021-12-10 2 33 35 33
I need a data frame with info column from df1,df2 and df3 and Date,ID,Number,ID2 as index.
Format of the merged dataframe should consist these columns:
Date ID Number ID2 info_df1 info_df2
info_df3
If you trying to merge the dataframe based on Date, I think what you need is merge function:
mergedDf = df1.merge(df2, on="Date").merge(df3, on="Date");
mergedDf.set_index("ID2", inplace = True)
But if you are trying to merge dataframes based on multiple columns, you can use a list of column names on the on argument:
mergedDf = df1.merge(df2, on=["Date", "ID", "ID2"]).merge(df3, on=["Date", "ID", "ID2"]);
mergedDf.set_index("ID2", inplace = True)
Two steps:
first, pandas.concat(<DFs-list>) all those DFs into a df;
then, define a multi-index with df.set_index(<col-names-list>).
That will do it. Sure, you have to read some docs (here below), but those two steps should be about it.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.set_levels.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_frame.html
As others have mentioned, you need to merge the dataframes together. Using the built-in function functools.reduce, we can do this dynamically (for any number of dataframes) and easily:
i = 0
def func(x, y):
global i
i += 1
return y.merge(x.rename({'info': f'info_df{i + 1}'}, axis=1), on=['Date', 'ID', 'Number', 'ID2'], how='outer')
dfs = [df1, df2, df3]
new_df = ft.reduce(func, dfs).rename({'info': 'info_df1'}, axis=1)
Output:
>>> new_df
Date ID Number ID2 info_df1 info_df7 info_df6
0 2021-12-10 2 18 20 57.0 50.0 NaN
1 2021-12-10 2 18 20 63.0 50.0 NaN
2 2021-12-11 1 34 36 52.0 89.0 60.0
3 2021-12-10 2 33 35 33.0 40.0 57.0
4 2021-12-09 3 32 34 NaN 92.0 80.0
5 2021-12-08 4 31 33 NaN NaN 55.0
I've faced a problem while applying sort_values() and cumsum() within a group.
I have a dataaset:
Basically, I need to sort values within a group, get cumulative sales and select those lines that compose 90% of sales.
to get first
and then, just select 90% of sales within each region
I have tried the following but the last line doesn't work. I returns an error: Cannot access callable attribute 'sort_values' of 'SeriesGroupBy' objects, try using the 'apply' method
I've tried apply also..
import pandas as pd
df = pd.DataFrame({'id':['id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8', 'id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8'],
'region':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,],
'sales':[54,34,23,56,78,98,76,34,27,89,76,54,34,45,56,54]})
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cumul'] = df.groupby(df['region'])['sales'].sort_values(ascending=False).cumsum()
Thank you for any suggestions
You can definitely sort the dataframe first, then do groupby():
df.sort_values(['region','sales'], ascending=[True,False],inplace=True)
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cummul'] = df.groupby('region')['%'].cumsum()
# filter
df[df['cummul'].le(0.9)]
Output:
id region sales % cummul
5 id_6 1 98 0.216336 0.216336
4 id_5 1 78 0.172185 0.388521
6 id_7 1 76 0.167770 0.556291
3 id_4 1 56 0.123620 0.679912
0 id_1 1 54 0.119205 0.799117
1 id_2 1 34 0.075055 0.874172
9 id_2 2 89 0.204598 0.204598
10 id_3 2 76 0.174713 0.379310
14 id_7 2 56 0.128736 0.508046
11 id_4 2 54 0.124138 0.632184
15 id_8 2 54 0.124138 0.756322
13 id_6 2 45 0.103448 0.859770
First we use your logic to create the % column, but we multiply by 100 and round to whole numbers.
Then we sort by region and %, no need for groupby.
After we sort, we create the cumul column.
And finally we select those within the 90% range with query:
df['%'] = df['sales'].div(df.groupby('region')['sales'].transform('sum')).mul(100).round()
df = df.sort_values(['region', '%'], ascending=[True, False])
df['cumul'] = df.groupby('region')['%'].cumsum()
df.query('cumul.le(90)')
output
id region sales % cumul
5 id_6 1 98 22.0 22.0
4 id_5 1 78 17.0 39.0
6 id_7 1 76 17.0 56.0
0 id_1 1 54 12.0 68.0
3 id_4 1 56 12.0 80.0
1 id_2 1 34 8.0 88.0
9 id_2 2 89 20.0 20.0
10 id_3 2 76 17.0 37.0
14 id_7 2 56 13.0 50.0
11 id_4 2 54 12.0 62.0
15 id_8 2 54 12.0 74.0
13 id_6 2 45 10.0 84.0
If you just need the sales data without the percentage, this can easily be done with method chaining:
(
df
.sort_values(by='sales', ascending=False)
.groupby('region')
.apply(lambda x[x.sales > x.sales.quantile(.1)])
.reset_index(level=0, drop=True)
)
Output
id region sales
5 id_6 1 98
4 id_5 1 78
6 id_7 1 76
3 id_4 1 56
0 id_1 1 54
1 id_2 1 34
7 id_8 1 34
9 id_2 2 89
10 id_3 2 76
14 id_7 2 56
11 id_4 2 54
15 id_8 2 54
13 id_6 2 45
12 id_5 2 34
This works because getting all values greater than 10% is essentially the same as getting the top 90%.
The Scenario:
I have 2 dataframes fc0 and yc0. Where fc0 is a Cluster and yc0 is another dataframe which needs to be merged in fc0.
The Nature of data is as follows:
fc0
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
yc0
iid uid 1 2 5 6 9 15
0 944 5.0 3.0 4.0 3.0 3.0 5.0
The Twist
I have 1682 columns in fc0 and I have few hundered values in yc0. Now I need the yc0 to go into fc0
In haste of resolving it, I even tried yc0.reset_index(inplace=True) but wasn't really helpful.
Expected Output
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
944 5.0 3.0 NaN NaN 4.0 3.0 3.0
References
Link1 Tried this, but landed up inserting NaN values for 1st 16 Columns and rest of the data shifted by that many columns
Link2 Couldn't match column keys, besides I tried it for row.
Link3 Merging doesn't match the columns in it.
Link4 Concatenation doesn't work that way.
Link5 Same issues with Join.
EDIT 1
fc0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 235 entries, 234 to 468
Columns: 1683 entries, uid to 1682
dtypes: float64(1682), int64(1)
memory usage: 3.0 MB
and
yc0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 336 entries, uid to 1007
dtypes: float64(335), int64(1)
memory usage: 2.7 KB
Here's a MVCE example. Does this small sample data show the functionality that you are expecting?
df1 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('ABCE'))
A B C E
0 81 57 54 88
1 63 63 74 10
2 13 89 88 66
3 90 81 3 31
4 66 93 55 4
df2 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('BCDE'))
B C D E
0 93 48 62 25
1 24 97 52 88
2 53 50 21 13
3 81 27 7 81
4 10 21 77 19
df_out = pd.concat([df1,df2])
print(df_out)
Output:
A B C D E
0 81.0 57 54 NaN 88
1 63.0 63 74 NaN 10
2 13.0 89 88 NaN 66
3 90.0 81 3 NaN 31
4 66.0 93 55 NaN 4
0 NaN 93 48 62.0 25
1 NaN 24 97 52.0 88
2 NaN 53 50 21.0 13
3 NaN 81 27 7.0 81
4 NaN 10 21 77.0 19
I am trying to fill NaN values in a dataframe with values coming from a standard normal distribution.
This is currently my code:
sqlStatement = "select * from sn.clustering_normalized_dataset"
df = psql.frame_query(sqlStatement, cnx)
data=df.pivot("user","phrase","tfw")
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
data[np.isnan(data)] = dfrand[np.isnan(data)]
After pivoting the dataframe 'data' it looks like that:
phrase aaron abbas abdul abe able abroad abu abuse \
user
14233664 NaN NaN NaN NaN NaN NaN NaN NaN
52602716 NaN NaN NaN NaN NaN NaN NaN NaN
123456789 NaN NaN NaN NaN NaN NaN NaN NaN
500158258 NaN NaN NaN NaN NaN NaN NaN NaN
517187571 0.4 NaN NaN 0.142857 1 0.4 0.181818 NaN
However, I need that each NaN value will be replaced with a new random value. So I created a new df consists of only random values (dfrand) and then trying to swap the missing numbers (Nan) by the values from dfrand corresponding to indices of the NaN's. Well - unfortunately it doesn't work -
Although the expression
np.isnan(data)
returns a dataframe consists of True and False values, the expression
dfrand[np.isnan(data)]
return only NaN values so the overall trick doesn't work.
Any ideas what the issue ?
Three-thousand columns is not so many. How many rows do you have? You could always make a random dataframe of the same size and do a logical replacement (the size of your dataframe will dictate whether this is feasible or not.
if you know the size of your dataframe:
import pandas as pd
import numpy as np
# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(rows,cols))
# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in
# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]
if you do not know the size of your dataframe, just shuffle things around
import pandas as pd
import numpy as np
# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in
# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]
EDIT
Per "users" last comment:
"dfrand[np.isnan(data)] returns NaN only."
Right! And that is exactly what you wanted. In my solution I have: data[np.isnan(data)] = dfrand[np.isnan(data)]. Translated, this means: take the randomly-generated value from dfrand that corresponds to the NaN-location within "data" and insert it in "data" where "data" is NaN. An example will help:
a = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
a[0][5] = np.nan
In [32]: a
Out[33]:
0 1 2
0 2 26 28
1 14 79 82
2 89 32 59
3 65 47 31
4 29 59 15
5 NaN 58 90
6 15 66 60
7 10 19 96
8 90 26 92
9 0 19 23
# define randomly-generated dataframe, much like what you are doing, and replace NaN's
b = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
In [39]: b
Out[39]:
0 1 2
0 92 21 55
1 65 53 89
2 54 98 97
3 48 87 79
4 98 38 62
5 46 16 30
6 95 39 70
7 90 59 9
8 14 85 37
9 48 29 46
a[np.isnan(a)] = b[np.isnan(a)]
In [38]: a
Out[38]:
0 1 2
0 2 26 28
1 14 79 82
2 89 32 59
3 65 47 31
4 29 59 15
5 46 58 90
6 15 66 60
7 10 19 96
8 90 26 92
9 0 19 23
As you can see, all NaN's in have been replaced with the randomly-generated value in based on 's nan-value indices.
you could try something like this, assuming you are dealing with one series:
ser = data['column_with_nulls_to_replace']
index = ser[ser.isnull()].index
df = pd.DataFrame(np.random.randn(len(index)), index=index, columns=['column_with_nulls_to_replace'])
ser.update(df)