Find commonalities between dataframes - python

I want to compare two dataframes with different dimension
df = pd.DataFrame({'Age': ['20', '14', '56', '28'],
'Weight': [59, 29, 73, 56],
'Height' : [185, 160, 175, 180]})
df1 = pd.DataFrame({'Age': ['20', '14', '56', '12', '10', '30', '28'],
'Weight': [59, 29, 73, 56, 68, 48, 50],
'Height' : [185, 155, 170, 160, 155, 177, 172]})
I want to find find shared data points in df from df1 vice versa and create separate dataframes so looking for output to be like this:
df_result = pd.DataFrame({'Age':['20','14','56','28'],
'Weight': [59, 29, 73,56],
'Height' : [185, 160, 175, 180]})
df1_result = pd.DataFrame({'Age':['20','14','56','28'],
'Weight': [59, 29, 73, 56],
'Height': [185, 160, 170, 172]})
I'm trying to use Age and Weight as criteria to look for shared data point

Use inner join with rename columns names:
df1 = df.merge(df1, on=['Age','Weight'])
df_result = df1.rename(columns={'Height_x':'Height'})[df.columns]
print (df_result)
Age Weight Height
0 20 59 185
1 14 29 160
2 56 73 175
df1_result = df1.rename(columns={'Height_y':'Height'})[df.columns]
print (df1_result)
Age Weight Height
0 20 59 185
1 14 29 155
2 56 73 170
Or boolean indexing with Index.isin by both columns:
ind1 = df.set_index(['Age','Weight']).index
ind2 = df1.set_index(['Age','Weight']).index
df_result = df[ind1.isin(ind2)]
print (df_result)
Age Weight Height
0 20 59 185
1 14 29 160
2 56 73 175
df1_result = df1[ind2.isin(ind1)]
print (df1_result)
Age Weight Height
0 20 59 185
1 14 29 155
2 56 73 170

Related

Pandas fillna multiple columns with values from corresponding columns without repeating for each

Let's say I have a DataFrame like this:
x = pd.DataFrame({'col1_x': [15, np.nan, 136, 93, 743, np.nan, np.nan, 91] ,
'col2_x': [np.nan, np.nan, 51, 22, 38, np.nan, 72, np.nan],
'col1_y': [10, 20, 30, 40, 50, 60, 70, 80],
'col2_y': [93, 24, 52, 246, 142, 53, 94, 2]})
And I want to fill the NaN values in col_x with the values in col_y respectively,
I can do this:
x['col1_x'] = x['col1_x'].fillna(x['col1_y'])
x['col2_x'] = x['col2_x'].fillna(x['col2_y'])
print(x)
Which will yield:
col1_x col2_x col1_y col2_y
0 15.0 93.0 10 93
1 20.0 24.0 20 24
2 136.0 51.0 30 52
3 93.0 22.0 40 246
4 743.0 38.0 50 142
5 60.0 53.0 60 53
6 70.0 72.0 70 94
7 91.0 2.0 80 2
But requires to repeat the same function with different variables, now let's assume that I have a bigger DataFrame with much more columns, is it possible to do it without repeating?
you can use **kwargs to assign()
build up a dict with a comprehension to build **kwargs
import pandas as pd
import numpy as np
x = pd.DataFrame({'col1_x': [15, np.nan, 136, 93, 743, np.nan, np.nan, 91] ,
'col2_x': [np.nan, np.nan, 51, 22, 38, np.nan, 72, np.nan],
'col1_y': [10, 20, 30, 40, 50, 60, 70, 80],
'col2_y': [93, 24, 52, 246, 142, 53, 94, 2]})
x.assign(**{c:x[c].fillna(x[c.replace("_x","_y")]) for c in x.columns if "_x" in c})
col1_x
col2_x
col1_y
col2_y
0
15
93
10
93
1
20
24
20
24
2
136
51
30
52
3
93
22
40
246
4
743
38
50
142
5
60
53
60
53
6
70
72
70
94
7
91
2
80
2
How does it work
# core - loop through columns that end with _x and generate it's pair column _y
{c:c.replace("_x","_y")
for c in x.columns if "_x" in c}
# now we have all the pairs of a columns let's do what we want - fillna()
{c:x[c].fillna(x[c.replace("_x","_y")]) for c in x.columns if "_x" in c}
# this dictionary matches this function.... https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
# so final part is call the function with **kwargs
x.assign(**{c:x[c].fillna(x[c.replace("_x","_y")])
for c in x.columns if "_x" in c})
You can use the following notation -
x.fillna({"col1_x": x["col1_y"], "col2_x": x["col2_y"]})
Assuming you can extract all the indices numbers you can do -
replace_dict = {f"col{item}_x":x[f"col{item}_y"] for item in indices}
x = x.fillna(replace_dict}
Are you trying to make this type of function :
def fil(fill,fromm):
fill.fillna(fromm,inplace=True)
fil(x['col1_x'],x['col1_y'])
Or if you are sure about dataframe(x) then this :
def fil(fill,fromm):
x[fill].fillna(x[fromm],inplace=True)
fil('col1_x','col1_y')
For your code :
import pandas as pd
import numpy as np
x = pd.DataFrame({'col1_x': [15, np.nan, 136, 93, 743, np.nan, np.nan, 91] ,
'col2_x': [np.nan, np.nan, 51, 22, 38, np.nan, 72, np.nan],
'col1_y': [10, 20, 30, 40, 50, 60, 70, 80],
'col2_y': [93, 24, 52, 246, 142, 53, 94, 2]})
def fil(fill,fromm):
x[fill].fillna(x[fromm],inplace=True)
fil('col1_x','col1_y')
fil('col2_x','col2_y')
print(x)
"""
col1_x col2_x col1_y col2_y
0 15.0 93.0 10 93
1 20.0 24.0 20 24
2 136.0 51.0 30 52
3 93.0 22.0 40 246
4 743.0 38.0 50 142
5 60.0 53.0 60 53
6 70.0 72.0 70 94
7 91.0 2.0 80 2
"""
Additionally, if you have column name like col1_x,col2_x,col3_x.... same for y then you may automate it like this :
for i in range(1,3):
fil(f'col{i}_x',f'col{i}_y')

How to build a histogram from a pandas dataframe where each observation is a list?

I have a dataframe as follows. The values are in a cell, a list of elements. I want to visualize distribution of the values from the "Values" column using histogram"S" stacked in rows OR separated by colours (Area_code).
How can I get the values and construct histogram"S" in plotly? Any other idea also welcome. Thank you.
Area_code Values
0 New_York [999, 54, 231, 43, 177, 313, 212, 279, 199, 267]
1 Dallas [915, 183, 2326, 316, 206, 31, 317, 26, 31, 56, 316]
2 XXX [560]
3 YYY [884, 13]
4 ZZZ [203, 1066, 453, 266, 160, 109, 45, 627, 83, 685, 120, 410, 151, 33, 618, 164, 496]
If you reshape your data, this would be a perfect case for px.histogram. And from there you can opt between several outputs like sum, average, count through the histfunc method:
fig = px.histogram(df, x = 'Area_code', y = 'Values', histfunc='sum')
fig.show()
You haven't specified what kind of output you're aiming for, but I'll leave it up to you to change the argument for histfunc and see which option suits your needs best.
I'm often inclined to urge users to rethink their entire data process, but I'm just going to assume that there are good reasons why you're stuck with what seems like a pretty weird setup in your dataframe. The snippet below contains a complete data munginge process to reshape your data from your setup, to a so-called long format:
Area_code Values
0 New_York 999
1 New_York 54
2 New_York 231
3 New_York 43
4 New_York 177
5 New_York 313
6 New_York 212
7 New_York 279
8 New_York 199
9 New_York 267
10 Dallas 915
11 Dallas 183
12 Dallas 2326
13 Dallas 316
14 Dallas 206
15 Dallas 31
16 Dallas 317
17 Dallas 26
18 Dallas 31
19 Dallas 56
20 Dallas 316
21 XXX 560
22 YYY 884
23 YYY 13
24 ZZZ 203
And this is a perfect format for many of the great functionalites of plotly.express.
Complete code:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
# data input
df = pd.DataFrame({'Area_code': {0: 'New_York', 1: 'Dallas', 2: 'XXX', 3: 'YYY', 4: 'ZZZ'},
'Values': {0: [999, 54, 231, 43, 177, 313, 212, 279, 199, 267],
1: [915, 183, 2326, 316, 206, 31, 317, 26, 31, 56, 316],
2: [560],
3: [884, 13],
4: [203, 1066, 453, 266, 160, 109, 45, 627, 83, 685, 120, 410, 151, 33, 618, 164, 496]}})
# data munging
areas = []
value = []
for i, row in df.iterrows():
# print(row['Values'])
for j, val in enumerate(row['Values']):
areas.append(row['Area_code'])
value.append(val)
df = pd.DataFrame({'Area_code': areas,
'Values': value})
# plotly
fig = px.histogram(df, x = 'Area_code', y = 'Values', histfunc='sum')
fig.show()

Pandas: Iterate and insert column with conditions within groups complex question

I have a quite complex question about how to add a new column with conditions for each group. Here is the example dataframe,
df = pd.DataFrame({
'id': ['AA', 'AA', 'AA', 'AA', 'BB', 'BB', 'BB', 'BB', 'BB',
'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC'],
'From_num': [80, 68, 751, 'Issued', 32, 68, 126, 'Issued', 'Missed', 105, 68, 114, 76, 68, 99, 'Missed'],
'To_num':[99, 80, 68, 751, 105, 32, 68, 126, 49, 324, 105, 68, 114, 76, 68, 99],
})
id From_num To_num
0 AA 80 99
1 AA 68 80
2 AA 751 68
3 AA Issued 751
4 BB 32 105
5 BB 68 32
6 BB 126 68
7 BB Issued 126
8 BB Missed 49
9 CC 105 324
10 CC 68 105
11 CC 114 68
12 CC 76 114
13 CC 68 76
14 CC 99 68
15 CC Missed 99
I have a 'flag' number 68. In each group, for any row equals or above this flag number in 'From_num' column will be tagged "Forward" in the new column , any row equals or below the flag number in 'To_num' column will be labelled 'Back' in the same column. However, the hardest situation is: if this flag number appears more than once in each column, the rows between the 'From_num' and 'To_num' will be labelled "Forward&Back" in the new column, see the df and the expected result below.
Expected result
id From_num To_num Direction
0 AA 80 99 Forward
1 AA 68 80 Forward
2 AA 751 68 Back
3 AA Issued 751 Back
4 BB 32 105 Forward
5 BB 68 32 Forward
6 BB 126 68 Back
7 BB Issued 126 Back
8 BB Missed 49 Back
9 CC 105 324 Forward
10 CC 68 105 Forward
11 CC 114 68 Forward&Back # From line 11 to 13, flag # 68 appears more than once
12 CC 76 114 Forward&Back # so the line 11, 12 and 13 labelled "Forward&Back"
13 CC 68 76 Forward&Back
14 CC 99 68 Back
15 CC Missed 99 Back
I tried to write many loops, and they all failed and could not have an expected result. So if anyone has ideas, please help. Hopefully the question is clear. Many thanks!
I've done without "real looping".
preserve the row numbers (reset_index())
construct a new data frame that is records that contain flags (68)
simple logic for "Forward" and "Back" is based on row being before or after first sighting of 68
"Forward&Back" occurs when there are multiple sightings and between 2nd and (n-1)th sighting
def direction(r):
flagrow = df2[(df2["id"]==r["id"]) ]["index"].values
if r["index"] <= flagrow[0]: val = "Forward"
elif r["index"] > flagrow[0]: val = "Back"
if len(flagrow)>2 and r["index"] >= flagrow[1] and r["index"]<flagrow[-1]: val = "Forward&Back"
return val
df = pd.DataFrame({
'id': ['AA', 'AA', 'AA', 'AA', 'BB', 'BB', 'BB', 'BB', 'BB',
'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC'],
'From_num': [80, 68, 751, 'Issued', 32, 68, 126, 'Issued', 'Missed', 105, 68, 114, 76, 68, 99, 'Missed'],
'To_num':[99, 80, 68, 751, 105, 32, 68, 126, 49, 324, 105, 68, 114, 76, 68, 99],
})
df = df.reset_index()
df2 = df[(df.From_num==68) | (df.To_num==68)].copy()
df["Direction"] = df.apply(lambda r: direction(r), axis=1)
df

Python Generate unique ranges of a specific length and categorize them

I have a dataframe column which specifies how many times a user has performed an activity.
eg.
>>> df['ActivityCount']
Users ActivityCount
User0 220
User1 190
User2 105
User3 109
User4 271
User5 265
...
User95 64
User96 15
User97 168
User98 251
User99 278
Name: ActivityCount, Length: 100, dtype: int32
>>> activities = sorted(df['ActivityCount'].unique())
[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 78,
83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 144, 145, 148, 153, 155, 157, 162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 240, 244, 247, 251, 255, 258, 260, 265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]
According to their ActivityCount, I have to divide users into 5 different categories eg A, B, C, D and E.
Activity Count range varies from time to time. In the above example it's approx in-between (9-290) (lowest and highest of the series), it could be (5-500) or (5 to 30).
In above example, I can take the max number of activities and divide it by 5 and categorize each user between the range of 58 (from 290/5) like Range A: 0-58, Range B: 59-116, Range C: 117-174...etc
Is there any other way to achieve this using pandas or numpy, so that I can directly categorize the column in the given categories?
Expected output: -
>>> df
Users ActivityCount Category/Range
User0 220 D
User1 190 D
User2 105 B
User3 109 B
User4 271 E
User5 265 E
...
User95 64 B
User96 15 A
User97 168 C
User98 251 E
User99 278 E
The natural way to do that would be to split the data into 5 quanties, and then split the data into bins based on these quantities. Luckily, pandas allows you do easily do that:
df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])
The output is something like:
Activity Category
34 115 b
15 43 a
57 192 d
78 271 e
26 88 b
6 25 a
55 186 d
63 220 d
1 15 a
76 268 e
An alternative view - clustering
In the above method, we've split the data into 5 bins, where the sizes of the different bins are equal. An alternative, more sophisticated approach, would be to split the data into 5 clusters and aim to have the data points in each cluster as similar to each other as possible. In machine learning, this is known as a clustering / classification problem.
One classic clustering algorithm is k-means. It's typically used for data with multiple dimensions (e.g. monthly activity, age, gender, etc.) This is, therefore, a very simplistic case of clustering.
In this case, k-means clustering can be done in the following way:
import scipy
from scipy.cluster.vq import vq, kmeans, whiten
df = pd.DataFrame({"Activity": l})
features = np.array([[x] for x in df.Activity])
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 5)
code, dist = vq(whitened, codebook)
df["Category"] = code
And the output looks like:
Activity Category
40 138 1
79 272 0
72 255 0
13 38 3
41 139 1
65 231 0
26 88 2
59 197 4
76 268 0
45 145 1
A couple of notes:
The labels of the categories are random. In this case label '2' refers to higher activity than lavel '1'.
I didn't migrate the labels from 0-4 to A-E. This can easily be done using pandas' map.
Try the below solution:
df['Categ'] = pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'))
It creates Categ column - a result of division of ActivityCount
into 5 bins, labelled with A, ... E.
Borders of bins are set by division of full range into n subranges of
equal size.
You can also see the borders of each bin, calling:
pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'), retbins=True)[1]

How to create a dictionary of items from a dataframe?

I have a Pandas dataframe df which is of the form:
pk id_column date_column sales_column
0 111 03/10/19 23
1 111 04/10/19 24
2 111 05/10/19 25
3 111 06/10/19 26
4 112 07/10/19 27
5 112 08/10/19 28
6 112 09/10/19 29
7 112 10/10/19 30
8 113 11/10/19 31
9 113 12/10/19 32
10 113 13/10/19 33
11 113 14/10/19 34
12 114 15/10/19 35
13 114 16/10/19 36
14 114 17/10/19 37
15 114 18/10/19 38
How do I get a new dictionary which contains data from id_column and sales_column as its value like below in the order of date_column.
{
111: [23, 24, 25, 26],
112: [27, 28, 29, 30],
113: ...,
114: ...
}
First create Series of lists in groupby with list and then convert to dictionary by Series.to_dict:
If need sorting by id_column and date_column first convert values to datetimes and then use DataFrame.sort_values:
df['date_column'] = pd.to_datetime(df['date_column'], dayfirst=True)
df = df.sort_values(['id_column','date_column'])
d = df.groupby('id_column')['sales_column'].apply(list).to_dict()
print (d)
{111: [23, 24, 25, 26], 112: [27, 28, 29, 30], 113: [31, 32, 33, 34], 114: [35, 36, 37, 38]}

Categories

Resources