I have two dataframes, just like below.
Dataframe1:
country
type
start_week
end_week
1
a
12
13
2
b
13
14
Dataframe2:
country
type
week
value
1
a
12
1000
1
a
13
900
1
a
14
800
2
b
12
1000
2
b
13
900
2
b
14
800
I want to add to the first dataframe column with the mean value from the second dataframe for key (country+type) and between start_week and end_week.
I want desired output to look like the below:
country
type
start_week
end_week
avg
1
a
12
13
950
2
b
13
14
850
here is one way :
combined = df1.merge(df2 , on =['country','type'])
combined = combined.loc[(combined.start_week <= combined.week) & (combined.week <= combined.end_week)]
output = combined.groupby(['country','type','start_week','end_week'])['value'].mean().reset_index()
output:
>>
country type start_week end_week value
0 1 a 12 13 950.0
1 2 b 13 14 850.0
You can use pd.melt and comparison of numpy arrays.
# melt df1
melted_df1 = df1.melt(id_vars=['country','type'],value_name='week')[['country','type','week']]
# for loop to compare two dataframe arrays
result = []
for i in df2.values:
for j in melted_df1.values:
if (j == i[:3]).all():
result.append(i)
break
# Computing mean of the result dataframe
result_df = pd.DataFrame(result,columns=df2.columns).groupby('type').mean().reset_index()['value']
# Assigning result_df to df1
df1['avg'] = result_df
country type start_week end_week avg
0 1 a 12 13 950.0
1 2 b 13 14 850.0
Related
I have a Dataframe looks like the following..
import pandas as pd
import numpy as np
# Create data set.
present = 12
died = 20
dataSet = {'id': ['A', 'A', 'A','A','B','B','B','C'],
'id_2': [1, 2, 3, 1, 1,2,3,1],
'start' : [9,13,12,11,9,20,22,13],
'end' : [14,22,21,19,10,30,24,18]}
# Create dataframe with data set and named columns.
df = pd.DataFrame(dataSet, columns= ['id', 'id_2', 'start','end'])
id id_2 start end
0 A 1 9 14
1 A 2 13 22
2 A 3 12 21
3 A 1 11 19
4 B 1 9 10
5 B 2 20 30
6 B 3 22 24
7 C 1 13 18
we have present = 12, and died = 20
and i want to filter the dataframe in a following diagram.
where pink box represents df_begin,
yellow for df_between
purple for df_end.
I want to combine this, but I had to do it separately as following.
(present and died inclusive)
df_start = df.loc[(df['start'] <= present) & (df['end'] >=present)]
df_between = df.loc[(df['start'] >= present) & (df['end'] <= died)]
df_end = df.loc[(df['start'] <= died) & (df['end'] >= died)]
concat these three dataframes and drop duplicate will give me 3 colored box combined which is I want, but is there way to do in a simple/better/fancy way? (ie. imagine this dataframe is more than 1million - performance also matters..)
Hence, desired output would be..
id id_2 start end
0 A 1 9 14
1 A 2 13 22
2 A 3 12 21
3 A 1 11 19
4 B 2 20 30
5 C 1 13 18
Thank you!
IIUC, you can do it with two conditions and a negation with OR.
m1 = (df['end'] < present) #Find all ranges that end before present
m2 = (df['start'] > died) #Find all ranges that start after died
df[~(m1|m2)] #Negate to find all ranges that intercept and overlap present to died
Output:
id id_2 start end
0 A 1 9 14
1 A 2 13 22
2 A 3 12 21
3 A 1 11 19
5 B 2 20 30
7 C 1 13 18
If I understand this correctly, you are looking to have three separates dataframes according to the logic specified, and at the same time be able to concat them into a single dataframe with no duplicates rapidly.
You can save your conditions as masks:
df_start_mask = (df['start'] <= present) & (df['end'] >=present)
df_between_mask = (df['start'] >= present) & (df['end'] <= died)
df_end_mask = (df['start'] <= died) & (df['end'] >= died)
Creating the three separate dataframes is similar to before:
df_start = df.loc[df_start_mask]
df_between = df.loc[df_between_mask]
df_end = df.loc[df_end_mask]
However creating the combined dataframe is much faster, because instead of having to concat, you can directly index from your original dataframe:
combined_df = df.loc[df_start_mask | df_between_mask | df_end_mask ]
Which returns the intended result:
>>> print(combined_df)
id id_2 start end
0 A 1 9 14
1 A 2 13 22
2 A 3 12 21
3 A 1 11 19
5 B 2 20 30
7 C 1 13 18
I have a dataframe who looks like this:
Name rent sale
0 A 180 2
1 B 1 4
2 M 12 1
3 O 10 1
4 A 180 5
5 M 2 19
that i want to make condition that if i have a duplicate row and a duplicate value in column field => Example :
duplicate row A have duplicate value 180 in rent column
I keep only one (without making the sum)
Or make the sum => Example duplicate row A with different values 2 & 5 in Sale column and duplicate row M with different values in rent & sales columns
Expected output:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
I tried this code but it's not workin as i want
import pandas as pd
df=pd.DataFrame({'Name':['A','B','M','O','A','M'],
'rent':[180,1,12,10,180,2],
'sale':[2,4,1,1,5,19]})
df2 = df.drop_duplicates().groupby('Name',sort=False,as_index=False).agg(Name=('Name','first'),
rent=('rent', 'sum'),
sale=('sale','sum'))
print(df2)
I got this output
Name rent sale
0 A 360 7
1 B 1 4
2 M 14 20
3 O 10 1
Can try summing only the unique values per group:
def sum_unique(s):
return s.unique().sum()
df2 = df.groupby('Name', sort=False, as_index=False).agg(
Name=('Name', 'first'),
rent=('rent', sum_unique),
sale=('sale', sum_unique)
)
df2:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
You can first groupby by Name and rent, and then just by Name:
df2 = df.groupby(['Name', 'rent'], as_index=False).sum().groupby('Name', as_index=False).sum()
I have a data frame like this:
ID Value
111 10
111 5
112 11
112 11
And I want to create a third column "Check" that will be a binary 1 or 0 (True or False) to the following condition: Are all the values for the same ID number the same within a 5% margin of error?
For example, for ID number 111, the column Check would be 1 if the values 10 and 5 are the same or if the percentage difference between them is 5%. Pct. difference being calculated as abs(5 - 10) / (5 + 10)/2
The output would then be:
ID Value. Check
111 10 0
111 5 0
112 11 1
112 11 1
I am using the following code:
a = df.groupby([df['ID']])['Value'].nunique().eq(1)
index_list = a[a].index.tolist()
df['Check'] = 0
df.loc[df['ID'].isin(index_list), 'Check'] = 1
but it only checks if the values are the same and I am not sure to incorporate the 5 pct. difference in the check.
I would also only like to do this for when we have more than one observation per ID number, and return an NaN to the column Check when it's only one observation?
Thank you!!
Try transform
g = df.groupby('ID')['Value']
df['new'] = ((g.transform(np.ptp)/g.transform('mean'))<0.05).astype(int)
df
Out[40]:
ID Value new
0 111 10 0
1 111 5 0
2 112 11 1
3 112 11 1
Run:
df['Check'] = df.groupby([df['ID']]).Value.transform(
lambda grp: (grp.max() - grp.min()) / grp.mean() < 0.05)
Assuming that I have a dataframe with the following values:
name start end description
0 ag 20 30 None
1 bgb 21 111 'a'
2 cdd 31 101 None
3 bgb 17 19 'Bla'
4 ag 20 22 None
I want to groupby name and then get average of (end-start) values.
I can use mean (df.groupby(['name'], as_index=False).mean())
but how can I give the mean function the subtraction of two columns (last - first) ?
You can subtract column and then grouping by column df['name']:
df1 = df['end'].sub(df['start']).groupby(df['name']).mean().reset_index(name='diff')
print (df1)
name diff
0 ag 6
1 bgb 46
2 cdd 70
Another idea with new column diff:
df1 = (df.assign(diff = df['end'].sub(df['start']))
.groupby('name', as_index=False)['diff']
.mean())
print (df1)
name diff
0 ag 6
1 bgb 46
2 cdd 70
i have a small sample data set:
import pandas as pd
d = {
'measure1_x': [10,12,20,30,21],
'measure2_x':[11,12,10,3,3],
'measure3_x':[10,0,12,1,1],
'measure1_y': [1,2,2,3,1],
'measure2_y':[1,1,1,3,3],
'measure3_y':[1,0,2,1,1]
}
df = pd.DataFrame(d)
df = df.reindex_axis([
'measure1_x','measure2_x', 'measure3_x','measure1_y','measure2_y','measure3_y'
], axis=1)
it looks like:
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y
10 11 10 1 1 1
12 12 0 2 1 0
20 10 12 2 1 2
30 3 1 3 3 1
21 3 1 1 3 1
i created the column names almost the same except for '_x' and '_y' to help identify which pair should be multiplying: i want to multiply the pair with the same column name when '_x' and '_y' are disregarded, then i want sum the numbers to get a total number, keep in mind my actual data set is huge and the columns are not in this perfect order so this naming is a way for identifying correct pairs to multiply:
total = measure1_x * measure1_y + measure2_x * measure2_y + measure3_x * measure3_y
so desired output:
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y total
10 11 10 1 1 1 31
12 12 0 2 1 0 36
20 10 12 2 1 2 74
30 3 1 3 3 1 100
21 3 1 1 3 1 31
my attempt and thought process, but cannot proceed anymore syntax wise:
#first identify the column names that has '_x' and '_y', then identify if
#the column names are the same after removing '_x' and '_y', if the pair has
#the same name then multiply them, do that for all pairs and sum the results
#up to get the total number
for colname in df.columns:
if "_x".lower() in colname.lower() or "_y".lower() in colname.lower():
if "_x".lower() in colname.lower():
colnamex = colname
if "_y".lower() in colname.lower():
colnamey = colname
#if colnamex[:-2] are the same for colnamex and colnamey then multiply and sum
filter + np.einsum
Thought I'd try something a little different this time—
get your _x and _y columns separately
do a product-sum. This is very easy to specify with einsum (and fast).
df = df.sort_index(axis=1) # optional, do this if your columns aren't sorted
i = df.filter(like='_x')
j = df.filter(like='_y')
df['Total'] = np.einsum('ij,ij->i', i, j) # (i.values * j).sum(axis=1)
df
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y Total
0 10 11 10 1 1 1 31
1 12 12 0 2 1 0 36
2 20 10 12 2 1 2 74
3 30 3 1 3 3 1 100
4 21 3 1 1 3 1 31
A slightly more robust version which filters out non-numeric columns and performs an assertion beforehand—
df = df.sort_index(axis=1).select_dtypes(exclude=[object])
i = df.filter(regex='.*_x')
j = df.filter(regex='.*_y')
assert i.shape == j.shape
df['Total'] = np.einsum('ij,ij->i', i, j)
If the assertion fails, the the assumptions of 1) your columns being numeric, and 2) the number of x and y columns being equal, as your question would suggest, do not hold for your actual dataset.
Use df.columns.str.split to generate a new MultiIndex
Use prod with axis and level arguments
Use sum with axis argument
Use assign to create new column
df.assign(
Total=df.set_axis(
df.columns.str.split('_', expand=True),
axis=1, inplace=False
).prod(axis=1, level=0).sum(1)
)
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y Total
0 10 11 10 1 1 1 31
1 12 12 0 2 1 0 36
2 20 10 12 2 1 2 74
3 30 3 1 3 3 1 100
4 21 3 1 1 3 1 31
Restrict dataframe to just columns that look like 'meausre[i]_[j]'
df.assign(
Total=df.filter(regex='^measure\d+_\w+$').pipe(
lambda d: d.set_axis(
d.columns.str.split('_', expand=True),
axis=1, inplace=False
)
).prod(axis=1, level=0).sum(1)
)
Debugging
See if this gets you the correct Totals
d_ = df.copy()
d_.columns = d_.columns.str.split('_', expand=True)
d_.prod(axis=1, level=0).sum(1)
0 31
1 36
2 74
3 100
4 31
dtype: int64