I have two dataframes, just like below.
Dataframe1:
country
type
start_week
end_week
1
a
12
13
2
b
13
14
Dataframe2:
country
type
week
value
1
a
12
1000
1
a
13
900
1
a
14
800
2
b
12
1000
2
b
13
900
2
b
14
800
I want to add to the first dataframe column with the mean value from the second dataframe for key (country+type) and between start_week and end_week.
I want desired output to look like the below:
country
type
start_week
end_week
avg
1
a
12
13
950
2
b
13
14
850
here is one way :
combined = df1.merge(df2 , on =['country','type'])
combined = combined.loc[(combined.start_week <= combined.week) & (combined.week <= combined.end_week)]
output = combined.groupby(['country','type','start_week','end_week'])['value'].mean().reset_index()
output:
>>
country type start_week end_week value
0 1 a 12 13 950.0
1 2 b 13 14 850.0
You can use pd.melt and comparison of numpy arrays.
# melt df1
melted_df1 = df1.melt(id_vars=['country','type'],value_name='week')[['country','type','week']]
# for loop to compare two dataframe arrays
result = []
for i in df2.values:
for j in melted_df1.values:
if (j == i[:3]).all():
result.append(i)
break
# Computing mean of the result dataframe
result_df = pd.DataFrame(result,columns=df2.columns).groupby('type').mean().reset_index()['value']
# Assigning result_df to df1
df1['avg'] = result_df
country type start_week end_week avg
0 1 a 12 13 950.0
1 2 b 13 14 850.0
i have a column with faulty values as it is supposed to count cycles, but the device where the data from resets the count after 50 so i was left with exmalple [1,1,1,1,2,2,2,,3,3,3,3,...,50,50,50,1,1,1,2,2,2,2,3,3,3,...,50,50,.....,50]
My solution is and i cannt even make it work:(for simplicity i made the data resets from 10 cycles
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,1,1,1,2,3,3,3,3,
4,4,5,6,6,6,7,8,8,8,8,9,10]}
df = pd.DataFrame(data)
x=0
count=0
old_value=df.at[x,'Cyc-Count']
for x in range(x,len(df)-1):
if df.at[x,'Cyc-Count']==df.at[x+1,'Cyc-Count']:
old_value=df.at[x+1,'Cyc-Count']
df.at[x+1,'Cyc-Count']=count
else:
old_value=df.at[x+1,'Cyc-Count']
count+=1
df.at[x+1,'Cyc-Count']=count
i need to fix this but preferably without even using if statments
the desired output for the upper example should be
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,11,11,11,12,13,13,13,13,
14,14,15,16,16,16,17,18,18,18,18,19,20]}
hint" my method has a big issue is that the last indexed value will be hard to change since when comparing it with its index+1 > it dosnt even exist
IIUC, you want to continue the count when the counter decreases.
You can use vectorial code:
s = df['Cyc-Count'].shift()
df['Cyc-Count2'] = (df['Cyc-Count']
+ s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer')
.cumsum()
)
Or, to modify the column in place:
s = df['Cyc-Count'].shift()
df['Cyc-Count'] += (s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer').cumsum()
)
output:
Cyc-Count Cyc-Count2
0 1 1
1 1 1
2 1 1
3 1 1
4 2 2
5 2 2
6 2 2
7 3 3
8 3 3
9 3 3
10 3 3
11 4 4
12 5 5
13 5 5
14 5 5
15 1 6
16 1 6
17 1 6
18 2 7
19 2 7
20 2 7
21 2 7
22 3 8
23 3 8
24 3 8
25 4 9
26 5 10
27 5 10
28 1 11
29 2 12
30 2 12
31 3 13
32 4 14
33 5 15
34 5 15
used input:
l = [1,1,1,1,2,2,2,3,3,3,3,4,5,5,5,1,1,1,2,2,2,2,3,3,3,4,5,5,1,2,2,3,4,5,5]
df = pd.DataFrame({'Cyc-Count': l})
You can use df.loc to access a group of rows and columns by label(s) or a boolean array.
syntax: df.loc[df['column name'] condition, 'column name or the new one'] = 'value if condition is met'
for example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10,0,0]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
print (df)
df.loc[df['set_of_numbers'] == 0, 'set_of_numbers'] = 999
df.loc[df['set_of_numbers'] == 5, 'set_of_numbers'] = 555
print (df)
before: ‘set_of_numbers’: [1,2,3,4,5,6,7,8,9,10,0,0]
After: ‘set_of_numbers’: [1,2,3,4,555,6,7,8,9,10,999,999]
I have a df like this:
time units cost
0 4 10
1 2 10
3 4 20
4 1 20
5 3 10
6 1 20
9 2 10
As you can see, df.time is not consecutive. If there is a missing value, I want to add a new row, populating df.time with the consecutive time value, df.units with 2 and df.cost with 20.
Expected output:
time units cost
0 4 10
1 2 10
2 2 20
3 4 20
4 1 20
5 3 10
6 1 20
7 2 20
8 2 20
9 2 10
How do I do this? I understand how to this by deconstructing all series into lists, looping through them and appending values when time is not equal to time - 1, but this seems inefficient.
You can use the reindex method with a call to fillna to do this:
# Build new index that ranges from time min to time max with a step of 1
new_index = range(df["time"].min(), df["time"].max() + 1)
out = (df.set_index("time") # Index our dataframe with the original time column
.reindex(new_index) # Reindex our dataframe with the new_index, all empty cells appear as nan
.fillna({"units": 2, "cost": 20}) # Fill in the nans for units and cost with 2 and 20 respectively
.astype(int)) # Due to NaNs that were in column from reindexing, we'll manually recast our
# data type from float to int (not necessary, but produces cleaner output)
print(out)
units cost
time
0 4 10
1 2 10
2 2 20
3 4 20
4 1 20
5 3 10
6 1 20
7 2 20
8 2 20
9 2 10
You can use df.reindex, then pd.Series.fillna.
idx = pd.RangeIndex(df['time'].min(), df['time'].max()+1)
# If `df.time` is always sorted then,
# idx = pd.RangeIndex(df['time'].iat[0], df['time'].iat[-1]+1)
df = df.set_index('time')
df = df.reindex(idx)
df['units'] = df['units'].fillna(2).astype(int)
df['cost'] = df['cost'].fillna(20).astype(int)
# if you prefer not to hard-code the names of the columns, replace last
# the two lines with:
# defaults = [2,20]
# for (name, default) in zip(df.columns, defaults):
# df[name] = df[name].fillna(default).astype(type(default))
units cost
time
0 4 10
1 2 10
2 2 20
3 4 20
4 1 20
5 3 10
6 1 20
7 2 20
8 2 20
9 2 10
You can construct new DataFrame with complete "time" column and then do .fillna() from original dataframe (df is your original dataframe):
r = range(df['time'].min(), df['time'].max()+1)
df_out = pd.DataFrame({'time': r, 'units': [np.nan]*len(r), 'cost': [np.nan]*len(r)}).set_index('time')
df_out = df_out.fillna(df.set_index('time'))
df_out['units'] = df_out['units'].fillna(2).astype(int)
df_out['cost'] = df_out['cost'].fillna(20).astype(int)
print(df_out)
Prints:
units cost
time
0 4 10
1 2 10
2 2 20
3 4 20
4 1 20
5 3 10
6 1 20
7 2 20
8 2 20
9 2 10
I have a wide table in a format as follows (for up to 10 people):
person1_status | person2_status | person3_status | person1_type | person_2 type | person3_type
0 | 1 | 0 | 7 | 4 | 6
Where status can be a 0 or a 1 (first 3 cols).
Where type can be a # ranging from 4-7. The value here corresponds to another table that specifies a value based on type. So...
Type | Value
4 | 10
5 | 20
6 | 30
7 | 40
I need to calculate two columns, 'A', and 'B', where:
A is the sum of values of each person's type (in that row) where
status = 0.
B is the sum of values of each person's type (in that row) where
status = 1.
For example, the resulting columns 'A', and 'B' would be as follows:
A | B
70 | 10
An explanation of this:
'A' has value 70 because person1 and person3 have "status" 0 and have corresponding type of 7 and 6 (which corresponds to values 30 and 40).
Similarly, there should be another column 'B' that has the value "10" because only person2 has status "1" and their type is "4" (which has corresponding value of 10).
This is probably a stupid question, but how do I do this in a vectorized way? I don't want to use a for loop or anything since it'll be less efficient...
I hope that made sense... could anyone help me? I think I'm brain dead trying to figure this out.
For simpler calculated columns I was getting away with just np.where but I'm a little stuck here since I need to calculate the sum of values from multiple columns given certain conditions while pulling in those values from a separate table...
hope that made sense
Use the filter method which will filter the column names for those where a string appears in them.
Make a dataframe for the lookup values other_table and set the index as the type column.
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Full example below:
Create fake data
df = pd.DataFrame({'person_1_status':np.random.randint(0, 2,1000) ,
'person_2_status':np.random.randint(0, 2,1000),
'person_3_status':np.random.randint(0, 2,1000),
'person_1_type':np.random.randint(4, 8,1000),
'person_2_type':np.random.randint(4, 8,1000),
'person_3_type':np.random.randint(4, 8,1000)},
columns= ['person_1_status', 'person_2_status', 'person_3_status',
'person_1_type', 'person_2_type', 'person_3_type'])
person_1_status person_2_status person_3_status person_1_type \
0 1 0 0 7
1 0 1 0 6
2 1 0 1 7
3 0 0 0 7
4 0 0 1 4
person_3_type person_3_type
0 5 5
1 7 7
2 7 7
3 7 7
4 7 7
Make other_table
other_table = pd.Series({4:10, 5:20, 6:30, 7:40})
4 10
5 20
6 30
7 40
dtype: int64
Filter out status and type columns to their own dataframes
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
Make lookup table
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
Apply matrix multiplication and sum across rows.
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Output
person_1_status person_2_status person_3_status person_1_type \
0 0 0 1 7
1 0 1 0 4
2 0 1 1 7
3 0 1 0 6
4 0 0 1 5
person_2_type person_3_type A B
0 7 5 80 20
1 6 4 20 30
2 5 5 40 40
3 6 4 40 30
4 7 5 60 20
consider the dataframe df
mux = pd.MultiIndex.from_product([['status', 'type'], ['p%i' % i for i in range(1, 6)]])
data = np.concatenate([np.random.choice((0, 1), (10, 5)), np.random.rand(10, 5)], axis=1)
df = pd.DataFrame(data, columns=mux)
df
The way this is structured we can do this for type == 1
df.status.mul(df.type).sum(1)
0 0.935290
1 1.252478
2 1.354461
3 1.399357
4 2.102277
5 1.589710
6 0.434147
7 2.553792
8 1.205599
9 1.022305
dtype: float64
and for type == 0
df.status.rsub(1).mul(df.type).sum(1)
0 1.867986
1 1.068045
2 0.653943
3 2.239459
4 0.214523
5 0.734449
6 1.291228
7 0.614539
8 0.849644
9 1.109086
dtype: float64
You can get your columns in this format using the following code
df.columns = df.columns.str.split('_', expand=True)
df = df.swaplevel(0, 1, 1)
I wanted to apply a custom operation on a column by grouping the values on another column. Group by column to get the count, then divide the another column value with this count for all the grouped records.
My Data Frame:
emp opp amount
0 a 1 10
1 b 1 10
2 c 2 30
3 b 2 30
4 d 2 30
My scenario:
For opp=1, two emp's worked(a,b). So the amount should be shared like
10/2 =5
For opp=2, two emp's worked(b,c,d). So the amount should be like
30/3 = 10
Final Output DataFrame:
emp opp amount
0 a 1 5
1 b 1 5
2 c 2 10
3 b 2 10
4 d 2 10
What is the best possible to do so
df['amount'] = df.groupby('opp')['amount'].transform(lambda g: g/g.size)
df
# emp opp amount
# 0 a 1 5
# 1 b 1 5
# 2 c 2 10
# 3 b 2 10
# 4 d 2 10
Or:
df['amount'] = df.groupby('opp')['amount'].apply(lambda g: g/g.size)
does similar thing.
You could try something like this:
df2 = df.groupby('opp').amount.count()
df.loc[:, 'calculated'] = df.apply( lambda row: \
row.amount / df2.ix[row.opp], axis=1)
df
Yields:
emp opp amount calculated
0 a 1 10 5
1 b 1 10 5
2 c 2 30 10
3 b 2 30 10
4 d 2 30 10