I have a dataframe with an observation number, and id, and a number
Obs# Id Value
--------------------
1 1 5.643
2 1 7.345
3 2 0.567
4 2 1.456
I want to calculate a new column that is the mean of the previous values of a specific id
I am trying to use something like this but it only acquires the previous value:
df.groupby('Id')['Value'].apply(lambda x: x.shift(1) ...
My question is how do I acquire the range of previous values filtered by the Id so I can calculate the mean ?
So the new column based on this example should be
5.643
6.494
0.567
1.0115
It seems that you want expanding, then mean
df.groupby('Id').Value.expanding().mean()
Id
1.0 1 5.6430
2 6.4940
2.0 3 0.5670
4 1.0115
Name: Value, dtype: float64
You can also do it like:
df = pd.DataFrame({'Obs':[1,2,3,4],'Id':[1,1,2,2],'Value':[5.643,7.345, 0.567,1.456]})
df.groupby('Id')['Value'].apply(lambda x: x.cumsum()/np.arange(1, len(x)+1))
It gives output as :
5.643
6.494
0.567
1.0115
Related
I have a pandas dataframe with some group_ids, values and sizes such as this:
group_id
value
size
0
10
1
0
10
3
1
5
2
2
6
4
Rows with the same group_id also have the same value.
I would like to "distribute" the value of entries within the same group according to size. So for example the first row should be updated to have value = 10 * 1 / (1 + 3) = 2.5, while the second row should be updated to have value = 10 * 3 / (1+3) = 7.5, and the rest of the entries should not change (since there are no other rows in its group).
I tried iterating over the groups with the same group_id using a groupby construct, but from there I am a bit lost. I guess that if I could get the index of the rows withing a group I could slice the original dataframe and manipulate each group in turn. But I don't know how to do that nor if it is the most pythonic way of doing it.
Multiply the value and size then divide with groupby.transform on the size column:
df['value'].mul(df['size']).div(df.groupby("group_id")['size'].transform('sum'))
0 2.5
1 7.5
2 5.0
3 6.0
dtype: float64
Assign this to a new column or replace an existing one as per your requirement
I have two data frames df (with 15000 rows) and df1 ( with 20000 rows)
Where df looks like
Number Color Code Quantity
1 Red 12380 2
2 Bleu 14440 3
3 Red 15601 1
and df1 that has two columns Code and Quantity where I want to fill Quantity column under certain conditions using python in order to obtain like this
Code Quantity
12380 2
15601 1
15640 1
14400 0
The conditions that I want to take in considerations are:
If the two last caracters of Code column of df1 are both equal to zero, in this case I want to have 0 in the Quantity column of df1
If I don't find the Code in df, in this cas I put 1 in the Quantity column of df1
Otherwise I take the quantity value of df
Let us try:
mask = df1['Code'].astype(str).str[-2:].eq('00')
mapped = df1['Code'].map(df.set_index('Code')['Quantity'])
df1['Quantity'] = mapped.mask(mask, 0).fillna(1)
Details:
Create a boolean mask specifying the condition where the last two characters of Code are both 0:
>>> mask
0 False
1 False
2 False
3 True
Name: Code, dtype: bool
Using Series.map map the values in Code column in df1 to the Quantity column in df based on the matching Code:
>>> mapped
0 2.0
1 1.0
2 NaN
3 NaN
Name: Code, dtype: float64
mask the values in the above mapped column where the boolean mask is True, and lastly fill the NaN values with 1:
>>> df1
Code Quantity
0 12380 2.0
1 15601 1.0
2 15640 1.0
3 14400 0.0
I have a DataFrame A in Jupiter that looks like the following
Index Var1.A.1 Var1.B.1 Var1.CA.1 Var2.A.1 Var2.B.1 Var2.CA.1
0 1 21 3 3 4 4
1 3 5 4 9 5 1
....
100 9 75 2 4 8 2
I'd like to assess the mean value based on the extension of the name, i.e.
Mean value of .A.1
Mean Value of .B.1
Mean value of .CA.1
For example, to assess the mean value of the variable with extension .A.1, I've tried the following, which doesn't return what I look for
List=['.A.1', '.B.1', '.CA.1']
A[List[List.str.contains('.A.1')]].mean()
However, in this way I get the mean values of the different variables, getting also CA.1, which is not what it look for.
Any advice?
thanks
If want mean per rows by all values after first . use groupby with lambda function and mean:
df = df.groupby(lambda x: x.split('.', 1)[-1], axis=1).mean()
print (df)
A.1 B.1 CA.1
0 2.0 12.5 3.5
1 6.0 5.0 2.5
100 6.5 41.5 2.0
Here is a thrid option:
columns = A.columns
A[[s for s in columns if ".A.1" in s]].stack().reset_index().mean()
dfA.filter(like='.A.1') - gives you the column containing the '.A.1' substring
I have a pandas dataframe df with (at least) two columns: id, value and possibly more. id's are not unique. I need to filter the dataframe so that only one row per id remains. The row I want to select is the row where value is not NaN. It is guaranteed that there is at most one such row. For those id's with all NaN's in the value column I don't care which row is selected. What is the best way to achieve this?
Example: if the dataframe is
id other value
0 0 3.14
0 1 NaN
1 2 NaN
1 3 NaN
the result may be either
id other value
0 0 3.14
1 2 NaN
or
id other value
0 0 3.14
1 3 NaN
You could sort_values, there is a parameter na_position which defaults to 'last' meaning that it will push all NaN for that column to the bottom. Therefore, you can use the following to get a single records for each 'id'.
df.sort_values(by='value').groupby('id').head(1)
Output:
id other value
0 0 0 3.14
2 1 2 NaN
Timing:
Abdou Solution:
f = lambda x: x.head(1) if x.value.isnull().all() else x[~x.value.isnull()].head(1)
df.groupby('id').apply(f)
100 loops, best of 3: 5.62 ms per loop
This solution
df.sort_values(by='value').groupby('id').head(1)
1000 loops, best of 3: 1.44 ms per loop
Assuming that your dataframe is called dff, the following should do:
f = lambda x: x.head(1) if x.value.isnull().all() else x[~x.value.isnull()].head(1)
dff.groupby('id').apply(f)
Output:
# id other value
# id
# 0 0 0 0 3.14
# 1 2 1 2 NaN
It groups the dataframe by the id column first. If all elements in the value column are null, it takes the first row. Otherwise, it filters out null values and take the first row of the output.
I hope this helps.
What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN