I have data frame
a=pd.DataFrame([[1,1,9],[2,1,9],[3,2,9],[4,2,9]],columns=['a','b','c'])
a b c
0 1 1 9
1 2 1 9
2 3 2 9
3 4 2 9
if I run
a['c'].iloc[0]=100
it works and I get
a b c
0 1 1 100
1 2 1 9
2 3 2 9
3 4 2 9
But if I want to update the first observation of group b==2 by running
a['c'][a['b']==2].iloc[0]=100
It doesn't do what I want it do. I still get the same dataframe.
a b c
0 1 1 100
1 2 1 9
2 3 2 9
3 4 2 9
I wonder why? and what's a possible solution for this?
Thank you for your help.
You should using .loc like this, chian with .iloc and .loc sometime will cause the issue
Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained> assignment and should be avoided
a.loc[a.index[a.b==2][0],'c']=10000
a
Out[761]:
a b c
0 1 1 9
1 2 1 9
2 3 2 10000
3 4 2 9
Related
Overview
When creating a conditional count_cumsum column in Pandas I have created a temporary count column then deleted this after the desired column was created.
Code
df = pd.DataFrame({"Level":[1,2,3,4,5,6,7,8],
"Price":[2,3,4,5,6,7,1,10]})
df["Count"] = np.where((df.Price > df.Level),1,np.NaN)
df['count_cumsum'] = df.Count.groupby(df.Count.isna().cumsum()).cumsum()
del df["Count"]
Level Price count_cumsum
0 1 2 1.0
1 2 3 2.0
2 3 4 3.0
3 4 5 4.0
4 5 6 5.0
5 6 7 6.0
6 7 1 NaN
7 8 10 1.0
Question
How can I use a zero instead of NaN for the df["Count"] column to keep count_cumsum as an int column and is there a simpler way to yield this output.
Desired output
Level Price count_cumsum
0 1 2 1
1 2 3 2
2 3 4 3
3 4 5 4
4 5 6 5
5 6 7 6
6 7 1 0
7 8 10 1
To use zero instead of NaN, you can replace codes on np.nan with 0 and replace isna() by eq(0) in your code. This should be simple and you should be able to do it yourself based on the hint here. I will go straightly to the way to simplify the coding below:
You can simplify the processing logics as follows:
# Replace the `np.where` on the boolean condition and setting 0 or 1 according to condition by using `astype(int)` instead
m = (df.Price > df.Level).astype(int)
#Use the series m for grouping and cumsum
df['count_cumsum'] = m.groupby(m.eq(0).cumsum()).cumsum()
In this way, you can simplify the code by:
without defining temporary column df["Count"] on df and delete it afterwards
simplify the code np.where((df.Price > df.Level),1,0) to simply converting the boolean condition (df.Price > df.Level) to integer (will give 0 and 1 for False and True respectively).
Result:
print(df)
Level Price count_cumsum
0 1 2 1
1 2 3 2
2 3 4 3
3 4 5 4
4 5 6 5
5 6 7 6
6 7 1 0
7 8 10 1
You can avoid the NaN's altogether with the clean and readable solution below:
df = pd.DataFrame({"Level":[1,2,3,4,5,6,7,8],
"Price":[2,3,4,5,6,7,1,10]})
df["Count"] = np.where((df.Price > df.Level),1,0)
df['count_cumsum'] = df.Count.groupby((df.Count == 0).cumsum()).cumsum()
del df["Count"]
Level Price count_cumsum
0 1 2 1
1 2 3 2
2 3 4 3
3 4 5 4
4 5 6 5
5 6 7 6
6 7 1 0
7 8 10 1
This leaves everything as an int type as well, which seems to be what you're after.
I am new to Data Science and I am currently using the Pandas library on the Jupyter notebook. Sorry for my poor English.
A,1,5,9
B,2,6,3
A,3,7,2
B,4,8,1
How to group the above CSV values also adding the contents after creating the DataFrame?
I want the output something like this.
A B
0 4 6
1 12 14
2 11 4
Thanks in advance :)
You can use df.pivot_table
df
0 1 2 3
0 A 1 5 9
1 B 2 6 3
2 A 3 7 2
3 B 4 8 1
df.pivot_table(columns=0,aggfunc='sum').rename_axis(columns=None)
A B
1 4 6
2 12 14
3 11 4
This is groupby + sum with transpose:
print(df.groupby(0).sum().T.rename_axis(columns=None))
A B
1 4 6
2 12 14
3 11 4
Note: replace 0 in df.groupby(0) with the actual column name of the first column
I am trying to implement a permutation test on a large Pandas dataframe. The dataframe looks like the following:
group some_value label
0 1 8 1
1 1 7 0
2 1 6 2
3 1 5 2
4 2 1 0
5 2 2 0
6 2 3 1
7 2 4 2
8 3 2 1
9 3 4 1
10 3 2 1
11 3 4 2
I want to group by column group, and shuffle the label column and write back to the data frame, preferably in place. The some_value column should remain intact. The result should look something like the following:
group some_value label
0 1 8 1
1 1 7 2
2 1 6 2
3 1 5 0
4 2 1 1
5 2 2 0
6 2 3 0
7 2 4 2
8 3 2 1
9 3 4 2
10 3 2 1
11 3 4 1
I used np.random.permutation but found it was very slow.
df["label"] = df.groupby("group")["label"].transform(np.random.permutation
It seems that df.sample is much faster. How can I solve this problem using df.sample() instead of np.random.permutation, and inplace?
We can using sample Notice this is assuming df=df.sort_values('group')
df['New']=df.groupby('group').label.apply(lambda x : x.sample(len(x))).values
Or we can do it by
df['New']=df.sample(len(df)).sort_values('group').New.values
What about providing a custom transform function?
def sample(x):
return x.sample(n=x.shape[0])
df.groupby("group")["label"].transform(sample)
This SO explanation of printing out what is passed into the custom function via the transform function is helpful.
Looking for advise on how to solve the following problem.
I have Pandas dataframe, let's say 1.000.000 rows by 10 columns (A, B, C...J). Data type is float.
The task is to remove all the rows (i) in a dataframe if there exist another raw (j) with all values equal to or greater than values in original raw (i).
(Ai<=Aj) and (Bi<=Bj) and (Ci<=Cj) ... and (Ji<=Jj)
I wonder whether there any tools exist in pandas tool kit or any other analytics python module to efficiently solve this problem.
I have a very inefficient solution with multiple iterations in a simple array. Hoping to find smth more promising.
Simplified example, original data:
0 1 5 4 4 2
2 5 6 4 3 7
-2 5 6 5 3 7
0 0 0 0 0 1
0 0 0 0 0 8
Result to be:
0 1 5 4 4 2
2 5 6 4 3 7
-2 5 6 5 3 7
0 0 0 0 0 8
There is a way from numpy
#import numpy as np
df[np.any(np.sum(df.values>=df.values[:,None],1)==1,1)]
Out[40]:
A B C D E F
0 0 1 5 4 4 2
1 2 5 6 4 3 7
2 -2 5 6 5 3 7
4 0 0 0 0 0 8
Say I have a dataframe like this, filename is the index:
filename a b c
1 1 2 3
1 1 3 4
2 2 2 2
2 3 2 5
2 8 9 9
3 4 8 6
3 1 1 1
I want to divide this dataframe into three dataframes and then process them one by one in a loop. Each dataframe contains rows with same filename like this:
dataframe1:
filename a b c
1 1 2 3
1 1 3 4
dataframe2:
filename a b c
2 2 2 2
2 3 2 5
2 8 9 9
dataframe3:
filename a b c
3 4 8 6
3 1 1 1
Also, in my situation, I actually don't know how many sub dataframes I will get in advance, so I want the program figure this out too and then I can use a loop to process each sub dataframe.
How can I do this in python pandas? Thanks!
Simply you can do this if you want to get number of groups
group = df.groupby('filename')
group.ngroups
and if you want to apply your custom function you can use apply where it takes your custom function as a parameter , and it passes each group to your custom function
group.apply()
you can try this simple function to understand what is your input to your custom functions
def print_group(df):
print(df)
print('-------------------')
group.apply(print_group)
a b c
0 1 2 3
1 1 3 4
-------------------
a b c
2 2 2 2
3 3 2 5
4 8 9 9
-------------------
a b c
5 4 8 6
6 1 1 1
-------------------