I have a Pandas dataframe contains some columns. Each columns have some differents values. See the image.
In col1 I have that the value 1 is more frequent than others, so, I need to transform this column to have values 1 and more then 1.
How can I do that?
My goals here is transforme this column in a categorical column but I have no idea how can I do that.
The output expected is something like the next image:
Try clip function on column:
df["col1"].clip(upper=2)
0 1
1 2
2 2
3 2
4 1
5 2
6 2
7 1
8 1
9 1
10 1
11 2
12 1
I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2
I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.
First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64
I am trying to check if values for each row noted in Dataframe "Actual" match values under the same row in Dataframe "Estimate". Column position is not important. The value just needs to exist on the same row level between the different dataframes. The Dataframes can be concat/ merged, if need be.
I present below my code::
Actual=pd.DataFrame([[4,7,2,8,1],[1,5,7,9,8]], columns=['Actual1','Actual2','Actual3','Actual4','Actual5'])
estimate=pd.DataFrame([[1,2,7,9,3],[0,8,2,5,9]], columns=['estimate1','estimate2','estimate3','estimate4','estimate5'])
Actual
Actual1 Actual2 Actual3 Actual4 Actual5
0 4 7 2 8 1
1 1 5 7 9 8
estimate
estimate1 estimate2 estimate3 estimate4 estimate5
0 1 2 7 9 3
1 0 8 2 5 9
My attempt using Pandas::
for loop1 in range(1,6,1):
for loop2 in range(1,6,1):
Actual['want'+str(loop1)]=np.where(Actual['Actual'+ str(loop1)] == estimate['estimate' + str(loop2)],1,0)
and finally, my output that I would like::
want=pd.DataFrame([[0,1,1,0,1],[0,1,0,1,1]], columns=['want1','want2','want3','want4','want5'])
want
want1 want2 want3 want4 want5
0 0 1 1 0 1
1 0 1 0 1 1
So, as I was mentioning earlier, since from Dataframe "Actual" value 4 does not exist on the whole first row of dataframe "estimate", column "want1" has been assigned value 0. Once again, considering the first row of Dataframe "Actual" column 5 where value=1, since this value exists in the same first row of dataframe "estimate" (column location does not matter) column 'want5' has been assigned value 1.
Thanks.
Assuming that the indices in your Actual and estimate DataFrames are the same, one approach would be to just apply a check along the columns with isin.
Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Here we use the name attribute as the glue between the two DataFrames.
Demo
>>> Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Actual1 Actual2 Actual3 Actual4 Actual5
0 0 1 1 0 1
1 0 1 0 1 1
I am working on groupby in Python's pd.DataFrame. The task in the code is that I want to group the data because I want to make sure that no matter how many times I query and output the data to MySQL, it won't mess with my raw data.
df1=pd.DataFrame(df) #this is a DataFrame with multiple different lines of 'Open' for one 'Symbol'
df2=pd.read_sql('select * from 6openposition',con=conn)
df2=df2.append(df1)
df2=df2.groupby(['Symbol']).agg({'Open':'first'})
df2.to_sql(name='6openposition', con=conn, if_exists='replace', index= False, flavor = 'mysql')
#Example Raw Data:
Symbol Open
0 A 10
1 AA 20
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50
#After I query the data for multiple times(I appended):
Symbol Open
0 A 10
1 AA 20
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50
0 AA 30
1 AAA 40
2 AAA 50
3 AAA 50
4 AAA 60
#How my code ended up with:
Symbol Open
0 A 10
1 AA 20
2 AAA 40
#What I want:
Symbol Open
0 A 10
1 AA 20
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50
6 AAA 60
My raw data could have multiple value in column 'Open' for same 'Symbol'.
As I eliminate the influence of my multiple times of input to MySQL, raw data here is influenced.
My thought on solving this problem is to group by the initial index and 'Symbol' at the same time because after append the initial indices could be another 'group by' column. The initial indices are [0,1,2,...]. If the 'Symbol' and initial indices are the same, I could take the first value of 'Open' in that group. To group by initial indices I could:
df2=df2.groupby(level=0).agg({'Open':'first'})
#this code will combine the lines with same indices and take the first value of 'Open' column
But I have no idea how to combine 'level=0' to 'level='Symbol''. Could you teach me how to group by two columns including initial indices and another column? Or tell me a way to eliminate multiple times of input not messing with my raw data.
Starting with df, including your index which seems to indicate whether data are repeated:
Symbol Open
0 A 10
1 AA 20
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50
Use
df.reset_index().drop_duplicates().drop('index', axis=1)
(keeps first occurrence by default) to get:
Symbol Open
0 A 10
1 AA 20
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50