I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2
Related
I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')
I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')
I have two dataframe, and i am able to merge it. but I want to merge it in specific format ( column wise), Below are the further details
>df1
id A B C
0 1 20 0 1
1 2 23 1 2
>df2
id A B C
0 1 10 1 1
1 2 20 1 1
Below is my code and output
df = pd.merge(df1,df2,on='id',suffixes=('_Pre', '_Post'))
The output of this is :
id A_Pre B_Pre C_Pre A_Post B_Post C_Post
0 1 20 0 1 10 1 1
1 2 23 1 2 20 1 1
But the EXPECTED output should be, Can someone help or guide me for this :
id A_Pre A_Post B_Pre B_Post C_Pre C_Post
0 1 20 10 0 1 1 1
1 2 23 20 1 1 2 1
When subsequently manipulation is possible you can do domething like:
df[np.array([[x+"_Pre", x+"_Post"] for x in df1.columns.drop("id")]).flatten()]
If you just want to modify the order of your columns you can use reindex :
df = df.reindex(columns=['A_Pre','A_Post','B_Pre','B_Post','C_Pre','C_Post'])
You can order the columns in the new dataset using sorted and just add the column "id" in a second statement
order_col = sorted(df.columns[1:], key=lambda x:x[:3])
df_final = pd.concat([df['id'],df[order_col]], axis=1)
Learning Python. I have a dataframe like this
cand1 cand2 cand3
0 40.0900 39.6700 36.3700
1 44.2800 44.2800 35.4200
2 43.0900 51.2200 46.3500
3 35.7200 55.2700 36.4700
and I want to rank each row according to the value of the columns, so that I get
cand1 cand2 cand3
0 1 2 3
1 1 1 3
2 1 3 2
3 3 1 2
I have now
for index, row in df.iterrows():
df.loc['Rank'] = df.loc[index].rank(ascending=False).astype(int)
print (df)
However, this keeps on repeating the whole dataframe. Note also the special case in row 2, where two values are the same.
Suggestion appreciated
Use df.rank instead of series rank
df_rank = df.rank(axis=1, ascending=False, method='min').astype(int)
Out[165]:
cand1 cand2 cand3
0 1 2 3
1 1 1 3
2 3 1 2
3 3 1 2
In a pandas dataframe, how can I drop a random subset of rows that obey a condition?
In other words, if I have a Pandas dataframe with a Label column, I'd like to drop 50% (or some other percentage) of rows where Label == 1, but keep all of the rest:
Label A -> Label A
0 1 0 1
0 2 0 2
0 3 0 3
1 10 1 11
1 11 1 12
1 12
1 13
I'd love to know the simplest and most pythonic/panda-ish way of doing this!
Edit: This question provides part of an answer, but it only talks about dropping rows by index, disregarding the row values. I'd still like to know how to drop only from rows that are labeled a certain way.
Use the frac argument
df.sample(frac=.5)
If you define the amount you want to drop in a variable n
n = .5
df.sample(frac=1 - n)
To include the condition, use drop
df.drop(df.query('Label == 1').sample(frac=.5).index)
Label A
0 0 1
1 0 2
2 0 3
4 1 11
6 1 13
Using drop with sample
df.drop(df[df.Label.eq(1)].sample(2).index)
Label A
0 0 1
1 0 2
2 0 3
3 1 10
5 1 12