replacing values of dataframe column - python

i have the current dataframe:
df = pd.DataFrame({"A":[1,2,-3,-4,5],
"B":[1,-2,3,-4,5]})
i want to replace, just in column A,
all positive values with 1 and all negative values with 0.
i tried to do it this way:
df[df["A"]>0]["A"] = 1
df[df["A"]<0]["A"] = 0
but that didnt work (dataframe didnt change at all).
however the below code did work:
df["A"][df["A"]>0] = 1
df["A"][df["A"]<0] = 0
can anyone tell me what the difference between the two?
why the first one didnt work, while the second one did?
thanks!

To put it simply:
df[df["A"]>0]["A"] gives you a copy of the dataframe
and df["A"][df["A"]>0] gives you a view of it.
The copy isnt linked to the dataframe so changing it wont do anything to the original.
Go to this link for more info:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

As an alternative method to above, you can use np.where and do:
import numpy as np
df['A'] = np.where(df['A'] >0 ,1,0) # 0's will be given 0 here.
which will essentially replace all +ve's with 1, and everything else with 0.

Related

Issue w/ pandas.index.get_loc() when match is found, TypeError: ("'>' not supported between instances of 'NoneType' and 'str'", 'occurred at index 1')

Below is the example to reproduce the error:
testx1df = pd.DataFrame()
testx1df['A'] = [100,200,300,400]
testx1df['B'] = [15,60,35,11]
testx1df['C'] = [11,45,22,9]
testx1df['D'] = [5,15,11,3]
testx1df['E'] = [1,6,4,0]
(testx1df[testx1df < 6].apply(lambda x: x.index.get_loc(x.first_valid_index(), method='ffill'), axis=1))
The desired output should be a list or array with the values [3,NaN,4,3]. The NaN because it does not satisfy the criteria.
I checked the pandas references and it says that for cases when you do not have an exact match you can change the "method" to 'fill', 'brill', or 'nearest' to pick the previous, next, or closest index. Based on this, if i indicated the method as 'ffill' it would give me an index of 4 instead of NaN. However, when i do so it does not work and i get the error show in the question title. For criteria higher than 6 it works fine but it doesn't for less than 6 due to the fact that the second row in the data frame does not satisfy it.
Is there a way around this issue? should it not work for my example(return previous index of 3 or 4)?
One solution i thought of is to add a dummy column populated by zeros so that is has a place to "find" and index that satisfies the criteria but this is a bit crude to me and i think there is a more efficient solution out there.
please try this:
import numpy as np
ls = list(testx1df[testx1df<6].T.isna().sum())
ls = [np.nan if x==testx1df.shape[1] else x for x in ls]
print(ls)

KeyError: True for conditional indexing in pandas DataFrame

I have a dataframe containing two columns: one filled with a string (irrelevant), and the other one is (a reference to) a dataframe.
Now I want to only keep the rows, where the dataframes in the second column have entries aka len(df.index) > 0 (there should be rows left, I don't care about columns).
I know that sorting out rows like this works perfectly fine for me if I use it in a list comprehension and can do it on every entry by its own, like in the following example:
[do_x for a, inner_df
in zip(outer_df.index, outer_df["inner"])
if len(inner_df.index) > 0]
But if I try using it for conditional indexing to create a shorter version of the dataframe, it will produce the error KeyError: True.
I thought, that putting len() around it could be a problem so I also tried different approaches to check for zero rows. In the following I show 4 examples of how I tried it:
# a) with the length of the index
outer_df = outer_df.loc[len(outer_df["inner"].index) > 0, :]
# b) same, but with lambda just like in the panda docs user guide
# I used it on the other versions too, with no change in result
outer_df = outer_df.loc[lambda df: len(df["inner"]) > 0, :]
# c) switching
outer_df = outer_df.loc[outer_df["inner"].index.size > 0, :]
# d) even "shorter" version
outer_df = outer_df.loc[not outer_df["inner"].empty, :]
So... where is my error and can I even do it with conditional indexing or do I need to find another way?
Edit: Changed and added some sentences above for more clarity plus added all below.
I know, that the filtering here kind of works through creating a Series the same length as the dataframe consisting of "True" and "False" after a comparison, resulting in keeping only the rows that contain a "True".
I do not however see a fundamental difference between my attempt to create such a list and the following examples (Source https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/):
# 1. difference: the resulting Series is *not* altered
# it just gets compared directly with here the value 80
# -> I thought this might be the problem, but then there is also #2
df = df[df['Percentage'] > 80]
# or
df = df.loc[df['Percentage'] > 80]
# 2. Here the entry is checked in a similar way to my c and d
options = ['x', 'y']
df = df[df['Stream'].isin(options)]
# or
df = df.loc[df['Stream'].isin(options)]
In both, number two here and my versions c & d, the entry in the cell (string // dataframe) is checked for something (is part of list // is empty).
Not sure if I understand your question or where you are stuck. however, I will just write my comment in this answer so that I can easily edit the post.
First, let's try typing in myvar = df['Percentage'] > 80 and see what myvar is. See if the content of myvar makes sense to you.
There is really only 1 true rule of .loc[], that is, the TRUTH TABLE.
Regarding the df[stuff] expression always appears within .loc[ df[stuff] expression ], you might get the impression that df[stuff] expression had some special meaning. For example: df[df['Percentage'] > 80] is asking for any Percentage that is greater than 80, looks quite intuitive! so...df['Percentage'] > 80 must be a "special syntax"? In reality, df['Percentage'] > 80 isn't anything special, it is just another truth table. df[stuff] expression will always be a truth table, that's it.

Check each value in one DataFrame, if it is less than variable replace value in another DataFrame(same size) on same place with 0

So I have 2 dataframes. One with values, and the second one with test statistic(TS) values. I need to check each cell in TS dataframe and if its value is smaller than 1, I need to change value in the second dataframe on same cell to 0.
I've tried to map them but could not find the right way.
yearly_flux = yearly_flux.map(lambda x : 0 ts_yearly_flux else x, ts_yearly_flux)
Have no idea if I can solve it like this, but I've tryed.
It's my second question so sorry if something is missing.
df1 = pd.DataFrame(np.random.normal(size=(5,10)))
df2 = pd.DataFrame(np.random.normal(size=(5,10)))
df2[:] = np.where(df1 < 1, 0, df2)

Creating a new column but creates copy of dataframe

I would like to check the value of the row above and see it it is the same as the current row. I found a great answer here: df['match'] = df.col1.eq(df.col1.shift()) such that col1 is what you are comparing.
However, when I tried it, I received a SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. warning. My col1 is a string. I know you can suppress warnings but how would I check the same row above and make sure that I am not creating a copy of the dataframe? Even with the warning I do get my desired output, but was curious if there exists a better way.
import pandas as pd
data = {'col1':['a','a','a','b','b','c','c','c','d','d'],
'week':[1,1,1,1,1,2,2,2,2,2]}
df = pd.DataFrame(data, columns=['col1','week'])
df['check_condition'] = 1
while sum(df.check_condition) != 0:
for week in df.week:
wk = df.loc[df.week == week]
wk['match'] = wk.col1.eq(wk.col1.shift()) # <-- where the warning occurs
# fix the repetitive value...which I have not done yet
# for now just exit out of the while loop
df.loc[df.week == week,'check_condition'] = 0
You can't ignore a pandas SettingWithCopyWarning!
It's 100% telling you that your code is not going to work as intended, if at all. Stop, investigate and fix it. (It's not an ignoreable thing you can filter out, like a pandas FutureWarning nagging about deprecation.)
Multiple issues with your code:
You're trying to iterate over a dataframe (but not with groupby()), take slices of it (in the subdataframe wk, which yes is a copy of a slice)...
then assign to the (nonexistent) new column wk['match']. This is bad, you shouldn't do this. (You could initialize df['match'] = np.nan, but it'd still be wrong to try to assign to the copy in wk)...
SettingWithCopyWarning is being triggered when you try to assign to wk['match']. It's telling you wk is a copy of a slice from dataframe df, not df itself. Hence like it tells you: A value is trying to be set on a copy of a slice from a DataFrame. That assignment would only get thrown away every time wk gets overwritten by your loop, so even if you could force it to work on wk it would be wrong. That's why SettingWithCopyWarning is a code smell you shouldn't be making a copy of a slice of df in the first place.
Later on, you also try to assign to column df['check_condition'] while iterating over the df, that's also bad.
Solution:
df['check_condition'] = df['col1'].eq(df['col1'].shift()).astype(int)
df
col1 week check_condition
0 a 1 0
1 a 1 1
2 a 1 1
3 b 1 0
4 b 1 1
5 c 2 0
6 c 2 1
7 c 2 1
8 d 2 0
9 d 2 1
More generally, for more complicated code where you want to iterate over each group of dataframe according to some grouping criteria, you'd use use groupby() and split-apply-combine instead.
you're grouping by wk.col1.eq(wk.col1.shift()), i.e. rows where col1 value doesn't change from the preceding row
and you want to set check_condition to 0 on those rows
and 1 on rows where col1 value did change from the preceding row
But in this simpler case you can skip groupby() and do a direct assignment.

How to change cycles in pandas

I have a dataframe and I need to change the 3d column by the rule
1) if differ between i+1 row and i row of 2nd column > 1 then 3d column +1
I wrote a code using a cycle, but this code is working for eternity.
I wrote a code in pure python, but there must be a better way to do this in pandas.
So, How to rewrite my code in pandas to reduce time?
old_store_id = -1
for i in range(0,df_sort.shape[0]):
if (old_store_id != df_sort.iloc[i, 0]):
old_store_id = df_sort.iloc[i, 0]
continue
if (df_sort.iloc[i,1]-df_sort.iloc[i-1,1])>1:
df_sort.iloc[i,2] = df_sort.iloc[i-1,2]+1
else:
df_sort.iloc[i,2] = df_sort.iloc[i-1,2]
Before the code:
After the code:
df['value'] = df.groupby('store_id')['period_id'].transform(lambda x: (x.diff()>1).cumsum()+1)
So we group by store_id, check when the diff between periods is greater than 1, then take the cumsum of the bool. We added 1 to make the counter start at 1 instead of 0.
Make sure that period_id is sorted correctly before using the above code, otherwise it will not work.

Categories

Resources