I am trying to select these rows (where T-stage = 3 AND N-stage = 0 AND Radiation = 1) from three columns (T-stage, N-stage, and Radiation) with Python from the Table below. I used the following but the results is not what was expected:
df=pd.read_csv('Mydata.csv') // Loading my data
#I tried the two approaches below, but the results were not what I expected.
A = ((df['T-stage'] == 3) | (df['N-stage'] == 0 | (df['Radiation'] == 1)))
or
B = ((df['T-stage'] == 3) & (df['N-stage'] == 0 & (df['Radiation'] == 1)))
Seems some () mismatch, for each condition use one ():
B = df[(df['T-stage'] == 3) & (df['N-stage'] == 0) & (df['Radiation'] == 1)]
I have the following DataFrame:
df = pd.DataFrame(
{
'date': ['2020-12-05', '2020-12-06', '2020-12-07'],
'day': ['Saturday', 'Sunday', 'Monday'],
'score': [2, 3, 0]
}
)
df
In the DataFrame above, I want to update the score on Monday if the scores on the weekend were non-zero values. For the DataFrame above, Monday's score would be 2.5. But it should work for other, longer DataFrames as well.
I know I can use the following:
df.score.loc[(df.day == 'Monday') & (df.score != 0) & (df.score.shift(1) != 0) & (df.score.shift(2) != 0)] = (df.score + df.score.shift(1)+df.score.shift(2))/3
df.score.loc[(df.day == 'Monday') & (df.score != 0) & (df.score.shift(1) != 0) & (df.score.shift(2) == 0)] = (df.score + df.score.shift(1))/2
df.score.loc[(df.day == 'Monday') & (df.score != 0) & (df.score.shift(1) == 0) & (df.score.shift(2) != 0)] = (df.score + df.score.shift(2))/2
df.score.loc[(df.day == 'Monday') & (df.score == 0) & (df.score.shift(1) != 0) & (df.score.shift(2) != 0)] = (df.score.shift(1) + df.score.shift(2))/2
df.score.loc[(df.day == 'Monday') & (df.score == 0) & (df.score.shift(1) != 0) & (df.score.shift(2) == 0)] = df.score.shift(1)
df.score.loc[(df.day == 'Monday') & (df.score == 0) & (df.score.shift(1) == 0) & (df.score.shift(2) != 0)] = df.score.shift(2)
but this is too lengthy. I think I need to iterate through the DataFrame, something like this:
for index, row in df.iterrows():
if row.day == 'Monday':
non_zeros = []
if row.score != 0:
non_zeros.append(row.score)
if row.score.shift(1) != 0:
non_zeros.append(row.score.shift(1))
if row.score.shift(2) != 0:
non_zeros.append(row.score.shift(2))
mon_score = sum(non_zeros)/len(non_zeros)
df.at[index, 'score'] = mon_score
The code above doesn't work because I get an error:
AttributeError: 'float' object has no attribute 'shift'
So, it seems that shift() isn't correct.
How would I access the previous row and how would I access the score in the previous row? Is there a better way than manually listing the combinations of conditions, like I've done above?
How would I access the previous row
Keep the previous row in a variable - you actually want to see the previous two rows.
rows = df.iterrows()
index,minus2 = next(rows)
index,minus1 = next(rows)
for index, current in rows:
if current.day == 'Monday':
print(f'Saturday:{minus2.date}, Sunday:{minus2.date}, Monday:{current.date}')
print(f'Sat score:{minus2.score}, Sun score:{minus2.score}, Mon score:{current.score}')
print('*********')
minus2,minus1 = minus1,current
Here is another way to do it.
Setup
import pandas as pd
from numpy.random import default_rng
rng = default_rng()
dates = pd.date_range("2020-12-04", periods=60, freq="D")
days = dates.day_name()
score = rng.choice([0, 0, 0, 0, 0, 0, 0.1, 0.2, 0.3], size=60)
df = pd.DataFrame({"date": dates, "day": days, "score": score})
print(df.head(10))
Groupby weekend/weekday; if there is a score on the weekend; calculate the new Monday score; use the group's Monday index to assign a new value to the DataFrame.
weekends = df.day.str.contains(r"Saturday|Sunday|Monday")
days_of_interest = df[weekends]
gb = df.groupby((days_of_interest.day == "Saturday").cumsum())
for k, g in gb:
if (g.iloc[:2].score != 0).any():
monday = (g.iloc[-1]).name
new_score = g.score.mean()
# print(new_score))
df.loc[monday, "score"] = new_score
# print(g)
print(df)
It assumes the data does not break/stop/start on a Saturday,Sunday,Monday boundary - necessary assumption because it uses .iloc slicing/indexing and groups starting on 'Saturday'. Extra logic or alternate selection methods would be needed to accommodate the edge cases if the assumption is not correct.
I wasn't too sure on how the monday score was to be updated, looked like you were averaging over non-zero scores (sum(non_zeros)/len(non_zeros)) - including monday(?). Maybe it should have been:
# add average of weekend scores to monday score
weekend_score = g.iloc[:2].score.mean()
df.loc[monday,'score'] += weekend_score
# or just the sum of all the scores?
df.loc[monday,'score'] = g.score.sum()
If I get it correctly, this might be the solution you have been looking for. I have assumed that your dummy data might be same(no Nulls or any other randomized rows) Try following if you want to use shift().
import pandas as pd
import math
df = pd.DataFrame(
{
'date': ['2020-11-30','2020-12-05', '2020-12-06', '2020-12-07'],
'day': ['Monday','Saturday', 'Sunday', 'Monday'],
'score': [4, 2, 3, 0]
}
)
def update(row,df1,df2):
score_Mon = row['score']
score_Sun = df1.iloc[row.name]['score']
score_Sat = df2.iloc[row.name]['score']
# Base condition if previous Saturday,Sunday values available, else put
# default value only
if row['day'] == 'Monday' and not(math.isnan(score_Sun)) and
not(math.isnan(score_Sat)):
# Non zero condition
if score_Mon != 0 and score_Sun > 0 and score_Sat > 0:
row['score'] = (score_Mon + score_Sun + score_Sat)/3
# Zero condition
else:
row['score'] = (score_Sun + score_Sat)/2
return row
df = df.apply(update,axis=1,args = [df.shift(1),df.shift(2)])
df
df = pd.DataFrame(
{
'date': ['2020-12-05', '2020-12-06', '2020-12-07', '2020-12-05', '2020-12-06', '2020-12-07'],
'day': ['Saturday', 'Sunday', 'Monday','Saturday', 'Sunday', 'Monday'],
'score': [-0.2, 0, 0.0, -0.3, 0, 0.0]
}
)
Based on the question, if we have to access previous rows based on row in question, we can use below
mondays = df['day']=='Monday'
sundays = df.score.shift(1)[mondays]
saturdays = df.score.shift(2)[mondays]
# row_mask_for_upd = mondays & sundays.astype(bool) | saturdays.astype(bool) # if either of sundays or saturdays have to be non zero
row_mask_for_upd = mondays & sundays.astype(bool) & saturdays.astype(bool) # if both sundays and saturdays have to be non zero
if True in set(row_mask_for_upd):
df.loc[row_mask_for_upd, "score"] = (sundays + saturdays)/2
Input:
Output:
Other Inputs and Outputs:
Input:
Output:
Input:
Output:
I'm making a program that handles over 20 million rows and over 50 columns of data. I'm trying to check if the numbers in one of the columns are even or odd.
If even, insert 'E' into a different column; if odd, insert 'O' into the column.
DF_FILE_IN = pd.read_csv('3MB_2.txt',chunksize=1000,sep='\t',dtype=str,engine='c',header=0,encoding='latin-1')
out_fields = ['HSNBR','OEFLAG']
for DF_FILE in DF_FILE_IN:
df_out1 = pd.DataFrame(dtype='str',columns=out_fields)
df_out1['HSNBR'] = DF_FILE['ANumber'].map(lambda x: f'{x:0>6}')
df_out1.loc[pd.to_numeric(df_out1['HSNBR']).map(lambda x: (x % 2 == 0) & (x != 0)), 'OEFLAG'] = 'E'
df_out1.loc[pd.to_numeric(df_out1['HSNBR']).map(lambda x: (x % 2 != 0) & (x != 0)), 'OEFLAG'] = 'O'
But some data has letters, symbols, spaces, etc.
When I run it, an error pops up from this line of code:
df_out1.loc[pd.to_numeric(df_out1['HSNBR']).map(lambda x: (x % 2 == 0) & (x != 0)), 'OEFLAG'] = 'E'
and says (example):
ValueError: Unable to parse string "111 1/2g" at position 10
I'm using chunking to pull in the data (eg. 1 million rows at a time). I am wanting to put the data that causes the errors into a separate file. But when I use try except, it doesn't process the column of data, in that chunk.
How do I get the data and errors into a file, while letting the program keep processing the column?
What #BernardL means is to write a function like:
def even_odd(x):
x = str(x)
if x.isnumeric():
x = int(x)
if (x % 2 == 0) and (x != 0):
return 'E'
if (x % 2 != 0) and (x != 0):
return 'O'
return 'error'
And then apply it with:
df_out1['OEFLAG'] = df_out1['HSNBR'].map(even_odd)
And then you can take out the errors with:
df_out1[df_out1['OEFLAG'] == 'error'].to_csv('errors_file.csv')
df_out1 = df_out1[df_out1['OEFLAG'] != 'error']
Im excluding rows from my df that fill certain conditions
df[~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20))]
I would like to also exclude, the numbers that start with 0 in the column 'Serial'
df[~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20) & (df['Serial'] == range(0) == 0))]
I tried the above, no result.
You will probably want to use df.str.startswith to check the first character:
df[ ~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20) & (df['Serial'].str.startswith('0')))]
The current expression df['Serial'] == range(0) == 0 is meaningless. It is equivalent to df['Serial'] == range(0) and range(0) == 0. Clearly, neither of those is related to comparing the first character of the string to '0' (as opposed to 0).
I have data represented using pandas DataFrame, which for example looks as follows:
| id | entity | name | value | location
where id is an integer value, entity is an integer , name is a string, value is an integer, and location is a string (for example US, CA, UK etc).
Now, I want to add a new column to this data frame, column "flag", where values are assigned as follows:
for d in df.iterrows():
if d.entity == 10 and d.value != 1000 and d.location == CA:
d.flag = "A"
elif d.entity != 10 and d.entity != 0 and d.value == 1000 and d.location == US:
d.flag = "C"
elif d.entity == 0 and d.value == 1000 and d.location == US"
d.flag = "B"
else:
print("Different case")
Is there a way to speed this up and use some built in functions instead of the for loop?
Use np.select which you pass a list of conditions, based on those conditions you give it choices and you can specify a default value when none of the conditions is met.
conditions = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
choices = ["A", "C", "B"]
df['flag'] = np.select(conditions, choices, default="Different case")
Add () with bitwise and -> & for working with numpy.select:
m = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
df['flag'] = np.select(m, ["A", "C", "B"], default="Different case")
You wrote "find all columns which fulfill a set of conditions", but your code shows you're actually trying to add a new column whose value for each row is computed from the values of other columns of the same row.
If that's indeed the case, you can use df.apply, giving it a function that computes the value for a specific row:
def flag_value(row):
if row.entity == 10 and row.value != 1000 and row.location == CA:
return "A"
elif row.entity != 10 and row.entity != 0 and row.value == 1000 and row.location == US:
return "C"
elif row.entity == 0 and row.value == 1000 and row.location == US:
return "B"
else:
return "Different case"
df['flag'] = df.apply(flag_value, axis=1)
Take a look at this related question for more information.
If you truly want to find all columns which specify some condition, the usual way to do this with a Pandas dataframe is to use df.loc and indexing:
only_a_cases = df.loc[df.entity == 10 & df.value != 1000 & df.location == "CA"]
# or:
only_a_cases = df.loc[lambda df: df.entity == 10 & df.value != 1000 & df.location == "CA"]