I am trying to compare two columns (key.response and corr_answer) in a csv file using pandas and creating a new column "Correct_or_Not" that will contain a 1 in the cell if the key.response and corr_answer column are equal and a 0 if they are not. When I evaluate on their own outside of the loop they return the truth value I expect. The first part of the code is just me formatting the data to remove some brackets and apostrophes.
I tried using a for loop, but for some reason it puts a 0 in every column for 'Correct_or_Not".
import pandas as pd
df= pd.read_csv('exptest.csv')
df['key.response'] = df['key.response'].str.replace(']','')
df['key.response'] = df['key.response'].str.replace('[','')
df['key.response'] = df['key.response'].str.replace("'",'')
df['corr_answer'] = df['corr_answer'].str.replace(']','')
df['corr_answer'] = df['corr_answer'].str.replace('[','')
df['corr_answer'] = df['corr_answer'].str.replace("'",'')
for i in range(df.shape[0]):
if df['key.response'][i] == df['corr_answer'][i]:
df['Correct_or_Not']=1
else:
df['Correct_or_Not']=0
df.head()
key.response corr_answer Correct_or_Not
0 1 1 0
1 2 2 0
2 1 2 0
You can generate the Correct_or_Not column all at once without the loop:
df['Correct_or_Not'] = df['key.response'] == df['corr_answer']
and df['Correct_or_Not'] = df['Correct_or_Not'].astype(int) if you need the results as integers.
In your loop you forgot the index [i] when assigning the result. Like this the last row's result gets applied everywhere.
you can also do this
df['Correct_or_not']=0
for i in range(df.shape[0]):
if df['key.response'][i]==df['corr_answer'][i]:
df['Correct_or_not'][i]=1
Related
I read in a file and created a Dataframe from that file, the problem is that not all of the information that I read was separated properly and was not the same length. I have a df that has 1600 columns but I do not need them all I specifically need the information that is 3 columns to the left of a specific particular sting in one of the previous columns. For Example:
In the 1st row column number 1000, it has a value of ['HFOBR'] and then I need the column value that is 3 to the left.
In the 2nd row the column number with ['PQOBR'] might be 799 but I still need the value that is 3 to the left.
In the 3rd row the column number might be 400 with ['BBSOBR'] but I still need the lave 3 to the left.
And so on I really am trying to search each row for the partial sting OBR and then take the value of 3 to the left of it and put that value in a new df with a column of its own.
Here you will find a snapshot of the dataframe
Here you will see the code I used to create the dataframe in the first place where I read in an HL7 file and tried to convert it to a Dataframe, and each of the HL7 messages are not the same length whish is casing part of the problem I am having
message = []
parsed_msg = []
with open(filename) as msgs:
start = False
for line in msgs.readlines():
if line[:3] == 'MSH':
if start:
parsed_msg = hl7.parse_batch(msg)
#print(parsed_msg)
start = False
message += parsed_msg
msg = line
start = True
else:
msg += line
df = pd.DataFrame(message)
Sample data:
df = pd.DataFrame([["HFOBR", "foo", "a", "b", "c"], ["foo", "PQOBR", "a", "b", "c"]])
df
0 1 2 3 4
0 HFOBR foo a b c
1 foo PQOBR a b c
Define a function to find the value three columns to the left of the first column containing a string with "OBR"
import numpy as np
def find_left_value(row):
obr_col_idx = np.where(row.str.contains("OBR"))[0]
left_col_idx = obr_col_idx + 3
return row[left_col_idx].iloc[0]
Apply this function to your dataframe:
df['result'] = df.apply(find_left_value, axis=1)
Resulting dataframe:
0 1 2 3 4 result
0 HFOBR foo a b c b
1 foo PQOBR a b c c
FYI: making sample data like this that people can test answers on will help you 1) define your problem more clearly, and 2) get answers.
I have a very simple for loop problem and I haven't found a solution in any of the similar questions on Stack. I want to use a for loop to create values in a pandas dataframe. I want the values to be strings that contain a numerical index. I can make the correct value print, but I can't make this value get saved in the dataframe. I'm new to python.
# reproducible example
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1['file'] = "data_{i}.txt".format(i=i)
This loop prints the exact value that I want to put into the 'file' column of df1, but when I look at df1, it only uses the last value for the index.
x file
0 0 data_4.txt
1 1 data_4.txt
2 2 data_4.txt
3 3 data_4.txt
4 4 data_4.txt
I have tried using enumerate, but can't find a solution with this. I assume everyone will yell at me for posting a duplicate question, but I have not found anything that works and if someone points me to a solution that solves this problem, I'll happily remove this question.
There are better ways to create a DataFrame, but to answer your question:
Replace the last line in your code:
df1['file'] = "data_{i}.txt".format(i=i)
with:
df1.loc[i, 'file'] = "data_{0}.txt".format(i)
For more information, read about the .loc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
On the same page, you can read about accessors like .at and .iloc as well.
You can do list-comprehension:
df1['file'] = ["data_{i}.txt".format(i=i) for i in range(5)]
print(df1)
Prints:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
OR at the creating of DataFrame:
df1 = pd.DataFrame({'x':range(5), 'file': ["data_{i}.txt".format(i=i) for i in range(5)]})
print(df1)
OR:
df1 = pd.DataFrame([{'x':i, 'file': "data_{i}.txt".format(i=i)} for i in range(5)])
print(df1)
I've found success with the .at method
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1.at[i, 'file'] = "data_{i}.txt".format(i=i)
Returns:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
when you assign a variable to a dataframe column the way you do -
using the df['colname'] = 'val', it assigns the val across all rows.
That is why you are seeing only the last value.
Change your code to:
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
to_assign = []
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
to_assign.append(data_{i}.txt".format(i=i))
##outside of the loop - only once - to all dataframe rows
df1['file'] = to_assign.
As a thought, pandas has a great API for performing these type of actions without for loops.
You should start practicing those.
I have a pandas dataframe df_causation which I have created as an empty dataframe with a corresponding column name.
df_causation = pd.DataFrame(columns=['Question'])
I have a for loop, in which for each iteration of the loop, I get a new string called cause_str like this:-
for i in range(len(X_test)):
cause_str = hyp.join(f_imp) #cause_str is a new string obtained for each iteration
(Ignore the method on how this is obtained, I just gave an example)
I would like to append these new strings (cause_str) (all of them) to each successive row in my Pandas dataframe df_causation's Question column. Any suitable way for doing this?
EDIT: EXPECTED OUTPUT
df_causation. **Causation**
Row 0 cause_str from i = 0 th iteration in loop
Row 1 cause_str from i = 1 th iteration in loop etc.
IIUC correctly, this should work:
dfd['**Causation**'] = df['df_causation.'].apply(lambda x: f'cause str from i = {x.split(" ")[1]} th iteration in loop')
df3
df_causation. **Causation**
0 Row 0 cause str from i = 0 th iteration in loop
1 Row 1 cause str from i = 1 th iteration in loop
I am running through all the cells in my table using this bit of code:
df.loc[(df["month"]=="january"),"a5"]=1
Which assigns the value "1" in the a5 column for all the rows where the value in the month column is "january". I wanted to know if there was a way to assign "1" not to that row but to the row below.
I have tried to simply do
df.loc[(df["month"]=="january")+1,"a5"]=1
but it doesn't work. For some reason that I don't quite grasp, :
df.loc[(df["month"]=="january")+2,"a5"]=1
assigns 1 to the row that says "january" and to the row below.
you can do as follows.
import pandas as pd
df = pd.DataFrame.from_dict({'month':['j','k','j','k'],'a5':[10,10,10,10]})
index = (df["month"]=="j").shift(1) # shift by 1 after shifting first value will be nan so we need to fill that with False
index[0] = False #
df.loc[index,'a5']=1
print(df)
>>>
month a5
0 j 10
1 k 1
2 j 10
3 k 1
When you write some query liek this df['somthing'] = x. It will return Series contain True if condition meets, False otherwise.
Idea here is to shift all values down by one and False in start.
I hope this clarify few thing.
df.loc[(df["month"]=="january"),"a5"]=1
And then just use shift to move it to the next row:
df["a5"].shift(1)
Should work.
I want to add a column to a Dataframe that will contain a number derived from the number of NaN values in the row, specifically: one less than the number of non-NaN values in the row.
I tried:
for index, row in df.iterrows():
count = row.value_counts()
val = sum(count) - 1
df['Num Hits'] = val
Which returns an error:
-c:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
and puts the first val value into every cell of the new column. I've tried reading about .loc and indexing in the Pandas documentation and failed to make sense of it. I gather that .loc wants a row_index and a column_index but I don't know if these are pre-defined in every dataframe and I just have to specify them somehow or if I need to "set" an index on the dataframe somehow before telling the loop where to place the new value, val.
You can totally do it in a vectorized way without using a loop, which is likely to be faster than the loop version:
In [89]:
print df
0 1 2 3
0 0.835396 0.330275 0.786579 0.493567
1 0.751678 0.299354 0.050638 0.483490
2 0.559348 0.106477 0.807911 0.883195
3 0.250296 0.281871 0.439523 0.117846
4 0.480055 0.269579 0.282295 0.170642
In [90]:
#number of valid numbers - 1
df.apply(lambda x: np.isfinite(x).sum()-1, axis=1)
Out[90]:
0 3
1 3
2 3
3 3
4 3
dtype: int64
#DSM brought up an good point that the above solution is still not fully vectorized. A vectorized form can be simply (~df.isnull()).sum(axis=1)-1.
You can use the index variable that you define as part of the for loop as the row_index that .loc is looking for:
for index, row in df.iterrows():
count = row.value_counts()
val = sum(count) - 1
df.loc[index, 'Num Hits'] = val