I am running through all the cells in my table using this bit of code:
df.loc[(df["month"]=="january"),"a5"]=1
Which assigns the value "1" in the a5 column for all the rows where the value in the month column is "january". I wanted to know if there was a way to assign "1" not to that row but to the row below.
I have tried to simply do
df.loc[(df["month"]=="january")+1,"a5"]=1
but it doesn't work. For some reason that I don't quite grasp, :
df.loc[(df["month"]=="january")+2,"a5"]=1
assigns 1 to the row that says "january" and to the row below.
you can do as follows.
import pandas as pd
df = pd.DataFrame.from_dict({'month':['j','k','j','k'],'a5':[10,10,10,10]})
index = (df["month"]=="j").shift(1) # shift by 1 after shifting first value will be nan so we need to fill that with False
index[0] = False #
df.loc[index,'a5']=1
print(df)
>>>
month a5
0 j 10
1 k 1
2 j 10
3 k 1
When you write some query liek this df['somthing'] = x. It will return Series contain True if condition meets, False otherwise.
Idea here is to shift all values down by one and False in start.
I hope this clarify few thing.
df.loc[(df["month"]=="january"),"a5"]=1
And then just use shift to move it to the next row:
df["a5"].shift(1)
Should work.
Related
I have DataFrame with thousands rows. Its structure is as below
A B C D
0 q 20 'f'
1 q 14 'd'
2 o 20 'a'
I want to compare the A column of current row and next row. If those values are equal I want to add the value of B column which has lower the value to D column of compared row which has greater value. Then I want to remove the moved column value of column B. It's like a swap process.
A B C D
0 q 20 'f' 14
1 o 20 'a'
I have thousands rows and iloc, loc, at methods work slow. At least I want to use DataFrame apply method. I tried some code samples but they didn't work.
I want to do something as below:
DataFrame.apply(lambda row: self.compare(row, next(row)), axis=1))
I have a compare method but I couldn't pass next row to the compare method. How can I pass it to the method? Also I am open to hear faster pandas solutions.
Best not to do that with apply as it will be slow; you can look at using shift, e.g.
df['A_shift'] = df['A'].shift(1)
df['Is_Same'] = 0
df.loc[df.A_shift == df.A, 'Is_Same'] = 1
Gets a bit more complicated if you're doing the shift within groups, but still possible.
I have a data set I am reading with csv file and I want to grab the row number/index where an if statement is true.
So if the column row value is 0 and another column value in the same row is not null.
Right now my loop is showing that all rows in my data set have a 0 and not null which is wrong, so its not working.
What am I doing wrong?
counter = 0
for index, row in raw_csv_data.iterrows():
if(row['column1'] == 0 and row['column3'] != np.nan):
print(row['column1'], row['column3'])
Solution fixed part of if statement
row.isna()['column3'] == False
Another way to do this is the following:
counts = sum((data['column1'].eq(0) & ~data['column3'].isna()))
eq is a method to check if the values are equal to 0 (see here)
Similarity, for isna() see here
Problem is this statement have syntactical issue, please check the below line of code and replace it with yous and then it will works for sure
if(row['column1'] == 0 and row['column3'] != np.nan)
below is your complete code
counter = 0
for index, row in raw_csv_data.iterrows():
if(row['column1'] == 0 and row['column3'] != np.nan)
counter += 1
I am trying to compare two columns (key.response and corr_answer) in a csv file using pandas and creating a new column "Correct_or_Not" that will contain a 1 in the cell if the key.response and corr_answer column are equal and a 0 if they are not. When I evaluate on their own outside of the loop they return the truth value I expect. The first part of the code is just me formatting the data to remove some brackets and apostrophes.
I tried using a for loop, but for some reason it puts a 0 in every column for 'Correct_or_Not".
import pandas as pd
df= pd.read_csv('exptest.csv')
df['key.response'] = df['key.response'].str.replace(']','')
df['key.response'] = df['key.response'].str.replace('[','')
df['key.response'] = df['key.response'].str.replace("'",'')
df['corr_answer'] = df['corr_answer'].str.replace(']','')
df['corr_answer'] = df['corr_answer'].str.replace('[','')
df['corr_answer'] = df['corr_answer'].str.replace("'",'')
for i in range(df.shape[0]):
if df['key.response'][i] == df['corr_answer'][i]:
df['Correct_or_Not']=1
else:
df['Correct_or_Not']=0
df.head()
key.response corr_answer Correct_or_Not
0 1 1 0
1 2 2 0
2 1 2 0
You can generate the Correct_or_Not column all at once without the loop:
df['Correct_or_Not'] = df['key.response'] == df['corr_answer']
and df['Correct_or_Not'] = df['Correct_or_Not'].astype(int) if you need the results as integers.
In your loop you forgot the index [i] when assigning the result. Like this the last row's result gets applied everywhere.
you can also do this
df['Correct_or_not']=0
for i in range(df.shape[0]):
if df['key.response'][i]==df['corr_answer'][i]:
df['Correct_or_not'][i]=1
I want to assign a boolean value to a currently column of "True" if the first column contains only one period and "False" if it contains more than one period.
This is what I've gotten to at this point and I am completely stuck:
for index, row in qbstats.iterrows():
if qbstats['qb'].count(".") > 1
...... so if it's greater than one I want to assign the column labeled "num_periods_in_name" as False else wise it sets as True.
I would appreciate any help, thanks.
You can use np.where():
df['New Col'] = np.where(df['qb'].str.count('\.')>1, False, True)
Note, you will need to escape the . with a \ as well.
Below is an example:
qb
0 Hello.
1 helloo...
2 hello...ooo
3 Hell.o
And applying the code above gives:
qb New Col
0 Hello. True
1 helloo... False
2 hello...ooo False
3 Hell.o True
As part of trying to learn pandas I'm trying to reshape a spreadsheet. After removing non zero values I need to get some data from a single column.
For the sample columns below, I want to find the most effective way of finding the row and column index of the cell that contains the value date and get the value next to it. (e.g. here it would be 38477.
In practice this would be a much bigger DataFrame and the date row could change and it may not always be in the first column.
What is the best way to find out where date is in the array and return the value in the adjacent cell?
Thanks
<bound method DataFrame.head of 0 1 2 4 5 7 8 10 \
1 some title
2 date 38477
5 cat1 cat2 cat3 cat4
6 a b c d e f g
8 Z 167.9404 151.1389 346.197 434.3589 336.7873 80.52901 269.1486
9 X 220.683 56.0029 73.73679 428.8939 483.7445 251.1877 243.7918
10 C 433.0189 390.1931 251.6636 418.6703 12.21859 113.093 136.28
12 V 226.0135 418.1141 310.2038 153.9018 425.7491 73.08073 277.5065
13 W 295.146 173.2747 2.187459 401.6453 51.47293 175.387 397.2021
14 S 306.9325 157.2772 464.1394 216.248 478.3903 173.948 328.9304
15 A 19.86611 73.11554 320.078 199.7598 467.8272 234.0331 141.5544
This really just reformats a lot of the iteration you are doing to make it clearer and take advantage of pandas ability to easily select, etc.
First, we need a dummy dataframe (with date in the last row and explicitly ordered the way you have in your setup)
import pandas as pd
df = pd.DataFrame({"A": [1,2,3,4,np.NaN],
"B":[5, 3, np.NaN, 3, "date"],
"C":[np.NaN,2, 1,3, 634]})[["A","B","C"]]
A clear way to do it is to find the row and then enumerate over the row to find date:
row = df[df.apply(lambda x: (x == "date").any(), axis=1)].values[0] # will be an array
for i, val in enumerate(row):
if val == "date":
print row[i + 1]
break
If your spreadsheet only has a few non-numeric columns, you could go by column, check for date and get a row and column index (this may be faster because it searches by column rather than by row, though I'm not sure)
# gives you column labels, which are `True` if at least one entry has `date` in it
# have to check `kind` otherwise you get an error.
col_result = df.apply(lambda x: x.dtype.kind == "O" and (x == "date").any())
# select only columns where True (this should be one entry) and get their index (for the label)
column = col_result[col_result].index[0]
col_index = df.columns.get_loc(column)
# will be True if it contains date
row_selector = df.icol(col_index) == "date"
print df[row_selector].icol(col_index + 1).values