I have a df with cols
start end strand
3 90290834 90290905 +
3 90290834 90291149 +
3 90291019 90291149 +
3 90291239 90291381 +
5 33977824 33984550 -
5 33983577 33984550 -
5 33984631 33986386 -
what i am trying to do is add new columns(5ss and 3ss)based on the strand column
f = pd.read_clipboard()
f
def addcolumns(row):
if row['strand'] == "+":
row["5ss"] == row["start"]
row["3ss"] == row["end"]
else:
row["5ss"] == row["end"]
row["3ss"] == row["start"]
return row
f = f.apply(addcolumns, axis=1)
KeyError: ('5ss', u'occurred at index 0')
which part of the code is wrong? or there is an easier way to do this?
Instead of using .apply() I'd suggest using np.where() instead:
df.loc[:, '5ss'] = np.where(f.strand == '+', f.start, f.end)
df.loc[:, '3ss'] = np.where(f.strand == '+', f.end, f.start)
np.where() creates a new object based on three arguments
A logical condition (in this case f.strand == '+')
A value to take when the condition is true
A value to take when the condition is false
Using apply() with axis=1 applies the function to each column. So even though you've named the variable row, it's actually iterating over columns. You could leave out the axis argument or specify axis=0 to apply the function to the rows. But given what you're trying to do, it would be simpler to use np.where(), which allows you to specify some conditional logic for column assignment.
Related
I am creating a generic condition for joining 2 dataframes which have the same key and same structure as a code below. I would like to make it as a function for comparing 2 dataframes. First idea, I made it as string condition as it's easy for concatenate the condition with the loop. Finally, it seems like the join condition couldn't accept the string condition. Could somebody please guide me on this?
import pyspark.sql.functions as F
key = "col1 col2 col3"
def CompareData(df1,df2,key) :
key_list = key.split(" ")
key_con=""
for col in key_list:
condi = "(F.col(\"" + col +"\") == F.col(\""+"x_"+col+"\"))" # trying to generate generic condition
key_con=key_con + "&" + condi
key_condition=key_con.replace('&','',1)
df1_tmp = df1.select([F.col(c).alias("x_"+c) for c in df1.columns])
df_compare = df2.join(df1_tmp, key_condition , "left") # The problem was here. key_condition has error. If I copy the condition string below and place into join condition, it works fine.
# key_condition = (F.col("col1") == F.col("x_col1")) & (F.col("col2") == F.col("x_col2")) & (F.col("col3") == F.col("x_col3"))
Try this:
key_con = F.lit(True)
for col in key_list:
condi = (F.col(col) == F.col(f"x_{col}"))
key_con = key_con & condi
In your attempt, your condition is of type string. But join's argument on only accepts string if it is a plain column name. You are trying to create a column expression and pass it to the on argument. Column expression is not the same thing as string, so you need a slightly different method to make a composite column expression.
I am trying to select a range within a data frame based on the values. I have logic for what i am trying to implement in excel and i just need to translate it into a python script. I need to return a range of rows from where the value in starting where Column A value is and ending where Column B has that same value. Example below:
index
A
B
output range
0
dsdfsdf
1
2
3
4
quwfi
5
dsdfsdf
0:5
6
quwfi
4:6
One thing to note the value in Column B will always be lower down the list than Column A
So far I have tried to just grab the index of Column A and put it on the row in output range for Column B using,
df['output range'] = np.where(df['B'] != "", (df.index[df['A'] == df.at[df['B']].value]))
This gives me a ValueError: Invalid call for scalar access (getting)!
Removing the np.where portion of it does not change the result
This should give you the required behavior:
df = pd.DataFrame({'A': ['dsdfsdf','','','','quwfi','',''],'B': ['','','','','','dsdfsdf','quwfi']})
def get_range(x):
if x != '':
first_index = df[df['A'] == x].index.values[0]
current_index = df[df['B'] ==x].index.values[0]
return f"{first_index}:{current_index}"
return ''
df['output range']= df['B'].apply(get_range)
df
I have a dataframe from which I want to return only the rows in which the values in column '1' match a specific string and in column '2' the value is an integer.
I have the following code in which I attempt to generate a set of indexes which match the criteria and then only pull these rows through from the dataframe.
Ok_index = df[(df['1']== "string") & (df['2'] % 1 == 0)].index
new_df = df.iloc[Ok_index]
I understand the issue will be with the second conditional statement but I don't know how to apply the same logic from the string check to the integer check.
The following dataframe:
1
2
'String'
1.5
'String'
10
'Not string'
10
Should return this dataframe:
1
2
'String'
10
Check with is_integer
df['2'].apply(lambda x : x.is_integer())
0 False
1 True
2 True
Name: 2, dtype: bool
Actually your error is in your second line. You are retrieving the index from the dataframe, so you need to use .loc in order to filter it. Essentialy:
new_df = df.loc[Ok_index]
But if you want to use all pandas' power, you can actually do all this in a single line:
new_df = df[(df['1']== "string") & (df['2'] % 1 == 0)]
You don't need to get the index for the desirable rows first, and then filter the dataframe. You can do all this at once.
I am attempting to loop through two columns in my dataframe and add either a 1 or 0 to a new column based on the two aforementioned column values. For example, if Column A is > Column B then add a 1 to Column C. However, I keep receiving the following error and I'm not sure why.
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
My code:
for i in df.itertuples():
if df['AdjClose'] > df['30ma']:
df['position'] = 1
elif df['AdjClose'] < df['30ma']:
df['position'] = 0
You aren't actually looping through the rows. In your if statement, instead of your condition being True or False, it's a Series. Hence, the error. A Series is not true or false, it's a Series. A more correct way to write your code would be
for i in range(len(df)):
if df.loc[i, 'AdjClose'] > df.loc[i, '30ma']:
df.loc[i, 'position'] = 1
elif df.loc[i, 'AdjClose'] < df.loc[i, '30ma']:
df.loc[i, 'position'] = 0
A shorter, cleaner, and more pandas-y way to write the code that also has the benefit of running faster would be:
df.loc[df.AdjClose > df['30ma'], 'position'] = 1
df.loc[df.AdjClose < df['30ma'], 'position'] = 0
I highly recommend reading the docs on indexing, it can be a bit tricky in pandas to start with. https://pandas.pydata.org/pandas-docs/stable/indexing.html
Edit:
Note, the for loop code makes the assumption that your index is made of unique values ranging from 0 to n-1. It's a bit more complicated if you have a different index. See https://pandas.pydata.org/pandas-docs/stable/whatsnew.html#deprecate-ix
Your code is calling df.itertuples, but not using the result. You could fix that using one of Ian Kent's suggestions, or something like this:
for row in df[['AdjClose', '30ma']].itertuples():
if row[1] > row[2]: # note: row[0] is the index value
df.loc[row.Index, 'position'] = 1
elif row[1] < row[2]:
df.loc[row.Index, 'position'] = 0
If your columns all had names that were valid Python identifiers, you could use something neater:
for row in df.itertuples():
if row.AdjClose > row.ma30:
df.loc[row.Index, 'position'] = 1
elif row.AdjClose < row.ma30:
df.loc[row.Index, 'position'] = 0
Note that neither of these will work if the index for df has duplicate values.
You might also be able to use df.apply, like this:
def pos(row):
if row['AdjClose'] > row['30ma']:
return 1
elif row['AdjClose'] > row['30ma']:
return 0
else:
return pd.np.nan # undefined?
df['position'] = df.apply(pos)
or just
df['position'] = df.apply(lambda row: 1 if row['AdjClose'] > row['30ma'] else 0)
This should work even if the index has duplicate values. However, you have to define a value for every row, even the ones where row['AdjClose'] == row['30ma'].
Overall, you're probably best off with Ian Kent's second recommendation.
You're trying to test a boolean over multiple values (similar to if pd.Series([False, True, False]) which is not clear what the result is), so pandas raises that error.
The message suggests you could use any() to return if any value (in this case the one value you're testing) is True.
So maybe something like this:
for i in df.itertuples():
if (df['AdjClose'] > df['30ma']).any():
df['position'] = 1
elif (df['AdjClose'] < df['30ma']).any():
df['position'] = 0
See these docs for further details Using If/Truth Statements with pandas
I am seeking to drop some rows from a DataFrame based on two conditions needing to be met in the same row. So I have 5 columns, in which; if two columns have equal values (code1 and code2) AND one other column (count) is greater than 1, then when these two conditions are met in the same row - the column is dropped.
I could alternatively keep columns that meet the conditions of:
count == 1 'OR' (as opposed to AND) df_1.code1 != df_1.code2
In terms of the first idea what I am thinking is:
df_1 = '''drop row if''' [df_1.count == 1 & df_1.code1 == df_1.code2]
Here is what I have so far in terms of the second idea;
df_1 = df_1[df_1.count == 1 or df_1.code1 != df_1.code2]
You can use .loc to specify multiple conditions.
df_new = df_1.loc[(df_1.count != 1) & (df_1.code1 != df_1.code2), :]
df.drop(df[(df['code1'] == df['code2']) & (df['count'] > 1)].index, inplace=True)
Breaking it to steps:
df[(df['code1'] == df['code2']) & (df['count'] > 1)] returns a subset of rows from df where the value in code1 equals to the value in code2 and the value in count is greater than 1.
.index returns the indexes of those rows.
The last step is calling df.drop() that expects indexes to be dropped from the dataframe, and using inplace=True so we won't need to re-assign, ie
df = df.drop(...).