conditionally replacing values in a column - python

I have a pandas dataframe, where the 2nd, 3rd and 6th columns look like so:
start
end
strand
108286
108361
+
734546
734621
-
761233
761309
+
I'm trying to implement a conditional where, if strand is +, then the value in end becomes the equivalent value in start + 1, and if strand is -, then the value in start becomes the value in end, so the output should look like this:
start
end
strand
108286
108287
+
734620
734621
-
761233
761234
+
And where the pseudocode may look like this:
if df["strand"] == "+":
df["end"] = df["start"] + 1
else:
df["start"] = df["end"] - 1
I imagine this might be best done with loc/iloc or numpy.where? but I can't seem to get it to work, as always, any help is appreciated!

You are correct, loc is the operator you are looking for
df.loc[df.strand=='+','end'] = df.loc[df.strand=='+','start']+1
df.loc[df.strand=='-','start'] = df.loc[df.strand=='-','end']-1

You could also use numpy.where:
import numpy as np
df[['start', 'end']] = np.where(df[['strand']]=='-', df[['end','end']]-[1,0], df[['start','start']]+[0,1])
Note that this assumes strand can have one of two values: + or -. If it can have any other values, we can use numpy.select instead.
Output:
start end strand
0 108286 108287 +
1 734620 734621 -
2 761233 761234 +

Related

Python Pandas Calculation of a Column with conditions

I'm trying to make a calculation on 2 columns using python pandas. Use case is like this:
I have values like 100,101,102... in column "hesapKodu1". I have split this column in 3 columns for the first 3 characters. "hesapkodu1_1" is the first character of "hesapKodu1" so it is "1". "hesapKodu1_2" is the first 2 characters of "hesapkodu1", so it is like "10,11"...
What I'm trying to do is this:
When hesapKodu1 is 123 or 125 or 130 I would like to make calculation for columns BORC and ALACAK: it will be like BORC - ALACAK.
But for the other hesapKodu1 values it will be ALACAK - BORC.
And at the end all of the results should be summed as a return.
Right now my code is like this. And this code can only do BORC - ALACAK when hesapKodu1 starts with 1. I cannot find a way to iterate through upper conditions.
source_df['hesapKodu1_1']=source_df['hesapKodu1'].str[:1]
source_df['hesapKodu1_2']=source_df['hesapKodu1'].str[:2]
source_df['hesapKodu1_3']=source_df['hesapKodu1'].str[:3]
hk1 = round(source_df.loc[source_df['hesapKodu1_1'] == '1', 'BORÇ'].sum() - source_df.loc[source_df['hesapKodu1_1'] == '1', 'ALACAK'].sum(),2)
h
You can make use of np.where() which would be faster than apply().
import numpy as np
source_df["new_column"] = np.where(
source_df["hesapKodu1"].isin(["123", "125", "130"]),
source_df["BORC"] - source_df["ALACAK"],
source_df["ALACAK"] - source_df["BORC"],
)
You can use apply for that -
def my_func(record):
if source_df['hesapKodu1'] in ['123', '125', '130']
record['new_column'] = record['BORC'] - record['ALACAK']
else:
record['new_column'] = record['ALACAK'] - record['BORC']
return record
target_df = source_df.apply(my_func, axis=1)

How to add a prefix to a string if it ends with a particular character (Pandas) i.e. add '-' to string given it ends with '-'

For a particular column (dtype = object), how can I add '-' to the start of the string, given it ends with '-'.
i.e convert: 'MAY500-' to '-May500-'
(I need to add this to every element in the column)
Try something like this:
#setup
df = pd.DataFrame({'col':['aaaa','bbbb-','cc-','dddddddd-']})
mask = df.col.str.endswith('-'), 'col'
df.loc[mask] = '-' + df.loc[mask]
Output
df
col
0 aaaa
1 -bbbb-
2 -cc-
3 -dddddddd-
You can use np.select
Given a dataframe like this:
df
values
0 abcd-
1 a-bcd
2 efg-
You can use np.select as follows:
df['values'] = np.select([df['values'].str.endswith('-')], ['-' + df['values']], df['values'])
output:
df
values
0 -abcd-
1 a-bcd
2 -efg-
def add_prefix(text):
# If text is null or empty string the -1 index will result in IndexError
if text and text[-1] == "-":
return "-"+text
return text
df = pd.DataFrame(data={'A':["MAY500", "MAY500-", "", None, np.nan]})
# Change the column to string dtype first
df['A'] = df['A'].astype(str)
df['A'] = df['A'].apply(add_prefix)
0 MAY500
1 -MAY500-
2
3 None
4 nan
Name: A, dtype: object
I have a knack for using apply with lambda functions a lot. It just makes the code a lot easier to read.
df['value'] = df['value'].apply(lambda x: '-'+str(x) if str(x).endswith('-') else x)

Modify DataFrame column values based on condition

I am trying to modify the formatting of the strings of a Datframe column according to a condition.
Here is an example of the file
The DataFrame
Now, as you might see, the object column values either start with http or a capital letter: I want to make it so that:
if the string starts with http, I put it between <>
if the string starts with a capital letter, I format it as " + string + " + '#en'
However, I cant seem to be able to do so: I tried to make a simple if condition with .startswith(h) or contains('http') but it doesn't work, because I understand that it actually returns a list of booleans instead of a single condition.
Maybe it is very simple but I cannot solve, any help is appreciated.
Here is my code
import numpy as np
import pandas as pd
import re
ont1 = pd.read_csv('1.tsv',sep='\t',names=['subject','predicate','object'])
ont1['subject'] = '<' + ont1['subject'] + '>'
ont1['predicate'] = '<' + ont1['predicate'] + '>'
So it looks like you have many of the right pieces here, you mentioned boolean indexing which is what you can use to select and update certain rows, for example I'll do this on a dummy DataFrame:
df = pd.DataFrame({"a":["http://akjsdhka", "Helloall", "http://asdffa", "Bignames", "nonetodohere"]})
First we can find rows starting with "http":
mask = df["a"].str.startswith("http")
df.loc[mask, "a"] = "<" + df["a"] + ">"
Then we update the rows where that mask is true, and the same for the other condition:
mask2 = df["a"].str[0].str.isupper()
df.loc[mask2, "a"] = "\"" + df["a"] + "\"#en"
Final result:
a
0 <http://akjsdhka>
1 "Helloall"#en
2 <http://asdffa>
3 "Bignames"#en
4 nonetodohere
Try:
ont1.loc[['subject'].str.startsWith("http"),'subject'] = "<" + ont1 ['subject'] + ">"
Ref to read:
https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

Counting the repeated values in one column base on other column

Using Panda, I am dealing with the following CSV data type:
f,f,f,f,f,t,f,f,f,t,f,t,g,f,n,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,nowin
t,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,nowin
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,win
For this part of the raw data, I was trying to return something like:
Column1_name -- t -- counts of nowin = 0
Column1_name -- t -- count of wins = 3
Column1_name -- f -- count of nowin = 2
Column1_name -- f -- count of win = 1
Based on this idea get dataframe row count based on conditions I was thinking in doing something like this:
print(df[df.target == 'won'].count())
However, this would return always the same number of "wons" based on the last column without taking into consideration if this column it's a "f" or a "t". In other others, I was hoping to use something from Panda dataframe work that would produce the idea of a "group by" from SQL, grouping based on, for example, the 1st and last column.
Should I keep pursing this idea of should I simply start using for loops?
If you need, the rest of my code:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data"
df = pd.read_csv(url,names=[
'bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target'
])
features = ['bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target']
# number of lines
#tot_of_records = np.size(my_data,0)
#tot_of_records = np.unique(my_data[:,1])
#for item in my_data:
# item[:,0]
num_of_won=0
num_of_nowin=0
for item in df.target:
if item == 'won':
num_of_won = num_of_won + 1
else:
num_of_nowin = num_of_nowin + 1
print(num_of_won)
print(num_of_nowin)
print(df[df.target == 'won'].count())
#print(df[:1])
#print(df.bkblk.to_string(index=False))
#print(df.target.unique())
#ini_entropy = (() + ())
This could work -
outdf = df.apply(lambda x: pd.crosstab(index=df.target,columns=x).to_dict())
Basically we are going in on each feature column and making a crosstab with target column
Hope this helps! :)

regex for string dates in dataframe

let's say I have a dataframe with a value strings that looks like like:
[26.07. - 08.09.]
and I want to add '2018' behind the last '.' before the date ends such that my output will be :
[26.07.2018 - 08.09.2018]
and apply this for the rest of the dataframe which basically has the same format.
so far I have the code:
df.iloc[:,1].replace('.','2018',regex=True)
how can I change my code such that it will work as I desire?
I am doing this so that eventually I will be able to transform these into dates that can count how many days are there between the two dates.
a = '[26.07. - 08.09.]'
aWithYear = [i[:-1]+'2018'+i[-1] for i in a.split('-')]
print('-'.join(aWithYear))
# prints [26.07.2018 - 08.09.2018]
If you have, for example,
df = pd.DataFrame({'col': ['[05.07. - 18.08.]', '[05.07. - 18.09.]']})
col
0 [05.07. - 18.08.]
1 [05.07. - 18.09.]
You can split and concat the str.get(0) and str.get(1) values
vals = df.col.str.strip('[]').str.split("- ")
get = lambda s: vals.str.get(s).str.strip() + '2018'
df['col'] = '[' + get(0) + ' - ' + get(1) + ']'
col
0 [05.07.2018 - 18.08.2018]
1 [05.07.2018 - 18.09.2018]

Categories

Resources