I am trying to modify the formatting of the strings of a Datframe column according to a condition.
Here is an example of the file
The DataFrame
Now, as you might see, the object column values either start with http or a capital letter: I want to make it so that:
if the string starts with http, I put it between <>
if the string starts with a capital letter, I format it as " + string + " + '#en'
However, I cant seem to be able to do so: I tried to make a simple if condition with .startswith(h) or contains('http') but it doesn't work, because I understand that it actually returns a list of booleans instead of a single condition.
Maybe it is very simple but I cannot solve, any help is appreciated.
Here is my code
import numpy as np
import pandas as pd
import re
ont1 = pd.read_csv('1.tsv',sep='\t',names=['subject','predicate','object'])
ont1['subject'] = '<' + ont1['subject'] + '>'
ont1['predicate'] = '<' + ont1['predicate'] + '>'
So it looks like you have many of the right pieces here, you mentioned boolean indexing which is what you can use to select and update certain rows, for example I'll do this on a dummy DataFrame:
df = pd.DataFrame({"a":["http://akjsdhka", "Helloall", "http://asdffa", "Bignames", "nonetodohere"]})
First we can find rows starting with "http":
mask = df["a"].str.startswith("http")
df.loc[mask, "a"] = "<" + df["a"] + ">"
Then we update the rows where that mask is true, and the same for the other condition:
mask2 = df["a"].str[0].str.isupper()
df.loc[mask2, "a"] = "\"" + df["a"] + "\"#en"
Final result:
a
0 <http://akjsdhka>
1 "Helloall"#en
2 <http://asdffa>
3 "Bignames"#en
4 nonetodohere
Try:
ont1.loc[['subject'].str.startsWith("http"),'subject'] = "<" + ont1 ['subject'] + ">"
Ref to read:
https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
Related
I have a pandas dataframe, where the 2nd, 3rd and 6th columns look like so:
start
end
strand
108286
108361
+
734546
734621
-
761233
761309
+
I'm trying to implement a conditional where, if strand is +, then the value in end becomes the equivalent value in start + 1, and if strand is -, then the value in start becomes the value in end, so the output should look like this:
start
end
strand
108286
108287
+
734620
734621
-
761233
761234
+
And where the pseudocode may look like this:
if df["strand"] == "+":
df["end"] = df["start"] + 1
else:
df["start"] = df["end"] - 1
I imagine this might be best done with loc/iloc or numpy.where? but I can't seem to get it to work, as always, any help is appreciated!
You are correct, loc is the operator you are looking for
df.loc[df.strand=='+','end'] = df.loc[df.strand=='+','start']+1
df.loc[df.strand=='-','start'] = df.loc[df.strand=='-','end']-1
You could also use numpy.where:
import numpy as np
df[['start', 'end']] = np.where(df[['strand']]=='-', df[['end','end']]-[1,0], df[['start','start']]+[0,1])
Note that this assumes strand can have one of two values: + or -. If it can have any other values, we can use numpy.select instead.
Output:
start end strand
0 108286 108287 +
1 734620 734621 -
2 761233 761234 +
I am trying to add prefixes to urls in my 'Websites' Column. I can't figure out how to keep each new iteration of the helper column from overwriting everything from the previous column.
for example say I have the following urls in my column:
http://www.bakkersfinedrycleaning.com/
www.cbgi.org
barstoolsand.com
This would be the desired end state:
http://www.bakkersfinedrycleaning.com/
http://www.cbgi.org
http://www.barstoolsand.com
this is as close as I have been able to get:
def nan_to_zeros(df, col):
new_col = f"nanreplace{col}"
df[new_col] = df[col].fillna('~')
return df
df1 = nan_to_zeros(df1, 'Website')
df1['url_helper'] = df1.loc[~df1['nanreplaceWebsite'].str.startswith('http')| ~df1['nanreplaceWebsite'].str.startswith('www'), 'url_helper'] = 'https://www.'
df1['url_helper'] = df1.loc[df1['nanreplaceWebsite'].str.startswith('http'), 'url_helper'] = ""
df1['url_helper'] = df1.loc[df1['nanreplaceWebsite'].str.startswith('www'),'url_helper'] = 'www'
print(df1[['nanreplaceWebsite',"url_helper"]])
which just gives me a helper column of all www because the last iteration overwrites all fields.
Any direction appreciated.
Data:
{'Website': ['http://www.bakkersfinedrycleaning.com/',
'www.cbgi.org', 'barstoolsand.com']}
IIUC, there are 3 things to fix here:
df1['url_helper'] = shouldn't be there
| should be & in the first condition because 'https://www.' should be added to URLs that start with neither of the strings in the condition. The error will become apparent if we check the first condition after the other two conditions.
The last condition should add "http://" instead of "www".
Alternatively, your problem could be solved using np.select. Pass in the multiple conditions in the conditions list and their corresponding choice list and assign values accordingly:
import numpy as np
s = df1['Website'].fillna('~')
df1['fixed Website'] = np.select([~(s.str.startswith('http') | ~s.str.contains('www')),
~(s.str.startswith('http') | s.str.contains('www'))
],
['http://' + s, 'http://www.' + s], s)
Output:
Website fixed Website
0 http://www.bakkersfinedrycleaning.com/ http://www.bakkersfinedrycleaning.com/
1 www.cbgi.org http://www.cbgi.org
2 barstoolsand.com http://www.barstoolsand.com
I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]
This question might have other answers but I could not figure out how to apply them on my current code.
I have to iterate through the DataFrame and modify certain column values as shown below:
NOTE: All of the columns are strings. The ones with _Length contain the length in int of the columns containing strings.
for col in range(0, 200):
if df['Partial_Input_Length'][col] < 50:
df['Full_Input'][col] = df['Partial_Input'][col] + " " + df['Input5'][col] + " " + df['Input6'][col]
else:
df['Full_Input'][col] = df['Partial_Input'][col]
This was used when I used a testing DataFrame containing only 200 rows. If I use for col in range(0, 80000): in the 80k rows DataFrame, it takes a huge amount of time until every operation is done.
I also tried out with itertuples() in this way:
for col in df.itertuples():
if col.Partial_Input_Length < 50:
col.Full_Input = col.Partial_Input + " " + col.Input5 + " " + col.Input6
else:
col.Full_Input = col.Partial_Input
But after running it, I get the following error.
File "", line 23, in
col.Full_Input = col.Partial_Input + " " + col.Input5 + " " + col.Input6
AttributeError: can't set attribute
Moreover, I tried with iterrows() like this:
for index, col in df.iterrows():
if df['Partial_Input_Length'][index] < 50:
df['Full_Input'][index] = df['Partial_Input'][index] + " " + df['Input5'][index] + " " + df['Input6'][index]
else:
df['Full_Input'][index] = df['Partial_Input'][index]
But the code above is taking huge amounts of time, as well.
Is it normal that every time I run these iterations on a big dataframe it takes a lot of time or am I doing something wrong?
I am quite a newbie when it comes to iterating in python. Therefore, what method should I use for the quickest iteration time and which fits on what I am trying to use it for?
You can do it without looping:
df['Full_Input'] = df['Partial_Input'].str.cat(df['Input5'], sep=" ").str.cat(df['Input6'], sep=" ")
df['Full_Input'] = np.where(df['Partial_Input_Length'].str.len() > 50, df['Partial_Input'], df['Full_Input'])
first of all you should not be modifying the elements that you are iterating over
almost all iter* functions in pandas will return read-only items, so setting anything on them will not work
to do what you want, I use apply or run a loop, that will call a function that will return a dict with the changes you want to be done and then either remake the entire dataframe or do a merge
something like
# if your modification is more simple then a simple apply will also work
df['new_col'] = df.apply(lambda x: f'{x.startDate.year}-{x.startDate.week}')
# if you want to do something more complex with all the items in the row
def foo(row):
def mofification_code(item):
return modified_item
return {
'primary_key': row.primary_key,
'modified_data': modification_code(row.item)
}
modified_data = [foo(row) for row in df.itertuples()]
# sometimes this may be sufficient,
new_df = pd.DataFrame(modified_data)
# alternatively, you can do a merge with the original data
new_df = pd.merge(df, new_df, how='left', left_on='primary_key', right_on='primary_key')
let's say I have a dataframe with a value strings that looks like like:
[26.07. - 08.09.]
and I want to add '2018' behind the last '.' before the date ends such that my output will be :
[26.07.2018 - 08.09.2018]
and apply this for the rest of the dataframe which basically has the same format.
so far I have the code:
df.iloc[:,1].replace('.','2018',regex=True)
how can I change my code such that it will work as I desire?
I am doing this so that eventually I will be able to transform these into dates that can count how many days are there between the two dates.
a = '[26.07. - 08.09.]'
aWithYear = [i[:-1]+'2018'+i[-1] for i in a.split('-')]
print('-'.join(aWithYear))
# prints [26.07.2018 - 08.09.2018]
If you have, for example,
df = pd.DataFrame({'col': ['[05.07. - 18.08.]', '[05.07. - 18.09.]']})
col
0 [05.07. - 18.08.]
1 [05.07. - 18.09.]
You can split and concat the str.get(0) and str.get(1) values
vals = df.col.str.strip('[]').str.split("- ")
get = lambda s: vals.str.get(s).str.strip() + '2018'
df['col'] = '[' + get(0) + ' - ' + get(1) + ']'
col
0 [05.07.2018 - 18.08.2018]
1 [05.07.2018 - 18.09.2018]