python - remove text of string - python

I have a string that looks like this
name = '1/23/20151'
And now I want to remove just the trailing 1, at the end of 2015. So that it becomes
1/23/2015
So i tried this
sep = '2015'
name = name.split(sep, 1)[0]
but this removes the 2015 also, I want the 2015 to stay, how could I do this.
Thanks for the help in advance.
EDIT
Sorry I didn't fully explain the problem I have two strings the one previously mentioned and a noraml date '1/22/2015' and I loop through and only want to remove this extra character if it is there which is why name = name[:-1] doesn't work.

name = name.rstrip('1')
will only remove trailing '1'
name = '1/23/20151'
name = name.rstrip('1') # 1/23/2015
'1/23/2015'.rstrip('1') # 1/23/2015

just do this
name = name[:-1]
That should do it.
If you only want to remove the fifth digit after the year, I'd do this:
name = name.split('/')
name = '/'.join([name[0],name[1],name[2][:4]])

List slicing can easily accomplish this:
>>> name[:-1]
>>> '1/23/2015'

Related

Loop substrings to new column

I am working on a dataset that looks somewhat like this (using python and pandas):
date text
0 Jul 31 2020 Sentence Numero Uno #cool
1 Jul 31 2020 Second sentence
2 Jul 31 2020 Test sentence 3 #thanks
So I use this bit of code I found online to remove the Hashtags like #cool #thanks as well as make everything lowercase.
for i in range(df.shape[0]) :
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
That works, however I now don't want to delete the hashtags completely but save them in a extra column like this:
date text hashtags
0 Jul 31 2020 sentence numero uno #cool
1 Jul 31 2020 second sentence
2 Jul 31 2020 test sentence 3 #thanks
Can anyone help me with that? How can I do that?
Thanks in advance.
Edit: As some strings contain multiple hashtags it should be stored in the hashtag column as a list.
One possible way to go about this would be the following:
df['hashtag'] = ''
for i in range(len(df)) :
df['hashtag'][i] = ' '.join(re.findall("(#[A-Za-z0-9]+)", df['text'][i]))
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
So, first you create an empty string column called hashtag. Then, in every loop through the rows, you first extract any number of unique hashtags that might exist in the text into the new column. If none exist, you end up with an empty string (you can change that if you like to something else). And then, you replace the hashtag with an empty space, as you were already doing before.
If it happens that in some texts you have more than 1 hashtag, depending on how you want to use the hashtags later, it could be easier to actually store them as a list, instead of " ".join(...). So, if you want to store them as a list, you could replace row 3 with:
df['hashtag'][i] = re.findall("(#[A-Za-z0-9]+)", df['text'][i])
which just returns a list of hashtags.
Use Series.str.findall with Series.str.join:
df['hashtags'] = df['text'].str.lower().str.findall(r"(\#[A-z0-9]+)").str.join(' ')
You can use this string method of pandas:
pattern = r"(\#[A-z0-9]+)"
df['text'].str.extract(pattern, expand=True)
If your string contains multiple matches, you should use str.extractall:
df['text'].str.extractall(pattern)
I added a couple of lines below your code, it should work:
df['hashtags']=''
for i in range(df.shape[0]) :
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
l=df['text'][i].split(0)
s=[k for k in l if k[0]=='#']
if len(s)>=1:
df['hashtags'][i]=' '.join(s)
Use newdf = pd.DataFrame(df.row.str.split('#',1).tolist(),columns = ['text','hashtags']) instead of you for-loop. This will create a new Dataframe. Then you can set df['text']=newdf['text'] and df['hashtags']=newdf['hashtags'].

How to replace string and exclude certain changing integers?

I am trying to replace
'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
with:
'AMAT 10Q Filing Section: Risk'
However, everything up until Filing Section: Risk will be constantly changing, except for positioning. I just want to pull the characters from position 0 to 5 and from 15 through 19.
df['section'] = df['section'].str.replace(
I'd like to manipulate this but not sure how?
Any help is much appreciated!
Given your series as s
s.str.slice(0, 5) + s.str.slice(15, 19) # if substring-ing
s.str.replace(r'\d{5}', '') # for a 5-length digit string
You may need to adjust your numbers to index properly. If that doesn't work, you probably want to use a regular expression to get rid of some length of numbers (as above, with the example of 5).
Or in a single line to produce the final output you have above:
s.str.replace(r'\d{10}_|\d{8}_', '').str.replace('_', ' ')
Though, it might not be wise to replace the underscores. Instead, if they change, explode the data into various columns which can be worked on separately.
If you want to replace a fix length/position of chars, use str.slice_replace to replace
df['section'] = df['section'].str.slice_replace(6, 14, ' ')
Other people would probably use regex to replace pieces in your string. However, I would:
Split the string
append the piece if it isn't a number
Join the remaining data
Like so:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
n = []
for i in s.split('_'):
try:
i = int(i)
except ValueError:
n.append(i)
print(' '.join(n))
AMAT 10Q Filing Section: Risk
Edit:
Re-reading your question, if you are just looking to substring:
Grabbing the first 5 characters:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
print(s[:4]) # print index 0 to 4 == first 5
print(s[15:19]) # print index 15 to 19
print(s[15:]) # print index 15 to the end.
If you would like to just replace pieces:
print(s.replace('_', ' '))
you could throw this in one line as well:
print((s[:4] + s[15:19] + s[28:]).replace('_', ' '))
'AMAT 10Q Filing Section: Risk'

How to remove - from values in a field- python or pyspark

I have a field that looks like
field1
231-206-2222
231-206-2344
231-206-1111
231-206-1111
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
It seems like a dataframe to me, if so try this:
df['field1'].apply(lambda x: x.replace("-",""))
There are many ways of doing it.
Demo:
1) # where sub will replace hyphen with empty space
df = pd.DataFrame({'field1': ['123-456-999', '333-222-111']})
df['field1'] = df['field1'].apply(lambda x: re.sub(r'-', '', x))
2) # where \D+ will match one or more non-digits and remove it
df['field1'] = df['field1'].str.replace(r'\D+', '')
3) # to replace - with empty space
df['field1'] = df['field1'].str.replace('-', '')
Result:
field1
0 123456999
1 333222111

Replace string is not changing value

I am trying to replace any i's in a string with capital I's. I have the following code:
str.replace('i ','I ')
However, it does not replace anything in the string. I am looking to include a space after the I to differentiate between any I's in words and out of words.
Thanks if you can provide help!
The exact code is:
new = old.replace('i ','I ')
new = old.replace('-i-','-I-')
new = old.replace('i ','I ')
new = old.replace('-i-','-I-')
You throw away the first new when you assign the result of the second operation over it.
Either do
new = old.replace('i ','I ')
new = new.replace('-i-','-I-')
or
new = old.replace('i ','I ').replace('-i-','-I-')
or use regex.
I think you need something like this.
>>> import re
>>> s = "i am what i am, indeed."
>>> re.sub(r'\bi\b', 'I', s)
'I am what I am, indeed.'
This only replaces bare 'i''s with I, but the 'i''s that are part of other words are left untouched.
For your example from comments, you may need something like this:
>>> s = 'i am sam\nsam I am\nThat Sam-i-am! indeed'
>>> re.sub(r'\b(-?)i(-?)\b', r'\1I\2', s)
'I am sam\nsam I am\nThat Sam-I-am! indeed'

Any particular way to strip away multiple words from particular text?

I'll give a bit of the snippit of code I made. Here it is:
url = urlopen("http://sports.yahoo.com/nhl/scoreboard?d=2013-01-19")
content = url.read()
soup = BeautifulSoup(content)
def yahooscores():
for table in soup.find_all('table', class_='player-title'):
for row in table.find_all('tr'):
date = None
for cell in row.find_all('td', class_='yspsctnhdln'):
for text in cell:
date = cell.text
if date is not None:
print ('%s' % (date) + ", 2013:")
I was trying to go about stripping the words "Scores & Schedules" from the date part of the website, but I could not somehow do it with the .split() and .strip() methods.
So, let me explain what I wish to do, with the above website as an example.
So far, this is what comes out for a date:
Scores & Schedule: Jan 19, 2013:
I just want this:
Jan 19, 2013:
Is there anything in particular I need to know in order to strip those 3 words?
The actual content of cell.text is:
'\nScores & Schedule: Jan 19\n'
... so it makes more sense to get what you need out of that (the last two words) first, and then add ', 2013:' to it, as I think you're trying to do already. A handy feature of split() is that it automatically strips leading and trailing whitespace, so probably the most robust way to get what you want is to change your last line to:
print(' '.join(date.split()[-2:]) + ', 2013:')
This splits date into a list of words with .split(), then uses [-2:] to get the last two words in the list, then joins them back together with a space using ' '.join(...), and finally adds ', 2013:' to the end before printing the result.
As a side note, '%s' % (date) in your original version does absolutely nothing: all you're doing is replacing date with itself. It might be worth familiarising yourself with the documentation on percent-formatting so that you understand why.
Keeping it simple:
>>> s = "Scores & Schedule: Jan 19, 2013:"
>>> s.replace("Scores & Schedule:", "")
' Jan 19, 2013:'
date = "Scores & Schedule: Jan 19, 2013:"
There are many options:
date = date[19:]
date = date.replace("Scores & Schedule: ", "")
date = date.split(":")[1].strip()+":"
to name a few.
Just replace the unwanted part with an empty string.
>>> "Scores & Schedule: Jan 19, 2013:".replace("Scores & Schedule:", "")
' Jan 19, 2013:'
How about:
print(date[20:].strip('\n') + ', 2013')
this is assuming that there will ALWAYS be 'Scores & Schedule: ' in the response.

Categories

Resources