Any particular way to strip away multiple words from particular text? - python

I'll give a bit of the snippit of code I made. Here it is:
url = urlopen("http://sports.yahoo.com/nhl/scoreboard?d=2013-01-19")
content = url.read()
soup = BeautifulSoup(content)
def yahooscores():
for table in soup.find_all('table', class_='player-title'):
for row in table.find_all('tr'):
date = None
for cell in row.find_all('td', class_='yspsctnhdln'):
for text in cell:
date = cell.text
if date is not None:
print ('%s' % (date) + ", 2013:")
I was trying to go about stripping the words "Scores & Schedules" from the date part of the website, but I could not somehow do it with the .split() and .strip() methods.
So, let me explain what I wish to do, with the above website as an example.
So far, this is what comes out for a date:
Scores & Schedule: Jan 19, 2013:
I just want this:
Jan 19, 2013:
Is there anything in particular I need to know in order to strip those 3 words?

The actual content of cell.text is:
'\nScores & Schedule: Jan 19\n'
... so it makes more sense to get what you need out of that (the last two words) first, and then add ', 2013:' to it, as I think you're trying to do already. A handy feature of split() is that it automatically strips leading and trailing whitespace, so probably the most robust way to get what you want is to change your last line to:
print(' '.join(date.split()[-2:]) + ', 2013:')
This splits date into a list of words with .split(), then uses [-2:] to get the last two words in the list, then joins them back together with a space using ' '.join(...), and finally adds ', 2013:' to the end before printing the result.
As a side note, '%s' % (date) in your original version does absolutely nothing: all you're doing is replacing date with itself. It might be worth familiarising yourself with the documentation on percent-formatting so that you understand why.

Keeping it simple:
>>> s = "Scores & Schedule: Jan 19, 2013:"
>>> s.replace("Scores & Schedule:", "")
' Jan 19, 2013:'

date = "Scores & Schedule: Jan 19, 2013:"
There are many options:
date = date[19:]
date = date.replace("Scores & Schedule: ", "")
date = date.split(":")[1].strip()+":"
to name a few.

Just replace the unwanted part with an empty string.
>>> "Scores & Schedule: Jan 19, 2013:".replace("Scores & Schedule:", "")
' Jan 19, 2013:'

How about:
print(date[20:].strip('\n') + ', 2013')
this is assuming that there will ALWAYS be 'Scores & Schedule: ' in the response.

Related

How to replace and split a simple data string using Python into 2 separate outputs

This is the input data I am dealing with.
November (5th-26th)
What I want to do is get 2 seperate outputs from this data string. i.e. {November 5th} and {November 26th}
I currently use this python script to remove the uneccesary characters in it
Name = input_data['date'].replace("(", "").replace(")", "")
output = [{"Date": Name}]
and use other no code formating tools (Zapier) to split the data and make the output come as {November 5th} and {November 26th}
I would like to know form you guys if there is a single python code I could use to get the desired output without using other formating tools.
Thanks
I haven't treid anything with the code yet.
You could do it like this in two steps:
string = "November (5th-26th)"
month, days = [x.strip('()') for x in string.split(' ')]
start, end = days.split('-')
output = {'date_range' : [f"{month} {start}", f"{month} {end}"]}
print(output)
{'date_range': ['November 5th', 'November 26th']}
2nd task:
Similar approach, replace the & (with leading and trailing space) with e.g , then seperate the data by splitting again similar to before.
string2 = "December 10th & 11th"
string2 = string2.replace(' & ', ',')
month, days = string2.split(' ')
start, end = days.split(',')
output = {'date_range' : [f"{month} {start}", f"{month} {end}"]}
print(output)
{'date_range': ['December 10th', 'December 11th']}
A simple regex will handle that nicely
import re
value = "November (5th-26th)"
month, date_from, date_to = re.match(r"(\w+)\s+\((.*)-(.*)\)", value).groups()
print([month, date_from, date_to]) # ['November', '5th', '26th']
result = [f"{month} {date_from}", f"{month} {date_to}"]
print(result) # ['November 5th', 'November 26th']
A simple tokenisation and split should do the trick:
s = 'November (5th-26th)'
m, d = s.split()
for dt in d[1:-1].split('-'):
print(f'{{{m} {dt}}}')
Output:
{November 5th}
{November 26th}

Remove transcript timestamps and join the lines to make paragraph

File: Plain Text Document
Content: Youtube timestamped transcript
I can separately remove each line's timestamp:
for count, line in enumerate(content, start=1):
if count % 2 == 0:
s = line.replace('\n','')
print(s)
I can also join the sentences if I don't remove the timestamps:
with open('file.txt') as f:
print (" ".join(line.strip() for line in f))
But I attempted to do these together (removing timestamps and joining the lines) in various formats but no right outcome:
with open('Russell Brand Script.txt') as m:
for count, line in enumerate(m, start=1):
if count % 2 == 0:
sentence=line.replace('\n',' ')
print(" ".join(sentence.rstrip('\n')))
I also tried various form of print(" ".join(sentence.rstrip('\n'))) and print(" ".join(sentence.strip())) but the results is always either of below:
How can I remove the timestamps and join the sentences to create a paragraph at once?
Whenever you call .join() on a string, it inserts the separator between every character of the string. You should also note that print(), by default, adds a newline after the string is printed.
To get around this, you can save each modified sentence to a list, and then output the entire paragraph at once at the end using "".join(). This gets around the newline issue described above, and gives you the ability to do additional processing on the paragraph afterwards, if desired.
with open('put_your_filename_here.txt') as m:
sentences = []
for count, line in enumerate(m, start=1):
if count % 2 == 0:
sentence=line.replace('\n', '')
sentences.append(sentence)
print(' '.join(sentences))
(Made a small edit to the code -- the old version of the code produced a trailing space after the paragraph.)
TL;DR: copy-paste solution using list-comprehension with if as filter and regex to match timestamp:
' '.join([line.strip() for line in transcript if not re.match(r'\d{2}:\d{2}', line)]).
Explained
Suppose your text input given is:
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
Then you can ignore the timestamps with regex \d{2}:\d{2} and append all filtered lines as phrase to a list. Trim each phrase using strip() which removes heading/trailing whitespace. But when you finally join all phrases to a paragraph use a space as delimiter:
import re
def to_paragraph(transcript_lines):
phrases = []
for line in transcript_lines:
trimmed = line.strip()
if trimmed != '' and not re.matches(r'\d{2}:\d{2}', trimmed):
phrases.append(trimmed)
else: # TODO: for debug only, remove
print(line) # TODO: for debug only, remove
return " ".join(phrases)
t = '''
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
'''
paragraph = to_paragraph(t.splitlines())
print(paragraph)
with open('put_your_filename_here.txt') as f:
print(to_paragraph(f.readlines())
Outputs:
00:00
00:03
00:05
00:07
00:09
('result:', "merry christmas it's our christmas video to you i already regret this hat but if we got some fantastic content for you a look at the most joyous and wonderful aspects have a very merry year ho ho ho")
Result is same as youtubetranscript.com returned for the given youtube video.

Loop substrings to new column

I am working on a dataset that looks somewhat like this (using python and pandas):
date text
0 Jul 31 2020 Sentence Numero Uno #cool
1 Jul 31 2020 Second sentence
2 Jul 31 2020 Test sentence 3 #thanks
So I use this bit of code I found online to remove the Hashtags like #cool #thanks as well as make everything lowercase.
for i in range(df.shape[0]) :
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
That works, however I now don't want to delete the hashtags completely but save them in a extra column like this:
date text hashtags
0 Jul 31 2020 sentence numero uno #cool
1 Jul 31 2020 second sentence
2 Jul 31 2020 test sentence 3 #thanks
Can anyone help me with that? How can I do that?
Thanks in advance.
Edit: As some strings contain multiple hashtags it should be stored in the hashtag column as a list.
One possible way to go about this would be the following:
df['hashtag'] = ''
for i in range(len(df)) :
df['hashtag'][i] = ' '.join(re.findall("(#[A-Za-z0-9]+)", df['text'][i]))
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
So, first you create an empty string column called hashtag. Then, in every loop through the rows, you first extract any number of unique hashtags that might exist in the text into the new column. If none exist, you end up with an empty string (you can change that if you like to something else). And then, you replace the hashtag with an empty space, as you were already doing before.
If it happens that in some texts you have more than 1 hashtag, depending on how you want to use the hashtags later, it could be easier to actually store them as a list, instead of " ".join(...). So, if you want to store them as a list, you could replace row 3 with:
df['hashtag'][i] = re.findall("(#[A-Za-z0-9]+)", df['text'][i])
which just returns a list of hashtags.
Use Series.str.findall with Series.str.join:
df['hashtags'] = df['text'].str.lower().str.findall(r"(\#[A-z0-9]+)").str.join(' ')
You can use this string method of pandas:
pattern = r"(\#[A-z0-9]+)"
df['text'].str.extract(pattern, expand=True)
If your string contains multiple matches, you should use str.extractall:
df['text'].str.extractall(pattern)
I added a couple of lines below your code, it should work:
df['hashtags']=''
for i in range(df.shape[0]) :
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
l=df['text'][i].split(0)
s=[k for k in l if k[0]=='#']
if len(s)>=1:
df['hashtags'][i]=' '.join(s)
Use newdf = pd.DataFrame(df.row.str.split('#',1).tolist(),columns = ['text','hashtags']) instead of you for-loop. This will create a new Dataframe. Then you can set df['text']=newdf['text'] and df['hashtags']=newdf['hashtags'].

How to replace string and exclude certain changing integers?

I am trying to replace
'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
with:
'AMAT 10Q Filing Section: Risk'
However, everything up until Filing Section: Risk will be constantly changing, except for positioning. I just want to pull the characters from position 0 to 5 and from 15 through 19.
df['section'] = df['section'].str.replace(
I'd like to manipulate this but not sure how?
Any help is much appreciated!
Given your series as s
s.str.slice(0, 5) + s.str.slice(15, 19) # if substring-ing
s.str.replace(r'\d{5}', '') # for a 5-length digit string
You may need to adjust your numbers to index properly. If that doesn't work, you probably want to use a regular expression to get rid of some length of numbers (as above, with the example of 5).
Or in a single line to produce the final output you have above:
s.str.replace(r'\d{10}_|\d{8}_', '').str.replace('_', ' ')
Though, it might not be wise to replace the underscores. Instead, if they change, explode the data into various columns which can be worked on separately.
If you want to replace a fix length/position of chars, use str.slice_replace to replace
df['section'] = df['section'].str.slice_replace(6, 14, ' ')
Other people would probably use regex to replace pieces in your string. However, I would:
Split the string
append the piece if it isn't a number
Join the remaining data
Like so:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
n = []
for i in s.split('_'):
try:
i = int(i)
except ValueError:
n.append(i)
print(' '.join(n))
AMAT 10Q Filing Section: Risk
Edit:
Re-reading your question, if you are just looking to substring:
Grabbing the first 5 characters:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
print(s[:4]) # print index 0 to 4 == first 5
print(s[15:19]) # print index 15 to 19
print(s[15:]) # print index 15 to the end.
If you would like to just replace pieces:
print(s.replace('_', ' '))
you could throw this in one line as well:
print((s[:4] + s[15:19] + s[28:]).replace('_', ' '))
'AMAT 10Q Filing Section: Risk'

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.
You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']
import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)
The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

Categories

Resources