I have the following code:
output = requests.get(url=url, auth=oauth, headers=headers, data=payload)
output_data = output.content
type(output_date)
<class 'bytes'>
output_data
Squeezed Text (3632 Lines)
When looking at the squeezed text, I have some values that look like this:
Steve likes to walk his dog. Steve says to John "I like \n Pineapple, oranges, \n and pizza.\n" and then he went to bed \n.
John likes his beer cold.\n
Sally likes her teeth brushed with a bottle of jack.\n
How can I remove the \n characters, but ONLY if it is contained within double quotes, so that my results look like this:
Steve likes to walk his dog. Steve says to John "I like Pineapple, oranges, and pizza." and then he went to bed \n.
John likes his beer cold.\n
Sally likes her teeth brushed with a bottle of jack.\n
I know how to remove \n characters, but I am not sure how to do this if I only want to remove the values if they are contained within double quotes.
Here is what I have tries:
I found this, and used this code:
my_text = re.sub(r'"\\n"','',my_text)
But it doesn't seem to be working.
I might be complicating it a bit, but something like this might work
parts = content.split("\"")
for i, part in enumerate(parts):
if i % 2:
parts[i] = part.replace("\n", "")
content = "\"".join(parts)
Figured it out.
Steps:
Convert bytes to String
Create the pattern for Regex
Use regex to format the values.
Step 1:
my_text = my_text.decode("utf-8")
Step 2:
pattern = re.compile(r'".*?"',re.DOTALL)
Step 3:
my_text = pattern.sub(lambda x:x.group().replace('\n',''),my_text)
This solves my problem.
Related
I hope to extract the full sentence, if containing certain key words (like or love).
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]* like|love [^.]*\.'
re.findall(pattern,text)
Using | for the divider , I was expected ['I like blueberry icecream.']
But only got ['I like']
I also tried pattern = '[^.]*(like|love)[^.]*\.' but got only ['like']
What did I do wrong as I know single word works with following RegEx - '[^.]* like [^.]*\.'
You need to put a group around like|love. Otherwise the | applies to the entire patterns on either side of it. So it's matching either a string ending with like or a string beginning with love.
pattern = '[^.]* (?:like|love) [^.]*\.'
Research more and found out I was missing ?:
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]*(?:like|love)[^.]*\.'
Output
['I like blueberry icecream.']
Source: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
I actually think it would be easier to do this without regex. Just my two cents.
text = 'I like blueberry icecream. He has a green car. She has blue car. I love dogs.'
print([x for x in text.split('.') if any(y in x for y in ['like', 'love'])])
You can use below regex
regex = /[^.]* (?:like|love) [^.]*\./g
Demo here
I have many big strings with many characters (about 1000-1500 characters) and I want to write the string to a text file using python. However, I need the strings to occupy only a single line in a text file.
For example, consider two strings:
string_1 = "Mary had a little lamb
which was as white
as snow"
string_2 = "Jack and jill
went up a hill
to fetch a pail of
water"
When I write them to a text file, I want the strings to occupy only one line and not multiple lines.
text file eg:
Mary had a little lamb which was as white as snow
Jack and Jill went up a hill to fetch a pail of water
How can this be done?
If you want all the strings to be written out on one line in a file without a newline separator between them there are a number of ways as others here have shown.
The interesting issue is how you get them back into a program again if that is needed, and getting them back into appropriate variables.
I like to use json (docs here) for this kind of thing and you can get it to output all onto one line. This:
import json
string_1 = "Mary had a little lamb which was as white as snow"
string_2 = "Jack and jill went up a hill to fetch a pail of water"
strs_d = {"string_1": string_1, "string_2": string_2}
with open("foo.txt","w") as fh:
json.dump(strs_d, fh)
would write out the following into a file:
{"string_1": "Mary had a little lamb which was as white as snow", "string_2": "Jack and jill went up a hill to fetch a pail of water"}
This can be easily reloaded back into a dictionary and the oroginal strings pulled back out.
If you do not care about the names of the original string variable, then you can use a list like this:
import json
string_1 = "Mary had a little lamb which was as white as snow"
string_2 = "Jack and jill went up a hill to fetch a pail of water"
strs_l = [string_1, string_2]
with open("foo.txt","w") as fh:
json.dump(strs_l, fh)
and it outputs this:
["Mary had a little lamb which was as white as snow", "Jack and jill went up a hill to fetch a pail of water"]
which when reloaded from the file will get the strings all back into a list which can then be split into individual strings.
This all assumes that you want to reload the strings (and so do not mind the extra json info in the output to allow for the reloading) as opposed to just wanting them output to a file for some other need and cannot have the extra json formatting in the output.
Your example output does not have this, but your example output also is on more than one line and the question wanted it all on one line, so your needs are not entirely clear.
In [36]: string_1 = "Mary had a little lamb which was as white as snow"
...:
...: string_2 = "Jack and jill went up a hill to fetch a pail of water"
In [37]: s = [string_1, string_2]
In [38]: with open("a.txt","w") as f:
...: f.write(" ".join(s))
...:
Construct single line from multiline string and then write to file as normal. Your example really should use triple quotes to allow for multi-line strings
string_1 = """Mary had a little lamb
which was as white
as snow"""
string_2 = """Jack and jill
went up a hill
to fetch a pail of
water"""
with open("myfile.txt", "w") as f:
f.write(" ".join(string_1.split("\n")) + "\n")
f.write(" ".join(string_2.split("\n")) + "\n")
with open("myfile.txt") as f:
print(f.read())
output
Mary had a little lamb which was as white as snow
Jack and jill went up a hill to fetch a pail of water
You can split the string to lines using parenthesis:
s = (
"First line "
"second line "
"third line"
)
You can also use triple quotes and remove the newline characters using strip and replace:
s = """
First line
Second line
Third line
""".strip().replace("\n", " ")
total_str = [string_1,string_2]
with open(file_path+"file_name.txt","w") as fp:
for i in total_str:
fp.write(i+'\n')
fp.close()
My List:
['\n\r\n\tThis article is about sweet bananas. For the genus to which
banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas
used in cooking, see Cooking banana. For other uses, see Banana
(disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya
and Australia, and are likely to have been first domesticated in Papua
New Guinea.\n\r\n\tThey are grown in 135
countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between
"bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is
the largest herbaceous flowering plant.\n\r\n\tAll the above-ground
parts of a banana plant grow from a structure usually called a
"corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West
African origin, possibly from the Wolof word banaana, and passed into
English via Spanish or Portuguese.\n']
Example code:
import requests
from bs4 import BeautifulSoup
import re
re=requests.get('http://www.abcde.com/banana')
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
print(title_tag.text)
list=[]
for tag in soup.select('.page_article_content'):
list.append(tag.text)
#list=([c.replace('\n', '') for c in list])
#list=([c.replace('\r', '') for c in list])
#list=([c.replace('\t', '') for c in list])
print(list)
After I scraping a web page, I need to do data cleansing. I want to replace all the "\r", "\n", "\t" as "", but I found that I have subtitle in this, if I do this, subtitles and sentences are going to mix together.
Every subtitle always starts with \n\n and ends with \n\r\n\t, is it possible that I can do something to distinguish them in this list like \aEtymology\a. It's not going to work if I replace \n\n and \n\r\n\t separately to \a first cause other parts might have the same elements like this \n\n\r and it will become like \a\r. Thanks in advance!
Approach
Replace the subtitles to a custom string, <subtitles> in the list
Replace the \n, \r, \t etc. in the list
Replace the custom string with the actual subtitle
Code
l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']
import re
regex=re.findall("\n\n.*.\n\r\n\t",l[0])
print(regex)
for x in regex:
l = [r.replace(x,"<subtitles>") for r in l]
rep = ['\n','\t','\r']
for y in rep:
l = [r.replace(y, '') for r in l]
for x in regex:
l = [r.replace('<subtitles>', x, 1) for r in l]
print(l)
Output
['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t']
['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.']
import re
print([re.sub(r'[\n\r\t]', '', c) for c in list])
I think you may use regex
You can do this by using regular expressions:
import re
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t')
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li]
\g<1> is a backreference to the (\w+) in the first regex. It lets you reuse what ever is in there.
if one particular word does not end with another particular word, leave it. here is my string:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
i want to print and count all words between john and dead or death or died.
if john does not end with any of the died or dead or death words. leave it. start again with john word.
my code :
x = re.sub(r'[^\w]', ' ', x) # removed all dots, commas, special symbols
for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
print i
print len([word for word in i.split()])
my output:
got shot
2
with his john got killed or
6
with his wife
3
output which i want:
got shot
2
got killed or
3
with his wife
3
i don't know where i am doing mistake.
it is just a sample input. i have to check with 20,000 inputs at a time.
You can use this negative lookahead regex:
>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
... print i.strip()
... print len([word for word in i.split()])
...
got shot
2
got killed or
3
with his wife
3
Instead of your .*? this regex is using (?:(?!john).)*? which will lazily match 0 or more of any characters only when john is not present in this match.
I also suggest using word boundaries to make it match complete words:
re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)
Code Demo
I assume, you want to start over, when there is another john following in your string before dead|died|death occur.
Then, you can split your string by the word john and start matching on the resulting parts afterwards:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
m = re.match('(.+?)(dead|died|death)', e)
if m:
print(m.group(1))
print(len(m.group(1).split()))
yields:
got shot
2
got killed or
3
with his wife
3
Also, note that after the replacements I propose here (before splitting and matching), the string looks like this:
john got shot dead john with his john got killed or died in 1990 john with his wife dead or died
I.e., there are no multiple whitespaces left in a sequence. You manage this by splitting by a whitespace later, but I feel this is a bit cleaner.
I have a large text in which three people talking.
I read that text to a string variable in python.
Text is like
JOHN: hello
MIKE: hello john
SARAH: hello guys
Imagine a long talk between 3 people. I want to split the texts into lists like
john = []
mike = []
sarah = []
and I want the list john to contain every sentence john said.
Can anyone help me with the code I need?
See if this is enough to get you started.
for line in text:
if line.startswith('JOHN'):
john.append(line)
elif line.startswith('MIKE'):
mike.append(line)
elif line.startswith('SARAH'):
sarah.append(line)