Parsing the String to different format in Python - python

I have a text document and I need to add two # symbols before the keywords present in an array.
Sample text and Array:
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
Required Text:
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need, ##name: George ##employee_id:14296 ##blood_group:b positive this is the blood group of the employee ##age:32"

Just use the replace function
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr = ['name','employee_id','blood_group','age']
for w in arr:
str = str.replace(w, f'##{w}')
print(str)

You can simply loop over arr and use the str.replace function:
for repl in arr:
strng.replace(repl, '##'+repl)
print(strng)
However, I urge you to change the variable name str because it is a reserved keyword.

You might use re module for that task following way
import re
txt = "This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
newtxt = re.sub('('+'|'.join(arr)+')',r'##\1',txt)
print(newtxt)
Output:
This is a sample text document which consists of all demographic information of employee here is the value you may need,##name: George ##employee_id:14296##blood_group:b positive this is the blood group of the employee##age:32
Explanation: here I used regular expression to catch words from your list and replace each with ##word. This is single pass, as opposed to X passes when using multiple str.replace (where X is length of arr), so should be more efficient for cases where arr is long.

As an alternative, you can convert the below in a loop for lengthier list. There seems to be space before ## too.
str= str[:str.find(arr[0])] + ' ##' + str[str.find(arr[0]):]
str= str[:str.find(arr[1])] + ' ##' + str[str.find(arr[1]):]
str= str[:str.find(arr[2])] + ' ##' + str[str.find(arr[2]):]
str= str[:str.find(arr[3])] + ' ##' + str[str.find(arr[3]):]

You can replace the value and add space and double ## before the replaced value and in the result replace double spaces with one space.
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
for i in arr:
str = str.replace(i, " ##{}".format(i))
print(str.replace(" ", " "))
Output
This is a sample text document which consists of all demographic information of employee here is the value you may need, ##name: George ##employee_id:14296 ##blood_group:b positive this is the blood group of the employee ##age:32

Related

How to filter strings if the first three sentences contain keywords

I have a pandas dataframe called df. It has a column called article. The article column contains 600 strings, each of the strings represent a news article.
I want to only KEEP those articles whose first four sentences contain keywords "COVID-19" AND ("China" OR "Chinese"). But I´m unable to find a way to conduct this on my own.
(in the string, sentences are separated by \n. An example article looks like this:)
\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......
First we define a function to return a boolean based on whether your keywords appear in a given sentence:
def contains_covid_kwds(sentence):
kw1 = 'COVID19'
kw2 = 'China'
kw3 = 'Chinese'
return kw1 in sentence and (kw2 in sentence or kw3 in sentence)
Then we create a boolean series by applying this function (using Series.apply) to the sentences of your df.article column.
Note that we use a lambda function in order to truncate the sentence passed on to the contains_covid_kwds up to the fifth occurrence of '\n', i.e. your first four sentences (more info on how this works here):
series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))
Then we pass the boolean series to df.loc, in order to localize the rows where the series was evaluated to True:
filtered_df = df.loc[series]
You can use pandas apply method and do the way I did.
string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})
def findKeys(string):
string_list = string.strip().lower().split('\n')
flag=0
keywords=['china','covid-19','wuhan']
# Checking if the article has more than 4 sentences
if len(string_list)>4:
# iterating over string_list variable, which contains sentences.
for i in range(4):
# iterating over keywords list
for key in keywords:
# checking if the sentence contains any keyword
if key in string_list[i]:
flag=1
break
# Else block is executed when article has less than or equal to 4 sentences
else:
# Iterating over string_list variable, which contains sentences
for i in range(len(string_list)):
# iterating over keywords list
for key in keywords:
# Checking if sentence contains any keyword
if key in string_list[i]:
flag=1
break
if flag==0:
return False
else:
return True
and then call the pandas apply method on df:-
df['Contains Keywords?'] = df['article'].apply(findKeys)
First I create a series which contains just the first four sentences from the original `df['articles'] column, and convert it to lower case, assuming that searches should be case-independent.
articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()
Then use a simple boolean mask to filter only those rows where the keywords were found in the first four sentences.
df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]
Here:
found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
if s1 in string and (s2 in string or s3 in string):
found.append(string)

python dataframe regex create new column from text cell

I have a dataframe and one of the columns contains a bunch of random text. Within the random text is one name per row. I would like to create a new column within the dataframe that is only the name. All of these name start with capital letters and are preceded by phrases like, "Meet" "name is" "hello to". I believe I should use regex but not sure beyond that.
Example texts from a dataframe cells:
"This is John. He is a rock star on tour in Australia." (desired name is John)
"Meet Randy. He probably has the best hairdo on planet Earth." (desired name is Randy)
"Say hello to Mike! His moustache won first prize at the county fair." (desired name is Mike)
I think the code should be something like:
df['name'][df['text'].str.extract('r'____________')
First get the regex patterns. My logic seeing your pattern is that:
every name starts with a capital letter,
has a space before the name
starts has a character after the name (exclamation mark or full stop),
after the name has a space else even Earth will be counted, which we do not want
The regex for the following is:
re1='(\\s+)' # White Space 1
re2='((?:[A-ZÀ-ÿ][a-zÀ-ÿ]+))' # Word 1
re3='([.!,?\\-])' # Any Single Character 1
re4='(\\s+)' # White Space 2
I use this website to get my regex: https://txt2re.com/
Now do:
df['name'] = df['text'].str.extract(re1+re2+re3+re4, expand=True)[1]
Output:
0 John
1 Randy
2 Mike
3 Amélie
Name: name, dtype: object

List of multiple strings to single string with same structure

I have a function that returns me a list of strings.
I need the strings to be concatenated and returned in form of a single string.
List of strings:
data_hold = ['ye la AAA TAM tat TE
0042
on the mountain sta
nding mute Saw hi
m ply t VIC 3181',
'Page 2 of 3
ACCOUNT SUMMARY NEED TO GET IN TOUCH? ',
'YOUR USAGE BREAKDOWN
Average cost per day $1.57 kWh Tonnes']
I tried concatenating them as follows -
data_hold[0] + '\n' + data_hold[1]
Actual result:
"ye la AAA TAM tat TE\n0042\n\non the mountain sta\nnding mute Saw hi\nm ply t VIC 3181ACCOUNT SUMMARY NEED TO GET IN TOUCH? ',\n'YOUR USAGE BREAKDOWNAverage cost per day $1.57 kWh Tonnes'\n
Expected result:
'ye la AAA TAM tat TE
0042
on the mountain sta
nding mute Saw hi
m ply t VIC 3181',
'Page 2 of 3
ACCOUNT SUMMARY NEED TO GET IN TOUCH? ',
'YOUR USAGE BREAKDOWN
Average cost per day $1.57 kWh Tonnes'
Your 'expected result' is not a single string. However, running print('\n'.join(data_hold)) will produce the equivalent single string.
You misunderstand the difference between what the actual value of a string is, what is printed if you print() the string and how Python may represent the string to show you its value on screen.
For example, take a string with the value:
One line.
Another line, with a word in 'quotes'.
So, the string contains a single text, with two lines and some part of the string has the same quotes in it that you would use to mark the beginning and end of the string.
In code, there's various ways you can construct this string:
one_way = '''One line
Another line, with a word in 'quotes'.'''
another_way = 'One line\nAnother line, with a word in \'quotes\'.'
When you run this, you'll find that one_way and another_way contain the exact same string that, when printed, looks just like the example text above.
Python, when you ask it to show you the representation in code, will actually show you the string like it is specified in the code for another_way, except that it prefers to show it using double quotes to avoid having to escape the single quotes:
>>> one_way = '''One line
... Another line, with a word in 'quotes'.'''
>>> one_way
"One line\nAnother line, with a word in 'quotes'."
Compare:
>>> this = '''Some text
... continued here'''
>>> this
'Some text\ncontinued here'
Note how Python decides to use single quotes if there are no single quotes in the string itself. And if both types of quotes are in there, it'll escape like the example code above:
>>> more = '''Some 'text'
... continued "here"'''
>>> more
'Some \'text\'\ncontinued "here"'
But when printed, you get what you'd expect:
>>> print(more)
Some 'text'
continued "here"

Regex sub only removes certain expressions

I'm running a program which creates product labels based on csv data. The function which I am struggling with takes a data structure which consists of a number combination(width of a wooden plank) and a string (name of product). Possible combinations I search for are as follows:
5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF
My function needs to take in the data, split the width from the name and return them both as separate variables as follows:
desc = row[1]
if filter.lower() in desc.lower():
size = re.search(r'(\d{1})(\-*)(\d{0,1})(\/*)(\d{0,2})(\+*)(\d{0,1})(\-*)(\d{0,1})(\/*)(\d{0,2})', desc)
if size:
# remove size from description
desc = re.sub(size.group(), '', desc)
size = size.group() # extract match from obj
else:
size = "None"
The function does as intended with the first two samples, however when it comes across the last product, it recognizes the size but does not remove it from description. Screen shot below shows the output after I print (size + \n + desc)
Is there an issue with my re expression or elsewhere?
Thanks
re.sub() expects its first argument to be a regex. It works for the first two because they don't contain any characters that have special meaning in the context, however the third contains +, which is special.
There's not actually any reason to use regex there... regular string replacement should work:
desc = desc.replace(size.group(), '')
Why replace and not simply match what you need?
import re
text = """5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF""".split('\n')
print(text)
for t in text:
pattern = r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
m = re.search(pattern,t)
print(m.group('size'))
print(m.group('species'))
Output:
5
MAPLE PEPPER-ANTIQUE
3-1/4
MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4
MAPLE TIMBERWOLF
Regex:
r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
2 named groups, between them 0-n spaces.
1st group only 0123456789-+/ allowed
2nd group any but 0123456789 allowed

Execute only if string contains a ','?

I'm trying to execute a bunch of code only if the string I'm searching contains a comma.
Here's an example set of rows that I would need to parse (name is a column header for this tab-delimited file and the column (annoyingly) contains the name, degree, and area of practice:
name
Sam da Man J.D.,CEP
Green Eggs Jr. Ed.M.,CEP
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
My issue is that some of the rows contain a comma, which is followed by an acronym which represents an "area of practice" for the professional and some do not.
My code relies on the principle that each line contains a comma, and I will now have to modify the code in order to account for lines where there is no comma.
def parse_ieca_gc(s):
########################## HANDLE NAME ELEMENT ###############################
degrees = ['M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.', 'RN', 'B.S.Ed.', 'M.D.']
degrees_list = []
# separate area of practice from name and degree and bind this to var 'area'
split_area_nmdeg = s['name'].split(',')
area = split_area_nmdeg.pop() # when there is no area of practice and hence no comma, this pops out the name + deg and leaves an empty list, that's why 'print split_area_nmdeg' returns nothing and 'area' returns the name and deg when there's no comma
print 'split area nmdeg'
print area
print split_area_nmdeg
# Split the name and deg by spaces. If there's a deg, it will match with one of elements and will be stored deg list. The deg is removed name_deg list and all that's left is the name.
split_name_deg = re.split('\s',split_area_nmdeg[0])
for word in split_name_deg:
for deg in degrees:
if deg == word:
degrees_list.append(split_name_deg.pop())
name = ' '.join(split_name_deg)
# area of practice
category = area
re.search() and re.match() both do not work, it appears, because they return instances and not a boolean, so what should I use to tell if there's a comma?
The easiest way in python to see if a string contains a character is to use in. For example:
if ',' in s['name']:
if re.match(...) is not None :
instead of looking for boolean use that. Match returns a MatchObject instance on success, and None on failure.
You are already searching for a comma. Just use the results of that search:
split_area_nmdeg = s['name'].split(',')
if len(split_area_nmdeg) > 2:
print "Your old code goes here"
else:
print "Your new code goes here"

Categories

Resources