Python replace in JSON list with for loop? - python

I'm quite new to python 3, and I'm developing a REST API to format some characters in a JSON of numerous strings (thousands sometimes) , the JSON has this structure:
[
[
"city",
"Street 158 No 96"
],
[
"city",
"st 144 11a 11 ap 104"
],
[
"city",
"Street83 # 85 - 22"
],
[
"city",
"str13 #153 - 81"
],
[
"city",
"street1h # 24 - 29"
]
]
So what I did to replace this on excel macros was.
text = Replace(text, "st", " street ", , , vbTextCompare)
For i = 0 To 9 Step 1
text = Replace(text, "street" & i, " street " & i, , , vbTextCompare)
text = Replace(text, "st" & i, " street " & i, , , vbTextCompare)
This would format every cell to 'street #' no matter the number, now the problem is when I try to do this with python, right now I have learned how to replace multiple values on a list like so:
addressList= []
for address in request.json:
address = [element
.replace('st', 'street ')
.replace('street1­', 'street 1')
.replace('street2', 'street 2')
.replace('street3', 'street 3')
.replace('street4', 'street 4')
.replace('street5­', 'street 5')
#and so on for st too
for element in address]
addressList.append(address)
This method is not just long but also really ugly, I'd like to do something like what I had before, but I can't seem to be able to use a for inside the replace, should I do it outside?
Thank you for helping.
--EDIT--
edited the json format so it's valid.
tried both revliscano and The fourth bird's replies they both work, currently i'm using revliscano's method as it allows me to create the list from my original Json in just 'one line'

Instead of using multiple replace calls, you could use a pattern matching st with optional reet and an optional space, then capture 1+ digits in a group.
\bst(?:reet)? ?(\d+)\b
Regex demo | Python demo
In the replacement use the capturing group street \1 using re.sub
Example code for a single element
import re
element = re.sub(r"\bst(?:reet)? ?(\d+)\b", r"street \1", "st 5")
print (element)
Output
street 5

I would use a regular expression to solve this. Try with the following
import re
address_list = [[re.sub(r'(?:st ?(\d)?\b)|(?:street(\d))', r'street \1', element)
for element in address]
for address in request.json]

You can use regular expressions mixed with dictionary to make it faster.
I use a function like this in one of my programs
import re
def multiple_replace(adict, text):
regex = re.compile("|".join(map(re.escape, adict.keys())))
return regex.sub(lambda match: adict[match.group(0)], text)
adict is the dictionary in which you have the mappings of the charaters you want to replace.
For you, it can be
adict = {
'street1­': 'street 1'
'street2':'street 2',
'street3': 'street 3',
'street4': 'street 4',
'street5­': 'street 5',
}
Of course you can't use the exact same function. You will have to write another regular expression according to your needs, like #The fourth bird did

Related

Python regex not giving desired output

I'm scraping a site which contains the following string
"1 Year+ in Category"
or in some cases
"1 Year+ by user in Category
I want to separate the Year, Category and the User. I tried using regular split but it doesn't work in this case because there are two delimiters 'in' and 'by'.
So, I used regex. It kinda works but not properly. Here is the snippet
dateandcat=re.split(r'.\s[in , by]',rightside[0])
rightside[0] contains date,category and user.
It results in the following output:
['1 Year', 'n Movies']
['1 Year', 'y user', 'n TV shows']
['1 Year', 'y user', 'n TV shows']
['1 Year', 'n Movies']
I could just trim off first two characters in [1] and [2] but I want to fix the regex. Why is second character of 'in' and 'by' still showing? How do I fix this?
Try using:
import re
value = "1 Year+ in Category by User"
match = re.match(r"(\d+ \w+\+?) in (\w+)(?: by (\w+)*)?", value)
if match:
print(match.groups())
Output:
('1 Year+', 'Category', 'User')
You can use regex101 to learn more about that regex and others.

How to replace a Word in DF Column with another Word using Python -without including the substring

I am trying to achieve the functionality of SAS TRANWRD Function in Python to replace a 'WORD' with another 'WORD'.
I have tried using str.replace and replace method available in Python but these methods are also replacing the SUBSTRING in addition to replacing the word.
Code:
DICT1 = {'NZ':'NEW ZEALAND'}
for k,v in DICT1.items():
df['COL1'] = df['COL1'].str.replace(k,v)
e.g.:
NZ COMPANY LIMITED --> NEW ZEALAND COMPANY LIMITED - *(Expected)*
GONZU ENTERPRISE --> GONEW ZEALANDU ENTERPRISE - *(Unexpected)*
In SAS this issue is taken care because TRANWRD Function only replaces the word after finding the space boundaries.
Can someone help how to achieve the similar functionality in Python ?
Use a regular expression search pattern that utilizes the word boundary metacharacter \b
Example:
import re
seekword = "Bob"
rplcword = "Bubba"
phrase = "Hello there Bob"
phrasex = re.sub(r"\b"+seekword+r"\b", rplcword, phrase)
print (phrasex)
Example 2:
Loop over multiple phrases and replacements
import re
PHRASES = \
[ 'Bill is from AU' \
, 'Bob is from NZ' \
, 'Bobby visited NZA' \
]
WORDMAPS = \
{ 'Bob': 'Bubba' \
, 'NZ': 'NEW ZEALAND'
}
for index, phrase in enumerate(PHRASES):
phrasex = phrase
for k,v in WORDMAPS.items():
phrasex = re.sub(r"\b"+k+r"\b", v, phrasex)
PHRASES[index] = phrasex
print (PHRASES)

Parse sentences with [value](type) format

I want to parse and extract key, values from a given sentence which follow the following format:
I want to get [samsung](brand) within [1 week](duration) to be happy.
I want to convert it into a split list like below:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
I have tried to split it using [ or ) :
re.split('\[|\]|\(|\)',s)
which is giving output:
['I want to get ',
'samsung',
'',
'brand',
' within ',
'1 week',
'',
'duration',
' to be happy.']
and
re.split('\[||\]|\(|\)',s)
is giving below output :
['I want to get ',
'samsung](brand) within ',
'1 week](duration) to be happy.']
Any help is appreciated.
Note: This is similar to stackoverflow inline links as well where if we type : go to [this link](http://google.com) it parse it as link.
As first step we split the string, and in second step we modify the string:
s = 'I want to get [samsung](brand) within [1 week](duration) to be happy.'
import re
s = re.split('(\[[^]]*\]\([^)]*\))', s)
s = [re.sub('\[([^]]*)\]\(([^)]*)\)', r'\1:\2', i) for i in s]
print(s)
Prints:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
You may use a two step approach: process the [...](...) first to format as needed and protect these using some rare/unused chars, and then split with that pattern.
Example:
s = "I want to get [samsung](brand) within [1 week](duration) to be happy.";
print(re.split(r'⦅([^⦅⦆]+)⦆', re.sub(r'\[([^][]*)]\(([^()]*)\)', r'⦅\1:\2⦆', s)))
See the Python demo
The \[([^\][]*)]\(([^()]*)\) pattern matches
\[ - a [ char
([^\][]*) - Group 1 ($1): any 0+ chars other than [ and ]
]\( - ]( substring
([^()]*) - Group 2 ($2): any 0+ chars other than ( and )
\) - a ) char.
The ⦅([^⦅⦆]+)⦆ pattern just matches any ⦅...⦆ substring but keeps what is in between as it is captured.
You could replace the ]( pattern first, then split on [) characters
re.replace('\)\[', ':').split('\[|\)',s)
One approach, using re.split with a lambda function:
sentence = "I want to get [samsung](brand) within [1 week](duration) to be happy."
parts = re.split(r'(?<=[\])])\s+|\s+(?=[\[(])', sentence)
processTerms = lambda x: re.sub('\[([^\]]+)\]\(([^)]+)\)', '\\1:\\2', x)
parts = list(map(processTerms, parts))
print(parts)
['I want to get', 'samsung:brand', 'within', '1 week:duration', 'to be happy.']

Extracting specific string after specific character

new = ['mary 2jay 3ken +', 'mary 2jay 3ken +', 'steven +john ']
print(new):
mary 2jay 3ken +
mary 2jay 3ken +
steven +john -
How could I get the sign/number after each person's name? I'm wondering whether dict would work in this case as my expected output is:
mary:2
jay:3
ken:+
steven:+
john:-
To get the index of "+" in a string, you can use:
index = a_string.index("+")
To check if "+" exist in a string, use:
if "+" in a_string:
# ...
To iterate a list of string, you can do:
for text in new:
# ...
There are fifty ways to do what you want. I suggest you to read the Python tutorial.
edit
You can use a RegEx to extract the fields name/number
for text in next:
couples = re.findall(r"(\S+)\s+(\d+|\+|\-|$)", text)
for name, num in couples:
print(name, num)

How to split a statement based on dots ('.'), while excluding dots inside angular brackets(< . >) using regular expressions in python?

I want to split statements by dots using regex in python, while excluding certain dots inside the angular brackets.
eg:
Original Statement :
'my name 54. is <not23.> worth mentioning. ok?'
I want to split it into following sentences:
Statement 1 : 'my name 54'
Statement 2 : ' is <not23.> worth mentioning'
Statement 3 : ' ok'
I have attempted
re.split(r'[^<.>]\.','my name 54. is <not23.> worth mentioning. ok?')
But, it's not ignoring dot inside <>,
so the result am getting is:
['my name 5', ' is <not23', '> worth mentioning', ' ok?']
Split on the following regex:
\.(?![^<]*>)
Live demo
import re
str = 'my name 54. is <not23.> worth mentioning. ok?'
regex = re.compile(r"\.(?![^<]*>)")
arr = regex.split(str)
print(arr)
Easy if you can user the newer regex module (it provides the (*SKIP)(*FAIL) functionality):
import regex as re
string = 'my name 54. is <not23.> worth mentioning. ok?'
rx = re.compile(r"<[^>]*>(*SKIP)(*FAIL)|\.")
parts = rx.split(string)
print(parts)
# ['my name 54', ' is <not23.> worth mentioning', ' ok?']

Categories

Resources