List of multiple strings to single string with same structure - python

I have a function that returns me a list of strings.
I need the strings to be concatenated and returned in form of a single string.
List of strings:
data_hold = ['ye la AAA TAM tat TE
0042
on the mountain sta
nding mute Saw hi
m ply t VIC 3181',
'Page 2 of 3
ACCOUNT SUMMARY NEED TO GET IN TOUCH? ',
'YOUR USAGE BREAKDOWN
Average cost per day $1.57 kWh Tonnes']
I tried concatenating them as follows -
data_hold[0] + '\n' + data_hold[1]
Actual result:
"ye la AAA TAM tat TE\n0042\n\non the mountain sta\nnding mute Saw hi\nm ply t VIC 3181ACCOUNT SUMMARY NEED TO GET IN TOUCH? ',\n'YOUR USAGE BREAKDOWNAverage cost per day $1.57 kWh Tonnes'\n
Expected result:
'ye la AAA TAM tat TE
0042
on the mountain sta
nding mute Saw hi
m ply t VIC 3181',
'Page 2 of 3
ACCOUNT SUMMARY NEED TO GET IN TOUCH? ',
'YOUR USAGE BREAKDOWN
Average cost per day $1.57 kWh Tonnes'

Your 'expected result' is not a single string. However, running print('\n'.join(data_hold)) will produce the equivalent single string.

You misunderstand the difference between what the actual value of a string is, what is printed if you print() the string and how Python may represent the string to show you its value on screen.
For example, take a string with the value:
One line.
Another line, with a word in 'quotes'.
So, the string contains a single text, with two lines and some part of the string has the same quotes in it that you would use to mark the beginning and end of the string.
In code, there's various ways you can construct this string:
one_way = '''One line
Another line, with a word in 'quotes'.'''
another_way = 'One line\nAnother line, with a word in \'quotes\'.'
When you run this, you'll find that one_way and another_way contain the exact same string that, when printed, looks just like the example text above.
Python, when you ask it to show you the representation in code, will actually show you the string like it is specified in the code for another_way, except that it prefers to show it using double quotes to avoid having to escape the single quotes:
>>> one_way = '''One line
... Another line, with a word in 'quotes'.'''
>>> one_way
"One line\nAnother line, with a word in 'quotes'."
Compare:
>>> this = '''Some text
... continued here'''
>>> this
'Some text\ncontinued here'
Note how Python decides to use single quotes if there are no single quotes in the string itself. And if both types of quotes are in there, it'll escape like the example code above:
>>> more = '''Some 'text'
... continued "here"'''
>>> more
'Some \'text\'\ncontinued "here"'
But when printed, you get what you'd expect:
>>> print(more)
Some 'text'
continued "here"

Related

how do i use string.replace() to replace only when the string is exactly matching

I have a dataframe with a list of poorly spelled clothing types. I want them all in the same format , an example is i have "trous" , "trouse" and "trousers", i would like to replace the first 2 with "trousers".
I have tried using string.replace but it seems its getting the first "trous" and changing it to "trousers" as it should and when it gets to "trouse", it works also but when it gets to "trousers" it makes "trousersersers"! i think its taking the strings which contain trous and trouse and trousers and changing them.
Is there a way i can limit the string.replace to just look for exactly "trous".
here's what iv troied so far, as you can see i have a good few changes to make, most of them work ok but its the likes of trousers and t-shirts which have a few similar changes to be made thats causing the upset.
newTypes=[]
for string in types:
underwear = string.replace(('UNDERW'), 'UNDERWEAR').replace('HANKY', 'HANKIES').replace('TIECLI', 'TIECLIPS').replace('FRAGRA', 'FRAGRANCES').replace('ROBE', 'ROBES').replace('CUFFLI', 'CUFFLINKS').replace('WALLET', 'WALLETS').replace('GIFTSE', 'GIFTSETS').replace('SUNGLA', 'SUNGLASSES').replace('SCARVE', 'SCARVES').replace('TROUSE ', 'TROUSERS').replace('SHIRT', 'SHIRTS').replace('CHINO', 'CHINOS').replace('JACKET', 'JACKETS').replace('KNIT', 'KNITWEAR').replace('POLO', 'POLOS').replace('SWEAT', 'SWEATERS').replace('TEES', 'T-SHIRTS').replace('TSHIRT', 'T-SHIRTS').replace('SHORT', 'SHORTS').replace('ZIP', 'ZIP-TOPS').replace('GILET ', 'GILETS').replace('HOODIE', 'HOODIES').replace('HOODZIP', 'HOODIES').replace('JOGGER', 'JOGGERS').replace('JUMP', 'SWEATERS').replace('SWESHI', 'SWEATERS').replace('BLAZE ', 'BLAZERS').replace('BLAZER ', 'BLAZERS').replace('WC', 'WAISTCOATS').replace('TTOP', 'T-SHIRTS').replace('TROUS', 'TROUSERS').replace('COAT', 'COATS').replace('SLIPPE', 'SLIPPERS').replace('TRAINE', 'TRAINERS').replace('DECK', 'SHOES').replace('FLIP', 'SLIDERS').replace('SUIT', 'SUITS').replace('GIFTVO', 'GIFTVOUCHERS')
newTypes.append(underwear)
types = newTypes
Assuming you're okay with not using string.replace(), you can simply do this:
lst = ["trousers", "trous" , "trouse"]
for i in range(len(lst)):
if "trous" in lst[i]:
lst[i] = "trousers"
print(lst)
# Prints ['trousers', 'trousers', 'trousers']
This checks if the shortest substring, trous, is part of the string, and if so converts the entire string to trousers.
Use a dict for string to be replaced:
d={
'trous': 'trouser',
'trouse': 'trouser',
# ...
}
newtypes=[d.get(string,string) for string in types]
d.get(string,string) will return string if string is not in d.

Parsing the String to different format in Python

I have a text document and I need to add two # symbols before the keywords present in an array.
Sample text and Array:
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
Required Text:
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need, ##name: George ##employee_id:14296 ##blood_group:b positive this is the blood group of the employee ##age:32"
Just use the replace function
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr = ['name','employee_id','blood_group','age']
for w in arr:
str = str.replace(w, f'##{w}')
print(str)
You can simply loop over arr and use the str.replace function:
for repl in arr:
strng.replace(repl, '##'+repl)
print(strng)
However, I urge you to change the variable name str because it is a reserved keyword.
You might use re module for that task following way
import re
txt = "This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
newtxt = re.sub('('+'|'.join(arr)+')',r'##\1',txt)
print(newtxt)
Output:
This is a sample text document which consists of all demographic information of employee here is the value you may need,##name: George ##employee_id:14296##blood_group:b positive this is the blood group of the employee##age:32
Explanation: here I used regular expression to catch words from your list and replace each with ##word. This is single pass, as opposed to X passes when using multiple str.replace (where X is length of arr), so should be more efficient for cases where arr is long.
As an alternative, you can convert the below in a loop for lengthier list. There seems to be space before ## too.
str= str[:str.find(arr[0])] + ' ##' + str[str.find(arr[0]):]
str= str[:str.find(arr[1])] + ' ##' + str[str.find(arr[1]):]
str= str[:str.find(arr[2])] + ' ##' + str[str.find(arr[2]):]
str= str[:str.find(arr[3])] + ' ##' + str[str.find(arr[3]):]
You can replace the value and add space and double ## before the replaced value and in the result replace double spaces with one space.
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
for i in arr:
str = str.replace(i, " ##{}".format(i))
print(str.replace(" ", " "))
Output
This is a sample text document which consists of all demographic information of employee here is the value you may need, ##name: George ##employee_id:14296 ##blood_group:b positive this is the blood group of the employee ##age:32

Python RegEx for Australian Phone Numbers - False negative - 2 matches in the same substring

I am trying to extract phone numbers from a web page using Python & RegEx
Australian number format
+61 (international code - shown below as 'i')
02, 03, 07 or 08 (state codes - shown below as 's')
1234-5678 (8 digit local number - shown below as 'x')
Common variations of format (in order of commonality):
Format 1: ss xxxx xxxx (e.g. 02 1234 5678)
Format 2: +ii s xxxx xxxx (e.g. +61 2 1234 5678) (note the first 's' digit is removed here)
Format 3: (seen rarely) +ii (s)s xxxx-xxxx (e.g. +61 (0)2 1234 5678
My RegEx:
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', sample_text))
works well on a simple sample_text:
sample_text =
"610212345678ABC##610312345678ABC##610712345678ABC##610812345678ABC##0212345678ABC##0312345678ABC##0712345678ABC##0812345678ABC##61212345678ABC##61312345678ABC##61712345678ABC##61812345678ABC##0412345678ABC##61412345678ABC##130012345678ABC##180012345678ABC##"
Result:
['0212345678', '0312345678', '0712345678', '0812345678',
'0212345678', '0312345678', '0712345678', '0812345678',
'61212345678', '61312345678', '61712345678', '61812345678',
'0412345678', '61412345678', '1300123456', '1800123456']
The Goal
Using http://www.outware.com.au/contact as an example ...
The 2 actual numbers on the page are:
+61 (0)3 8684 9912 and +61 (0)2 8064 7043 (both numbers appear twice - once in the main section of the page and once in the footer)
The Problem
#take HTML markup from body tags
b = driver.find_element_by_css_selector('body').text
#remove all non-alpha + white space.
b = re.sub(r'\W+', '', b)
Result:
"PORTFOLIOINNOVATIONSERVICESCAREERSINSIGHTSNEWSABOUTCONTACTCONTACTOUTWAREMelbourneLe......AFRFast100Nov92017EXPLOREOUTWAREPortfolioInnovationWorkingatOutwareAboutSitemapCONNECTMELBOURNELevel3469LaTrobeStMelbourneVIC3000610386849912SYDNEYLevel41SmailStUltimoNSW2007610280647043"
Now if I apply my regex to this string
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', b))
Result:
[u'0386849912', u'0761028064', u'0386849912', u'0761028064']
I am getting a false positive because I have concatenated a postcode "NSW2007" onto the start of the phone number.
I presume because the regex has parsed the first part of "NSW2007610280647043" matching "0761028064" it doesn't then match "0280647043" which is also part of the same substring
I actually don't mind the false positive (i.e. getting "0761028064") but I do need to solve the false negative (i.e. not getting "0280647043")
I know there's some RegEx gurus here who can help on this. :-)
Please help!!
Don't search/replace any text prior to using the regex. That will make your input unusable. Try this:
(?:(?:\+?61 )?(?:0|\(0\))?)?[2378] \d{4}[ -]?\d{4}
https://regex101.com/r/1Q4HuD/3
It might help if you use a negative look ahead to check to see make sure the following character is not a number. For example: (?!\d).
This could create a problem though if some data following a phone number starts with a number.
The look behind looks like this when implemented in your regex:
(02\d{8}|03\d{8}|07\d{8}|08\d{8}|612\d{8}|613\d{8}|617\d{8}|618\d{8}|04\d{8}|614\d{8}|1300\d{6}|1800\d{6})(?!\d)
(I removed the square brackets as you do not need them when trying to match a single character)
This answer should be a comment, it isn't because of my low reputation!
I've seen you're updating the regex and I think this variation can help you. It should match very uncommon formats!
(\+61 )?(?:0|\(0\))?[2378] (?:[\s-]*\d){8}

Execute only if string contains a ','?

I'm trying to execute a bunch of code only if the string I'm searching contains a comma.
Here's an example set of rows that I would need to parse (name is a column header for this tab-delimited file and the column (annoyingly) contains the name, degree, and area of practice:
name
Sam da Man J.D.,CEP
Green Eggs Jr. Ed.M.,CEP
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
My issue is that some of the rows contain a comma, which is followed by an acronym which represents an "area of practice" for the professional and some do not.
My code relies on the principle that each line contains a comma, and I will now have to modify the code in order to account for lines where there is no comma.
def parse_ieca_gc(s):
########################## HANDLE NAME ELEMENT ###############################
degrees = ['M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.', 'RN', 'B.S.Ed.', 'M.D.']
degrees_list = []
# separate area of practice from name and degree and bind this to var 'area'
split_area_nmdeg = s['name'].split(',')
area = split_area_nmdeg.pop() # when there is no area of practice and hence no comma, this pops out the name + deg and leaves an empty list, that's why 'print split_area_nmdeg' returns nothing and 'area' returns the name and deg when there's no comma
print 'split area nmdeg'
print area
print split_area_nmdeg
# Split the name and deg by spaces. If there's a deg, it will match with one of elements and will be stored deg list. The deg is removed name_deg list and all that's left is the name.
split_name_deg = re.split('\s',split_area_nmdeg[0])
for word in split_name_deg:
for deg in degrees:
if deg == word:
degrees_list.append(split_name_deg.pop())
name = ' '.join(split_name_deg)
# area of practice
category = area
re.search() and re.match() both do not work, it appears, because they return instances and not a boolean, so what should I use to tell if there's a comma?
The easiest way in python to see if a string contains a character is to use in. For example:
if ',' in s['name']:
if re.match(...) is not None :
instead of looking for boolean use that. Match returns a MatchObject instance on success, and None on failure.
You are already searching for a comma. Just use the results of that search:
split_area_nmdeg = s['name'].split(',')
if len(split_area_nmdeg) > 2:
print "Your old code goes here"
else:
print "Your new code goes here"

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources