R Regex, get string between quotations marks - python

So. I'm trying to extract the Document is original from the string below.
c:1:{s:7:"note";s:335:"Document is original-no need to register again";}

Two thoughts:
A little bit of work, get most components of that structure:
string <- 'c:1:{s:7:"note";s:335:"Document is original-no need to register again";}'
strcapture("(.*):(.*):(.*)",
strsplit(regmatches(string, gregexpr('(?<={)[^}]+(?=})', string, perl = TRUE))[[1]], ";")[[1]],
proto = list(s="", len=1L, x=""))
# s len x
# 1 s 7 "note"
# 2 s 335 "Document is original-no need to register again"
A simpler approach, perhaps a little more hard-coded:
regmatches(string, gregexpr('(?<=")([^;"]+)(?=")', string, perl = TRUE))[[1]]
# [1] "note"
# [2] "Document is original-no need to register again"
From here, you need to figure out how to dismiss "note" and then perhaps strsplit(.., "-") to get the substring you want.

Related

A better way to use , replace() in python

I built a pretty basic program ,,, that will take input in English ,, and encrypt it using random alphabets of different languages ;; And also decrypt it :-
def encrypt_decrypt():
inut = input("Text to convert ::-- ")
# feel free to replace the symbols ,, with ur own carecters or numbers or something
# u can also add numbers , and other carecters for encryption or decryption
decideing_variable = input("U wanna encrypt or decrypt ?? ,, write EN or DE ::- ")
if decideing_variable == "EN":
deep = inut.replace("a", "ᛟ").replace("b", "ᛃ").replace("c", "Ῡ").replace("d", "ϰ").replace("e", "Г").replace("f", "ξ").replace("g", "ᾫ").replace("h", "ῆ").replace("i", "₪").replace("j", "א").replace("k", "ⴽ").replace("l", "ⵞ").replace("m", "ⵥ").replace("n", "ঙ").replace("o", "Œ").replace("p", "უ").replace("q", "ক").replace("r", "ჶ").replace("s", "Ø").replace("t", "ю").replace("u", "ʧ").replace("v", "ʢ").replace("w", "ұ").replace("x", "Џ").replace("y", "န").replace("z", "໒")
print(f"\n{deep}\n")
elif decideing_variable == "DE":
un_deep = inut.replace("ᛟ", "a").replace("ᛃ", "b").replace("Ῡ", "c").replace("ϰ", "d").replace("Г", "e").replace("ξ","f").replace("ᾫ", "g").replace("ῆ", "h").replace("₪", "i").replace("א", "j").replace("ⴽ", "k").replace("ⵞ", "l").replace("ⵥ", "m").replace("ঙ", "n").replace("Œ", "o").replace("უ", "p").replace("ক", "q").replace("ჶ", "r").replace("Ø", "s").replace("ю", "t").replace("ʧ", "u").replace("ʢ", "v").replace("ұ", "w").replace("Џ", "x").replace("န", "y").replace("໒", "z")
print(f"\n{un_deep}\n")
encrypt_decrypt()
while writing this I didn't know any better way then chaining .replace() function ,,,
But I have a feeling , that this isn't the proper way to do it ,,
The code works fine .
But ,, does any one know a better way of doing this ?
It looks like you are doing a character by character replacement. The function you are looking for is string.maketrans. You can give strings of equal length to convert each character to the desired character. Here is a working example online:
# first string
firstString = "abc"
secondString = "def"
string = "abc"
print(string.maketrans(firstString, secondString))
# example dictionary
firstString = "abc"
secondString = "defghi"
string = "abc"
print(string.maketrans(firstString, secondString))
You can also look at the official documentation for further details.
You can make a dictionary for corresponding words and use this,
text = "ababdba"
translation = {'a':'ᛟ', 'b':'ᛃ', 'c':'Ῡ','d': 'ϰ','e': 'Г','f': 'ξ','g': 'ᾫ','h':'ῆ','i': '₪','j': 'א','k': 'ⴽ','l': 'ⵞ','m' :'ⵥ','n': 'ঙ','o': 'Œ','p': 'უ','q': 'ক','r': 'ჶ','s': 'Ø','t': 'ю','u': 'ʧ', 'v':'ʢ','w': 'ұ','x': 'Џ','y': 'န','z': '໒'}
def translate(text,translation):
result = []
for char in text:
result.append( translation[char] )
return "".join(result)
print(translate(text,translation))
result is
ᛟᛃᛟᛃϰᛃᛟ
This might help you.
str.translate() and str.maketrans() are built to do all of the replacements in one go.
e.g.
>>> encrypt_table = str.maketrans("abc", "ᛟᛃῩ")
>>> "an abacus".translate(encrypt_table)
'ᛟn ᛟᛃᛟῩus'
NB. not string.maketrans() which is how it used to be in Python 2, and is now outdated; Python 3 turned that into two systems, str.maketrans() for text and bytes.maketrans() for bytes. see How come string.maketrans does not work in Python 3.1?

regexp value elements in array on Python 2.7

in Python2.7.
I have an array with objects like:
[{"TEMPLATE_NAME": "HP_LaserJet_P2055dn_USB_S29HDY6_HPLIP",
"PRINTER_INFO": "HP LaserJet P2055dn",
"PRINTER_LOCATION": "Локальный принтер",
"DEVICE_URI": "hp:/usb/HP_LaserJet_P2055dn?serial=S29HDY6"},
{"TEMPLATE_NAME": "HP_LaserJet_P2055dn",
"PRINTER_INFO": "HP LaserJet P2055dn",
"PRINTER_LOCATION": "Локальный принтер",
"DEVICE_URI": "usb://HP/LaserJet%20P2055dn?serial=S29HDY6"}]
It is necessary for any coincidence of the argument and the string to get the first object found in the array. Now it is done like this:
ArgInListFindNewPrinters = next(name for name in ListFindNewPrinters if ArgPrinter in [name['PRINTER_INFO'], name['DEVICE_URI'], name['TEMPLATE_NAME'], name['PRINTER_LOCATION']])
print ArgInListFindNewPrinters
>> {"TEMPLATE_NAME": "HP_LaserJet_P2055dn_49A71E", "PRINTER_INFO": "HP HP LaserJet P2055dn", "PRINTER_LOCATION": "Локальный принтер", "DEVICE_URI": "dnssd://HP%20LaserJet%20P2055dn%20%5B49A71E%5D._pdl-datastream._tcp.local/"}
The disadvantage of this method is that it looks for a complete match of the argument and the string, but I need any case-insensitive entry.
Example: ArgPrinter = "LaserJe", ArgPrinter = "=S29HD"
The main problem is finding any occurrences of a substring in a string.
===========================================================================
I found a solution, but it is not very practical because translation into a string requires a change in encoding:
ArgInListFindNewPrinters = next(name for name in ListFindNewPrinters if re.search(ArgPrinter, str(name), re.IGNORECASE))
If there are more optimal ways to do this, I will be grateful.
Convert both the target string and the searched string to lowercase to perform a case-insensitive search.
Use if x in string to match substrings.
There may be a way to do this more nicely, but this works:
ArgInListFindNewPrinters = next(name for name in ListFindNewPrinters
if ArgPrinter.lower() in name['PRINTER_INFO'].lower()
or ArgPrinter.lower() in name['DEVICE_URI'].lower()
or ArgPrinter.lower() in name['TEMPLATE_NAME'].lower()
or ArgPrinter.lower() in name['PRINTER_LOCATION'].lower())

Array has multi strings against text with multiline ( regular expression) Python

I am working on the regular expression on python. I spend the whole week I can't understand what wrong with my code. it obvious that multi-string should match, but I get a few of them. such as "model" , '"US"" but I can't match 37abc5afce16xxx and "-104.99875". My goal is just to tell whether there is a match for any string on the array or not and what is that matching.
I have string such as:'
text = {'"version_name"': '"8.5.2"', '"abi"': '"arm64-v8a"', '"x_dpi"':
'515.1539916992188', '"environment"': '{"sdk_version"',
'"time_zone"':
'"America\\/Wash"', '"user"': '{}}', '"density_default"': '560}}',
'"resolution_width"': '1440', '"package_name"':
'"com.okcupid.okcupid"', '"d44bcbfb-873454-4917-9e02-2066d6605d9f"': '{"language"', '"country"':
'"US"}', '"now"': '1.515384841291E9', '{"extras"': '{"sessions"',
'"device"': '{"android_version"', '"y_dpi"': '37abc5afce16xxx',
'"model"': '"Nexus 6P"', '"new"': 'true}]', '"only_respond_with"':
'["triggers"]}\n0\r\n\r\n', '"start_time"': '1.51538484115E9',
'"version_code"': '1057', '"-104.99875"': '"0"', '"no_acks"': 'true}',
'"display"': '{"resolution_height"'}
An array has multi-string as :
Keywords =["37abc5afce16xxx","867686022684243", "ffffffff-f336-7a7a-0f06-65f40033c587", "long", "Lat", "uuid", "WIFI", "advertiser", "d44bcbfb-873454-4917-9e02-2066d6605d9f","deviceFinger", "medialink", "Huawei","Andriod","US","local_ip","Nexus", "android2.10.3","WIFI", "operator", "carrier", "angler", "MMB29M", "-104.99875"]
My code as
for x in Keywords:
pattern = r"^.*"+str(x)+"^.*"
if re.findall(pattern, str(values1),re.M):
print "Match"
print x
else:
print "Not Match"
Your code's goal is a bit confusing, so this is assuming you want to check for which items from the Keywords list are also in the text dictionary
In your code, it looks like you only compare the regex to the dictionary values, not the keys (assuming that's what the values1 variable is).
Also, instead of using the regex "^.*" to match for strings, you can simply do
for X in Keywords:
if X in yourDictionary.keys():
doSomething
if X in yourDictionary.values():
doSomethingElse

How to replace digits in string?

Ok say I have a string in python:
str="martin added 1 new photo to the <a href=''>martins photos</a> album."
the string contains a lot more css/html in real world use
What is the fastest way to change the 1 ('1 new photo') to say '2 new photos'. of course later the '1' may say '12'.
Note, I don't know what the number is, so doing a replace is not acceptable.
I also need to change 'photo' to 'photos' but I can just do a .replace(...).
Unless there is a neater, easier solution to modify both?
Update
Never mind. From the comments it is evident that the OP's requirement is more complicated than it appears in the question. I don't think it can be solved by my answer.
Original Answer
You can convert the string to a template and store it. Use placeholders for the variables.
template = """%(user)s added %(count)s new %(l_object)s to the
<a href='%(url)s'>%(text)s</a> album."""
options = dict(user = "Martin", count = 1, l_object = 'photo',
url = url, text = "Martin's album")
print template % options
This expects the object of the sentence to be pluralized externally. If you want this logic (or more complex conditions) in your template(s) you should look at a templating engine such as Jinja or Cheetah.
It sounds like this is what you want (although why is another question :^)
import re
def add_photos(s,n):
def helper(m):
num = int(m.group(1)) + n
plural = '' if num == 1 else 's'
return 'added %d new photo%s' % (num,plural)
return re.sub(r'added (\d+) new photo(s?)',helper,s)
s = "martin added 0 new photos to the <a href=''>martins photos</a> album."
s = add_photos(s,1)
print s
s = add_photos(s,5)
print s
s = add_photos(s,7)
print s
Output
martin added 1 new photo to the <a href=''>martins photos</a> album.
martin added 6 new photos to the <a href=''>martins photos</a> album.
martin added 13 new photos to the <a href=''>martins photos</a> album.
since you're not parsing html, just use an regular expression
import re
exp = "{0} added ([0-9]*) new photo".format(name)
number = int(re.findall(exp, strng)[0])
This assumes that you will always pass it a string with the number in it. If not, you'll get an IndexError.
I would store the number and the format string though, in addition to the formatted string. when the number changes, remake the format string and replace your stored copy of it. This will be much mo'bettah' then trying to parse a string to get the count.
In response to your question about the html mattering, I don't think so. You are not trying to extract information that the html is encoding so you are not parsing html with regular expressions. This is just a string as far as that concern goes.

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources