How to replace URL parts with regex in Python? - python

I have a JSON file with URLs that looks something like this:
{
"a_url": "https://foo.bar.com/x/env_t/xyz?asd=dfg",
"b_url": "https://foo.bar.com/x/env_t/xyz?asd=dfg",
"some other property": "blabla",
"more stuff": "yep",
"c_url": "https://blabla.com/somethingsomething?maybe=yes"
}
In this JSON, I want to look up all URLs that have a specific format, and then replace some parts in it.
In URLs that have the format of the first 2 URLs, I want to replace "foo" by "fooa" and "env_t" by "env_a", so that the output looks like this:
{
"a_url": "https://fooa.bar.com/x/env_a/xyz?asd=dfg",
"b_url": "https://fooa.bar.com/x/env_a/xyz?asd=dfg",
"some other property": "blabla",
"more stuff": "yep",
"c_url": "https://blabla.com/somethingsomething?maybe=yes"
}
I can't figure out how to do this. I came up with this regex:
https://foo([a-z]?)\.bar\.com/x/(.+)/.+\"
In regex101 this matches my URLs and captures the groups that I'm seeking to replace, but I can't figure out how to do this with Python's regex.sub().

1. Using regex
import regex
url = 'https://foo.bar.com/x/env_t/xyz?asd=dfg'
new_url = regex.sub(
'(env_t)',
'env_a',
regex.sub('(foo)', 'fooa', url)
)
print(new_url)
output:
https://fooa.bar.com/x/env_a/xyz?asd=dfg
2. Using str.replace
with open('./your-json-file.json', 'r+') as f:
content = f.read()
new_content = content \
.replace('foo', 'fooa') \
.replace('env_t', 'env_a')
f.seek(0)
f.write(new_content)
f.truncate()

Related

Python How to add nested fields to Yaml file

I need to modify a YAML file and add several fields.I am using the ruamel.yaml package.
First I load the YAML file:
data = yaml.load(file_name)
I can easily add new simple fields, like-
data['prop1'] = "value1"
The problem I face is that I need to add a nested dictionary incorporate with array:
prop2:
prop3:
- prop4:
prop5: "Some title"
prop6: "Some more data"
I tried to define-
record_to_add = dict(prop2 = dict(prop3 = ['prop4']))
This is working, but when I try to add beneath it prop5 it fails-
record_to_add = dict(prop2 = dict(prop3 = ['prop4'= dict(prop5 = "Value")]))
I get
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
What am I doing wrong?
The problem has little to do with ruamel.yaml. This:
['prop4'= dict(prop5 = "Value")]
is invalid Python as a list ([ ]) expects comma separated values. You would need to use something like:
record_to_add = dict(prop2 = dict(prop3 = dict(prop4= [dict(prop5 = "Some title"), dict(prop6='Some more data'),])))
As your program is incomplete I am not sure if you are using the old API or not. Make sure to use
import ruamel.yaml
yaml = ruamel.yaml.YAML()
and not
import ruamel.yaml as yaml
Its because of having ['prop4'= <> ].Instead record_to_add = dict(prop2 = dict(prop3 = [dict(prop4 = dict(prop5 = "Value"))])) should work.
Another alternate would be,
import yaml
data = {
"prop1": {
"prop3":
[{ "prop4":
{
"prop5": "some title",
"prop6": "some more data"
}
}]
}
}
with open(filename, 'w') as outfile:
yaml.dump(data, outfile, default_flow_style=False)

Dictionary from a String with particular structure

I am using python 3 to read this file and convert it to a dictionary.
I have this string from a file and I would like to know how could be possible to create a dictionary from it.
[User]
Date=10/26/2003
Time=09:01:01 AM
User=teodor
UserText=Max Cor
UserTextUnicode=392039n9dj90j32
[System]
Type=Absolute
Dnumber=QS236
Software=1.1.1.2
BuildNr=0923875
Source=LAM
Column=OWKD
[Build]
StageX=12345
Spotter=2
ApertureX=0.0098743
ApertureY=0.2431899
ShiftXYZ=-4.234809e-002
[Text]
Text=Here is the Text files
DataBaseNumber=The database number is 918723
..... (There are more than 1000 lines per file) ...
On the text I have "Name=Something" and then I would like to convert it as follows:
{'Date':'10/26/2003',
'Time':'09:01:01 AM'
'User':'teodor'
'UserText':'Max Cor'
'UserTextUnicode':'392039n9dj90j32'.......}
The word between [ ] can be removed, like [User], [System], [Build], [Text], etc...
In some fields there is only the first part of the string:
[Colors]
Red=
Blue=
Yellow=
DarkBlue=
What you have is an ordinary properties file. You can use this example to read the values into map:
try (InputStream input = new FileInputStream("your_file_path")) {
Properties prop = new Properties();
prop.load(input);
// prop.getProperty("User") == "teodor"
} catch (IOException ex) {
ex.printStackTrace();
}
EDIT:
For Python solution, refer to the answerred question.
You can use configparser to read .ini, or .properties files (format you have).
import configparser
config = configparser.ConfigParser()
config.read('your_file_path')
# config['User'] == {'Date': '10/26/2003', 'Time': '09:01:01 AM'...}
# config['User']['User'] == 'teodor'
# config['System'] == {'Type': 'Abosulte', ...}
Can easily be done in python. Assuming your file is named test.txt.
This will also work for lines with nothing after the = as well as lines with multiple =.
d = {}
with open('test.txt', 'r') as f:
for line in f:
line = line.strip() # Remove any space or newline characters
parts = line.split('=') # Split around the `=`
if len(parts) > 1:
d[parts[0]] = ''.join(parts[1:])
print(d)
Output:
{
"Date": "10/26/2003",
"Time": "09:01:01 AM",
"User": "teodor",
"UserText": "Max Cor",
"UserTextUnicode": "392039n9dj90j32",
"Type": "Absolute",
"Dnumber": "QS236",
"Software": "1.1.1.2",
"BuildNr": "0923875",
"Source": "LAM",
"Column": "OWKD",
"StageX": "12345",
"Spotter": "2",
"ApertureX": "0.0098743",
"ApertureY": "0.2431899",
"ShiftXYZ": "-4.234809e-002",
"Text": "Here is the Text files",
"DataBaseNumber": "The database number is 918723"
}
I would suggest to do some cleaning to get rid of the [] lines.
After that you can split those lines by the "=" separator and then convert it to a dictionary.

Finding all URLs in a single line

I'm trying to to fetch a page that has many urls and other stuff all in just one line in a plain text like
"link_url":"http://www.example.com/link1?site=web","mobile_link_url":"http://m.example.com/episode/link1?site=web" link_url":"http://www.example.com/link2?site=web","mobile_link_url":"http://m.example.com/episode/link2?site=web"
i tired
import re
import requests as req
response = req.get("http://api.example.com/?callback=jQuery112")
content = response.text
print content will give me the "link_url": output
but i need to find
http://www.example.com/link1?site=web
http://www.example.com/link2?site=web
and output only link1 and link2 to a file like
link1
link2
link3
The code below might be what you need.
import re
urls = '''"link_url":"http://www.example.com/link1?site=web","mobile_link_url":"http://m.example.com/episode/link1?site=web" link_url":"http://www.example.com/link2?site=web","mobile_link_url":"http://m.example.com/episode/link2?site=web"'''
links = re.findall(r'http://www[a-z/.?=:]+(link\d)+', urls)
print(links)
If it is a string and not a JSON object, then you could do this even though it's a bit hacky:
s1 ="\"link_url\":\"http://www.example.com/link1?site=web\",\"mobile_link_url\":\"http://m.example.com/episode/link1?site=web\" link_url\":\"http://www.example.com/link2?site=web\",\"mobile_link_url\":\"http://m.example.com/episode/link2?site=web\""
links = [x for x in s1.replace("\":\"", "LINK_DELIM").replace("\"", "").replace(" ", ",").split(",")]
for link in links:
print(link.split("LINK_DELIM")[1])
Which yields:
http://www.example.com/link1?site=web
http://m.example.com/episode/link1?site=web
http://www.example.com/link2?site=web
http://m.example.com/episode/link2?site=web
Though I think #al76's answer is more elegant for this.
But if it's a JSON which looks like:
[
{
"link_url": "http://www.example.com/link1?site=web",
"mobile_link_url": "http://m.example.com/episode/link1?site=web"
},
{
"link_url": "http://www.example.com/link2?site=web",
"mobile_link_url": "http://m.example.com/episode/link2?site=web"
}
]
Then you could do something like:
import json
s1 = "[{ \"link_url \": \"http://www.example.com/link1?site=web \", \"mobile_link_url \": \"http://m.example.com/episode/link1?site=web \"}, { \"link_url \": \"http://www.example.com/link2?site=web \", \"mobile_link_url \": \"http://m.example.com/episode/link2?site=web \"} ]"
data = json.loads(s1)
links = [y for x in data for y in x.values()]
for link in links:
print(link)
If this is a JSON api then you can use response.json() to get a python dictionary, as .text will give you the response as one long string.
You also do not need to use regex for something so simple, python comes with a url parser out of the box.
So provided your response is something like
[
{
"link_url": "http://www.example.com/link1?site=web",
"mobile_link_url": "http://m.example.com/episode/link1?site=web"
},
{
"link_url": "http://www.example.com/link2?site=web",
"mobile_link_url": "http://m.example.com/episode/link2?site=web"
}
]
(doesn't matter if IRL it's one line, as long as it's valid JSON)
You can iterate the results as a dictionary, then use urlparse to get specific components of your urls:
from urllib.parse import urlparse
import requests
response = requests.get("http://api.example.com/?callback=jQuery112")
for urls in response.json():
print(urlparse(url["link_url"]).path.rsplit('/', 1)[-1])
urlparse(...).path will return the path of your url only, eg. episode/link1, and we then we just get the last segment of that with rsplit to just get link1, link2 etc.
try
urls=""" "link_url":"http://www.example.com/link1?site=web","mobile_link_url":"http://m.example.com/episode/link1?site=web" link_url":"http://www.example.com/link2?site=web","mobile_link_url":"http://m.example.com/episode/link2?site=web" """
re.findall(r'"http://www[^"]+"',urls)
urls=""" "link_url":"http://www.example.com/link1?site=web","mobile_link_url":"http://m.example.com/episode/link1?site=web" link_url":"http://www.example.com/link2?site=web","mobile_link_url":"http://m.example.com/episode/link2?site=web" """
p = [i.split('":')[1] for i in urls.replace(' ', ",").split(",")[1:-1]]
#### Output ####
['"http://www.example.com/link1?site=web"',
'"http://m.example.com/episode/link1?site=web"',
'"http://www.example.com/link2?site=web"',
'"http://m.example.com/episode/link2?site=web"']
*Not as efficient as regex.

Dynamically double-quote "keys" in text to form valid JSON string in python

I'm working with text contained in JS variables on a webpage and extracting strings using regex, then turning it into JSON objects in python using json.loads().
The issue I'm having is the unquoted "keys". Right now, I'm doing a series of replacements (code below) to "" each key in each string, but what I want is to dynamically identify any unquoted keys before passing the string into json.loads().
Example 1 with no space after : character
json_data1 = '[{storeName:"testName",address:"12345 Road",address2:"Suite 500",city:"testCity",storeImage:"http://www.testLink.com",state:"testState",phone:"999-999-9999",lat:99.9999,lng:-99.9999}]'
Example 2 with space after : character
json_data2 = '[{storeName: "testName",address: "12345 Road",address2: "Suite 500",city: "testCity",storeImage: "http://www.testLink.com",state: "testState",phone: "999-999-9999",lat: 99.9999,lng: -99.9999}]'
Example 3 with space after ,: characters
json_data3 = '[{storeName: "testName", address: "12345 Road", address2: "Suite 500", city: "testCity", storeImage: "http://www.testLink.com", state: "testState", phone: "999-999-9999", lat: 99.9999, lng: -99.9999}]'
Example 4 with space after : character and newlines
json_data4 = '''[
{
storeName: "testName",
address: "12345 Road",
address2: "Suite 500",
city: "testCity",
storeImage: "http://www.testLink.com",
state: "testState",
phone: "999-999-9999",
lat: 99.9999, lng: -99.9999
}]'''
I need to create pattern that identifies which are keys and not random string values containing characters such as the string link in storeImage. In other words, I want to dynamically find keys and double-quote them to use json.loads() and return a valid JSON object.
I'm currently replacing each key in the text this way
content = re.sub('storeName:', '"storeName":', content)
content = re.sub('address:', '"address":', content)
content = re.sub('address2:', '"address2":', content)
content = re.sub('city:', '"city":', content)
content = re.sub('storeImage:', '"storeImage":', content)
content = re.sub('state:', '"state":', content)
content = re.sub('phone:', '"phone":', content)
content = re.sub('lat:', '"lat":', content)
content = re.sub('lng:', '"lng":', content)
Returned as string representing valid JSON
json_data = [{"storeName": "testName", "address": "12345 Road", "address2": "Suite 500", "city": "testCity", "storeImage": "http://www.testLink.com", "state": "testState", "phone": "999-999-9999", "lat": 99.9999, "lng": -99.9999}]
I'm sure there is a better way of doing this but I haven't been able to find or come up with a regex pattern to handle these. Any help is greatly appreciated!
Something like this should do the job: ([{,]\s*)([^"':]+)(\s*:)
Replace for: \1"\2"\3
Example: https://regex101.com/r/oV0udR/1
That repetition is of course unnecessary. You could put everything into a single regex:
content = re.sub(r"\b(storeName|address2?|city|storeImage|state|phone|lat|lng):", r'"\1":', content)
\1 contains the match within the first (in this case, only) set of parentheses, so "\1": surrounds it with quotes and adds back the colon.
Note the use of a word boundary anchor to make sure we match only those exact words.
Regex: (\w+)\s?:\s?("?[^",]+"?,?)
Regex demo
import re
text = 'storeName: "testName", '
text = re.sub('(\w+)\s?:\s?("?[^",]+"?,?)', "\"\g<1>\":\g<2>", text)
print(text)
Output: "storeName":"testName",

Use a CSV cell value as a regex string using re in Python

So I have a CSV document that has metadata, an XPATH and a Regex String in each row. The script uses the xpath to iterate over API requests and then I want to use the regex stored in the CSV with that xpath to search for something in the API results. My issue is how to use the data in a CSV row as a literal regex search string, like r'^\w{2}.+' versus a string to search against.
with open(rules, 'r+') as rulefile:
rreader = csv.DictReader(rulefile)
for row in rreader:
for ip, apikey in keydict.iteritems():
rulequery = {'type': 'config', 'action': 'get', 'key': apikey, 'xpath': row["xpath"]}
rrule = requests.get('https://' + ip + '/api', params = rulequery, verify=False)
rrex = re.compile('{}'.format(row["regex"]), re.MULTILINE)
for line in rrule.text:
config = rrex.findall(line)
print(config)
So I think I may have found a solution, although Im not sure it is the best... Open for assistance if anyone has a better way to do it.
with open(rules, 'r+') as rulefile:
rreader = csv.DictReader(rulefile)
for row in rreader:
for ip, apikey in keydict.iteritems():
regex = row["regex"]
rulequery = {'type': 'config', 'action': 'get', 'key': apikey, 'xpath': row["xpath"]}
rrule = requests.get('https://' + ip + '/api', params = rulequery, verify=False)
config = re.search(regex, rrule.text)
print rrule.text[config.start():config.end()]

Categories

Resources