Parse in a file with links Python - python

I have a file that I have to parse that has a lot of links, and example of how it looks:
<hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
pls/facebook?funn=wordlis&sys;sys;colorsdif_id=11908675">colors</p></hm>
<hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
pls/facebook?funn=wordlis&sys;sys;colorsdif_id=45103481">yelloW</p></hm>
<td>I have a dream, and it is all good 2</hm>
<hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
pls/facebook?funn=wordlis&sys;sys;colorsdif_id=40984930">orangE</p></hm>
<hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
pls/facebook?funn=wordlis&sys;sys;colorsdif_id=90648361">pinK</p></hm>
I only have to keep the words that are in the position of >colors< so I also want >yelloW<, >orangE< and >pinK<.
In this example, the common expression between them, will be all the link, except the number (the id, that it is a different number in all the links), and the word.
Just after finding all the words I want to save them in a dictionary, that use the first element as key and the others as elements, so the final result will be:
d = {"colors": ["yelloW", "orangE", "pinK"]}

You can try something like this:
import re
re.findall(r"http://[^>]+>(\w+)",ree)
Where:
[^>]+ - get any characters except >
\w+ - get any letters
(..) - return the group between parentheses
And Python dictionaries doesn't support identical keys. You can look at this question.

Related

In python, find tokens in line

long time ago I wrote a tool for parsing text files, line by line, and do some stuff, depending on commands and conditions in the file.
I used regex for this, however, I was never good in regex.
A line holding a condition looks like this:
[type==STRING]
And the regex I use is:
re.compile(r'^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
This regex would result me the keyword "type" and the value "STRING".
However, now I need to update my tool to have more conditions in one line, e.g.
[type==STRING][amount==0]
I need to update my regex to get me two pairs of results, one pair type/STRING and one pair amount/0.
But I'm lost on this. My regex above gets me zero results with this line.
Any ideas how to do this?
You could either match a second pair of groups:
^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*(?:\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*)?$
Regex demo
Or you can omit the anchors and the [^\[\]]* part to get the group1 and group 2 values multiple times:
\[([^\]\[=]*)==([^\]\[=]*)\]
Regex demo
Is it a requirement that you use regex? You can alternatively accomplish this pretty easily using the split function twice and stripping the first opening and last closing bracket.
line_to_parse = "[type==STRING]"
# omit the first and last char before splitting
pairs = line_to_parse[1:-1].split("][")
for pair in pairs:
x, y = pair.split("==")
Rather depends on the precise "rules" that describe your data. However, for your given data why not:
import re
text = '[type==STRING][amount==0]'
words = re.findall('\w+', text)
lst = []
for i in range(0, len(words), 2):
lst.append((words[i], words[i+1]))
print(lst)
Output:
[('type', 'STRING'), ('amount', '0')]

List of strings: remove and split elements to extract words

I have the following two different cases of list of strings:
my_list1=['_','net_my_name','_64', '_66']
my_list2=['net_another_file']
I would like to extract
net_my_name as my name in case I have type of lists such as my_list1;
net_another_file as another file in case I have type of lists such as my_list2.
To do so, I was thinking of:
in case I find a situation like that one described by my_list1, then remove elements that are numerical, then split on _ to take the last two items (i.e. my name);
in case I find a situation like that one described by my_list2, then split on _ to take the last two items (i.e. another file).
If I removed numerical values, where they occur, I would have my_name as last word, i.e. my name as last two words.
Expected output:
my name
another file
Can you please tell me how to 'translate' in code the steps above? Thank you
Consider this code:
import re
string = "net_another_file777"
string = re.sub("[0-9]", "", string) # "net_another_file"
L = string.split('_')[-2:] # ['another', 'file']
Now you have just to go through the list and aply this to every element in the list.
Hope this helps you.

Get all substrings between two different start and ending delimiters

I am trying in Python 3 to get a list of all substrings of a given String a, which start after a delimiter x and end right before a delimiter y.
I have found solutions which only get me the first occurence, but the result needs to be a list of all occurences.
start = '>'
end = '</'
s = '<script>a=eval;b=alert;a(b(/XSS/.source));</script><script>a=eval;b=alert;a(b(/XSS/.source));</script>'"><marquee><h1>XSS by Xylitol</h1></marquee>'
print((s.split(start))[1].split(end)[0])
the above example is what I've got so far. But I am searching for a more elegant and stable way to get all the occurences.
So the expected return as list would contain the javascript code as following entries:
a=eval;b=alert;a(b(/XSS/.source));
a=eval;b=alert;a(b(/XSS/.source));
Looking for patterns in strings seems like a decent job for regular expressions.
This should return a list of anything between a pair of <script> and </script>:
import re
pattern = re.compile(r'<script>(.*?)</script>')
s = '<script>a=eval;b=alert;a(b(/XSS/.source));</script><script>a=eval;b=alert;a(b(/XSS/.source));</script>\'"><marquee><h1>XSS by Xylitol</h1></marquee>'
print(pattern.findall(s))
Result:
['a=eval;b=alert;a(b(/XSS/.source));', 'a=eval;b=alert;a(b(/XSS/.source));']

Search for string in file while ignoring id and replacing only a substring

I’ve got a master .xml file generated by an external application and want to create several new .xmls by adapting and deleting some rows with python. The search strings and replace strings for these adaptions are stored within an array, e.g.:
replaceArray = [
[u'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"',
u'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"'],
[u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="false"/>',
u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="true"/>'],
[u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="false"/>',
u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="true"/>']]
So I'd like to iterate through my file and replace all occurences of 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"' with 'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"' and so on.
Unfortunately the ID values of "RowID", “id_tool_base” and “ref_layerid_mapping” might change occassionally. So what I need is to search for matches of the whole string in the master file regardless which id value is inbetween the quotation mark and only to replace the substring that is different in both strings of the replaceArray (e.g. use=”true” instead of use=”false”). I’m not very familiar with regular expressions, but I think I need something like that for my search?
re.sub(r'<TOOL_SELECT_LINE RowID="\d+" id_tool_base="\d+" use="false"/>', "", sentence)
I'm happy about any hint that points me in the right direction! If you need any further information or if something is not clear in my question, please let me know.
One way to do this is to have a function for replacing text. The function would get the match object from re.sub and insert id captured from the string being replaced.
import re
s = 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"'
pat = re.compile(r'ref_layerid_mapping=(.+) lyvis="off" toc_visible="off"')
def replacer(m):
return "ref_layerid_mapping=" + m.group(1) + 'lyvis="on" toc_visible="on"';
re.sub(pat, replacer, s)
Output:
'ref_layerid_mapping="x4049"lyvis="on" toc_visible="on"'
Another way is to use back-references in replacement pattern. (see http://www.regular-expressions.info/replacebackref.html)
For example:
import re
s = "Ab ab"
re.sub(r"(\w)b (\w)b", r"\1d \2d", s)
Output:
'Ad ad'

Regular Expression Parsing Key Value Pairs in Namelist Input File

I have an input file which is in a Fortran "namelist" format which I would like to parse with python regular expressions. Easiest way to demonstrate is with a ficticious example:
$VEHICLES
CARS= 1,
TRUCKS = 0,
PLAINS= 0, TRAINS = 0,
LIB='AUTO.DAT',
C This is a comment
C Data variable spans multiple lines
DATA=1.2,2.34,3.12,
4.56E-2,6.78,
$END
$PLOTTING
PLOT=T,
PLOT(2)=12,
$END
So the keys can contain regular variable-name characters as well as parenthesis and numbers. The values can be strings, boolean (T, F, .T., .F., TRUE, FALSE, .TRUE., .FALSE. are all possible), integers, floating-point numbers, or comma-separated lists of numbers. Keys are connected to their values with equal signs. Key-Value pairs are separated by commas, but can share a line. Values can span multiple lines for long lists of numbers. Comments are any line beginning with a C. There is generally inconsistent spacing before and after '=' and ','.
I have come up with a working regular expression for parsing the keys and values and getting them into an Ordered Dictionary (need to preserve order of inputs).
Here's my code so far. I've included everything from reading the file to saving to a dictionary for thoroughness.
import re
from collections import OrderedDict
f=open('file.dat','r')
file_str=f.read()
#Compile regex pattern for requested namelist
name='Vehicles'
p_namelist = re.compile(r"\$"+name.upper()+"(.*?)\$END",flags=re.DOTALL|re.MULTILINE)
#Execute regex on file string and get a list of captured tokens
m_namelist = p_namelist.findall(file_str)
#Check for a valid result
if m_namelist:
#The text of the desired namelist is the first captured token
namelist=m_namelist[0]
#Split into lines
lines=namelist.splitlines()
#List comprehension which returns the list of lines that do not start with "C"
#Effectively remove comment lines
lines = [item for item in lines if not item.startswith("C")]
#Re-combine now that comment lines are removed
namelist='\n'.join(lines)
#Create key-value parsing regex
p_item = re.compile(r"([^\s,\=]+?)\s*=\s*([^=]+)(?=[\s,][^\s,\=]+\s*\=|$)",flags=re.DOTALL|re.MULTILINE)
#Execute regex
items = p_item.findall(namelist)
#Initialize namelist ordered dictionary
n = OrderedDict()
#Remove undesired characters from value
for item in items:
n[item[0]] = item[1].strip(',\r\n ')
My question is whether I'm going about this correctly. I realize there is a ConfigParser library, which I have not yet attempted. My focus here is the regular expression:
([^\s,\=]+?)\s*=\s*([^=]+)(?=[\s,][^\s,\=]+\s*\=|$)
but I went ahead and included the other code for thoroughness and to demonstrate what I'm doing with it. For my Regular Expression, because the values can contain commas, and the key-value pairs are also separated by commas, there is no simple way to isolate the pairs. I chose to use a forward look-ahead to find the next key and "=". This allows everything between the "=" and the next key to be the value. Finally, because this doesn't work for the last pair, I threw in "|$" into the forward look-ahead meaning that if another "VALUE=" isn't found, look for the end of the string. I figured matching the value with [^=]+ followed by a look-ahead was better than trying to match all possible value types.
While writing this question I came up with an alternative Regular Expresson that takes advantage of the fact that numbers are the only value that can be in lists:
([^\s,\=]+?)\s*=\s*((?:\s*\d[\d\.\E\+\-]*\s*,){2,}|[^=,]+)
This one matches either a list of 2 or more numbers with (?:\s*\d[\d\.\E\+\-]*\s*,){2,} or anything before the next comma with [^=,].
Are these somewhat messy Regular Expressions the best way to parse a file like this?
I would suggest to develop little more sophisticated parser.
I stumble upon the project on google code hosting that implements very similar parser functionality: Fortran Namelist parser for Python prog/scripts but it was build for little different format.
I played with it a little and updated it to support structure of the format in your example:
Please see my version on gist:
Updated Fortran Namelist parser for python https://gist.github.com/4506282
I hope this parser will help you with your project.
Here is example output produced by the script after parsing FORTRAN code example:
{'PLOTTING':
{'par':
[OrderedDict([('PLOT', ['T']), ('PLOT(2) =', ['12'])])],
'raw': ['PLOT=T', 'PLOT(2)=12']},
'VEHICLES':
{'par':
[OrderedDict([('TRUCKS', ['0']), ('PLAINS', ['0']), ('TRAINS', ['0']), ('LIB', ['AUTO.DAT']), ('DATA', ['1.2', '2.34', '3.12', '4.56E-2', '6.78'])])],
'raw':
['TRUCKS = 0',
'PLAINS= 0, TRAINS = 0',
"LIB='AUTO.DAT'",
'DATA=1.2,2.34,3.12',
'4.56E-2,6.78']}}

Categories

Resources