Problem with finding the correct match with regex - python

I have some data, which I'm trying to process. Basically I want to change all the commas , to semicolon ;, but some fields contain text, usernames or passwords that also contain commas. How do I change all the commas except the ones inclosed in "?
Test data:
Secret Name,URL,Username,Password,Notes,Folder,TOTP Key,TOTP Backup Codes
test1,,username,"pass,word",These are the notes,\Some\Folder,,
test2,,"user1, user2, user3","pass,word","Hello, I'm mr Notes",\Some\Folder,,
test3,http://1.2.3.4/ucsm/ucsm.jnlp,"xxxx\n(use Drop down, select Hello)",password,Use the following\nServer1\nServer2,\Some\Folder,,
What have I tried?
secrets = """Secret Name,URL,Username,Password,Notes,Folder,TOTP Key,TOTP Backup Codes
test1,,username,"pass,word",These are the notes,\Some\Folder,,
test2,,"user1, user2, user3","pass,word","Hello, I'm mr Notes",\Some\Folder,,
test3,http://1.2.3.4/ucsm/ucsm.jnlp,"xxxx\n(use Drop down, select Hello)",password,Use the following\nServer1\nServer2,\Some\Folder,,
"""
test = re.findall(r'(.+?\")(.+)(\".+)', secrets)
for line in test:
part1, part2, part3 = line
processed = "".join([part1.replace(",", ";"), part2, part3.replace(",", ";")])
print(processed)
Result:
test1;;username;"pass,word";These are the notes;\Some\Folder;;
test2;;"user1, user2, user3","pass,word","Hello, I'm mr Notes";\Some\Folder;;
It works fine, when there's only one occurence of "" in the line and no line breaks, but when there are more or there's a line break within the quotations, it's broken. How can I fix this?
FYI: Notes can contain multiple line breaks.

You don't need a regex here, take advantage of a CSV parser:
import csv, io
inp = csv.reader(io.StringIO(secrets), # or use file as input
quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL)
with open('out.csv', 'w') as out:
csv.writer(out, delimiter=';').writerows(inp)
output file:
Secret Name;URL;Username;Password;Notes;Folder;TOTP Key;TOTP Backup Codes
test1;;username;pass,word;These are the notes;\Some\Folder;;
test2;;user1, user2, user3;pass,word;Hello, I'm mr Notes;\Some\Folder;;
test3;http://1.2.3.4/ucsm/ucsm.jnlp;"xxxx
(use Drop down, select Hello)";password;Use the following
Server1
Server2;\Some\Folder;;
Optionally, use the quoting=csv.QUOTE_ALL parameter in csv.writer.

This should do I believe:
import re
print( re.sub(r'("[^"]*")|,', lambda x: x.group(1) if x.group(1) else x.group().replace(",", ";"), secrets))

mozway's solution looks like the best way to resolve this, but interestingly, SM1312's regex works almost perfectly with a much more simple replacement argument for the sub function (i.e. r'\1;'):
import re
print (re.sub(r'("[^"]*")|,', r'\1;', secrets))
The only issue is this introduces an extra semicolon after a quoted entry. This happens because the first alternation member (i.e. ("[^"]*")) does not consume a comma, but the replacement argument adds a semicolon regardless of which alternation member matches. Simply adding a comma to the first alternation member resolves this and works perfectly for the sample data:
import re
print (re.sub(r'("[^"]*"),|,', r'\1;', secrets))
However, it fails if the data includes a quoted entry as the last (i.e. the TOTP Backup Codes) column of the data; any commas in the last quoted entry will be changed to semicolons. This is likely not an acceptable failure mode since it is changing the data set. The following resolves that issue, but introduces a different error that may be tolerable; it adds an extra semicolon at the end of the line:
import re
print (re.sub(r'("[^"]*")(,|(?=\s+))|,', r'\1;', secrets))
This is accomplished by changing the first part of the original alternation member to use alternation itself. That is, the part that was matching the comma after the quoted entry is changed to additionally check for nothing but whitespace (i.e. (,|(?=\s+))), which includes an end of line, after the quoted entry using the following positive lookahead assertion: (?=\s+). The positive lookahead assertion for whitespace is used instead of simply matching whitespace to avoid consuming the whitespace and eliminating it from the resulting output.

Related

Python. How to print a certain part of a line after it had been "re.searched" from a file

Could you tell me how to print this part of the line only '\w+.226.\w.+' ?
Code
VSP = input("Номер ВСП (четыре цифры): ")
a = re.compile(r'\w+.226.\w.+'+VSP)
b=re.search(a, open('Sample.txt').read())
print (b.group())
Номер ВСП (четыре цифры): 1020
10.226.27.60 1020
After I have found the intended line associated with my variable "VSP" in the txt file, how can exclude it from output, printing the"10.226.27.60" only?
You will need to modify your regex slightly to separate the trailing characters in the IP and the spaces that separate it from VSP. Adding a capture group will let you select the portion with just the IP address. The updated regex looks like this:
'(\d+\.226\.\S+)\s+' + VSP
\S (uppercase S) matches any non-whitespace, while \s (lowercase s) matches all whitespace. I replaced the first \w with the more specific \d (digits), and . (any character at all) with \. (actual period). The second \w is now \S, but you could use \d+\.\d+ if you wanted to be more specific.
Using the first capture group will give you the IP address:
print(b.group(1))
If you are looking for a single IP address once, not compiling your regex is fine. Also, reading in a small file in its entirety is OK as long as the file is small. If either is not the case, I would recommend compiling the regex and going through the file line by line. That will allow you to discard most lines much faster than using a regex would do.
I see you already have an answer.You can also try this regex if you were to separate the two groups by the whitespace:
import re
a = re.compile(r'(.+?)\s+(.+)') # edit: added ? to avoid
# greedy behaviour of first .+
# otherwise multiple spaces after the
# address will be caught into
# b.group(1), as per #Mad comment
b=re.search(a, '10.226.27.60 1020')
print (b.group(0))
print (b.group(1))
print (b.group(2))
or customize the first group regexp to your needs.
Edit:
This was not meant to be a proper answer but more of a comment wich I didn't think was readable as such; I am trying only to show group separation using regex, wich seems OP didn't know about or didn't use.
That is why I am not matching .226. because OP can do that. I also removed the file read part, which isn't needed for demonstration. Please read #Mad answer because its quite complete and in fact also shows how to use groups.

regex: replace hyphens with en-dashes with re.sub

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!
I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo
re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?
Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.
The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.
This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

Python Regex to match YAML Front Matter

I'm having trouble crafting a regex to match YAML Front Matter
This is the front matter I was trying to match:
---
name: me
title: test
cpu: 1
---
This is what I thought would work:
re.search( r'^(---)(.*)(---)$', content, re.MULTILINE)
Any help would be greatly appreciated.
To unpack what you are currently doing with this regular expression:
r'^(---)(.*)(---)$':
r: Treat this as a string literal in Python
^: Start the evaluation at the beginning of a line
(---): Parse --- into an anonymous capture group
(.*): Parse all characters (.) non-greedily (*) until the next expression
(---): As above
$: End at the evaluation of the end of a line
The trouble is this will fail when whitespace is present. You're literally saying: find dashes that occur at the beginning of a line and parse until we find dashes that occur at the end of one. Furthermore, you're creating groups that I believe are not necessary to the useful evaluation of your regular expression, by using parentheses () around the dashes used to find YAML front matter.
A better expression would be:
r'^\s*---(.*)---\s*$'
Which adds the repeating group \s* to capture whitespace characters between the beginning of the first line up to the dashes, adds this again between the second group of dashes to the end of that line, and captures everything between into a single anonymous capture group that you can then use for additional processing. If extracting the contents of the front matter isn't desired, simply replace (.*) with .*.
Consider re.findall for multiple evaluations of this regular expression in a single file, and as mentioned, use re.DOTALL to allow the dot character to match new lines.
I've used something like this regex, re.findall('^---[\s\S]+?---', text):
def extractFrontMatter(markdown):
md = open(markdown, 'r')
text = md.read()
md.close()
# Returns first yaml content, `--- yaml frontmatter ---` from the .md file
# http://regexr.com/3f5la
# https://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match
match = re.findall('^---[\s\S]+?---', text)
if match:
# Strips `---` to create a valid yaml object
ymd = match[0].replace('---', '')
try:
return yaml.load(ymd)
except yaml.YAMLError as exc:
print exc
I've also come across python-frontmatter, which has some additional helper functions:
import frontmatter
post = frontmatter.load('/path/to-markdown.md')
print post.metadata, 'meta'
print post.keys(), 'keys'

Python regex for reading CSV-like rows

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:
data1, data2 ,"data3'''", 'data4""',,,data5,
but this one is malformed:
data1, data2, da"ta3", 'data4',
-- quotation marks can only be prepended or trailed by spaces.
Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.
I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.
So, maybe someone with experience in parsing something similar could help me on this?
(Or maybe this is too complex for regex and I should just write a function)
EDIT1:
csv module is not much of use here:
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
[['2', ' "dat', 'a1"', " 'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
[['2', 'dat,a1', "'dat", "a2'", '']]
-- unless this can be tuned?
EDIT2: A few language edits - I hope it's more valid English now
EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.
While the csv module is the right answer here, a regex that could do this is quite doable:
import re
r = re.compile(r'''
\s* # Any whitespace.
( # Start capturing here.
[^,"']+? # Either a series of non-comma non-quote characters.
| # OR
"(?: # A double-quote followed by a string of characters...
[^"\\]|\\. # That are either non-quotes or escaped...
)* # ...repeated any number of times.
" # Followed by a closing double-quote.
| # OR
'(?:[^'\\]|\\.)*'# Same as above, for single quotes.
) # Done capturing.
\s* # Allow arbitrary space before the comma.
(?:,|$) # Followed by a comma or the end of a string.
''', re.VERBOSE)
line = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
print r.findall(line)
# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']
EDIT: To validate lines, you can reuse the regex above with small additions:
import re
r_validation = re.compile(r'''
^(?: # Capture from the start.
# Below is the same regex as above, but condensed.
# One tiny modification is that it allows empty values
# The first plus is replaced by an asterisk.
\s*([^,"']*?|"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')\s*(?:,|$)
)*$ # And don't stop until the end.
''', re.VERBOSE)
line1 = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""
if r_validation.match(line1):
print 'Line 1 is valid.'
else:
print 'Line 1 is INvalid.'
if r_validation.match(line2):
print 'Line 2 is valid.'
else:
print 'Line 2 is INvalid.'
# Prints:
# Line 1 is valid.
# Line 2 is INvalid.
Although it would likely be possible with some combination of pre-processing, use of csv module, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csv module, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).
In complex parsing cases, pyparsing is always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...
You can likely find examples of pyparsing-based CSV parsing easily, with this question maybe enough to get you started.
Python has a standard library module to read csv files:
import csv
reader = csv.reader(open('file.csv'))
for line in reader:
print line
For your example input this prints
['data1', ' data2 ', "data3'''", ' \'data4""\'', '', '', 'data5', '']
EDIT:
you need to add skipinitalspace=True to allow spaces before double quotation marks for the extra examples you provided. Not sure about the single quotes yet.
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
It is not possible to give you an answer, because you have not completely specified the protocol that is being used by the writer.
It evidently contains rules like:
If a field contains any commas or single quotes, quote it with double quotes.
Else if the field contains any double quotes, quote it with single quotes.
Note: the result is still valid if you swap double and single in the above 2 clauses.
Else don't quote it.
The resultant field may have spaces (or other whitespace?) prepended or appended.
The so-augmented fields are assembled into a row, separated by commas and terminated by the platform's newline (LF or CRLF).
What is not mentioned is what the writer does in these cases:
(0) field contains BOTH single quotes and double quotes
(1) field contains leading non-newline whitespace
(2) field contains trailing non-newline whitespace
(3) field contains any newlines.
Where the writer ignores any of these cases, please specify what outcomes you want.
You also mention "quotation marks can only be prepended or trailed by spaces" -- surely you mean commas are allowed also, otherwise your example 'data4""',,,data5, fails on the first comma.
How is your data encoded?
This probably sounds too simple, but really from the looks of things you are looking for a string that contains either [a-zA-Z0-9]["']+[a-zA-Z0-9], I mean without in depth testing against the data really what you're looking for is a quote or double quote (or any combination) in between letters (you could also add numbers there).
Based on what you were asking, it really doesn't matter that it's a CSV, it matter's that you have data that doesn't conform. Which I believe just doing a search for a letter, then any combination of one or more " or ' and another letter.
Now are you looking to get a "quantity" or just a printout of the line that contains it so you know which ones to go back and fix?
I'm sorry I don't know python regex's but in perl this would look something like this:
# Look for one or more letter/number at least one ' or " or more and at least one
# or more letter/number
if ($line =~ m/[a-zA-Z0-9]+['"]+[a-zA-Z0-9]+/ig)
{
# Prints the line if the above regex is found
print $line;
}
Just simply convert that for when you look at a line.
I'm sorry if I misunderstood the question
I hope it helps!
If your goal is to convert the data to XML (or JSON, or YAML), look at this example for a Gelatin syntax that produces the following output:
<xml>
<line>
<column>data1</column>
<column>data2 </column>
<column>data3'''</column>
<column>data4""</column>
<column/>
<column/>
<column>data5</column>
<column/>
</line>
</xml>
Note that Gelatin also has a Python API:
from Gelatin.util import compile, generate_to_file
syntax = compile('syntax.gel')
generate_to_file(syntax, 'input.csv', 'output.xml', 'xml')

Categories

Resources