Splitting a string to find words between delimiters? - python

Given a certain line that looks like this:
jfdajfjlausername=Bob&djfkaak;jdskjpsasword=12345&
I want to return the username and password, in this case being Bob and 12345.
I tried splitting the string by the & sign but could not figure out how to then find the individual words, and then also tried the below code:
left='password='
right='&'
userleft='username='
for x in file.readlines():
if 'password=' and 'username=' in x:
text=str(x)
#password=(text[text.index(left)+len(left):text.index(right)])
#username=(text[text.index(userleft)+len(userleft):text.index(useright)])

Without using regular expressions, you can split twice: once on & and once on =:
line = 'jfdajfjlausername=Bob&djfkaak;jdskjpsasword=12345&'
items = [item.split('=') for item in line.split('&')]
Now you can extract the values:
for item in items:
if len(item) == 2:
if item[0].endswith('password'):
password = item[1]
elif item[0].endswith('username'):
username = item[1]
If you had a bunch of keys you were looking for, like ('username', 'password'), you could write a nested loop to build dictionaries:
keys = ('username', 'password')
result = {}
for item in items:
if len(item) == 2:
for k in keys:
if item[0].endswith(k):
result[k] = item[1]
break
This makes it a lot easier to check that you got all the values you want, e.g. with if len(keys) == len(result): ....

If you want a very simple approach, you could do this:
data = 'jfdajfjlausername=Bob&djfkaak;jdskjpassword=12345&'
#right of "username=" and left of "&"
un = data.split('username=')[1].split('&')[0]
#right of "password=" and left of "&"
pw = data.split('password=')[1].split('&')[0]
print(un, pw) #Bob, 12345
Since the process is identical except for the desired key, you could do something like the below and homogenize the process of getting the value for any key in the query. An interesting side-effect of this is: Even if your example query did not end in "&", this would still work. This is because everything that is left would be in the result of .split('&')[0], and there simply wouldn't be a .split('&')[1]. Nothing below uses .split('&')[1] so, it just wouldn't matter.
query = 'jfdajfjlausername=Bob&djfkaak;jdskjpassword=12345&'
key2val = lambda q,k: q.split(f'{k}=')[1].split('&')[0]
un = key2val(query, 'username')
pw = key2val(query, 'password')
print(un, pw) #Bob, 12345
This method is likely superior to regex. It is bound to be faster, it doesn't require any dependencies or loops, and it is flexible enough to allow you to get the value from any key, regardless of order, without the need to ever change anything.

Use Regex:
import re
for x in file.readlines():
if 'password=' in x and 'username=' in x:
text=str(x)
username = re.findall('username=(\w+)',text)
password = re.findall('password=(\w+)',text)
Note the updated if statement. In the orginal, the if checks if "password=" evaluates to True, which it always will - since it is not an empty string.

You can use a single regular expression to parse this information out:
import re
s = "jfdajfjlausername=Bob&djfkaak;jdskjpassword=12345&"
regex = "username=(?P<username>.+)&.*password=(?P<password>.+)&"
match = re.search(regex, s)
print(match.groupdict())
{'username': 'Bob', 'password': '12345'}
Implementing this while looping over the lines in a file would look like:
regex = "username=(?P<username>.+)&.*password=(?P<password>.+)&"
with open('text') as f:
for line in f:
match = re.search(regex, line)
if match is not None:
print(match.groupdict())

Update #2
This reads a file named "text" and parses out the username and password for each line if they both exist.
This solution assumes that the username and password fields both end with a "&".
Update #3:
Note that this code will work even if the order of the username and password is reversed.
import re
with open('text') as f:
for line in f:
print(line.strip())
# Note that ([^&]+) captures any characters up to the next &.
m1 = re.search('username=([^&]+)', line)
m2 = re.search('password=([^&]+)', line)
if m1 and m2:
print('username=', m1[1])
print('password=', m2[1])
Output:
jfdajfjlausername=Bob&djfkaak;jdskjpassword=12345&
username= Bob
password= 12345

Related

Regex that grabs variable number of groups

This is not a question asking how to use re.findall() or the global modifier (?g) or \g. This is asking how to match n groups with one regex expression, with n between 3 and 5.
Rules:
needs to ignore lines with first non-space character as # (comments)
needs to get at least three items, always in order: ITEM1, ITEM2, ITEM3
class ITEM1(stuff)
model = ITEM2
fields = (ITEM3)
needs to get any of the following matches if they exist (UNKNOWN order, and can be missing)
write_once_fields = (ITEM4)
required_fields = (ITEM5)
needs to know which match is which, so either retrieve matches in order, returning None if there is no match, or retrieve pairs.
My question is if this is doable, and how?
I've gotten this far, but it hasn't dealt with comments or unknown order or if some items are missing and to stop searching for this particular regex when you see the next class definition. https://www.regex101.com/r/cG5nV9/8
(?s)\nclass\s(.*?)(?=\()
.*?
model\s=\s(.*?)\n
.*?
(?=fields.*?\((.*?)\))
.*?
(?=write_once_fields.*?\((.*?)\))
.*?
(?=required_fields.*?\((.*?)\))
Do I need a conditional?
Thanks for any kinds of hints.
I'd do something like:
from collections import defaultdict
import re
comment_line = re.compile(r"\s*#")
matches = defaultdict(dict)
with open('path/to/file.txt') as inf:
d = {} # should catch and dispose of any matching lines
# not related to a class
for line in inf:
if comment_line.match(line):
continue # skip this line
if line.startswith('class '):
classname = line.split()[1]
d = matches[classname]
if line.startswith('model'):
d['model'] = line.split('=')[1].strip()
if line.startswith('fields'):
d['fields'] = line.split('=')[1].strip()
if line.startswith('write_once_fields'):
d['write_once_fields'] = line.split('=')[1].strip()
if line.startswith('required_fields'):
d['required_fields'] = line.split('=')[1].strip()
You could probably do this easier with regex matching.
comment_line = re.compile(r"\s*#")
class_line = re.compile(r"class (?P<classname>)")
possible_keys = ["model", "fields", "write_once_fields", "required_fields"]
data_line = re.compile(r"\s*(?P<key>" + "|".join(possible_keys) +
r")\s+=\s+(?P<value>.*)")
with open( ...
d = {} # default catcher as above
for line in ...
if comment_line.match(line):
continue
class_match = class_line.match(line)
if class_match:
d = matches[class_match.group('classname')]
continue # there won't be more than one match per line
data_match = data_line.match(line)
if data_match:
key,value = data_match.group('key'), data_match.group('value')
d[key] = value
But this might be harder to understand. YMMV.

Parsing key values in string

I have a string that I am getting from a command line application. It has the following structure:
-- section1 --
item11|value11
item12|value12
item13
-- section2 --
item21|value21
item22
what I would like is to parse this to a dict so that I can easily access the values with:
d['section1']['item11']
I already solved it for the case when there are no sections and every key has a value but I get errors otherwise. I have tried a couple things but it is getting complicated because and nothing seems to work. This is what I have now:
s="""
item11|value11
item12|value12
item21|value21
"""
d = {}
for l in s.split('\n'):
print(l, l.split('|'))
if l != '':
d[l.split('|')[0]] = l.split('|')[1]
Can somebody help me extend this for the section case and when no values are present?
Seems like a perfect fit for the ConfigParser module in the standard library:
d = ConfigParser(delimiters='|', allow_no_value=True)
d.SECTCRE = re.compile(r"-- *(?P<header>[^]]+?) *--") # sections regex
d.read_string(s)
Now you have an object that you can access like a dictionary:
>>> d['section1']['item11']
'value11'
>>> d['section2']['item22'] # no value case
None
Regexes are a good take at this:
import re
def parse(data):
lines = data.split("\n") #split input into lines
result = {}
current_header = ""
for line in lines:
if line: #if the line isn't empty
#tries to match anything between double dashes:
match = re.match(r"^-- (.*) --$", line)
if match: #true when the above pattern matches
#grabs the part inside parentheses:
current_header = match.group(1)
else:
#key = 1st element, value = 2nd element:
key, value = line.split("|")
#tries to get the section, defaults to empty section:
section = result.get(current_header, {})
section[key] = value #adds data to section
result[current_header] = section #updates section into result
return result #done.
print parse("""
-- section1 --
item1|value1
item2|value2
-- section2 --
item1|valueA
item2|valueB""")

capturing the usernames after List: tag

I am trying to create a list named "userlist" with all the usernames listed beside "List:",
my idea is to parse the line with "List:" and then split based on "," and put them in a list,
however am not able to capture the line ,any inputs on how can this be achieved?
output=""" alias: tech.sw.host
name: tech.sw.host
email: tech.sw.host
email2: tech.sw.amss
type: email list
look_elsewhere: /usr/local/mailing-lists/tech.sw.host
text: List tech SW team
list_supervisor: <username>
List: username1,username2,username3,username4,
: username5
Members: User1,User2,
: User3,User4,
: User5 """
#print output
userlist = []
for line in output :
if "List" in line:
print line
If it were me, I'd parse the entire input so as to have easy access to every field:
inFile = StringIO.StringIO(ph)
d = collections.defaultdict(list)
for line in inFile:
line = line.partition(':')
key = line[0].strip() or key
d[key] += [part.strip() for part in line[2].split(',')]
print d['List']
Using regex, str.translate and str.split :
>>> import re
>>> from string import whitespace
>>> strs = re.search(r'List:(.*)(\s\S*\w+):', ph, re.DOTALL).group(1)
>>> strs.translate(None, ':'+whitespace).split(',')
['username1', 'username2', 'username3', 'username4', 'username5']
You can also create a dict here, which will allow you to access any attribute:
def func(lis):
return ''.join(lis).translate(None, ':'+whitespace)
lis = [x.split() for x in re.split(r'(?<=\w):',ph.strip(), re.DOTALL)]
dic = {}
for x, y in zip(lis[:-1], lis[1:-1]):
dic[x[-1]] = func(y[:-1]).split(',')
dic[lis[-2][-1]] = func(lis[-1]).split(',')
print dic['List']
print dic['Members']
print dic['alias']
Output:
['username1', 'username2', 'username3', 'username4', 'username5']
['User1', 'User2', 'User3', 'User4', 'User5']
['tech.sw.host']
Try this:
for line in output.split("\n"):
if "List" in line:
print line
When Python is asked to treat a string like a collection, it'll treat each character in that string as a member of that collection (as opposed to each line, which is what you're trying to accomplish).
You can tell this by printing each line:
>>> for line in ph:
... print line
...
a
l
i
a
s
:
t
e
...
By the way, there are far better ways of handling this. I'd recommend taking a look at Python's built-in RegEx library: http://docs.python.org/2/library/re.html
Try using strip() to remove the white spaces and line breakers before doing the check:
if 'List:' == line.strip()[:5]:
this should capture the line you need, then you can extract the usernames using split(','):
usernames = [i for i in line[5:].split(',')]
Here is my two solutions, which are essentially the same, but the first is easier to understand.
import re
output = """ ... """
# First solution: join continuation lines, the look for List
# Join lines such as username5 with previous line
# List: username1,username2,username3,username4,
# : username5
# becomes
# List: username1,username2,username3,username4,username5
lines = re.sub(r',\s*:\s*', ',', output)
for line in lines.splitlines():
label, values = [token.strip() for token in line.split(':')]
if label == 'List':
userlist = userlist = [user.strip() for user in values.split(',')]
print 'Users:', ', '.join(userlist)
# Second solution, same logic as above
# Different means
tokens, = [line for line in re.sub(r',\s*:\s*', ',', output).splitlines()
if 'List:' in line]
label, values = [token.strip() for token in tokens.split(':')]
userlist = userlist = [user.strip() for user in values.split(',')]
print 'Users:', ', '.join(userlist)

Can't print a specific line from text file

So I currently have this code to read an accounts.txt file that looks like this:
username1:password1
username2:password2
username3:password3
I then have this (thanks to a member here) read the accounts.txt file and split it at the username and password so I can later print it. When I try to print line 1 with the username and password separate with this code:
with open('accounts.txt') as f:
credentials = [x.strip().split(':') for x in f.readlines()]
for username,password in credentials:
print username[0]
print password[0]
It prints out this:
j
t
a
2
a
3
(These are the three lines I have in the text file, properly split, however it's printing all the lines and only the first letter of each line.)
I've tried a few different methods with no luck. Anyone have an idea on what to do?
Thank you for all your help. It's really appreciated. This is my second day programming and I apologize for such a simple question.
username and password are strings. When you do this to a string, you get the first character in the string:
username[0]
Don't do that. Just print username.
Some further explanation. credentials is a list of lists of strings. It looks like this when you print it out:
[['username1', 'password1'], ['username2', 'password2'], ['username3', 'password3']]
To get one username/password pair, you could do this: print credentials[0]. The result would be this:
['username1', 'password1']
Or if you did print credentials[1], this:
['username2', 'password2']
You can also do a thing called "unpacking," which is what your for loop does. You can do it outside a for loop too:
username, password = credentials[0]
print username, password
The result would be
username1 password1
And again, if you take a string like 'username1' and take a single element of it like so:
username[0]
You get a single letter, u.
First, I'd like to say if this is your second day programming, then you're off to a good start by using the with statement and list comprehensions already!
As the other people already pointed out, since you are using [] indexing with a variable that contains a string, it treats the str as if it were an array, so you get the character at the index you specify.
I thought I'd point out a couple of things:
1) you don't need to use f.readline() to iterate over the file since the file object f is an iterable object (it has the __iter__ method defined which you can check with getattr(f, '__iter__'). So you can do this:
with open('accounts.txt') as f:
for l in f:
try:
(username, password) = l.strip().split(':')
print username
print password
except ValueError:
# ignore ValueError when encountering line that can't be
# split (such as blank lines).
pass
2) You also mentioned you were "curious if there's a way to print only the first line of the file? Or in that case the second, third, etc. by choice?"
The islice(iterable[, start], stop[, step]) function from the itertools package works great for that, for example, to get just the 2nd & 3rd lines (remember indexes start at 0!!!):
from itertools import islice
start = 1; stop = 3
with open('accounts.txt') as f:
for l in islice(f, start, stop):
try:
(username, password) = l.strip().split(':')
print username
print password
except ValueError:
# ignore ValueError when encountering line that can't be
# split (such as blank lines).
pass
Or to get every other line:
from itertools import islice
start = 0; stop = None; step = 2
with open('accounts.txt') as f:
for l in islice(f, start, stop, step):
try:
(username, password) = l.strip().split(':')
print username
print password
except ValueError:
# ignore ValueError when encountering line that can't be
# split (such as blank lines).
pass
Spend time learning itertools (and its recipes!!!); it will simplify your code.

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Categories

Resources