Splitting contents in textfile - python

I have a text file that contains the following:
Number1 (E, P) (F, H)
Number2 (A, B) (C, D)
Number3 (I, J) (O, Z)
I know more or less how to read it and how to get the values of it into my program, but I wanted to know how to correctly split into "Number 1", "(E,P)" and "(F, H)". Also later, I want to be able to check in my program if "Number1" contains "(E, P)" or not.
def read_srg(name):
filename = name + '.txt'
fp = open(filename)
lines = fp.readlines()
R = {}
for line in lines:
??? = line.split()
fp.close()
return R

I think the easiest/most reliable way would be to use a regex:
import re
regex = re.compile(r"([^()]*) (\([^()]*\)) (\([^()]*\))")
with open("myfile.txt") as text:
for line in text:
contents = regex.match(line)
if contents:
label, g1, g2 = contents.groups()
# now do something with these values, e. g. add them to a list
Explanation:
([^()]*) # Match any number of characters besides parentheses --> group 1
[ ] # Match a space
(\([^()]*\)) # Match (, then any non-parenthesis characters, then ) --> group 2
[ ] # Match a space
(\([^()]*\)) # Match (, then any non-parenthesis characters, then ) --> group 3

Because of the whitespaces within the parentheses, you better go with a regular expression, than just splitting lines.
Here's your read_srg function, with the regex check integrated:
import re
def read_srg(name):
with open('%s.txt' % (name, ), 'r') as text:
matchstring = r'(Number[0-9]+) (\([A-Z,\s]+\)) (\([A-Z,\s]+\))'
R = {}
for i, line in enumerate(text):
match = re.match(matchstring, line)
if not match:
print 'skipping exception found in line %d: %s' % (i + 1, line)
continue
key, v1, v2 = match.groups()
R[key] = v1, v2
return R
from pprint import pformat
print pformat(read_srg('example'))
To read your dictionary and perform checks on keys and values, you can later do something like:
test_dict = read_srg('example')
for key, (v1, v2) in test_dict.iteritems():
matchstring = ''
if 'Number1' in key and '(E, P)' in v1:
matchstring = 'match found: '
print '%s%s > %s %s' % (matchstring, key, v1, v2)
A big advantage of this approach is that you can also use your regex to check that your file isn't malformed for some reason.
This is why the matching rule is quite strict:
matchstring = r'(Number[0-9]+) (\([A-Z,\s]+\)) (\([A-Z,\s]+\))'
(Number[0-9]+) will match only words made of Number followed by any number of digits
(\([A-Z,\s]+\)) will match only strings enclosed into () which contain capital letters or , or a whitespace
I read in your comment that the format of the file is always the same, so I'm assuming it is procedurally generated.
Still, you might want to check its integrity (or to be sure that your code does not break if at some point the procedure generating the txt file changes its formatting).
Depending how strict you want your sanity check to be, you can push the above even further:
if you know there should never be more than 3 digits after Number, you might change (Number[0-9]+) to (Number[0-9]{1,3}) (which limits the match to 1, 2 or 3 digits)
if you want to be sure the format in parentheses is always two single capital letters split by ", " you can change (\([A-Z,\s]+\)) to (\([A-Z], [A-Z]\))

You were really close. Try this:
def read_srg(name):
with open(name + '.txt', 'r') as f:
R = {}
for line in f:
line = line.replace(', ', ',') # Number1 (E, P) (F, H) -> Number1 (E,P) (F,H)
header, *contents = line.strip().split() # `header` gets the first item of the list and all the rest go to `contents`
R[header] = contents
return R
Checking for membership can be later done like so:
if "(E,P)" in R["Number1"]:
# do stuff
I did not test this but it should be fine. Let me know if anything comes up.

Related

Struggling with Regex for adjacent letters differing by case

I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?
re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.
I suggest a different approach which uses groupby to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe
r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on #Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())

Python How to extract specific string into multiple variable

i am trying to extract a specific line as variable in file.
this is content of my test.txt
#first set
Task Identification Number: 210CT1
Task title: Assignment 1
Weight: 25
fullMark: 100
Description: Program and design and complexity running time.
#second set
Task Identification Number: 210CT2
Task title: Assignment 2
Weight: 25
fullMark: 100
Description: Shortest Path Algorithm
#third set
Task Identification Number: 210CT3
Task title: Final Examination
Weight: 50
fullMark: 100
Description: Close Book Examination
this is my code
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod:
for line in mod:
taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ")
print(taskNumber)
print(taskTile)
print(weight)
print(fullMark)
print(description)
here is what i'm trying to do:
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time
and loop until the third set
but there's an error occurred in the output
ValueError: not enough values to unpack (expected 5, got 2)
Reponse for SwiftsNamesake
i tried out your code . i am still getting an error.
ValueError: too many values to unpack (expected 5)
this is my attempt by using your code
from itertools import zip_longest
def chunks(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
with open(home + '\\Desktop\\PADS Assignment\\210CT.txt', 'r') as mod:
for group in chunks(mod.readlines(), 5+2, fillvalue=''):
# Choose the item after the colon, excluding the extraneous rows
# that don't have one.
# You could probably find a more elegant way of achieving the same thing
l = [item.split(': ')[1].strip() for item in group if ':' in item]
taskNumber , taskTile , weight, fullMark , desc = l
print(taskNumber , taskTile , weight, fullMark , desc, sep='|')
As previously mentioned, you need some sort of chunking. To chunk it usefully we'd also need to ignore the irrelevant lines of the file. I've implemented such a function with some nice Python witchcraft below.
It might also suit you to use a namedtuple to store the values. A namedtuple is a pretty simple type of object, that just stores a number of different values - for example, a point in 2D space might be a namedtuple with an x and a y field. This is the example given in the Python documentation. You should refer to that link for more info on namedtuples and their uses, if you wish. I've taken the liberty of making a Task class with the fields ["number", "title", "weight", "fullMark", "desc"].
As your variables are all properties of a task, using a named tuple might make sense in the interest of brevity and clarity.
Aside from that, I've tried to generally stick to your approach, splitting by the colon. My code produces the output
================================================================================
number is 210CT1
title is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
================================================================================
number is 210CT2
title is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
================================================================================
number is 210CT3
title is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
which seems to be roughly what you're after - I'm not sure how strict your output requirements are. It should be relatively easy to modify to that end, though.
Here is my code, with some explanatory comments:
from collections import namedtuple
#defines a simple class 'Task' which stores the given properties of a task
Task = namedtuple("Task", ["number", "title", "weight", "fullMark", "desc"])
#chunk a file (or any iterable) into groups of n (as an iterable of n-tuples)
def n_lines(n, read_file):
return zip(*[iter(read_file)] * n)
#used to strip out empty lines and lines beginning with #, as those don't appear to contain any information
def line_is_relevant(line):
return line.strip() and line[0] != '#'
with open("input.txt") as in_file:
#filters the file for relevant lines, and then chunks into 5 lines
for task_lines in n_lines(5, filter(line_is_relevant, in_file)):
#for each line of the task, strip it, split it by the colon and take the second element
#(ie the remainder of the string after the colon), and build a Task from this
task = Task(*(line.strip().split(": ")[1] for line in task_lines))
#just to separate each parsed task
print("=" * 80)
#iterate over the field names and values in the task, and print them
for name, value in task._asdict().items():
print("{} is {}".format(name, value))
You can also reference each field of the Task, like this:
print("The number is {}".format(task.number))
If the namedtuple approach is not desired, feel free to replace the content of the main for loop with
taskNumber, taskTitle, weight, fullMark, desc = (line.strip().split(": ")[1] for line in task_lines)
and then your code will be back to normal.
Some notes on other changes I've made:
filter does what it says on the tin, only iterating over lines that meet the predicate (line_is_relevant(line) is True).
The * in the Task instantiation unpacks the iterator, so each parsed line is an argument to the Task constructor.
The expression (line.strip().split(": ")[1] for line in task_lines) is a generator. This is needed because we're doing multiple lines at once with task_lines, so for each line in our 'chunk' we strip it, split it by the colon and take the second element, which is the value.
The n_lines function works by passing a list of n references to the same iterator to the zip function (documentation). zip then tries to yield the next element from each element of this list, but as each of the n elements is an iterator over the file, zip yields n lines of the file. This continues until the iterator is exhausted.
The line_is_relevant function uses the idea of "truthiness". A more verbose way to implement it might be
def line_is_relevant(line):
return len(line.strip()) > 0 and line[0] != '#'
However, in Python, every object can implicitly be used in boolean logic expressions. An empty string ("") in such an expression acts as False, and a non-empty string acts as True, so conveniently, if line.strip() is empty it will act as False and line_is_relevant will therefore be False. The and operator will also short-circuit if the first operand is falsy, which means the second operand won't be evaluated and therefore, conveniently, the reference to line[0] will not cause an IndexError.
Ok, here's my attempt at a more extended explanation of the n_lines function:
Firstly, the zip function lets you iterate over more than one 'iterable' at once. An iterable is something like a list or a file, that you can go over in a for loop, so the zip function can let you do something like this:
>>> for i in zip(["foo", "bar", "baz"], [1, 4, 9]):
... print(i)
...
('foo', 1)
('bar', 4)
('baz', 9)
The zip function returns a 'tuple' of one element from each list at a time. A tuple is basically a list, except it's immutable, so you can't change it, as zip isn't expecting you to change any of the values it gives you, but to do something with them. A tuple can be used pretty much like a normal list apart from that. Now a useful trick here is using 'unpacking' to separate each of the bits of the tuple, like this:
>>> for a, b in zip(["foo", "bar", "baz"], [1, 4, 9]):
... print("a is {} and b is {}".format(a, b))
...
a is foo and b is 1
a is bar and b is 4
a is baz and b is 9
A simpler unpacking example, which you may have seen before (Python also lets you omit the parentheses () here):
>>> a, b = (1, 2)
>>> a
1
>>> b
2
Although the n-lines function doesn't use this. Now zip can also work with more than one argument - you can zip three, four or as many lists (pretty much) as you like.
>>> for i in zip([1, 2, 3], [0.5, -2, 9], ["cat", "dog", "apple"], "ABC"):
... print(i)
...
(1, 0.5, 'cat', 'A')
(2, -2, 'dog', 'B')
(3, 9, 'apple', 'C')
Now the n_lines function passes *[iter(read_file)] * n to zip. There are a couple of things to cover here - I'll start with the second part. Note that the first * has lower precedence than everything after it, so it is equivalent to *([iter(read_file)] * n). Now, what iter(read_file) does, is constructs an iterator object from read_file by calling iter on it. An iterator is kind of like a list, except you can't index it, like it[0]. All you can do is 'iterate over it', like going over it in a for loop. It then builds a list of length 1 with this iterator as its only element. It then 'multiplies' this list by n.
In Python, using the * operator with a list concatenates it to itself n times. If you think about it, this kind of makes sense as + is the concatenation operator. So, for example,
>>> [1, 2, 3] * 3 == [1, 2, 3] + [1, 2, 3] + [1, 2, 3] == [1, 2, 3, 1, 2, 3, 1, 2, 3]
True
By the way, this uses Python's chained comparison operators - a == b == c is equivalent to a == b and b == c, except b only has to be evaluated once,which shouldn't matter 99% of the time.
Anyway, we now know that the * operator copies a list n times. It also has one more property - it doesn't build any new objects. This can be a bit of a gotcha -
>>> l = [object()] * 3
>>> id(l[0])
139954667810976
>>> id(l[1])
139954667810976
>>> id(l[2])
139954667810976
Here l is three objects - but they're all in reality the same object (you might think of this as three 'pointers' to the same object). If you were to build a list of more complex objects, such as lists, and perform an in place operation like sorting them, it would affect all elements of the list.
>>> l = [ [3, 2, 1] ] * 4
>>> l
[[3, 2, 1], [3, 2, 1], [3, 2, 1], [3, 2, 1]]
>>> l[0].sort()
>>> l
[[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]
So [iter(read_file)] * n is equivalent to
it = iter(read_file)
l = [it, it, it, it... n times]
Now the very first *, the one with the low precedence, 'unpacks' this, again, but this time doesn't assign it to a variable, but to the arguments of zip. This means zip receives each element of the list as a separate argument, instead of just one argument that is the list. Here is an example of how unpacking works in a simpler case:
>>> def f(a, b):
... print(a + b)
...
>>> f([1, 2]) #doesn't work
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: f() missing 1 required positional argument: 'b'
>>> f(*[1, 2]) #works just like f(1, 2)
3
So in effect, now we have something like
it = iter(read_file)
return zip(it, it, it... n times)
Remember that when you 'iterate' over a file object in a for loop, you iterate over each lines of the file, so when zip tries to 'go over' each of the n objects at once, it draws one line from each object - but because each object is the same iterator, this line is 'consumed' and the next line it draws is the next line from the file. One 'round' of iteration from each of its n arguments yields n lines, which is what we want.
Your line variable gets only Task Identification Number: 210CT1 as its first input. You're trying to extract 5 values from it by splitting it by :, but there are only 2 values there.
What you want is to divide your for loop into 5, read each set as 5 lines, and split each line by :.
The problem here is that you are spliting the lines by : and for each line there is only 1 : so there are 2 values.
In this line:
taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ")
you are telling it that there are 5 values but it only finds 2 so it gives you an error.
One way to fix this is to run multiple for loops one for each value since you are not allowed to change the format of the file. I would use the first word and sort the data into different
import re
Identification=[]
title=[]
weight=[]
fullmark=[]
Description=[]
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod::
for line in mod:
list_of_line=re.findall(r'\w+', line)
if len(list_of_line)==0:
pass
else:
if list_of_line[0]=='Task':
if list_of_line[1]=='Identification':
Identification.append(line[28:-1])
if list_of_line[1]=='title':
title.append(line[12:-1])
if list_of_line[0]=='Weight':
weight.append(line[8:-1])
if list_of_line[0]=='fullMark':
fullmark.append(line[10:-1])
if list_of_line[0]=='Description':
Description.append(line[13:-1])
print('taskNumber is %s' % Identification[0])
print('taskTitle is %s' % title[0])
print('Weight is %s' % weight[0])
print('fullMark is %s' %fullmark[0])
print('desc is %s' %Description[0])
print('\n')
print('taskNumber is %s' % Identification[1])
print('taskTitle is %s' % title[1])
print('Weight is %s' % weight[1])
print('fullMark is %s' %fullmark[1])
print('desc is %s' %Description[1])
print('\n')
print('taskNumber is %s' % Identification[2])
print('taskTitle is %s' % title[2])
print('Weight is %s' % weight[2])
print('fullMark is %s' %fullmark[2])
print('desc is %s' %Description[2])
print('\n')
of course you can use a loop for the prints but i was too lazy so i copy and pasted :).
IF YOU NEED ANY HELP OR HAVE ANY QUESTIONS PLEASE PLEASE ASK!!!
THIS CODE ASSUMES THAT YOU ARE NOT THAT ADVANCED IN CODING
Good Luck!!!
As another poster (#Cuber) has already stated, you're looping over the lines one-by-one, whereas the data sets are split across five lines. The error message is essentially stating that you're trying to unpack five values when all you have is two. Furthermore, it looks like you're only interested in the value on the right hand side of the colon, so you really only have one value.
There are multiple ways of resolving this issue, but the simplest is probably to group the data into fives (plus the padding, making it seven) and process it in one go.
First we'll define chunks, with which we'll turn this somewhat fiddly process into one elegant loop (from the itertools docs).
from itertools import zip_longest
def chunks(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
Now, we'll use it with your data. I've omitted the file boilerplate.
for group in chunks(mod.readlines(), 5+2, fillvalue=''):
# Choose the item after the colon, excluding the extraneous rows
# that don't have one.
# You could probably find a more elegant way of achieving the same thing
l = [item.split(': ')[1].strip() for item in group if ':' in item]
taskNumber , taskTile , weight, fullMark , desc = l
print(taskNumber , taskTile , weight, fullMark , desc, sep='|')
The 2 in 5+2 is for the padding (the comment above and the empty line below).
The implementation of chunks may not make sense to you at the moment. If so, I'd suggest looking into Python generators (and the itertools documentation in particular, which is a marvellous resource). It's also a good idea to get your hands dirty and tinker with snippets inside the Python REPL.
You can still read in lines one by one, but you will have to help the code understand what it's parsing. We can use an OrderedDict to lookup the appropriate variable name.
import os
import collections as ct
def printer(dict_, lookup):
for k, v in lookup.items():
print("{} is {}".format(v, dict_[k]))
print()
names = ct.OrderedDict([
("Task Identification Number", "taskNumber"),
("Task title", "taskTitle"),
("Weight", "weight"),
("fullMark","fullMark"),
("Description", "desc"),
])
filepath = home + '\\Desktop\\PADS Assignment\\test.txt'
with open(filepath, "r") as f:
for line in f.readlines():
line = line.strip()
if line.startswith("#"):
header = line
d = {}
continue
elif line:
k, v = line.split(":")
d[k] = v.strip(" ")
else:
printer(d, names)
printer(d, names)
Output
taskNumber is 210CT3
taskTitle is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
taskNumber is 210CT2
taskTitle is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
You're trying to get more data than is present on one line; the five pieces of data are on separate lines.
As SwiftsNamesake suggested, you can use itertools to group the lines:
import itertools
def keyfunc(line):
# Ignores comments in the data file.
if len(line) > 0 and line[0] == "#":
return True
# The separator is an empty line between the data sets, so it returns
# true when it finds this line.
return line == "\n"
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod:
for k, g in itertools.groupby(mod, keyfunc):
if not k: # Does not process lines that are separators.
for line in g:
data = line.strip().partition(": ")
print(f"{data[0] is {data[2]}")
# print(data[0] + " is " + data[2]) # If python < 3.6
print("") # Prints a newline to separate groups at the end of each group.
If you want to use the data in other functions, output it as a dictionary from a generator:
from collections import OrderedDict
import itertools
def isSeparator(line):
# Ignores comments in the data file.
if len(line) > 0 and line[0] == "#":
return True
# The separator is an empty line between the data sets, so it returns
# true when it finds this line.
return line == "\n"
def parseData(data):
for line in data:
k, s, v = line.strip().partition(": ")
yield k, v
def readData(filePath):
with open(filePath, "r") as mod:
for key, g in itertools.groupby(mod, isSeparator):
if not key: # Does not process lines that are separators.
yield OrderedDict((k, v) for k, v in parseData(g))
def printData(data):
for d in data:
for k, v in d.items():
print(f"{k} is {v}")
# print(k + " is " + v) # If python < 3.6
print("") # Prints a newline to separate groups at the end of each group.
data = readData(home + '\\Desktop\\PADS Assignment\\test.txt')
printData(data)
Inspired by itertools-related solutions, here is another using the more_itertools.grouper tool from the more-itertools library. It behaves similarly to #SwiftsNamesake's chunks function.
import collections as ct
import more_itertools as mit
names = dict([
("Task Identification Number", "taskNumber"),
("Task title", "taskTitle"),
("Weight", "weight"),
("fullMark","fullMark"),
("Description", "desc"),
])
filepath = home + '\\Desktop\\PADS Assignment\\test.txt'
with open(filepath, "r") as f:
lines = (line.strip() for line in f.readlines())
for group in mit.grouper(7, lines):
for line in group[1:]:
if not line: continue
k, v = line.split(":")
print("{} is {}".format(names[k], v.strip()))
print()
Output
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
taskNumber is 210CT2
taskTitle is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
taskNumber is 210CT3
taskTitle is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
Care was taken to print the variable name with the corresponding value.

Python: How to reform python list and print them line by line?

I have a file like this :
2.nseasy.com.|['azeaonline.com']
ns1.iwaay.net.|['alchemistrywork.com', 'dha-evolution.biz', 'hidada.net', 'sonifer.biz']
ns2.hd28.co.uk.|['networksound.co.uk']
Expected result:
2.nseasy.com.|'azeaonline.com'
ns1.iwaay.net.|'alchemistrywork.com'
ns1.iwaay.net.|'dha-evolution.biz'
ns1.iwaay.net.|'hidada.net'
ns1.iwaay.net.|'sonifer.biz'
ns2.hd28.co.uk.|'networksound.co.uk'
When I try to do that, instead of items of value domains_list, I get characters of domains. which means that the lists in the value of dictionary d are are recognized as a list but as a string. Here is an my code:
d = defaultdict(list)
f = open(file,'r')
start = time()
for line in f:
NS,domain_list = line.split('|')
s = json.dumps(domain_list)
d[NS] = json.loads(s)
for NS, domains in d.items():
for domain in domains:
print (NS, domain)
example of the current result:
w
o
o
d
l
a
n
d
f
a
r
m
e
r
s
m
a
r
k
e
t
.
o
r
g
'
]
What you are doing with json is not correct. s = json.dumps(domain_list) dumps the list into a string s. The json.loads(s) reads the string again, and then you range over the the string and print it, hence the single characters in the output.
Try something like:
d = defaultdict(list)
f = open(file,'r')
start = time()
for line in f:
NS,domain_list = line.split('|')
d[NS] = json.loads(domain_list.replace("'", '"'))
for NS, domains in d.items():
for domain in domains:
print (NS, domain)
Here's another one (assuming names.txt contains your data):
with open('names.txt') as f: # Open the file for reading
for line in f: # iterate over each line
host,parts=line.strip().split('|') # Split the parts on the |
parts=parts.replace('[','').replace(']','') # Remove the [] chars
parts_a=map(str.strip, parts.split(',')) # Split on the comma, and remove any spaces
for part in parts_a: # for the split part, iterate through each one
print '{0}|{1}'.format(host, part) # print the host and part separated by a |
Note: You could replace the 4th and 5th line with parts_a=json.loads(parts) as well, assuming that the part after the | is JSON...
You dont need to use json in this case as it doesn't solve your problem , you can use ast.literal_eval and itertools.repeat inside a list comprehension to create the desire pairs :
>>> from itertools import repeat
>>> import ast
>>> sp_l=[(i.split('|')[0],ast.literal_eval(i.split('|')[1])) for i in s.split('\n')]
>>> for k in [zip(repeat(i,len(j)),j) for i,j in sp_l]:
... for item in k:
... print '|'.join(item)
...
2.nseasy.com.|azeaonline.com
ns1.iwaay.net.|alchemistrywork.com
ns1.iwaay.net.|dha-evolution.biz
ns1.iwaay.net.|hidada.net
ns1.iwaay.net.|sonifer.biz
ns2.hd28.co.uk.|networksound.co.uk
Try:
import ast
with open(file, "r") as f:
d = {k: ast.literal_eval(v) for k, v in map(lambda s: s.split("|"), f)}
for NS, domains in d.items():
for domain in domains:
print "%s|'%s'" % (NS, domain)
Or even just:
with open('file.xyz') as f:
for thing in f:
q, r = thing.split('|')
r = ast.literal_eval(r)
for other in r:
print '{}|{}'.format(q, other)
Here is a regex solution:
import re
input = '''2.nseasy.com.|['azeaonline.com']
ns1.iwaay.net.|['alchemistrywork.com', 'dha-evolution.biz', 'hidada.net', 'sonifer.biz']
ns2.hd28.co.uk.|['networksound.co.uk']'''
for line in input.split('\n'):
splitted = line.split('|')
left = splitted[0]
right = re.findall("'([a-z\.-]+?)'", splitted[1])
for domain in right:
print '{0}|{1}'.format(left, domain)
Outputs:
2.nseasy.com.|azeaonline.com
ns1.iwaay.net.|alchemistrywork.com
ns1.iwaay.net.|dha-evolution.biz
ns1.iwaay.net.|hidada.net
ns1.iwaay.net.|sonifer.biz
ns2.hd28.co.uk.|networksound.co.uk

Python: Cut and Paste Data on Text File.

I have a text file that contains numbers.
I need to move some of the numbers from the beginning of the file to the end in the correct order.
For example, the Original TEXT file has the following content: 0123456789.
I need to move the first 4 numbers to the end in the same order so it'll look like this:
4567890123.
Unfortunately i have no idea how to do this with Python,
I don't know even where to start.
Any pointers to solving this problem would be highly appreciated.
See the Python tutorial (section "Strings"; search for "slice notation"):
>>> a = "0123456789"
>>> b = a[4:] + a[:4]
>>> b
'4567890123'
Or what is it you're really trying to do?
The individual characters of the string a = '0123456789' can be accessed through a[i], where for i=0 you get the character at the first position (indexes are numbered from 0), so '0'. You can also extract several characters at once in the form a[i:j], where i is the position of the first character and j is the position of the character after the last character. If you omit one of i or j, it will take all characters from the beginning or until the end of the string.
So:
a[0] = a[0:1] = a[:1] = '0'
a[1] = a[1:2] = '1'
a[4] = a[4:5] = '4'
a[0:3] = a[:3] = '012'
a[3:5] = '34'
a[4:] = '456789'
So the first 4 characters are a[:4] and the rest is a[4:]. Now you concatenate them together:
a[4:] + a[:4]
and it will return
'4567890123'
In order to read the file, you will have to open it in the read mode and use the first line, stripping any whitespace/newlines:
with open('filename.txt', 'r') as f:
line = f.readline().strip()
print(line[4:] + line[:4])

How to extract the substring between two markers?

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.
I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
How to do the same thing in Python?
Using regular expressions - documentation for further reference
import re
text = 'gfgfdAAA1234ZZZuijjk'
m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)
# found: 1234
or:
import re
text = 'gfgfdAAA1234ZZZuijjk'
try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling
# found: 1234
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'
Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.
regular expression
import re
re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)
The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text
string methods
your_text.partition("AAA")[2].partition("ZZZ")[0]
The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.
PS Python Challenge?
Surprised that nobody has mentioned this which is my quick version for one-off scripts:
>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'
you can do using just one line of code
>>> import re
>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')
>>> ['1234']
result will receive list...
import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)
You can use re module for that:
>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)
In python, extracting substring form string can be done using findall method in regular expression (re) module.
>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']
text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'
print(text[text.index(left)+len(left):text.index(right)])
Gives
string
>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
You could do the same with re.sub function using the same regex.
>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'
In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).
You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.
def FindSubString(strText, strSubString, Offset=None):
try:
Start = strText.find(strSubString)
if Start == -1:
return -1 # Not Found
else:
if Offset == None:
Result = strText[Start+len(strSubString):]
elif Offset == 0:
return Start
else:
AfterSubString = Start+len(strSubString)
Result = strText[AfterSubString:AfterSubString + int(Offset)]
return Result
except:
return -1
# Example:
Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"
print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")
print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")
print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))
# Your answer:
Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"
AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0)
print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))
Using PyParsing
import pyparsing as pp
word = pp.Word(pp.alphanums)
s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
print(match)
which yields:
[['1234']]
One liner with Python 3.8 if text is guaranteed to contain the substring:
text[text.find(start:='AAA')+len(start):text.find('ZZZ')]
Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:
regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'
I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.
Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.
also, you can find all combinations in the bellow function
s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
word_places = []
i=0
while True:
word_place = text.find(word,i)
i+=len(word)+word_place
if i>=len(text):
break
if word_place<0:
break
word_places.append(word_place)
return word_places
def find_all_combination(text,start,end):
start_places = find_all_places(text,start)
end_places = find_all_places(text,end)
combination_list = []
for start_place in start_places:
for end_place in end_places:
print(start_place)
print(end_place)
if start_place>=end_place:
continue
combination_list.append(text[start_place:end_place])
return combination_list
find_all_combination(s,"Part","Part")
result:
['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']
In case you want to look for multiple occurences.
content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
spos = c.find('_Suffix')
if spos!=-1:
strings.append( c[:spos])
print( strings )
Or more quickly :
strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]
Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.
def find_substring(string, start, end):
len_until_end_of_first_match = string.find(start) + len(start)
after_start = string[len_until_end_of_first_match:]
return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]
Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :
string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []
for char in string:
if char in numbersList: output.append(char)
print(f"output: {''.join(output)}")
### output: 1234
Typescript. Gets string in between two other strings.
Searches shortest string between prefixes and postfixes
prefixes - string / array of strings / null (means search from the start).
postfixes - string / array of strings / null (means search until the end).
public getStringInBetween(str: string, prefixes: string | string[] | null,
postfixes: string | string[] | null): string {
if (typeof prefixes === 'string') {
prefixes = [prefixes];
}
if (typeof postfixes === 'string') {
postfixes = [postfixes];
}
if (!str || str.length < 1) {
throw new Error(str + ' should contain ' + prefixes);
}
let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes);
const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length);
let value = str.substring(start.pos + start.sub.length, end.pos);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
while (true) {
try {
start = this.indexOf(value, prefixes);
} catch (e) {
break;
}
value = value.substring(start.pos + start.sub.length);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
}
return value;
}
a simple approach could be the following:
string_to_search_in = 'could be anything'
start = string_to_search_in.find(str("sub string u want to identify"))
length = len("sub string u want to identify")
First_part_removed = string_to_search_in[start:]
end_coord = length
Extracted_substring=First_part_removed[:end_coord]
One liners that return other string if there was no match.
Edit: improved version uses next function, replace "not-found" with something else if needed:
import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )
My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:
import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

Categories

Resources