splitting multiple lines using str.split - python

Trying to split a multiline paragraph using str.split single line split works correctly. Is str.split the correct way to split multiple lines what am I missing here?
single line split working correctly example:
dmap_lines = """Nople Normal Altar1-truck-Altar2,Altar2-train-Cansomme,Cansomme-flight-Karoh,Karoh-truck-Nople"""
destinations = []
remainders1 = []
stages = []
for line in dmap_lines:
destination, remainder1, remainder = dmap_lines.split(' ')
destinations.append(destination)
remainders1.append(remainder1)
remainder = remainder.split(',')
stages.append(remainder)
print(destination)
print(remainder1)
print(type(remainder))
print(remainder)
Expected Output:
Nople
Normal
<class 'list'>
['Altar1-truck-Altar2', 'Altar2-train-Cansomme', 'Cansomme-flight-Karoh', 'Karoh-truck-Nople']
with multiline code:
dmap_lines = """Nople Normal Altar1-truck-Altar2,Altar2-train-Cansomme,Cansomme-flight-Karoh,Karoh-truck-Nople\nDria Normal Altar1-truck-Altar2,Altar2-train-Mala1,Mala1-truck-Mala2,Mala2-flight-Dria"""
destinations = []
remainders1 = []
stages = []
for line in dmap_lines:
destination, remainder1, remainder = dmap_lines.split(' ')
destinations.append(destination)
remainders1.append(remainder1)
remainder = remainder.split(',')
stages.append(remainder)
print(destination)
print(remainder1)
print(type(remainder))
print(remainder)
Receiving error in output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-78-9eb9f8fa1c64> in <module>
4 stages = []
5 for line in dmap_lines:
----> 6 destination, remainder1, remainder = dmap_lines.split(' ')
7 destinations.append(destination)
8 remainders1.append(remainder1)
ValueError: too many values to unpack (expected 3)
Expected output:
Nople
Normal
<class 'list'>
['Altar1-truck-Altar2', 'Altar2-train-Cansomme', 'Cansomme-flight-Karoh', 'Karoh-truck-Nople']
Dria
Normal
<class 'list'>
['Altar1-truck-Altar2,Altar2-train-Mala1,Mala1-truck-Mala2,Mala2-flight-Dria']
Why is the for loop not iterating over multiple lines and splitting the string into the sections?

Is there any reason why re.findall would not be a better approach here:
dmap_lines = "Nople Normal Altar1-truck-Altar2,Altar2-train-Cansomme,Cansomme-flight-Karoh,Karoh-truck-Nople"
matches = re.findall(r'\b\w+-\w+-\w+\b', dmap_lines)
print(matches)
This prints:
['Altar1-truck-Altar2', 'Altar2-train-Cansomme', 'Cansomme-flight-Karoh',
'Karoh-truck-Nople']
To get a single CSV string, use join:
csv = ','.join(matches)
print(csv)
This prints:
Altar1-truck-Altar2,Altar2-train-Cansomme,Cansomme-flight-Karoh,Karoh-truck-Nople

Related

Iterating over a .txt file with a regular expression conditional

Program workflow:
Open "asigra_backup.txt" file and read each line
Search for the exact string: "Errors: " + {any value ranging from 1 - 100}. e.g "Errors: 12"
When a match is found, open a separate .txt file in write&append mode
Write the match found. Example: "Errors: 4"
In addition to above write, append the next 4 lines below the match found in step 3; as that is additional log information
What I've done:
Tested a regular expressions that matches with my sample data on regex101.com
Used list comprehension to find all matches in my test file
Where I need help (please):
Figuring out how to append additional 4 lines of log information below each match string found
CURRENT CODE:
result = [line.split("\n")[0] for line in open('asigra_backup.txt') if re.match('^Errors:\s([1-9]|[1-9][0-9]|100)',line)]
print(result)
CURRENT OUTPUT:
['Errors: 1', 'Errors: 128']
DESIRED OUTPUT:
Errors: 1
Pasta
Fish
Dog
Doctonr
Errors: 128
Lemon
Seasoned
Rhinon
Goat
SAMPLE .TXT FILE
Errors: 1
Pasta
Fish
Dog
Doctonr
Errors: 128
Lemon
Seasoned
Rhinon
Goat
Errors: 0
Rhinon
Cat
Dog
Fish
For those wanting additional clarification, as it may help the next person, this was my final solution:
def errors_to_file(self):
"""
Opens file containing Asigra backup logs, "asigra_backup.txt", and returns a list of all errors within the log.
Uses a regular expression match conditional on each line within the asigra backup log file. Error number range is 1 - 100.
Formats errors log by appending a space every 10th element in the errors log list.txt
Writes formatted error log to a file in current directory: "asigra_errors.txt"
"""
# "asigra_backup.txt" contains log information from the performed backup.
with open('asigra_backup.txt', "r") as f:
lines0 = [line.rstrip() for line in f]
# empty list that is appended with errors found in the log
lines = []
for i, line in enumerate(lines0):
if re.match('^Errors:\s([1-9]|[1-9][0-9]|100)',line):
lines.extend(lines0[i:i+9])
if len(lines) == 0:
print("No errors found")
print("Gracefully exiting")
sys.exit(1)
k = ''
N = 9
formatted_errors = list(chain(*[lines[i : i+N] + [k]
if len(lines[i : i+N]) == N
else lines[i : i+N]
for i in range(0, len(lines), N)]))
with open("asigra_errors.txt", "w") as e:
for i, line in enumerate(formatted_errors):
e.write(f"{line}\n")
Huge thank you to those that answered my question.
Using better regex and re.findall can make it easier. In the following regex, all Errors: and 4 following lines are detected.
import re
regex_matches = re.findall('(?:[\r\n]+|^)((Errors:\s*([1-9][0-9]?|100))(?:[\r\n\s\t]+.*){4})', open('asigra_backup.txt', 'r').read())
open('separate.txt', 'a').write('\n' + '\n'.join([i[0] for i in regex_matches]))
To access error numbers or error lines following lines can use:
error_rows = [i[1] for i in regex_matches]
error_numbers = [i[2] for i in regex_matches]
print(error_rows)
print(error_numbers)
I wrote a code which prints the output as requested. The code will work when Errors: 1 line is added as last line. See the text I have parsed:
data_to_parse = """
Errors: 56
Pasta
Fish
Dog
Doctonr
Errors: 0
Lemon
Seasoned
Rhinon
Goat
Errors: 45
Rhinon
Cat
Dog
Fish
Errors: 34
Rhinon
Cat
Dog
Fish1
Errors: 1
"""
See the code which gives the desired output without using regex. Indices have been used to get desired data.
lines = data_to_parse.splitlines()
errors_indices = []
i = 0
k = 0
for line in lines: # where Errors: are located are found in saved in list errors_indices.
if 'Errors:' in line:
errors_indices.append(i)
i = i+1
#counter = False
while k < len(errors_indices):
counter = False # It is needed to find the indices when Errors: 0 is hit.
for j in range(errors_indices[k-1], errors_indices[k]):
if 'Errors:' in lines[j]:
lines2 = lines[j].split(':')
lines2_val = lines2[1].strip()
if int(lines2_val) != 0:
print(lines[j])
if int(lines2_val) == 0:
counter = True
elif 'Errors:' not in lines[j] and counter == False:
print(lines[j])
k=k+1
I have tried a few times to see if the code is working properly. It looks it gives the requested output properly. See the output when the code is run as:

Keep Getting ValueError: not enough values to unpack (expected 2, got 1) for a text file for sentiment analysis?

I am trying to turn this text file into a dictionary using the code below:
with open("/content/corpus.txt", "r") as my_corpus:
wordpoints_dict = {}
for line in my_corpus:
key, value = line.split('')
wordpoints_dict[key] = value
print(wordpoints_dict)
It keeps returning:
ValueError Traceback (most recent call last)
<ipython-input-18-8cf5e5efd882> in <module>()
2 wordpoints_dict = {}
3 for line in my_corpus:
----> 4 key, value = line.split('-')
5 wordpoints_dict[key] = value
6 print(wordpoints_dict)
ValueError: not enough values to unpack (expected 2, got 1)
The data in the text file looks like this:
Text Data
You are trying to split a text value at ‘-‘. And unpack it to two values (key (before the dash), value (after the dash)). However, some lines in your txt file do not contain a dash so there is not two values to unpack. Try checking for blank lines as this could be a cause of the issue.
Your code doesn't match the error message. I'm going to assume that the error message is the correct one...
Just add a little logic to handle the case where there isn't a - on a line. I wouldn't be surprised if you fixed that problem and then hit the other side of that problem, where the line has more than one -. If that occurs in your file, you'll have to deal with that case as well, as you'll get a "too many values to unpack" error then. Here's your code with the added boilerplate for doing both of these things:
with open("/content/corpus.txt", "r") as my_corpus:
wordpoints_dict = {}
for line in my_corpus:
parts = line.split('-')
if len(parts) == 1:
parts = (line, '') # If no '-', use an empty second value
elif len(parts) > 2:
parts = parts[:2] # If too many items from split, use the first two
key, value = [x.strip() for x in parts] # strip leading and trailing spaces
wordpoints_dict[key] = value
print(wordpoints_dict)

How to strip a comma in the middle of a large number?

I want to convert a str number into a float or int numerical type. However, it is throwing an error that it can't, so I am removing the comma. The comma will not be removed, so I need to find a way of finding a way of designating the location in the number space like say fourth.
power4 = power[power.get('Number of Customers Affected') != 'Unknown']
power5 = power4[pd.notnull(power4['Number of Customers Affected'])]
power6 = power5[power5.get('NERC Region') == 'RFC']
power7 = power6.get('Number of Customers Affected').loc[1]
power8 = power7.strip(",")
power9 = float(power8)
ValueError Traceback (most recent call last) <ipython-input-70- 32ca4deb9734> in <module>
6 power7 = power6.get('Number of Customers Affected').loc[1]
7 power8 = power7.strip(",")
----> 8 power9 = float(power8)
9
10
ValueError: could not convert string to float: '127,000'
Use replace()
float('127,000'.replace(',',''))
Have you tried pandas.to_numeric?
import pandas as pd
a = '1234'
type(a)
a = pd.to_numeric(a)
type(a)
In the
power8 = power7.strip(",")
line, do
power8 = power7.replace(',', '')
strip() will not work here. What is required is replace() method of string. You may also try
''.join(e for e in s if e.isdigit())
Or,
s = ''.join(s.split(','))
RegeEx can also be a way to solve this, or you can have a look at this answer : https://stackoverflow.com/a/266162/9851541

python iterate through binary file without lines

I've got some data in a binary file that I need to parse. The data is separated into chunks of 22 bytes, so I'm trying to generate a list of tuples, each tuple containing 22 values. The file isn't separated into lines though, so I'm having problems figuring out how to iterate through the file and grab the data.
If I do this it works just fine:
nextList = f.read(22)
newList = struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList)
where newList contains a tuple of 22 values. However, if I try to apply similar logic to a function that iterates through, it breaks down.
def getAllData():
listOfAll = []
nextList = f.read(22)
while nextList != "":
listOfAll.append(struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList))
nextList = f.read(22)
return listOfAll
data = getAllData()
gives me this error:
Traceback (most recent call last):
File "<pyshell#27>", line 1, in <module>
data = getAllData()
File "<pyshell#26>", line 5, in getAllData
listOfAll.append(struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList))
struct.error: unpack requires a bytes object of length 22
I'm fairly new to python so I'm not too sure where I'm going wrong here. I know for sure that the data in the file breaks down evenly into sections of 22 bytes, so it's not a problem there.
Since you reported that it was running when len(nextList) == 0, this is probably because nextList (which isn't a list..) is an empty bytes object which isn't equal to an empty string object:
>>> b"" == ""
False
and so the condition in your line
while nextList != "":
is never true, even when nextList is empty. That's why using len(nextList) != 22 as a break condition worked, and even
while nextList:
should suffice.
read(22) isn't guaranteed to return a string of length 22. It's contract is to return string of length from anywhere between 0 and 22 (inclusive). A string of length zero indicates there is no more data to be read. In python 3 file objects produce bytes objects instead of str. str and bytes will never be considered equal.
If your file is small-ish then you'd be better off to read the entire file into memory and then split it up into chunks. eg.
listOfAll = []
data = f.read()
for i in range(0, len(data), 22):
t = struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", data[i:i+22])
listOfAll.append(t)
Otherwise you will need to do something more complicated with checking the amount of data you get back from the read.
def dataiter(f, chunksize=22, buffersize=4096):
data = b''
while True:
newdata = f.read(buffersize)
if not newdata: # end of file
if not data:
return
else:
yield data
# or raise error as 0 < len(data) < chunksize
# or pad with zeros to chunksize
return
data += newdata
i = 0
while len(data) - i >= chunksize:
yield data[i:i+chunksize]
i += chunksize
try:
data = data[i:] # keep remainder of unused data
except IndexError:
data = b'' # all data was used

I'm trying to save my result into a new file but got problems - Python

I'm trying to make an script which takes all rows starting by 'HELIX', 'SHEET' and 'DBREF' from a .txt, from that rows takes some specifical columns and then saves the results on a new file.
#!/usr/bin/python
import sys
if len(sys.argv) != 3:
print("2 Parameters expected: You must introduce your pdb file and a name for output file.")`
exit()
for line in open(sys.argv[1]):
if 'HELIX' in line:
helix = line.split()
cols_h = helix[0], helix[3:6:2], helix[6:9:2]
elif 'SHEET'in line:
sheet = line.split()
cols_s = sheet[0], sheet[4:7:2], sheet[7:10:2], sheet [12:15:2], sheet[16:19:2]
elif 'DBREF' in line:
dbref = line.split()
cols_id = dbref[0], dbref[3:5], dbref[8:10]
modified_data = open(sys.argv[2],'w')
modified_data.write(cols_id)
modified_data.write(cols_h)
modified_data.write(cols_s)
My problem is that when I try to write my final results it gives this error:
Traceback (most recent call last):
File "funcional2.py", line 21, in <module>
modified_data.write(cols_id)
TypeError: expected a character buffer object
When I try to convert to a string using ''.join() it returns another error
Traceback (most recent call last):
File "funcional2.py", line 21, in <module>
modified_data.write(' '.join(cols_id))
TypeError: sequence item 1: expected string, list found
What am I doing wrong?
Also, if there is some easy way to simplify my code, it'll be great.
PS: I'm no programmer so I'll probably need some explanation if you do something...
Thank you very much.
cols_id, cols_h and cols_s seems to be lists, not strings.
You can only write a string in your file so you have to convert the list to a string.
modified_data.write(' '.join(cols_id))
and similar.
'!'.join(a_list_of_things) converts the list into a string separating each element with an exclamation mark
EDIT:
#!/usr/bin/python
import sys
if len(sys.argv) != 3:
print("2 Parameters expected: You must introduce your pdb file and a name for output file.")`
exit()
cols_h, cols_s, cols_id = []
for line in open(sys.argv[1]):
if 'HELIX' in line:
helix = line.split()
cols_h.append(''.join(helix[0]+helix[3:6:2]+helix[6:9:2]))
elif 'SHEET'in line:
sheet = line.split()
cols_s.append( ''.join(sheet[0]+sheet[4:7:2]+sheet[7:10:2]+sheet[12:15:2]+sheet[16:19:2]))
elif 'DBREF' in line:
dbref = line.split()
cols_id.append(''.join(dbref[0]+dbref[3:5]+dbref[8:10]))
modified_data = open(sys.argv[2],'w')
cols = [cols_id,cols_h,cols_s]
for col in cols:
modified_data.write(''.join(col))
Here is a solution (untested) that separates data and code a little more. There is a data structure (keyword_and_slices) describing the keywords searched in the lines paired with the slices to be taken for the result.
The code then goes through the lines and builds a data structure (keyword2lines) mapping the keyword to the result lines for that keyword.
At the end the collected lines for each keyword are written to the result file.
import sys
from collections import defaultdict
def main():
if len(sys.argv) != 3:
print(
'2 Parameters expected: You must introduce your pdb file'
' and a name for output file.'
)
sys.exit(1)
input_filename, output_filename = sys.argv[1:3]
#
# Pairs of keywords and slices that should be taken from the line
# starting with the respective keyword.
#
keyword_and_slices = [
('HELIX', [slice(3, 6, 2), slice(6, 9, 2)]),
(
'SHEET',
[slice(a, b, 2) for a, b in [(4, 7), (7, 10), (12, 15), (16, 19)]]
),
('DBREF', [slice(3, 5), slice(8, 10)]),
]
keyword2lines = defaultdict(list)
with open(input_filename, 'r') as lines:
for line in lines:
for keyword, slices in keyword_and_slices:
if line.startswith(keyword):
parts = line.split()
result_line = [keyword]
for index in slices:
result_line.extend(parts[index])
keyword2lines[keyword].append(' '.join(result_line) + '\n')
with open(output_filename, 'w') as out_file:
for keyword in ['DBREF', 'HELIX', 'SHEET']:
out_file.writelines(keyword2lines[keyword])
if __name__ == '__main__':
main()
The code follows your text in checking if a line starts with a keyword, instead your code which checks if a keyword is anywhere within a line.
It also makes sure all files are closed properly by using the with statement.
You need to convert the tuple created on RHS in your assignments to string.
# Replace this with statement given below
cols_id = dbref[0], dbref[3:5], dbref[8:10]
# Create a string out of the tuple
cols_id = ''.join((dbref[0], dbref[3:5], dbref[8:10]))

Categories

Resources