Deleting Relative Lines with Regex - python

Using pdftotext, a text file was created that includes footers from the source pdf. The footers get in the way of other parsing that needs to be done. The format of the footer is as follows:
This is important text.
9
Title 2012 and 2013
\fCompany
Important text begins again.
The line for Company is the only one that does not recur elsewhere in the file. It appears as \x0cCompany\n. I would like to search for this line and remove it and the preceding three lines (the page number, title, and a blank line) based on where the \x0cCompany\n appears. This is what I have so far:
report = open('file.txt').readlines()
data = range(len(report))
name = []
for line_i in data:
line = report[line_i]
if re.match('.*\\x0cCompany', line ):
name.append(report[line_i])
print name
This allows me to make a list storing which line numbers have this occurrence, but I do not understand how to delete these lines as well as the three preceding lines. It seems I need to create some other loop based on this loop but I cannot make it work.

Instead of iterating through and getting the indices of that lines you want to delete, iterate through your lines and append only the lines that you want to keep.
It would also be more efficient to iterate your actual file object, rather than putting it all into one list:
keeplines = []
with open('file.txt') as b:
for line in b:
if re.match('.*\\x0cCompany', line):
keeplines = keeplines[:-3] #shave off the preceding lines
else:
keeplines.append(line)
file = open('file.txt', 'w'):
for line in keeplines:
file.write(line)

Related

Printing characters from a given sequence till a certain range only. How to do this in Python?

I have a file in which I have a sequence of characters. I want to read the second line of that file and want to read the characters of that line to a certain range only.
I tried this code, however, it is only printing specific characters from both lines. And not printing the range.
with open ("irumfas.fas", "r") as file:
first_chars = [line[1] for line in file if not line.isspace()]
print(first_chars)
Can anyone help in this regard? How can I give a range?
Below is mentioned the sequence that I want to print.But I want to start printing the characters from the second line of the sequence till a certain range only.
IRUMSEQ
ATTATAAAATTAAAATTATATCCAATGAATTCAATTAAATTAAATTAAAGAATTCAATAATATACCCCGGGGGGATCCAATTAAAAGCTAAAAAAAAAAAAAAAAAA
The following approach can be used.
Consider the file contains
RANDOMTEXTSAMPLE
SAMPLERANDOMTEXT
RANDOMSAMPLETEXT
with open('sampleText.txt') as sampleText:
content = sampleText.read()
content = content.split("\n")[1]
content = content[:6]
print(content)
Output will be
SAMPLE
I think you want something like this:
with open("irumfas.fas", "r") as file:
second_line = file.readlines()[1]
print(second_line[0:9])
readlines() will give you a list of the lines -- which we index to get only the 2nd line. Your existing code will iterate over all the lines (which is not what you want).
As for extracting a certain range, you can use list slices to select the range of characters you want from that line -- in the example above, its the first 10.
You can slice the line[1] in the file as you would slice a list.
You were very close:
end = 6 # number of characters
with open ("irumfas.fas", "r") as file:
first_chars = [line[1][:end] for line in file if not line.isspace()]
print(first_chars)

How to split lines in python

I am looking for a simple way to split lines in python from a .txt file and then just read out the names and compare them to another file.
I've had a code that split the lines successfully, but I couldn't find a way to read out just the names, unfortunately the code that split it successfully was lost.
this is what the .txt file looks like.
Id;Name;Job;
1;James;IT;
2;Adam;Director;
3;Clare;Assisiant;
example if the code I currently have (doesn't output anything)
my_file = open("HP_liki.txt","r")
flag = index = 0
x1=""
for line in my_file:
line.strip().split('\n')
index+=1
content = my_file.read()
list=[]
lines_to_read = [index-1]
for position, line1 in enumerate(x1):
if position in lines_to_read:
list=line1
x1=list.split(";")
print(x1[1])
I need a solution that doesn't import pandas or csv.
The first part of your code confuses me as to your purpose.
for line in my_file:
line.strip().split('\n')
index+=1
content = my_file.read()
Your for loop iterates through the file and strips each line. Then it splits on a newline, which cannot exist at this point. The for already iterates by lines, so there is no newline in any line in this loop.
In addition, once you've stripped the line, you ignore the result, increment index, and leave the loop. As a result, all this loop accomplishes is to count the lines in the file.
The line after the loop reads from a file that has no more data, so it will simply handle the EOF exception and return nothing.
If you want the names from the file, then use the built-in file read to iterate through the file, split each line, and extract the second field:
name_list = [line.split(';')[1]
for line in open("HP_liki.txt","r") ]
name_list also includes the header "Name", which you can easily delete.
Does that handle your problem?
So without using any external library you can use simple file io and then generalize according to your need.
readfile.py
file = open('datafile.txt','r')
for line in file:
line_split = line.split(';')
if (line_split[0].isdigit()):
print(line_split[1])
file.close()
datafile.txt
Id;Name;Job;
1;James;IT;
2;Adam;Director;
3;Clare;Assisiant;
If you run this you'll have output
James
Adam
Clare
You can change the if condition according to your need
I have my dataf.txt file:
Id;Name;Job;
1;James;IT;
2;Adam;Director;
3;Clare;Assisiant;
I have written this to extract information:
with open('dataf.txt','r') as fl:
data = fl.readlines()
a = [i.replace('\n','').split(';')[:-1] for i in data]
print(a[1:])
Outputs:
[['1', 'James', 'IT'], ['2', 'Adam', 'Director'], ['3', 'Clare', 'Assisiant']]

Spliting / Slicing Text File with Python

Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')
So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)

Reading and taking specific file contents in a list in python

I have a file containing:
name: Sam
placing: 2
quote: I'll win.
name: Jamie
placing: 1
quote: Be the best.
and I want to read the file through python and append specific contents into a list. I want my first list to contain:
rank = [['Sam', 2],['Jamie', 1]]
and second list to contain:
quo = ['I'll win','Be the best']
first off, i start reading the file by:
def read_file():
filename = open("player.txt","r")
playerFile = filename
player = [] #first list
quo = [] #second list
for line in playerFile: #going through each line
line = line.strip().split(':') #strip new line
print(line) #checking purpose
player.append(line[1]) #index out of range
player.append(line[2])
quo.append(line[3])
I'm getting an index out of range in the first append. I have split by ':' but I can't seem to access it.
When you do line = line.strip().split(':') when line = "name: Sam"
you will receive ['name', ' Sam'] so first append should work.
The second one player.append(line[2] will not work.
As zython said in the comments , you need to know the format of the file and each blank line or other changes in the file , can make you script to fail.
You should analyze the file differently:
If you can rely on the fact that "name" and "quote" are always existing fields in each player data , you should look for this field names.
for example:
for line in file:
# Run on each line and insert to player list only the lines with "name" in it
if ("name" in line):
# Line with "name" was found - do what you need with it
player.append(line.split(":")[1])
A few problems,
The program attempts to read three lines worth of data in a single iteration of the for loop. But that won't work, because the loop, and the split command are parsing only a single line per iteration. It will take three loop iterations to read a single entry from your file.
The program needs handling for blank lines. Generally, when reading files like this, you probably want a lot of error handling, the data is usually not formatted perfectly. My suggestion is to check for blank lines, where line has only a single value which is an empty string. When you see that, ignore the line.
The program needs to collect the first and second lines of each entry, and put those into a temporary array, then append the temporary array to player. So you'll need to declare that temporary array above, populate first with the name field, next with the placing field, and finally append it to player.
Zero-based indexing. Remember that the first item of an array is list[0], not list[1]
I think you are confused on how to check for a line and add content from line to two lists based on what it contains. You could use in to check what line you are on currently. This works assuming your text file is same as given in question.
rank, quo = [], []
for line in playerFile:
splitted = line.split(": ")
if "name" in line:
name = splitted[1]
elif "placing" in line:
rank.append([name, splitted[1]])
elif "quote" in line:
quo.append(splitted[1])
print(rank) # [['Sam', '2'],['Jamie', '1']]
print(quo) # ["I'll win",'Be the best']
Try this code:
def read_file():
filename = open("player.txt", "r")
playerFile = filename
player = []
rank = []
quo = []
for line in playerFile:
value = line.strip().split(": ")
if "name" in line:
player.append(value[1])
if "placing" in line:
player.append(value[1])
if "quote" in line:
quo.append(value[1])
rank.append(player)
player = []
print(rank)
print(quo)
read_file()

Reading like "list = [][][]" lists from couple of txt files

I have 9 txt files. Every text is like:
[(0.0, 32.633221, 39.91769),
(8.32, 32.633717, 39.917892),
(25.35, 32.633945, 39.917538),
(25.93, 32.634262, 39.916946),
(7.24, 32.634888, 39.91674),
(0.0, 32.635014, 39.916737),
(15.31, 32.635242, 39.916569),
(22.12, 32.635727, 39.916176)....
I want to create a new text file that contains only the first elements of every element. I mean like:
list_firsttxtfile = [(0.0), (8.32), (25.35), (25.93),... ]
Another way of doing it.
list_firsttxtfile = []
with open("mytextfile.txt", "r") as f:
data = f.read()
parts = data.split("\)")
for part in parts:
sections = part.split(",")
list_firsttxtfile.append(sections[0])
If every line of your file contains at least one float, you can extract it using regular expression:
import re
first_float = re.compile(r'[-+]?\d*\.?\d+')
with open("firstfile.txt") as txtfile:
# if line.strip() skips blank lines if any
list_first_txt_file = [float(first_float.search(line).group()) for line in txtfile if line.strip()]
Be aware that if your file contains lines where the regex doesn't match it will fail with AttributeError: 'NoneType' object has no attribute 'group'. You can change the if part of the list-comprehension to if first_float.search(line) to avoid that but it will run the regex on the same line twice which might not be great if your files are big.
If you are double sure that there is no blank lines and that each line will have a match, then you can remove the if entirely.

Categories

Resources