I could understand a little but I want the exact explanation of that particular line. I'm confused about the syntax.
Otherwise, I know how the code works and what is it doing, I just want to clarify my concept about the syntax.
Code :
import docx2txt
def extract_text_from_doc(doc_path):
temp = docx2txt.process("resumes/Chinmaya_Kaundanya_Resume.docx")
text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
return ' '.join(text)
It's the list comprehension version for:
text = []
for line in temp.split('\n'):
if line:
text.append(line.replace('\t', ' '))
It iterates through temp line by line, if the line is not empty it replaces '\t' (tabs) with spaces, and puts the results in a the array text.
it's basically a list comprehension
it will iterate through each line, checking if line is not empty then replace the tab character with spaces.
Related
Sorry for posting such an easy question, but i couldn't find an answer on google.
I wish my code to do something like this
code:
lines = open("Bal.txt").write
lines[1] = new_value
lines.close()
p.s i wish to replace the line in a file with a value
xxx.dat before:
ddddddddddddddddd
EEEEEEEEEEEEEEEEE
fffffffffffffffff
with open('xxx.txt','r') as f:
x=f.readlines()
x[1] = "QQQQQQQQQQQQQQQQQQQ\n"
with open('xxx.txt','w') as f:
f.writelines(x)
xxx.dat after:
ddddddddddddddddd
QQQQQQQQQQQQQQQQQQQ
fffffffffffffffff
Note:f.read() returns a string, whereas f.readlines() returns a list, enabling you to replace an occurrence within that list.
Inclusion of the \n (Linux) newline character is important to separate line[1] from line[2] when you next read the file, or you would end up with:
ddddddddddddddddd
QQQQQQQQQQQQQQQQQQQfffffffffffffffff
I have a data file which only contains the following line:
"testA" "testB":1:"testC":2
Now when I split this line, and print the resulting list(w) I get the following:
['"testA"', '"testB":1:"testC":2']
[]
Now when I want to access w[0], it returns "testA" just fine, but when I print w[1], it crashes and gives a list index out of range error, but still prints it out "testB":1:"testC":2
Any help would be much appreciated!
Your code does not crash on w[1] on the "testA" "testB":1:"testC":2 line, otherwise it would not print "testB":1:"testC":2. Note the additional [] in your output? Your file contains some more empty lines, which are split to [], and which will then produce that error even on w[0].
To fix the problem, you should check whether the line and/or the list created from that line is non-empty. (When testing the line, make sure to strip away any whitespace, such as the trailing newline character.) Your code should then look somewhat like this:
with open("test.txt") as f:
for line in f:
if line.strip(): # strip the `\n` before checking whether line is empty
w = line.split()
print(w[0], w[1])
working for me without any error:
>>> a = '"testA" "testB":1:"testC":2'
>>> b = a.split(' ')
>>> b
['"testA"', '"testB":1:"testC":2']
>>> b[0]
'"testA"'
>>> b[1]
'"testB":1:"testC":2'
ae you doing anything different?
I also cannot reproduce your error but want to share a tip that I use in situations where I'm reading from a file and splitting on a delimiter.
cleaned_list = [ a.strip() for a in input_line.split(' ') if a ]
Simple and sanitizes the split list well.
The answer was that the program read the next line after the first one which it split, and that line was empty. So as soon as it tried to index the content of that (empty line) it crashed. That's why it DID print the first line, and then crashed. Thanks a lot for your help, though!
I am attempting to pull out multiple (50-100) sequences from a large .txt file seperated by new lines ('\n'). The sequence is a few lines long but not always the same length so i can't just print lines x-y. The sequences end with " and the next line always starts with the same word so maybe that could be used as a keyword.
I am writing using python 3.3
This is what I have so far:
searchfile = open('filename.txt' , 'r')
cache = []
for line in searchfile:
cache.append(line)
for line in range(len(cache)):
if "keyword1" in cache[line].lower():
print(cache[line+5])
This pulls out the starting line (which is 5 lines below the keyword line always) however it only pulls out this line.
How do I print the whole sequence?
Thankyou for your help.
EDIT 1:
Current output = ABCDCECECCECECE ...
Desired output = ABCBDBEBSOSO ...
ABCBDBDBDBDD ...
continued until " or new line
Edit 2
Text file looks like this:
Name (keyword):
Date
Address1
Address2
Sex
Response"................................"
Y/N
The sequence between the " and " is what I need
TL;DR - How do I print from line + 5 to end when end = keyword
Not sure if I understand your sequence data but if you're searching for each 'keyword' then the next " char then the following should work:
keyword_pos =[]
endseq_pos = []
for line in range(len(cache)):
if 'keyword1' in cache[line].lower():
keyword_pos.append(line)
if '"' in cache[line]:
endseq_pos.append(line)
for key in keyword_pos:
for endseq in endseq_pos:
if endseq > key:
print(cache[key:endseq])
break
This simply compiles a list of all the positions of all the keywords and " characters and then matches the two and prints all the lines in between.
Hope that helps.
I agree with #Michal Frystacky that regex is the way forward. However as I now understand the problem, we need two searches one for the 'keyword' then again 5 lines on, to find the 'sequence'
This should work but may need the regex to be tweaked:
import re
with open('yourfile.txt') as f:
lines = f.readlines()
for i,line in enumerate(lines):
#first search for keyword
key_match = re.search(r'\((keyword)',line)
if key_match:
#if successful search 5 lines on for the string between the quotation marks
seq_match = re.search(r'"([A-Z]*)"',lines[i+5])
if seq_match:
print(key_match.group(1) +' '+ seq_match.group(1))
1This can be done rather simply with regex
import re
lines = 'Name (keyword):','Date','Address1','Address2','Sex','Response"................................" '
for line in lines:
match = re.search('.*?"(:?.*?)"?',line)
if match:
print(match.group(1))
Eventually to use this sample code we would lines = f.readlines() from the dataset. Its important to note that we catch only things between " and another ", if there is no " mark at the end, we will miss this data, but accounting for that isn't too difficult.
I have a file in the format of one word for line, and I want to join the lines with one space, I tries this, but it does not work
for line in file:
new = ' '.join(line)
print (new)
also this does not work
new = file.replace('\n'', ' ')
print (new)
You can also use list comprehensions:
whole_string = " ".join([word.strip() for word in file])
print(whole_string)
You can add each line to a list, then join it up after:
L = []
for line in file:
L.append(line.strip('\n'))
print " ".join(L)
Your current solution tries to use join with a string not a list
A one line solution to this problem would be the following:
print(open('thefile.txt').read().replace('\n', ' '))
This is I think what you want..
' '.join(l.strip() for l in file)
yet another way:
with open('yourfilename.txt', 'r') as file:
words = ' '.join(map(str.rstrip, file))
As you can see from several other answers, file is an iterator, so you can iterate over it and at each loop it will give you a line read from the file (including the \n at the end, that is why we're all stripping it off).
Logically speaking, map applies the given function (i.e. str.rstrip) to each line read in and the results are passed on to join.
I want to add some letters to the beginning and end of each line using python.
I found various methods of doing this, however, whichever method I use the letters I want to add to then end are always added to the beginning.
input = open("input_file",'r')
output = open("output_file",'w')
for line in input:
newline = "A" + line + "B"
output.write(newline)
input.close()
output.close()
I have used varios methods I found here. With each one of them both letters are added to the front.
inserting characters at the start and end of a string
''.join(('L','yourstring','LL'))
or
yourstring = "L%sLL" % yourstring
or
yourstring = "L{0}LL".format(yourstring)
I'm clearly missing something here. What can I do?
When reading lines from a file, python leaves the \n on the end. You could .rstrip it off however.
yourstring = 'L{0}LL\n'.format(yourstring.rstrip('\n'))