i have this tsv file containing some paths of links each link is seperated by a ';' i want to use:
In the example below we can se that the text in the file is seperated
and i only want to read through the last column wich is a path starting with '14th'
6a3701d319fc3754 1297740409 166 14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade NULL
3824310e536af032 1344753412 88 14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade 3
415612e93584d30e 1349298640 138 14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade
I want to somehow split the path into a chain like this:
['14th_century', 'Niger', 'Nigeria'....]
how do i read the file and remove the first 3 columns so i only got the last one ?
UPDATE:
i have tried this now:
import re
with open('test.tsv') as f:
lines = f.readlines()
for line in lines[22:len(lines)]:
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
e_line = line.split(' ')
real_line = e_line[0]
print real_line.split(';')
But the problem is that it not deleting the first 3 columns ?
If the separator betweeen first is only a space and not a serie of spaces or a tab, you could do that
with open('file_name') as f:
lines = f.readlines()
for line in lines:
e_line = line.split(' ')
real_line = e_line[3]
print real_line.split(';')
Answer to your updated question.
But the problem is that it not deleting the first 3 columns ?
There are several mistakes.
Your code:
import re
with open('test.tsv') as f:
lines = f.readlines()
for line in lines[22:len(lines)]:
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
e_line = line.split(' ')
real_line = e_line[0]
print real_line.split(';')
This line does nothing...
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
Because re.sub function doesn't change your line variable, but returns replaced string.
So you may want to do as below.
line = re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
And your regexp ^s\+ matches only string which starts with whitespaces or tabs. Because you use ^.
But I think you just want to replace consective whitespaces or tabs with one space.
So then, above code will be as below.(Just remove ^ in the regexp)
line = re.sub(r"\s+", " ", line, flags = re.MULTILINE)
Now, each string in line are separated just one space. So line.split(' ') will work as you want.
Next, e_line[0] returns first element of e_line which is 1st column of the line.
But you want to skip first 3 columns and get 4th column. You can do like this:
e_line = line.split(' ')
real_line = e_line[3]
OK. Now entire code is look like this.
for line in lines:#<---I also changed here because there is no need to skip first 22 lines in your example.
line = re.sub(r"\s+", " ", line)
e_line = line.split(' ')
real_line = e_line[3]
print real_line
output:
14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade
P.S:
This line can become more pythonic.
before:
for line in lines[22:len(lines)]:
after:
for line in lines[22:]:
And, you don't need to use flags = re.MULTILINE, because line is single-line in the for-loop.
You don't need to use regex for this. The csv module can handle tab-separated files too:
import csv
filereader = csv.reader(open('test.tsv', 'rb'), delimiter='\t')
path_list = [row[3].split(';') for row in filereader]
print(path_list)
Related
I am using Visual Studio Code to replace text with Python.
I am using a source file with original text and converting it into a new file with new text.
I would like to add quotes to the new text that follows. For example:
Original text: set vlans xxx vlan-id xxx
New text: vlan xxx name "xxx" (add quotes to the remaining portion of the line as seen here)
Here is my code:
with open("SanitizedFinal_E4300.txt", "rt") as fin:
with open("output6.txt", "wt") as fout:
for line in fin:
line = line.replace('set vlans', 'vlan').replace('vlan-id', 'name')
fout.write(line)
Is there a way to add quotes for text in the line that follows 'name'?
Edit:
I tried this code:
with open("SanitizedFinal_E4300.txt", "rt") as fin:
with open("output6.txt", "wt") as fout:
for line in fin:
line = line.replace('set vlans', 'vlan').replace('vlan-id', 'name')
words = line.split()
words[-1] = '"' + words[-1] + '"'
line = ' '.join(words)
fout.write(line)
and received this error:
line 124, in <module>
words[-1] = '"' + words[-1] + '"'
IndexError: list index out of range
I also tried this code with no success:
with open("SanitizedFinal_E4300.txt", "rt") as fin:
with open("output6.txt", "wt") as fout:
for line in fin:
line = line.replace('set vlans', 'vlan').replace('vlan-id', 'name')
import re
t = 'set vlans xxx vlan-id xxx'
re.sub(r'set vlans(.*)vlan-id (.*)', r'vlan\1names "\2"', t)
'vlan xxx names "xxx"'
Again, my goal is to automatically add double quotes to the characters (vlan numbers) at the end of a line.
For example:
Original text: set protocols mstp configuration-name Building 2021.Rm402.access.mstp.zzz
Desired text: set protocols mstp configuration-name "Building 2021.Rm402.access.mstp.zzz"
Use the following regular expression:
>>> import re
>>> t = 'set vlans xxx vlan-id xxx'
>>> re.sub(r'set vlans(.*)vlan-id (.*)', r'vlan\1names "\2"', t)
'vlan xxx names "xxx"'
The parentheses in the search pattern (first parameter) are used to create groups that can be used in the replacement pattern (second parameter). So the first (.*) match in the search pattern will be included in the replacement pattern by means of \1; same thing goes with the second one.
Edit:
The code I shared is just an example of how to use regular expressions. Here's how you should use it.
import re
# whatever imports and code you have down to...
with open("SanitizedFinal_E4300.txt", "rt") as fin, open("output6.txt", "wt") as fout:
for line in fin:
line = re.sub(r'set vlans(.*)vlan-id (.*)', r'vlan\1names "\2"', line)
fout.write(line)
IMPORTANT: if the format of the lines you need to modify is any different from the original text example you shared, you'll need to make adjustments to the regular expression.
First, we split the text up into words by splitting them by whitespace (which is what split does by default).
Then, we take the last word, add quotes to it, and join it back together with a space between each word:
with open("SanitizedFinal_E4300.txt", "rt") as fin:
with open("output6.txt", "wt") as fout:
for line in fin:
line = line.replace('set vlans', 'vlan').replace('vlan-id', 'name')
words = line.split()
# print(words) # ['vlan', 'xxx', 'name', 'xxx']
if words: # if the line is empty, just output the empty line
words[-1] = '"' + words[-1] + '"'
line = ' '.join(words)
# print(line) # vlan xxx name "xxx"
fout.write(line)
WARNING: in your question, you say you'd like the output to be vlan xxx name "xxx" which has two spaces after the first xxx. This result would only have one space between each word.
I've learned that we can easily remove blank lined in a file or remove blanks for each string line, but how about remove all blanks at the end of each line in a file ?
One way should be processing each line for a file, like:
with open(file) as f:
for line in f:
store line.strip()
Is this the only way to complete the task ?
Possibly the ugliest implementation possible but heres what I just scratched up :0
def strip_str(string):
last_ind = 0
split_string = string.split(' ')
for ind, word in enumerate(split_string):
if word == '\n':
return ''.join([split_string[0]] + [ ' {} '.format(x) for x in split_string[1:last_ind]])
last_ind += 1
Don't know if these count as different ways of accomplishing the task. The first is really just a variation on what you have. The second does the whole file at once, rather than line-by-line.
Map that calls the 'rstrip' method on each line of the file.
import operator
with open(filename) as f:
#basically the same as (line.rstrip() for line in f)
for line in map(operator.methodcaller('rstrip'), f)):
# do something with the line
read the whole file and use re.sub():
import re
with open(filename) as f:
text = f.read()
text = re.sub(r"\s+(?=\n)", "", text)
You just want to remove spaces, another solution would be...
line.replace(" ", "")
Good to remove white spaces.
I want to replace a line in a file but my code doesn't do what I want. The code doesn't change that line. It seems that the problem is the space between ALS and 4277 characters in the input.txt. I need to keep that space in the file. How can I fix my code?
A part part of input.txt:
ALS 4277
Related part of the code:
for lines in fileinput.input('input.txt', inplace=True):
print(lines.rstrip().replace("ALS"+str(4277), "KLM" + str(4945)))
Desired output:
KLM 4945
Using the same idea that other user have already pointed out, you could also reproduce the same spacing, by first matching the spacing and saving it in a variable (spacing in my code):
import re
with open('input.txt') as f:
lines = f.read()
match = re.match(r'ALS(\s+)4277', lines)
if match != None:
spacing = match.group(1)
lines = re.sub(r'ALS\s+4277', 'KLM%s4945'%spacing, lines.rstrip())
print lines
As the spaces vary you will need to use regex to account for the spaces.
import re
lines = "ALS 4277 "
line = re.sub(r"(ALS\s+4277)", "KLM 4945", lines.rstrip())
print(line)
Try:
with open('input.txt') as f:
for line in f:
a, b = line.strip().split()
if a == 'ALS' and b == '4277':
line = line.replace(a, 'KLM').replace(b, '4945')
print(line, end='') # as line has '\n'
I have a file with some lines. Out of those lines I will choose only lines which starts with xxx. Now the lines which starts with xxx have pattern as follows:
xxx:(12:"pqrs",223,"rst",-90)
xxx:(23:"abc",111,"def",-80)
I want to extract only the string which are their in the first double quote
i.e., "pqrs" and "abc".
Any help using regex is appreciated.
My code is as follows:
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in 'xxx:':
if re.match('^xxx:',line):
c=line
break
this code is giving me error
Your code is wrongly indented. Your f = f.readlines() has 9 spaces in front while for line in f: has 4 spaces. It should look like below.
import re
list_of_prefixes = ["xxx","aaa"]
resulting_list = []
with open("raw.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in list_of_prefixes:
if re.match(phrase + ':\(\d+:\"(\w+)',line) != None:
resulting_list.append(re.findall(phrase +':\(\d+:\"(\w+)',line)[0])
Well you are heading in the right direction.
If the input is this simple, you can use regex groups.
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
m = re.match('^xxx:\(\d*:("[^"]*")',line)
if m is not None:
print(m.group(1))
All the magic is in the regular expression.
^xxx:(\d*:("[^"]*") means
Start from the beginning of the line, match on "xxx:(<any number of numbers>:"<anything but ">"
and because the sequence "<anything but ">" is enclosed in round brackets it will be available as a group (by calling m.group(1)).
PS: next time make sure to include the exact error you are getting
results = []
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
if line.startswith("xxx"):
line = line.split(":") # line[1] will be what is after :
result = line[1].split(",")[0][1:-1] # will be pqrs
results.append(result)
You want to look for lines that start with xxx
then split the line on the :. The first thing after the : is what you want -- up to the comma. Then your result is that string, but remove the quotes. There is no need for regex. Python string functions will be fine
To check if a line starts with xxx do
line.startswith('xxx')
To find the text in first double-quotes do
re.search(r'"(.*?)"', line).group(1)
(as match.group(1) is the first parenthesized subgroup)
So the code will be
with open("file") as f:
for line in f:
if line.startswith('xxx'):
print(re.search(r'"(.*?)"', line).group(1))
re module docs
I am attempting to pull out multiple (50-100) sequences from a large .txt file seperated by new lines ('\n'). The sequence is a few lines long but not always the same length so i can't just print lines x-y. The sequences end with " and the next line always starts with the same word so maybe that could be used as a keyword.
I am writing using python 3.3
This is what I have so far:
searchfile = open('filename.txt' , 'r')
cache = []
for line in searchfile:
cache.append(line)
for line in range(len(cache)):
if "keyword1" in cache[line].lower():
print(cache[line+5])
This pulls out the starting line (which is 5 lines below the keyword line always) however it only pulls out this line.
How do I print the whole sequence?
Thankyou for your help.
EDIT 1:
Current output = ABCDCECECCECECE ...
Desired output = ABCBDBEBSOSO ...
ABCBDBDBDBDD ...
continued until " or new line
Edit 2
Text file looks like this:
Name (keyword):
Date
Address1
Address2
Sex
Response"................................"
Y/N
The sequence between the " and " is what I need
TL;DR - How do I print from line + 5 to end when end = keyword
Not sure if I understand your sequence data but if you're searching for each 'keyword' then the next " char then the following should work:
keyword_pos =[]
endseq_pos = []
for line in range(len(cache)):
if 'keyword1' in cache[line].lower():
keyword_pos.append(line)
if '"' in cache[line]:
endseq_pos.append(line)
for key in keyword_pos:
for endseq in endseq_pos:
if endseq > key:
print(cache[key:endseq])
break
This simply compiles a list of all the positions of all the keywords and " characters and then matches the two and prints all the lines in between.
Hope that helps.
I agree with #Michal Frystacky that regex is the way forward. However as I now understand the problem, we need two searches one for the 'keyword' then again 5 lines on, to find the 'sequence'
This should work but may need the regex to be tweaked:
import re
with open('yourfile.txt') as f:
lines = f.readlines()
for i,line in enumerate(lines):
#first search for keyword
key_match = re.search(r'\((keyword)',line)
if key_match:
#if successful search 5 lines on for the string between the quotation marks
seq_match = re.search(r'"([A-Z]*)"',lines[i+5])
if seq_match:
print(key_match.group(1) +' '+ seq_match.group(1))
1This can be done rather simply with regex
import re
lines = 'Name (keyword):','Date','Address1','Address2','Sex','Response"................................" '
for line in lines:
match = re.search('.*?"(:?.*?)"?',line)
if match:
print(match.group(1))
Eventually to use this sample code we would lines = f.readlines() from the dataset. Its important to note that we catch only things between " and another ", if there is no " mark at the end, we will miss this data, but accounting for that isn't too difficult.