i have a file with data as such.
>1_DL_2021.1123
>2_DL_2021.1206
>3_DL_2021.1202
>3_DL_2021.1214
>4_DL_2021.1214
>4_DL_2021.1214
>6_DL_2021.1214
>7_DL_2021.1214
>8_DL_2021.1214
now as you can see the data is not numbered properly and hence needs to be numbered.
what im aiming for is this:
>1_DL_2021.1123
>2_DL_2021.1206
>3_DL_2021.1202
>4_DL_2021.1214
>5_DL_2021.1214
>6_DL_2021.1214
>7_DL_2021.1214
>8_DL_2021.1214
>9_DL_2021.1214
now the file has a lot of other stuff between these lines starting with > sign. i want only the > sign stuff affected.
could someone please help me out with this.
also there are 563 such lines so manually doing it is out of question.
So, assuming input data file is "input.txt"
You can achieve what you want with this
import re
with open("input.txt", "r") as f:
a = f.readlines()
regex = re.compile(r"^>\d+_DL_2021\.\d+\n$")
counter = 1
for i, line in enumerate(a):
if regex.match(line):
tokens = line.split("_")
tokens[0] = f">{counter}"
a[i] = "_".join(tokens)
counter += 1
with open("input.txt", "w") as f:
f.writelines(a)
So what it does it searches for line with the regex ^>\d+_DL_2021\.\d+\n$, then splits it by _ and gets the first (0th) element and rewrites it, then counts up by 1 and continues the same thing, after all it just writes updated strings back to "input.txt"
sudden_appearance already provided a good answer.
In case you don't like regex too much you can use this code instead:
new_lines = []
with open('test_file.txt', 'r') as f:
c = 1
for line in f:
if line[0] == '>':
after_dash = line.split('_',1)[1]
new_line = '>' + str(c) + '_' + after_dash
c += 1
new_lines.append(new_line)
else:
new_lines.append(line)
with open('test_file.txt', 'w') as f:
f.writelines(new_lines)
Also you can have a look at this split tutorial for more information about how to use split.
Related
I have this text file and let's say it contains 10 lines.
Bye
Hi
2
3
4
5
Hi
Bye
7
Hi
Every time it says "Hi" and "Bye" I want it to be removed except for the first time it was said.
My current code is (yes filename is actually pointing towards a file, I just didn't place it in this one)
text_file = open(filename)
for i, line in enumerate(text_file):
if i == 0:
var_Line1 = line
if i = 1:
var_Line2 = line
if i > 1:
if line == var_Line2:
del line
text_file.close()
It does detect the duplicates, but it takes a very long time considering the amount of lines there are, but I'm not sure on how to delete them and save it as well
You could use dict.fromkeys to remove duplicates and preserve order efficiently:
with open(filename, "r") as f:
lines = dict.fromkeys(f.readlines())
with open(filename, "w") as f:
f.writelines(lines)
Idea from Raymond Hettinger
Using a set & some basic filtering logic:
with open('test.txt') as f:
seen = set() # keep track of the lines already seen
deduped = []
for line in f:
line = line.rstrip()
if line not in seen: # if not seen already, write the lines to result
deduped.append(line)
seen.add(line)
# re-write the file with the de-duplicated lines
with open('test.txt', 'w') as f:
f.writelines([l + '\n' for l in deduped])
I try analyze text file with data - columns, and records.
My file:
Name Surname Age Sex Grade
Chris M. 14 M 4
Adam A. 17 M
Jack O. M 8
The text file has some empty data. As above.
User want to show Name and Grade:
import csv
with open('launchlog.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split() for line in stripped if line)
with open('log.txt', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name', 'Surname', 'Age', 'Sex', 'Grade'))
writer.writerows(lines)
log.txt :
Chris,M.,14,M,4
Adam,A.,17,M
Jack,O.,M,8
How to empty data insert a "None" string?
For example:
Chris,M.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
What would be the best way to do this in Python?
Use pandas:
import pandas
data=pandas.read_fwf("file.txt")
To get your dictionnary:
data.set_index("Name")["Grade"].to_dict()
Here's something in Pure Pythonâ„¢ that seems to do what you want, at least on the sample data file in your question.
In a nutshell what it does is first determine where each of the field names in column header line start and end, and then for each of the remaining lines of the file, does the same thing getting a second list which is used to determine what column each data item in the line is underneath (which it then puts in its proper position in the row that will be written to the output file).
import csv
def find_words(line):
""" Return a list of (start, stop) tuples with the indices of the
first and last characters of each "word" in the given string.
Any sequence of consecutive non-space characters is considered
as comprising a word.
"""
line_len = len(line)
indices = []
i = 0
while i < line_len:
start, count = i, 0
while line[i] != ' ':
count += 1
i += 1
if i >= line_len:
break
indices.append((start, start+count-1))
while i < line_len and line[i] == ' ': # advance to start of next word
i += 1
return indices
# convert text file with missing fields to csv
with open('name_grades.txt', 'rt') as in_file, open('log.csv', 'wt', newline='') as out_file:
writer = csv.writer(out_file)
header = next(in_file) # read first line
fields = header.split()
writer.writerow(fields)
# determine the indices of where each field starts and stops based on header line
field_positions = find_words(header)
for line in in_file:
line = line.rstrip('\r\n') # remove trailing newline
row = ['None' for _ in range(len(fields))]
value_positions = find_words(line)
for (vstart, vstop) in value_positions:
# determine what field the value is underneath
for i, (hstart, hstop) in enumerate(field_positions):
if vstart <= hstop and hstart <= vstop: # overlap?
row[i] = line[vstart:vstop+1]
break # stop looking
writer.writerow(row)
Here's the contents of the log.csv file it created:
Name,Surname,Age,Sex,Grade
Chris,C.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
I would use baloo's answer over mine -- but if you just want to get a feel for where your code went wrong, the solution below mostly works (there is a formatting issue with the Grade field, but I'm sure you can get through that.) Add some print statements to your code and to mine and you should be able to pick up the differences.
import csv
<Old Code removed in favor of new code below>
EDIT: I see your difficulty now. Please try the below code; I'm out of time today so you will have to fill in the writer parts where the print statement is, but this will fulfill your request to replace empty fields with None.
import csv
with open('Test.txt', 'r') as in_file:
with open('log.csv', 'w') as out_file:
writer = csv.writer(out_file)
lines = [line for line in in_file]
name_and_grade = dict()
for line in lines[1:]:
parts = line[0:10], line[11:19], line[20:24], line[25:31], line[32:]
new_line = list()
for part in parts:
val = part.replace('/n','')
val = val.strip()
val = val if val != '' else 'None'
new_line.append(val)
print(new_line)
Without using pandas:
Edited based on your comment, I hard coded this solution based on your data. This will not work for the rows doesn't have Surname column.
I'm writing out Name and Grade since you only need those two columns.
o = open("out.txt", 'w')
with open("inFIle.txt") as f:
for lines in f:
lines = lines.strip("\n").split(",")
try:
grade = int(lines[-1])
if (lines[-2][-1]) != '.':
o.write(lines[0]+","+ str(grade)+"\n")
except ValueError:
print(lines)
o.close()
I have a text file that contains the following contents. I want to split this file into multiple files (1.txt, 2.txt, 3.txt...). Each a new output file will be as the following. The code I tried doesn't split the input file properly. How can I split the input file into multiple files?
My code:
#!/usr/bin/python
with open("input.txt", "r") as f:
a1=[]
a2=[]
a3=[]
for line in f:
if not line.strip() or line.startswith('A') or line.startswith('$$'): continue
row = line.split()
a1.append(str(row[0]))
a2.append(float(row[1]))
a3.append(float(row[2]))
f = open('1.txt','a')
f = open('2.txt','a')
f = open('3.txt','a')
f.write(str(a1))
f.close()
Input file:
A
x
k
..
$$
A
z
m
..
$$
A
B
l
..
$$
Desired output 1.txt
A
x
k
..
$$
Desired output 2.txt
A
z
m
..
$$
Desired output 3.txt
A
B
l
..
$$
Read your input file and write to an output each time you find a "$$" and increase the counter of output files, code :
with open("input.txt", "r") as f:
buff = []
i = 1
for line in f:
if line.strip(): #skips the empty lines
buff.append(line)
if line.strip() == "$$":
output = open('%d.txt' % i,'w')
output.write(''.join(buff))
output.close()
i+=1
buff = [] #buffer reset
EDIT: should be efficient too https://wiki.python.org/moin/PythonSpeed/PerformanceTips#String_Concatenation
try re.findall() function:
import re
with open('input.txt', 'r') as f:
data = f.read()
found = re.findall(r'\n*(A.*?\n\$\$)\n*', data, re.M | re.S)
[open(str(i)+'.txt', 'w').write(found[i-1]) for i in range(1, len(found)+1)]
Minimalistic approach for the first 3 occurrences:
import re
found = re.findall(r'\n*(A.*?\n\$\$)\n*', open('input.txt', 'r').read(), re.M | re.S)
[open(str(found.index(f)+1)+'.txt', 'w').write(f) for f in found[:3]]
Some explanations:
found = re.findall(r'\n*(A.*?\n\$\$)\n*', data, re.M | re.S)
will find all occurrences matching the specified RegEx and will put them into the list, called found
[open(str(found.index(f)+1)+'.txt', 'w').write(f) for f in found]
iterate (using list comprehensions) through all elements belonging to found list and for each element create text file (which is called like "index of the element + 1.txt") and write that element (occurrence) to that file.
Another version, without RegEx's:
blocks_to_read = 3
blk_begin = 'A'
blk_end = '$$'
with open('35916503.txt', 'r') as f:
fn = 1
data = []
write_block = False
for line in f:
if fn > blocks_to_read:
break
line = line.strip()
if line == blk_begin:
write_block = True
if write_block:
data.append(line)
if line == blk_end:
write_block = False
with open(str(fn) + '.txt', 'w') as fout:
fout.write('\n'.join(data))
data = []
fn += 1
PS i, personally, don't like this version and i would use the one using RegEx
open 1.txt in the beginning for writing. Write each line to the current output file. Additionally, if line.strip() == '$$', close the old file and open a new one for writing.
The blocks are divided by empty lines. Try this:
import sys
lines = [line for line in sys.stdin.readlines()]
i = 1
o = open("1{}.txt".format(i), "w")
for line in lines:
if len(line.strip()) == 0:
o.close()
i = i + 1
o = open("{}.txt".format(i), "w")
else:
o.write(line)
Looks to me that the condition that you should be checking for is a line that contains just the carriage return (\n) character. When you encounter such a line, write the contents of the parsed file so far, close the file, and open another one for writing.
A very easy way would if you want to split it in 2 files for example:
with open("myInputFile.txt",'r') as file:
lines = file.readlines()
with open("OutputFile1.txt",'w') as file:
for line in lines[:int(len(lines)/2)]:
file.write(line)
with open("OutputFile2.txt",'w') as file:
for line in lines[int(len(lines)/2):]:
file.write(line)
making that dynamic would be:
with open("inputFile.txt",'r') as file:
lines = file.readlines()
Batch = 10
end = 0
for i in range(1,Batch + 1):
if i == 1:
start = 0
increase = int(len(lines)/Batch)
end = end + increase
with open("splitText_" + str(i) + ".txt",'w') as file:
for line in lines[start:end]:
file.write(line)
start = end
I have a file which contains following row:
//hva_SaastonJakaumanMuutos/printData/reallocationAssignment/changeUser/firstName>
I want to add "John" at the end of line.
I have written following code but for some reason it is not working,
def add_text_to_file(self, file, rowTitle, inputText):
f = open("check_files/"+file+".txt", "r")
fileList = list(f)
f.close()
j = 0
for row in fileList :
if fileList[j].find(rowTitle) > 0 :
fileList[j]=fileList[j].replace("\n","")+inputText+"\n"
break
j = j+1
f = open("check_files/"+file+".txt", "w")
f.writelines(fileList)
f.close()
Do you see where am I doing wrong?
str.find may return 0 if the text you are searching is found at the beginning. After all, it returns the index the match begins.
So your condition should be:
if fileList[j].find(rowTitle) >= 0 :
Edit:
The correction above would save the day but it's better if you things the right way, the pythonic way.
If you are looking for a substring in a text, you can use the foo in bar comparison. It will be True if foo can be found in bar and False otherwise.
You rarely need a counter in Python. enumerate built-in is your friend here.
You can combine the iteration and writing and eliminate an unnecessary step.
strip or rstrip is better than replace in your case.
For Python 2.6+, it is better to use with statement when dealing with files. It will deal with the closing of the file right way. For Python 2.5, you need from __future__ import with_statement
Refer to PEP8 for commonly preferred naming conventions.
Here is a cleaned up version:
def add_text_to_file(self, file, row_title, input_text):
with open("check_files/" + file + ".txt", "r") as infile:
file_list = infile.readlines()
with open("check_files/" + file + ".txt", "w") as outfile:
for row in file_list:
if row_title in row:
row = row.rstrip() + input_text + "\n"
outfile.write(row)
You are not giving much informations, so even thoug I wouldn't use the following code (because I'm sure there are better ways) it might help to clear your problem.
import os.path
def add_text_to_file(self, filename, row_title, input_text):
# filename should have the .txt extension in it
filepath = os.path.join("check_files", filename)
with open(filepath, "r") as f:
content = f.readlines()
for j in len(content):
if row_title in content[j]:
content[j] = content[j].strip() + input_text + "\n"
break
with open(filepath, "w") as f:
f.writelines(content)
I'm having a bit of a rough time laying out how I would count certain elements within a text file using Python. I'm a few months into Python and I'm familiar with the following functions;
raw_input
open
split
len
print
rsplit()
Here's my code so far:
fname = "feed.txt"
fname = open('feed.txt', 'r')
num_lines = 0
num_words = 0
num_chars = 0
for line in feed:
lines = line.split('\n')
At this point I'm not sure what to do next. I feel the most logical way to approach it would be to first count the lines, count the words within each line, and then count the number of characters within each word. But one of the issues I ran into was trying to perform all of the necessary functions at once, without having to re-open the file to perform each function seperately.
Try this:
fname = "feed.txt"
num_lines = 0
num_words = 0
num_chars = 0
with open(fname, 'r') as f:
for line in f:
words = line.split()
num_lines += 1
num_words += len(words)
num_chars += len(line)
Back to your code:
fname = "feed.txt"
fname = open('feed.txt', 'r')
what's the point of this? fname is a string first and then a file object. You don't really use the string defined in the first line and you should use one variable for one thing only: either a string or a file object.
for line in feed:
lines = line.split('\n')
line is one line from the file. It does not make sense to split('\n') it.
Functions that might be helpful:
open("file").read() which reads the contents of the whole file at once
'string'.splitlines() which separates lines from each other (and discards empty lines)
By using len() and those functions you could accomplish what you're doing.
fname = "feed.txt"
feed = open(fname, 'r')
num_lines = len(feed.splitlines())
num_words = 0
num_chars = 0
for line in lines:
num_words += len(line.split())
file__IO = input('\nEnter file name here to analize with path:: ')
with open(file__IO, 'r') as f:
data = f.read()
line = data.splitlines()
words = data.split()
spaces = data.split(" ")
charc = (len(data) - len(spaces))
print('\n Line number ::', len(line), '\n Words number ::', len(words), '\n Spaces ::', len(spaces), '\n Charecters ::', (len(data)-len(spaces)))
I tried this code & it works as expected.
One of the way I like is this one , but may be good for small files
with open(fileName,'r') as content_file:
content = content_file.read()
lineCount = len(re.split("\n",content))
words = re.split("\W+",content.lower())
To count words, there is two way, if you don't care about repetition you can just do
words_count = len(words)
if you want the counts of each word you can just do
import collections
words_count = collections.Counter(words) #Count the occurrence of each word