Python prints two lines in the same line when merging files - python

I am new to Python and I'm getting this result and I am not sure how to fix it efficiently.
I have n files, let's say for simplicity just two, with some info with this format:
1.250484649 4.00E-02
2.173737246 4.06E-02
... ...
This continues up to m lines. I'm trying to append all the m lines from the n files in a single file. I prepared this code:
import glob
outfile=open('temp.txt', 'w')
for inputs in glob.glob('*.dat'):
infile=open(inputs,'r')
for row in infile:
outfile.write(row)
It reads all the .dat files (the ones I am interested in) and it does what I want but it merges the last line of the first file and the first line of the second file into a single line:
1.250484649 4.00E-02
2.173737246 4.06E-02
3.270379524 2.94E-02
3.319202217 6.56E-02
4.228424345 8.91E-03
4.335169497 1.81E-02
4.557886098 6.51E-02
5.111075901 1.50E-02
5.547288248 3.34E-02
5.685118615 3.22E-03
5.923718239 2.86E-02
6.30299944 8.05E-03
6.528018125 1.25E-020.704223685 4.98E-03
1.961058114 3.07E-03
... ...
I'd like to fix this in a smart way. I can fix this if I introduce a blank line between each data line and then at the end remove all the blank likes but this seems suboptimal.
Thank you!

There's no newline on the last line of each .dat file, so you'll need to add it:
import glob
with open('temp.txt', 'w') as outfile:
for inputs in glob.glob('*.dat'):
with open(inputs, 'r') as infile:
for row in infile:
if not row.endswith("\n"):
row = f"{row}\n"
outfile.write(row)
Also using with (context managers) to automatically close the files afterwards.
To avoid a trailing newline - there's a few ways to do this, but the simplest one that comes to mind is to load all the input data into memory as individual lines, then write it out in one go using "\n".join(lines). This puts "\n" between each line but not at the end of the last line in the file.
import glob
lines = []
for inputs in glob.glob('*.dat'):
with open(inputs, 'r') as infile:
lines += [line.rstrip('\n') for line in infile.readlines()]
with open('temp.txt', 'w') as outfile:
outfile.write('\n'.join(lines))
[line.rstrip('\n') for line in infile.readlines()] - this is a list comprehension. It makes a list of each line in an individual input file, with the '\n' removed from the end of the line. It can then be += appended to the overall list of lines.
While we're here - let's use logging to give status updates:
import glob
import logging
OUT_FILENAME = 'test.txt'
lines = []
for inputs in glob.glob('*.dat'):
logging.info(f'Opening {inputs} to read...')
with open(inputs, 'r') as infile:
lines += [line.rstrip('\n') for line in infile.readlines()]
logging.info(f'Finished reading {inputs}')
logging.info(f'Opening {OUT_FILENAME} to write...')
with open(OUT_FILENAME, 'w') as outfile:
outfile.write('\n'.join(lines))
logging.info(f'Finished writing {OUT_FILENAME}')

Related

Adding a comma to end of first row of csv files within a directory using python

Ive got some code that lets me open all csv files in a directory and run through them removing the top 2 lines of each file, Ideally during this process I would like it to also add a single comma at the end of the new first line (what would have been originally line 3)
Another approach that's possible could be to remove the trailing comma's on all other rows that appear in each of the csvs.
Any thoughts or approaches would be gratefully received.
import glob
path='P:\pytest'
for filename in glob.iglob(path+'/*.csv'):
with open(filename, 'r') as f:
lines = f.read().split("\n")
f.close()
if len(lines) >= 1:
lines = lines[2:]
o = open(filename, 'w')
for line in lines:
o.write(line+'\n')
o.close()
adding a counter in there can solve this:
import glob
path=r'C:/Users/dsqallihoussaini/Desktop/dev_projects/stack_over_flow'
for filename in glob.iglob(path+'/*.csv'):
with open(filename, 'r') as f:
lines = f.read().split("\n")
print(lines)
f.close()
if len(lines) >= 1:
lines = lines[2:]
o = open(filename, 'w')
counter=0
for line in lines:
counter=counter+1
if counter==1:
o.write(line+',\n')
else:
o.write(line+'\n')
o.close()
One possible problem with your code is that you are reading the whole file into memory, which might be fine. If you are reading larger files, then you want to process the file line by line.
The easiest way to do that is to use the fileinput module: https://docs.python.org/3/library/fileinput.html
Something like the following should work:
#!/usr/bin/env python3
import glob
import fileinput
# inplace makes a backup of the file, then any output to stdout is written
# to the current file.
# change the glob..below is just an example.
#
# Iterate through each file in the glob.iglob() results
with fileinput.input(files=glob.iglob('*.csv'), inplace=True) as f:
for line in f: # Iterate over each line of the current file.
if f.filelineno() > 2: # Skip the first two lines
# Note: 'line' has the newline in it.
# Insert the comma if line 3 of the file, otherwise output original line
print(line[:-1]+',') if f.filelineno() == 3 else print(line, end="")
Ive added some encoding as well as mine was throwing a error but encoding fixed that up nicely
import glob
path=r'C:/whateveryourfolderis'
for filename in glob.iglob(path+'/*.csv'):
with open(filename, 'r',encoding='utf-8') as f:
lines = f.read().split("\n")
#print(lines)
f.close()
if len(lines) >= 1:
lines = lines[2:]
o = open(filename, 'w',encoding='utf-8')
counter=0
for line in lines:
counter=counter+1
if counter==1:
o.write(line+',\n')
else:
o.write(line+'\n')
o.close()

How do I split each line into two strings and print without the comma?

I'm trying to have output to be without commas, and separate each line into two strings and print them.
My code so far yields:
173,70
134,63
122,61
140,68
201,75
222,78
183,71
144,69
But i'd like it to print it out without the comma and the values on each line separated as strings.
if __name__ == '__main__':
# Complete main section of code
file_name = "data.txt"
# Open the file for reading here
my_file = open('data.txt')
lines = my_file.read()
with open('data.txt') as f:
for line in f:
lines.split()
lines.replace(',', ' ')
print(lines)
In your sample code, line contains the full content of the file as a str.
my_file = open('data.txt')
lines = my_file.read()
You then later re-open the file to iterate the lines:
with open('data.txt') as f:
for line in f:
lines.split()
lines.replace(',', ' ')
Note, however, str.split and str.replace do not modify the existing value, as strs in python are immutable. Also note you are operating on lines there, rather than the for-loop variable line.
Instead, you'll need to assign the result of those functions into new values, or give them as arguments (E.g., to print). So you'll want to open the file, iterate over the lines and print the value with the "," replaced with a " ":
with open("data.txt") as f:
for line in f:
print(line.replace(",", " "))
Or, since you are operating on the whole file anyway:
with open("data.txt") as f:
print(f.read().replace(",", " "))
Or, as your file appears to be CSV content, you may wish to use the csv module from the standard library instead:
import csv
with open("data.txt", newline="") as csvfile:
for row in csv.reader(csvfile):
print(*row)
with open('data.txt', 'r') as f:
for line in f:
for value in line.split(','):
print(value)
while python can offer us several ways to open files this is the prefered one for working with files. becuase we are opening the file in lazy mode (this is the prefered one espicialy for large files), and after exiting the with scope (identation block) the file io will be closed automaticly by the system.
here we are openening the file in read mode. files folow the iterator polices, so we can iterrate over them like lists. each line is a true line in the file and is a string type.
After getting the line, in line variable, we split (see str.split()) the line into 2 tokens, one before the comma and the other after the comma. split return new constructed list of strings. if you need to omit some unwanted characters you can use the str.strip() method. usualy strip and split combined together.
elegant and efficient file reading - method 1
with open("data.txt", 'r') as io:
for line in io:
sl=io.split(',') # now sl is a list of strings.
print("{} {}".format(sl[0],sl[1])) #now we use the format, for printing the results on the screen.
non elegant, but efficient file reading - method 2
fp = open("data.txt", 'r')
line = None
while (line=fp.readline()) != '': #when line become empty string, EOF have been reached. the end of file!
sl=line.split(',')
print("{} {}".format(sl[0],sl[1]))

improve search using a dict and pyahocorasick

I´m new at python and I don´t know how to program well. How do I edit this code so it can works using pyahocorasick? My code is very slow, because I need to search lots of strings at a very big file.
Any other way to improve the search?
import sys
with open('C:/dict_search.txt', 'r') as search_list:
targets = [line.strip() for line in search_list]
with open('C:/source.txt', 'r') as source_file, open('C:/out.txt', 'w') as fout:
for line in source_file:
if any(target in line for target in targets):
fout.write(line)
Dict_search.txt
509344
827276
324194
782211
772854
727246
858908
280903
377881
247333
538710
182734
701212
379326
148310
542129
315285
840427
581092
485581
867746
434527
746814
749479
252045
189668
418513
624231
620284
(...)
source.txt
1,324194,20190103,0000048632,00000000000004870,0000045054!
1,701212,20190103,0000048632,00000000000147072,0000045055!
1,581092,20190103,0000048632,00000000000032900,0000045056!
(...)
I need to find the "word" from dict_search.txt is in the source.txt and if the word is on the line, i need to copy the line to other file.
The problem is that my source.txt is very big and I have more than 100k words at dict_search.txt
My code takes to execute. I tried using the set() method, but I got a blank file.
After looking at your files, it looks like each line in the dict_search.txt file match with the format of second column in source.txt file. If this is the case, the below code will work for you. It's a linear time solution so it's going to be fast on the cost of space because it creates dictionary in memory.
d={}
with open("source.txt", 'r') as f:
for index, line in enumerate(f):
l = line.strip().split(",")
d[l[1]]= line
with open("Dict_search.txt", 'r') as search, open('output.txt', 'w') as output:
for line in search:
row = line.strip()
if row in d:
output.write(d[row])

Python - rearrange and write strings (line splitting, unwanted newline)

I have a script that:
Reads in each line of a file
Finds the '*' character in each line and splits the line here
Rearranges the 3 parts (first to last, and last to first)
Writes the rearranged strings to a .txt file
Problem is, it's finding some new line character or something, and isn't outputting how it should. Have tried stripping newline chars, but there must be something I'm missing.
Thanks in advance for any help!
the script:
## Import packages
import time
import csv
## Make output file
file_output = open('output.txt', 'w')
## Open file and iterate over, rearranging the order of each string
with open('input.csv', 'rb') as f:
## Jump to next line (skips file headers)
next(f)
## Split each line, rearrange, and write the new line
for line in f:
## Strip newline chars
line = line.strip('\n')
## Split original string
category, star, value = line.rpartition("*")
##Make new string
new_string = value+star+category+'\n'
## Write new string to file
file_output.write(new_string)
file_output.close()
## Require input (stops program from immediately quitting)
k = input(" press any key to exit")
Input file (input.csv):
Category*Hash Value
1*FB1124FF6D2D4CD8FECE39B2459ED9D5
1*FB1124FF6D2D4CD8FECE39B2459ED9D5
1*FB1124FF6D2D4CD8FECE39B2459ED9D5
1*34AC061CCCAD7B9D70E8EF286CA2F1EA
Output file (output.txt)
FB1124FF6D2D4CD8FECE39B2459ED9D5
*1
FB1124FF6D2D4CD8FECE39B2459ED9D5
*1
FB1124FF6D2D4CD8FECE39B2459ED9D5
*1
34AC061CCCAD7B9D70E8EF286CA2F1EA
*1
EDIT: Answered. Thanks everyone! Looks all good now! :)
The file output.txt should exist.
The following work with python2 on debian:
## Import packages
import time
import csv
## Make output file
file_output = open('output.txt', 'w')
## Open file and iterate over, rearranging the order of each string
with open('input.csv', 'rb') as f:
## Jump to next line (skips file headers)
next(f)
## Split each line, rearrange, and write the new line
for line in f:
## Split original string
category, star, value = line.rpartition("*")
##Make new string
new_string = value.strip()+star+category+'\n'
## Write new string to file
file_output.write(new_string)
file_output.close()
## Require input (stops program from immediately quitting)
k = input(" press any key to exit")
I strip() the value witch contain the \n in order to sanitize it. You used strip('\n') which could be ambiguous and just using the method without parameter do the job.
Use a DictWriter
import csv
with open('aster.csv') as f, open('out.txt', 'w') as fout:
reader = csv.DictReader(f, delimiter='*')
writer = csv.DictWriter(fout, delimiter='*', fieldnames=['Hash Value','Category'])
#writer.writeheader()
for line in reader:
writer.writerow(line)
Without csv library
with open('aster.csv') as f:
next(f)
lines = [line.strip().split('*') for line in f]
with open('out2.txt', 'w') as fout:
for line in lines:
fout.write('%s*%s\n' % (line[1], line[0]))

Import txt file and having each line as a list

I'm a new Python user.
I have a txt file that will be something like:
3,1,3,2,3
3,2,2,3,2
2,1,3,3,2,2
1,2,2,3,3,1
3,2,1,2,2,3
but may be less or more lines.
I want to import each line as a list.
I know you can do it as such:
filename = 'MyFile.txt'
fin=open(filename,'r')
L1list = fin.readline()
L2list = fin.readline()
L3list = fin.readline()
but since I don't know how many lines I will have, is there another way to create individual lists?
Do not create separate lists; create a list of lists:
results = []
with open('inputfile.txt') as inputfile:
for line in inputfile:
results.append(line.strip().split(','))
or better still, use the csv module:
import csv
results = []
with open('inputfile.txt', newline='') as inputfile:
for row in csv.reader(inputfile):
results.append(row)
Lists or dictionaries are far superiour structures to keep track of an arbitrary number of things read from a file.
Note that either loop also lets you address the rows of data individually without having to read all the contents of the file into memory either; instead of using results.append() just process that line right there.
Just for completeness sake, here's the one-liner compact version to read in a CSV file into a list in one go:
import csv
with open('inputfile.txt', newline='') as inputfile:
results = list(csv.reader(inputfile))
Create a list of lists:
with open("/path/to/file") as file:
lines = []
for line in file:
# The rstrip method gets rid of the "\n" at the end of each line
lines.append(line.rstrip().split(","))
with open('path/to/file') as infile: # try open('...', 'rb') as well
answer = [line.strip().split(',') for line in infile]
If you want the numbers as ints:
with open('path/to/file') as infile:
answer = [[int(i) for i in line.strip().split(',')] for line in infile]
lines=[]
with open('file') as file:
lines.append(file.readline())

Categories

Resources