Python: How do i split the file?

Python: How do i split the file? - python

I have this txt file which is ls -R of etc directory in a linux system. Example file:
etc:
ArchiveSEL
xinetd.d
etc/cmm:
CMM_5085.bin
cmm_sel
storage.cfg
etc/crontabs:
root
etc/pam.d:
ftp
rsh
etc/rc.d:
eth.set.sh
rc.sysinit
etc/rc.d/init.d:
cmm
functions
userScripts
etc/security:
access.conf
console.apps
time.conf
etc/security/console.apps:
kbdrate
etc/ssh:
ssh_host_dsa_key
sshd_config
etc/var:
setUser
snmpd.conf
etc/xinetd.d:
irsh
wu-ftpd
I would like to split it by subdirectories into several files. example files would be like this: etc.txt, etcCmm.txt, etcCrontabs.txt, etcPamd.txt, ...
Can someone give me a python code that can do that?
Notice that the subdirectory lines end with ':', but i'm just not smart enough to write the code. some examples would be appreciated.
thank you :)

Maybe something like this? re.M generates a multiline regular expression which can match several lines, and the last part just iterates over the matches and creates the files...
import re
data = '<your input data as above>' # or open('data.txt').read()
results = map(lambda m: (m[0], m[1].strip().splitlines()),
re.findall('^([^\n]+):\n((?:[^\n]+\n)*)\n', data, re.M))
for dirname, files in results:
f = open(dirname.replace('/', '')+'.txt', 'w')
for line in files:
f.write(line + '\n')
f.close()

You will need to do it line-by-line. if a line.endswith(":") then you are in a new subdirectory. From then on, each line is a new entry into your subdirectory, until another line ends with :.
From my understanding, you just want to split one textfile into several, ambiguously named, text files.
So you'd see if a line ends with :. then you open a new text file, like etcCmm.txt, and every line that you read from the source text, from that point on, you write intoetcCmm.txt. When you encounter another line that ends in :, you close the previously opened file, create a new one, and continue.
I'm leaving a few things for you to do yourself, such as figuring out what to call the text file, reading a file line-by-line, etc.

use regexp like '.*:'.
use file.readline().
use loops.

If Python is not a must, you can use this one liner
awk '/:$/{gsub(/:|\//,"");fn=$0}{print $0 > fn".txt"}' file

Here's what I would do:
Read the file into memory (myfile = open(filename).read() should do).
Then split the file along the delimiters:
import re
myregex = re.compile(r"^(.*):[ \t]*$", re.MULTILINE)
arr = myregex.split(myfile)[1:] # dropping everything before the first directory entry
Then convert the array to a dict, removing unwanted characters along the way:
mydict = dict([(re.sub(r"\W+","",k), v.strip()) for (k,v) in zip(arr[::2], arr[1::2])])
Then write the files:
for name,content in mydict.iteritems():
output = open(name+".txt","w")
output.write(content)
output.close()

Related

Code for coyping specific lines from multiple files to a single file (and removing part of the copied lines)

First of all, I am really new to this. I've been reading up on some tutorials over the past days, but now I've hit a wall with what I want to achieve.
To give you the long version: I have multiple files in a directory, all of which contain information in certain lines (23-26). Now, the code would have to find and open all files (naming pattern: *.tag) and then copy lines 23-26 to a new single file. (And add a new line after each new entry...). Optionally it would also remove a specific part from each line that I do not need:
C12b2
-> everything before C12b2 (or similar) would need to be removed.
Thus far I have managed to copy those lines from a single file to a new file, but the rest still eludes me: (no idea how formatting works here)
f = open('2.tag')
n = open('output.txt', 'w')
for i, text in enumerate(f):
if i >= 23 and i < 27:
n.write(text)
else:
pass
Could anyone give me some advice ? I do not need a complete code as an answer, however, good tutorials that don't skip explanations seem to be hard to come by.

You can look at the glob module , it gives a list of filenames that match the pattern you provide it , please note this pattern is not regex , it is shell-style pattern (using shell-style wildcards).
Example of glob -
>>> import glob
>>> glob.glob('*.py')
['a.py', 'b.py', 'getpip.py']
You can then iterate over each of the file returned by the glob.glob() function.
For each file you can do that same thing you are doing right now.
Then when writing files, you can use str.find() to find the first instance of the string C12b2 and then use slicing to remove of the part you do not want.
As an example -
>>> s = "asdbcdasdC12b2jhfasdas"
>>> s[s.find("C12b2"):]
'C12b2jhfasdas'
You can do something similar for each of your lines , please note if the usecase if that only some lines would have C12b2 , then you need to first check whether that string is present in the line, before doing the above slicing. Example -
if 'C12b2' in text:
text = text[text.find("C12b2"):]
You can do above before writing the line into the output file.
Also, would be good to look into the with statement , you can use it for openning files, so that it will automatically handle closing the file, when you are done with the processing.

Without importing anything but os:
#!/usr/bin/env python3
import os
# set the directory, the outfile and the tag below
dr = "/path/to/directory"; out = "/path/to/newfile"; tag = ".txt"
for f in [f for f in os.listdir(dr) if f.endswith(".txt")]:
open(out, "+a").write(("").join([l for l in open(dr+"/"+f).readlines()[22:25]])+"\n")
What it does
It does exactly as you describe, it:
collects a defined region of lines from all files (that is: of a defined extension) in a directory
pastes the sections into a new file, separated by a new line
Explanation
[f for f in os.listdir(dr) if f.endswith(".tag")]
lists all files of the specific extension in your directory,
[l for l in open(dr+"/"+f).readlines()[22:25]]
reads the selected lines of the file
open(out, "+a").write()
writes to the output file, creates it if it does not exist.
How to use
Copy the script into an empty file, save it as collect_lines.py
set in the head section the directory with your files, the path to the new file and the extension
run it with the command:
python3 /path/to/collect_lines.py
The verbose version, with explanation
If we "decompress" the code above, this is what happens:
#!/usr/bin/env python3
import os
#--- set the path to the directory, the new file and the tag below
dr = "/path/to/directory"; out = "/path/to/newfile"; tag = ".txt"
#---
files = os.listdir(dr)
for f in files:
if f.endswith(tag):
# read the file as a list of lines
content = open(dr+"/"+f).readlines()
# the first item in a list = index 0, so line 23 is index 22
needed_lines = content[22:25]
# convert list to string, add a new line
string_topaste = ("").join(needed_lines)+"\n"
# add the lines to the new file, create the file if necessary
open(out, "+a").write(string_topaste)

Using the glob package you can get a list of all *.tag files:
import glob
# ['1.tag', '2.tag', 'foo.tag', 'bar.tag']
tag_files = glob.glob('*.tag')
If you open your file using the with statement, it is being closed automatically afterwards:
with open('file.tag') as in_file:
# do something
Use readlines() to read your entire file into a list of lines, which can then be sliced:
lines = in_file.readlines()[22:26]
If you need to skip everything before a specific pattern, use str.split() to separate the string at the pattern and take the last part:
pattern = 'C12b2'
clean_lines = [line.split(pattern, 1)[-1] for line in lines]
Take a look at this example:
>>> lines = ['line 22', 'line 23', 'Foobar: C12b2 line 24']
>>> pattern = 'C12b2'
>>> [line.split(pattern, 1)[-1] for line in lines]
['line 22', 'line 23', ' line 24']

You can realines and writelines using a and b as line bounds for the slice of lines to write:
with open('oldfile.txt', 'r') as old:
lines = old.readlines()[a:b]
with open('newfile.txt', 'w') as new:
new.writelines(lines)

Searching a text file and grabbing all lines that do not include ## in python

I am trying to write a python script to read in a large text file from some modeling results, grab the useful data and save it as a new array. The text file is output in a way that has a ## starting each line that is not useful. I need a way to search through and grab all the lines that do not include the ##. I am used to using grep -v in this situation and piping to a file. I want to do it in python!
Thanks a lot.
-Tyler

I would use something like this:
fh = open(r"C:\Path\To\File.txt", "r")
raw_text = fh.readlines()
clean_text = []
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line)
Or you could also clean the newline and carriage return non-printing characters at the same time with a small modification:
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line.rstrip("\r\n"))
You would be left with a list object that contains one line of required text per element. You could split this into individual words using string.split() which would give you a nested list per original list element which you could easily index (assuming your text has whitespaces of course).
clean_text[4][7]
would return the 5th line, 8th word.
Hope this helps.
[Edit: corrected indentation in loop]

My suggestion would be to do the following:
listoflines = [ ]
with open(.txt, "r") as f: # .txt = file, "r" = read
for line in f:
if line[:2] != "##": #Read until the second character
listoflines.append(line)
print listoflines
If you're feeling brave, you can also do the following, CREDITS GO TO ALEX THORNTON:
listoflines = [l for l in f if not l.startswith('##')]
The other answer is great as well, especially teaching the .startswith function, but I think this is the more pythonic way and also has the advantage of automatically closing the file as soon as you're done with it.

How can I incorporate a text file into the body of my Python script?

Currently, my code is reading an external text file, using:
text_file = open("file.txt", 'r', 0)
my_list = []
for line in text_file
my_list.append(line.strip().lower())
return my_list
I would like to send my code to a friend without having to send a separate text file. So I am looking for a way of incorporating the content of the text file into my code.
How can I achieve this?
If I convert the text file into list format ([a, b, c, ...]) inside MS notepad using replace function, and then try to copy & paste list into Python IDE (I'm using IDLE), the process is hellishly memory intensive: IDLE tries to string out everything to the right in one line (i.e. no word wrap), and it never ends.

I'm not totally sure what you're asking, but if I'm guessing what you mean correctly, you could do this:
my_list = ['line1', 'line2']
Where each is a line from your text file.

Just put all the file contents into ONE MASSIVE string:
with open('path/to/my/txt/file') as f:
file_contents = f.read()
So now, your friend can do:
for line in file_contents.split('\n'):
#code
which is equivalent to
with open('path/to/file') as f:
for line in f:
#code
Hope this helps

I would suggest
assign the contents of the file to a variable in another py file
read the value by importing it in you program
that way the py file will be converted to pyc (send that), or py2exe will take care of it..
and would not allow your friend to mess with the contents...

You could also do something like:
my_file_contents = """file_contents_including_newlines"""
for line in my_file_contents.split('\n'): # Assuming UNIX line ending, else split '\r\n'
*do something with "line" variable*
Note the use of triple quotes around the text to be sent. This would work for non-binary data.

Rename Files Based on File Content

Using Python, I'm trying to rename a series of .txt files in a directory according to a specific phrase in each given text file. Put differently and more specifically, I have a few hundred text files with arbitrary names but within each file is a unique phrase (something like No. 85-2156). I would like to replace the arbitrary file name with that given phrase for every text file. The phrase is not always on the same line (though it doesn't deviate that much) but it always is in the same format and with the No. prefix.
I've looked at the os module and I understand how
os.listdir
os.path.join
os.rename
could be useful but I don't understand how to combine those functions with intratext manipulation functions like linecache or general line reading functions.
I've thought through many ways of accomplishing this task but it seems like easiest and most efficient way would be to create a loop that finds the unique phrase in a file, assigns it to a variable and use that variable to rename the file before moving to the next file.
This seems like it should be easy, so much so that I feel silly writing this question. I've spent the last few hours looking reading documentation and parsing through StackOverflow but it doesn't seem like anyone has quite had this issue before -- or at least they haven't asked about their problem.
Can anyone point me in the right direction?
EDIT 1: When I create the regex pattern using this website, it creates bulky but seemingly workable code:
import re
txt='No. 09-1159'
re1='(No)' # Word 1
re2='(\\.)' # Any Single Character 1
re3='( )' # White Space 1
re4='(\\d)' # Any Single Digit 1
re5='(\\d)' # Any Single Digit 2
re6='(-)' # Any Single Character 2
re7='(\\d)' # Any Single Digit 3
re8='(\\d)' # Any Single Digit 4
re9='(\\d)' # Any Single Digit 5
re10='(\\d)' # Any Single Digit 6
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
name = m.group(0)
print name
When I manipulate that to fit the glob.glob structure, and make it like this:
import glob
import os
import re
re1='(No)' # Word 1
re2='(\\.)' # Any Single Character 1
re3='( )' # White Space 1
re4='(\\d)' # Any Single Digit 1
re5='(\\d)' # Any Single Digit 2
re6='(-)' # Any Single Character 2
re7='(\\d)' # Any Single Digit 3
re8='(\\d)' # Any Single Digit 4
re9='(\\d)' # Any Single Digit 5
re10='(\\d)' # Any Single Digit 6
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,re.IGNORECASE|re.DOTALL)
for fname in glob.glob("\file\structure\here\*.txt"):
with open(fname) as f:
contents = f.read()
tname = rg.search(contents)
print tname
Then this prints out the byte location of the the pattern -- signifying that the regex pattern is correct. However, when I add in the nname = tname.group(0) line after the original tname = rg.search(contents) and change around the print function to reflect the change, it gives me the following error: AttributeError: 'NoneType' object has no attribute 'group'. When I tried copying and pasting #joaquin's code line for line, it came up with the same error. I was going to post this as a comment to the #spatz answer but I wanted to include so much code that this seemed to be a better way to express the `new' problem. Thank you all for the help so far.
Edit 2: This is for the #joaquin answer below:
import glob
import os
import re
for fname in glob.glob("/directory/structure/here/*.txt"):
with open(fname) as f:
contents = f.read()
tname = re.search('No\. (\d\d\-\d\d\d\d)', contents)
nname = tname.group(1)
print nname
Last Edit: I got it to work using mostly the code as written. What was happening is that there were some files that didn't have that regex expression so I assumed Python would skip them. Silly me. So I spent three days learning to write two lines of code (I know the lesson is more than that). I also used the error catching method recommended here. I wish I could check all of you as the answer, but I bothered #Joaquin the most so I gave it to him. This was a great learning experience. Thank you all for being so generous with your time. The final code is below.
import os
import re
pat3 = "No\. (\d\d-\d\d)"
ext = '.txt'
mydir = '/directory/files/here'
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(pat3, txt)
if s is None:
continue
name = s.group(1)
newpath = os.path.join(mydir, name)
if not os.path.exists(newpath):
os.rename(archpath, newpath + ext)
else:
print '{} already exists, passing'.format(newpath)

Instead of providing you with some code which you will simply copy-paste without understanding, I'd like to walk you through the solution so that you will be able to write it yourself, and more importantly gain enough knowledge to be able to do it alone next time.
The code which does what you need is made up of three main parts:
Getting a list of all filenames you need to iterate
For each file, extract the information you need to generate a new name for the file
Rename the file from its old name to the new one you just generated
Getting a list of filenames
This is best achieved with the glob module. This module allows you to specify shell-like wildcards and it will expand them. This means that in order to get a list of .txt file in a given directory, you will need to call the function glob.iglob("/path/to/directory/*.txt") and iterate over its result (for filename in ...:).
Generate new name
Once we have our filename, we need to open() it, read its contents using read() and store it in a variable where we can search for what we need. That would look something like this:
with open(filename) as f:
contents = f.read()
Now that we have the contents, we need to look for the unique phrase. This can be done using regular expressions. Store the new filename you want in a variable, say newfilename.
Rename
Now that we have both the old and the new filenames, we need to simply rename the file, and that is done using os.rename(filename, newfilename).
If you want to move the files to a different directory, use os.rename(filename, os.path.join("/path/to/new/dir", newfilename). Note that we need os.path.join here to construct the new path for the file using a directory path and newfilename.

There is no checking or protection for failures (check is archpath is a file, if newpath already exists, if the search is succesful, etc...), but this should work:
import os
import re
pat = "No\. (\d\d\-\d\d\d\d)"
mydir = 'mydir'
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(pat, txt)
name = s.group(1)
newpath = os.path.join(mydir, name)
os.rename(archpath, newpath)
Edit: I tested the regex to show how it works:
>>> import re
>>> pat = "No\. (\d\d\-\d\d\d\d)"
>>> txt='nothing here or whatever No. 09-1159 you want, does not matter'
>>> s = re.search(pat, txt)
>>> s.group(1)
'09-1159'
>>>
The regex is very simple:
\. -> a dot
\d -> a decimal digit
\- -> a dash
So, it says: search for the string "No. " followed by 2+4 decimal digits separated by a dash.
The parentheses are to create a group that I can recover with s.group(1) and that contains the code number.
And that is what you get, before and after:
Text of files one.txt, two.txt and three.txt is always the same, only the number changes:
this is the first
file with a number
nothing here or whatever No. 09-1159 you want, does not matter
the number is

Create a backup of your files, then try something like this:
import glob
import os
def your_function_to_dig_out_filename(lines):
import re
# i'll let you attempt this yourself
for fn in glob.glob('/path/to/your/dir/*.txt'):
with open(fn) as f:
spam = f.readlines()
new_fn = your_function_to_dig_out_filename(spam)
if not os.path.exists(new_fn):
os.rename(fn, new_fn)
else:
print '{} already exists, passing'.format(new_fn)

Replace string in a specific line using python

I'm writing a python script to replace strings from a each text file in a directory with a specific extension (.seq). The strings replaced should only be from the second line of each file, and the output is a new subdirectory (call it clean) with the same file names as the original files, but with a *.clean suffix. The output file contains exactly the same text as the original, but with the strings replaced. I need to replace all these strings: 'K','Y','W','M','R','S' with 'N'.
This is what I've come up with after googling. It's very messy (2nd week of programming), and it stops at copying the files into the clean directory without replacing anything. I'd really appreciate any help.
Thanks before!
import os, shutil
os.mkdir('clean')
for file in os.listdir(os.getcwd()):
if file.find('.seq') != -1:
shutil.copy(file, 'clean')
os.chdir('clean')
for subdir, dirs, files in os.walk(os.getcwd()):
for file in files:
f = open(file, 'r')
for line in f.read():
if line.__contains__('>'): #indicator for the first line. the first line always starts with '>'. It's a FASTA file, if you've worked with dna/protein before.
pass
else:
line.replace('M', 'N')
line.replace('K', 'N')
line.replace('Y', 'N')
line.replace('W', 'N')
line.replace('R', 'N')
line.replace('S', 'N')

some notes:
string.replace and re.sub are not in-place so you should be assigning the return value back to your variable.
glob.glob is better for finding files in a directory matching a defined pattern...
maybe you should be checking if the directory already exists before creating it (I just assumed this, this could not be your desired behavior)
the with statement takes care of closing the file in a safe way. if you don't want to use it you have to use try finally.
in your example you where forgetting to put the sufix *.clean ;)
you where not actually writing the files, you could do it like i did in my example or use fileinput module (which until today i did not know)
here's my example:
import re
import os
import glob
source_dir=os.getcwd()
target_dir="clean"
source_files = [fname for fname in glob.glob(os.path.join(source_dir,"*.seq"))]
# check if target directory exists... if not, create it.
if not os.path.exists(target_dir):
os.makedirs(target_dir)
for source_file in source_files:
target_file = os.path.join(target_dir,os.path.basename(source_file)+".clean")
with open(source_file,'r') as sfile:
with open(target_file,'w') as tfile:
lines = sfile.readlines()
# do the replacement in the second line.
# (remember that arrays are zero indexed)
lines[1]=re.sub("K|Y|W|M|R|S",'N',lines[1])
tfile.writelines(lines)
print "DONE"
hope it helps.

You should replace line.replace('M', 'N') with line=line.replace('M', 'N'). replace returns a copy of the original string with the relevant substrings replaced.
An even better way (IMO) is to use re.
import re
line="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
line=re.sub("K|Y|W|M|R|S",'N',line)
print line

Here are some general hints:
Don't use find for checking the file extension (e.g., this would also match "file1.seqdata.xls"). At least use file.endswith('seq'), or, better yet, os.path.splitext(file)[1]
Actually, don't do that altogether. This is what you want:
import glob
seq_files = glob.glob("*.seq")
Don't copy the files, it's much easier to use just one loop:
for filename in seq_files:
in_file = open(filename)
out_file = open(os.path.join("clean", filename), "w")
# now read lines from in_file and write lines to out_file
Don't use line.__contains__('>'). What you mean is
if '>' in line:
(which will call __contains__ internally). But actually, you want to know wether the line starts with a `">", not if there's one somewhere within the line, be it at the beginning or not. So the better way would be this:
if line.startswith(">"):
I'm not familiar with your file type; if the ">" check really is just for determining the first line, there's better ways to do that.
You don't need the if block (you just pass). It's cleaner to write
if not something:
do_things()
other_stuff()
instead of
if something:
pass
else:
do_things()
other_stuff()
Have fun learning Python!

you need to allocate the result of the replacement back to "line" variable
line=line.replace('M', 'N')
you can also use the module fileinput for inplace edit
import os, shutil,fileinput
if not os.path.exists('clean'):
os.mkdir('clean')
for file in os.listdir("."):
if file.endswith(".seq"):
shutil.copy(file, 'clean')
os.chdir('clean')
for subdir, dirs, files in os.walk("."):
for file in files:
f = fileinput.FileInput(file,inplace=0)
for n,line in enumerate(f):
if line.lstrip().startswith('>'):
pass
elif n==1: #replace 2nd line
for repl in ["M","K","Y","W","R","S"]:
line=line.replace(ch, 'N')
print line.rstrip()
f.close()
change inplace=0 to inplace=1 for in place editing of your files.

line.replace is not a mutator, it leaves the original string unchanged and returns a new string with the replacements made. You'll need to change your code to line = line.replace('R', 'N'), etc.
I think you also want to add a break statement at the end of your else clause, so that you don't iterate over the entire file, but stop after having processed line 2.
Lastly, you'll need to actually write the file out containing your changes. So far, you are just reading the file and updating the line in your program variable 'line'. You need to actually create an output file as well, to which you will write the modified lines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.