Turning txt files into rows on a CSV using python - python

I have a folder filled with text files that all have names similar to these:
2014521RNC Reax to Obama on VA.txt
2014520W.H. Evades Questions On When Obama.txt
2012517Updated Research/ Obama Vets Roll Out.txt
So digits and then letters and/or characters. In each text file, there are words. I'm trying to write a script that will take the first string of digits and add that to a csv in a column titled "date." Then it should take the letters and/or characters after the digits and put those in a column titled "title." And then it should take the text inside the file and add that to a column titled "content." I got kind of far but not the whole cigar. When I run the script below, date = -1 and title = -1 for all of them. What have I don't wrong?
f = open('RNC.csv','w')
names = ['date', 'title', 'content']
dw = csv.DictWriter(f, names)
dw.writerow({k:k for k in names})
for root, dirnames, filenames in os.walk('.'):
for filename in filenames:
if not filename.endswith('.txt'):
continue
title = filename.find(r'\D*')
date = filename.find(r'^\d*')
open_doc = open(root + '/' + filename, 'r')
content = open_doc.read().rstrip()
open_doc.close()
dw.writerow({'date':date, 'title':title, 'content':content})
f.close()

The problem is that filename.find(s) returns the position of the substring s in filename. It returns -1 when the substring isn't found.
You can use a regex to perform the matching instead:
import re
for filename in filenames:
m = re.match("\A(\d+)(.*)\.txt\Z", filename)
if m:
date = m.group(1)
title = m.group(2)
...

You can't supply regular expressions as parameters to the str.find method, which would interpret them as literal substrings to try to find in the filename. Probably what you need to do is something like this (after adding import re to the top of your script):
match = re.search(r'^(\d+)', filename)
date = match.group(1) if match else 'None'
match = re.search(r'(\D+)', filename)
title = match.group(1) if match else 'None'

Related

Write new file with digit before extension

I have a several files in a directory with the the following names
example1.txt
example2.txt
...
example10.txt
and a bunch of other files.
I'm trying to write a script that can get all the files with a file name like <name><digit>.txt and then get the one with higher digit (in this case example10.txt) and then write a new file where we add +1 to the digit, that is example11.txt
Right now I'm stuck at the part of selecting the files .txt and getting the last one.
Here is the code
import glob
from natsort import natsorted
files = natsorted(glob.glob('*[0-9].txt'))
last_file = files[-1]
print(files)
print(last_file)
You can use a regular expression to split the file name in the text and number part, increment the number and join everything else together to have your new file name:
import re
import glob
from natsort import natsorted
files = natsorted(glob.glob('*[0-9].txt'))
last_file = files[-1]
base_name, digits = re.match(r'([a-zA-Z]+)([0-9]+)\.txt', last_file).groups()
next_number = int(digits) + 1
next_file_name = f'{base_name}{next_number}.txt'
print(files)
print(last_file)
print(next_file_name)
Note that the regex assumes that the base name of the file has only alpha characters, with no spaces or _, etc. The regex can be extended if needed.
you can use this script, it will work well for your purpose i think.
import os
def get_last_file():
files = os.listdir('./files')
for index, file in enumerate(files):
filename = str(file)[0:str(file).find('.')]
digit = int(''.join([char for char in filename if char.isdigit()]))
files[index] = digit
files.sort()
return files[-1]
def add_file(file_name, extension):
last_digit = get_last_file() +1
with open('./files/' + file_name + str(last_digit) + '.' + extension, 'w') as f:
f.write('0')
# call this to create a new incremental file.
add_file('example', 'txt')
Here's a simple solution.
files = ["example1.txt", "example2.txt", "example3.txt", "example10.txt"]
highestFileNumber = max(int(file[7:-4]) for file in files)
fileToBeCreated = f"example{highestFileNumber+1}.txt"
print(fileToBeCreated)
output:
example11.txt
.txt and example are constants, so theres no sense in looking for patterns. just trim the example and .txt

Python - Extract Code from Text using Regex

I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles = []
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
your regex is '^C[0-9]{9}$'
^ start of line
C exact match
[0-9] any digit
{9} 9 times
$ end of line
import re
regex = re.compile('(^C\d{9})')
matches = []
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(C\d{9})',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search = {}
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(C\d{9})',i) for i in f]
search.update({f.name:data})
print(search)
This would return a dictionary with file names as keys and a list of found matches.

Issue with Find and Replace using Python

I have a set of .csv files with ; delimiter. There are certain junk values in the data that I need to replace with blanks. A sample problem row is:
103273;CAN D MAT;B.C.;B.C.;B.C.;03-Apr-2006
Desired row after find and replace is:
103273;CAN D MAT;;;;03-Apr-2006
In the above example I'm replacing ;B.C.; with ;;
I cannot replace just B.C. as I need to match the entire cell value for this particular error case. The code that I am using is:
import os, fnmatch
def findReplace(directory, filePattern):
for path, dirs, files in os.walk(os.path.abspath(directory)):
for filename in fnmatch.filter(files, filePattern):
filepath = os.path.join(path, filename)
with open(filepath) as f:
s = f.read()
for [find, replace] in zip([';#DIV/0!;',';B.C.;'],[';;',';;']
s = s.replace(find, replace)
with open(filepath, "w") as f:
f.write(s)
findReplace(*Path*, "*.csv")
The output that I'm instead getting is:
103273;CAN D MAT;;B.C.;;03-Apr-2006
Can someone please help with this issue?
Thanks in advance!
The [find, replacement] pairs are not well-suited for your purpose.
Replacing ; + value + ; with ;; is really just a complicated way of saying that you want to remove columns with value.
So instead of using the [find, replacement] pairs,
it will be more natural and straightforward to split the line on ; to fields,
replace the values that are considered junk with empty string,
and then join the values again:
JUNK = frozenset(['#DIV/0!', 'B.C.'])
def clean(s):
return ';'.join(map(lambda x: '' if x in JUNK else x, s.split(';')))
You could use this function in your implementation (or copy it inline):
def findReplace(directory, filePattern):
for path, dirs, files in os.walk(os.path.abspath(directory)):
for filename in fnmatch.filter(files, filePattern):
filepath = os.path.join(path, filename)
cleaned_lines = []
with open(filepath) as f:
for line in f.read():
cleaned_lines.append(clean(line))
with open(filepath, "w") as f:
f.write('\n'.join(cleaned_lines))
str.replace, once it has made one replacement, continues scanning from the next character after the last thing it replaced. So when two ;B.C.;s overlap, it will not replace both.
You can use the re module to replace B.C. only when it occurs between two ;, using lookahead and lookbehind assertions:
>>> import re
>>> s = "103273;CAN D MAT;B.C.;B.C.;B.C.;03-Apr-2006"
>>> re.sub(r'(?<=;)B[.]C[.](?=;)', "", s)
'103273;CAN D MAT;;;;03-Apr-2006'
... But in this case it may be better to split the line into fields on ;, replace the fields that match the strings you want to erase, and join the strings together again.
>>> fields = s.split(';')
>>> for i, f in enumerate(fields):
... if f in ('B.C.', '#DIV/0!'):
... fields[i] = ''
...
>>> ';'.join(fields)
'103273;CAN D MAT;;;;03-Apr-2006'
This has two main advantages: you don't have to write a fairly complex regular expression for each replaced string; and it will still work if one of the fields is at the beginning or end of the line.
For any CSV parsing more complicated than this (for example, if any fields can contain quoted ; characters, or if the file has a header that should be skipped), look into the csv module.

Python (Filename + Sha1) generator

I have a problem with my algorithm apparently it skips a lot of sha1 hashes when executing.
No problem with the filename, but im having problem with having this output:
filename+sha1\n
For every each of them. I can guess it`s because of os.walk in some way but im not that expert ATM.
txt = open('list','w')
for dirpath, dirnames, filenames in os.walk(dir_path):
text = str(filenames)
for tag in ("[", "]", " ","'"):
text = text.replace(tag, '')
text = str(text.replace(',','\n'))
for i in filenames:
m = hashlib.sha1(str(text).encode('utf-8')).hexdigest()
txt.write(text+" "+str(m)+"\n")
txt = txt.close()
Thanks
What looks like a potential issue is that you are converting filenames, which is a list of each individual file in the current folder, into a string and then performing replacements on that list. I assume what you intended to do instead was replace within each filename string those special tags. Try the below.
txt = open('list','w')
for dirpath, dirnames, filenames in os.walk(dir_path):
for text in filenames:
text = re.sub('[\[\]," "]',"",text)
m = hashlib.sha1(str(text).encode('utf-8')).hexdigest()
txt.write(text+" "+str(m)+"\n")
txt = txt.close()
As requested, if you dont' want to use re, just do what you had originally done:
text = 'fjkla[] k,, a,[,]dd,]'
for badchar in '[]," "]':
text = text.replace(badchar,"")
Change:
txt = open('list','w')
to:
txt = open('list','a')
You are using "w" which overwrites any previous content. You need "a", which appends to the existing file without overwriting.

python, find and print specific cells in csv files that are in different directories

I have different csv files in different directories. so i want to find specific cells in different columns that correspond to a specific date in my input.txt file.
here is what i have until now:
import glob, os, csv, numpy
import re, csv
if __name__ == '__main__':
Input=open('Input.txt','r');
output = []
for i, line in enumerate(Input):
if i==0:
header_Input = Input.readline().replace('\n','').split(',');
else:
date_input = Input.readline().replace('\n','').split(',');
a=os.walk("path to the directory")
[x[0] for x in os.walk("path to the directory")]
print(a)
b=next(os.walk('.'))[1] # immediate child directories.
for dirname, dirnames, filenames in os.walk('.'):
# print path to all subdirectories first.
for subdirname in dirnames:
print(os.path.join(dirname, subdirname))
# print path to all filenames.
for filename in filenames:
#print(os.path.join(dirname, filename))
csvfile = 'csv_file'
if csvfile in filename:
print(os.path.join(dirname, filename))
Now I have the csv files, so i need to find the date_input in every file, and print the line that contains all the information. Or if possible, to get only the cells that are in the columns with header == header_input.
This is not intended to be a full answer to your question. But you may want to consider replacing
for i, line in enumerate(Input):
if i==0:
header_Input = Input.readline().replace('\n','').split(',');
else:
date_input = Input.readline().replace('\n','').split(',');
with
header_Input = Input.readline().strip().split(',')
date_input = Input.readline().strip().split(',')
The enumerate(Input) expression reads lines from the file, and so do calls to readline() in the loop body. This will most likely result in some unfortunate results like reading alternating lines from the file.
The strip() method removes whitespace from the start and end of the line. Alternatively you may want to know that s[:-1] strips off the last character of s.

Categories

Resources