Following on from an earlier post, I have written some Python code to calculate the frequency of occurrences of certain phrases (contained in the "word_list" variable with three examples listed but will have many more) in a large number of text files. The code I've written below requires me to take each element of the list and insert it into a string for comparison to each text file. However the current code is only writing the frequencies for the last phrase in the list rather than all of them to the relevant columns in a spreadsheet. Is this just an indent issue, not placing the writerow in the correct position or is there a logic flaw in my code. Also is there any way to avoid using a list to string assignment in order to compare the phrases to those in the text files?
word_list = ['in the event of', 'frankly speaking', 'on the other hand']
S = {}
p = 0
k = 0
with open(file_path, 'w+', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["Fohone-K"] + word_list)
for filename in glob.glob(os.path.join(path, '*.txt')):
if filename.endswith('.txt'):
f = open(filename)
Fohone-K = filename[8:]
data = f.read()
# new code section from scratch file
l = len(word_list)
for s in range(l):
phrase = word_list[s]
S = data.count((phrase))
if S:
#k = k + 1
print("'{}' match".format(Fohone-K), S)
else:
print("'{} no match".format(Fohone-K))
print("\n")
# for m in word_list:
if S >= 0:
print([Fohone-K] + [S])
writer.writerow([Fohone-K] + [S])
The output currently looks like this.
enter image description here
When it needs to look like this.
enter image description here
You probably were going for something like this:
import csv, glob, os
word_list = ['in the event of', 'frankly speaking', 'on the other hand']
file_path = 'out.csv'
path = '.'
with open(file_path, 'w+', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["Fohone-K"] + word_list)
for filename in glob.glob(os.path.join(path, '*.txt')):
if filename.endswith('.txt'):
with open(filename) as f:
postfix = filename[8:]
content = f.read()
matches = [content.count(phrase) for phrase in word_list]
print(f"'{filename}' {'no ' if all(n == 0 for n in matches) else ''}match")
writer.writerow([postfix] + matches)
The key problem was you were writing S on each row, which only contained a single count. That's fixed here by writing a full set of matches.
Related
i have a file with data as such.
>1_DL_2021.1123
>2_DL_2021.1206
>3_DL_2021.1202
>3_DL_2021.1214
>4_DL_2021.1214
>4_DL_2021.1214
>6_DL_2021.1214
>7_DL_2021.1214
>8_DL_2021.1214
now as you can see the data is not numbered properly and hence needs to be numbered.
what im aiming for is this:
>1_DL_2021.1123
>2_DL_2021.1206
>3_DL_2021.1202
>4_DL_2021.1214
>5_DL_2021.1214
>6_DL_2021.1214
>7_DL_2021.1214
>8_DL_2021.1214
>9_DL_2021.1214
now the file has a lot of other stuff between these lines starting with > sign. i want only the > sign stuff affected.
could someone please help me out with this.
also there are 563 such lines so manually doing it is out of question.
So, assuming input data file is "input.txt"
You can achieve what you want with this
import re
with open("input.txt", "r") as f:
a = f.readlines()
regex = re.compile(r"^>\d+_DL_2021\.\d+\n$")
counter = 1
for i, line in enumerate(a):
if regex.match(line):
tokens = line.split("_")
tokens[0] = f">{counter}"
a[i] = "_".join(tokens)
counter += 1
with open("input.txt", "w") as f:
f.writelines(a)
So what it does it searches for line with the regex ^>\d+_DL_2021\.\d+\n$, then splits it by _ and gets the first (0th) element and rewrites it, then counts up by 1 and continues the same thing, after all it just writes updated strings back to "input.txt"
sudden_appearance already provided a good answer.
In case you don't like regex too much you can use this code instead:
new_lines = []
with open('test_file.txt', 'r') as f:
c = 1
for line in f:
if line[0] == '>':
after_dash = line.split('_',1)[1]
new_line = '>' + str(c) + '_' + after_dash
c += 1
new_lines.append(new_line)
else:
new_lines.append(line)
with open('test_file.txt', 'w') as f:
f.writelines(new_lines)
Also you can have a look at this split tutorial for more information about how to use split.
The title isn't big enough for me to explain this so here it goes:
I have a csv file looking something like this:
Example csv containing
long string with some special characters , number, string, number
long string with some special characters , number, string, number
long string with some special characters , number, string, number
long string with some special characters , number, string, number
I want to go through the first column and if the length of the string is greater then 20 do this:
LINE 20 : long string with som, e special characters
split the string, modify first csv with first part of the string, and create a new csv and add the other part on the same line number, leaving the rest just whitespace
what i have for now is this:
this below doesn't do anything right now, its just what I made to try and explain to myself and figure out how could i do new file writing with splitString
fileName = file name
maxCollumnLength = number of rows in the whole set
lineNum = line number of a string that is greater then 20
splitString = second part of the split string that should be written on another file
def newopenfile(fileName, maxCollumnLength, lineNum, splitString):
with open(fileName, 'rw', encoding="utf8") as nf:
writer = csv.writer(fileName, quoting=csv.QUOTE_NONE)
for i in range(0, maxCollumnLength-1):
#write whitespace until reaching lineNum of a string thats bigger then 20 then write that part of the string to a csv
this goes through the first column and checks the length
fileName = 'uskrs.csv'
firstColList=[] #an empty list to store the second column
splitString=[]
i = 0
with open(fileName, 'rw', encoding="utf8") as rf:
reader = csv.reader(rf, delimiter=',')
for row in reader:
if len(row[0]) > 20:
i+=1
#split row and parse the other end of the row to newopenfile(fileName, len(reader), i, splitString )
#print(row[0])
#for debuging
firstColList.append(row[0])
from this point i am stuck at how to actualy change the string in the csv and how to split them
THE STRING COULD ALSO HAVE 60+ chars, so it would need splitting more then 2 times and storing it in more then 2 csvs
I suck at explaining the problem, so if you have any questions please do ask
Okay so i was sucessful in dividing the first column if it has length greater then 20, and replace the first column with first 20 chars
import csv
def checkLength(column, readFile, writeFile, maxLen):
counter = 0
i = 0
idxSplitItems = []
final = []
newSplits = 0
with open(readFile,'r', encoding="utf8", newline='') as f:
reader = csv.reader(f)
your_list = list(reader)
final = your_list
for sublist in your_list:
#del sublist[-1] -remove last invisible element
i+=1
data = removeUnwanted(sublist[column])
print(data)
if len(data) > maxLen:
counter += 1 # Number of large
idxSplitItems.append(split_bylen(i,data,maxLen))
if len(idxSplitItems) > newSplits: newSplits = len(idxSplitItems)
final[i-1][column] = split_bylen(i,data,maxLen)[1]
final[i-1][column] = removeUnwanted(final[i-1][column])
print("After split data: "+ data)
print("After split final: "+ final[i-1][column])
writeSplitToCSV(writeFile, final)
checkCols(final, 6)
return final, idxSplitItems
def removeUnwanted(data):
data = data.replace(',',' ')
return data
def split_bylen(index, item, maxLen):
clean = removeUnwanted(item)
splitList = [clean[ind:ind+maxLen] for ind in range(0, len(item), maxLen)]
splitList.insert(0,index)
return splitList
def writeSplitToCSV(writeFile,data):
with open(writeFile,'w', encoding="utf8", newline='') as f:
writer = csv.writer(f)
writer.writerows(data)
def checkCols(data, columns):
for sublist in data:
if len(sublist)-1!=columns:
print ("[X] This row doesnt have the same amount of columns as others: "+sublist)
else:
print("All okay")
#len(data) #how many split items
#print(your_list[0][0])
#print("Number of large: ", counter)
final, idxSplitItems = checkLength(0,'test.csv','final.csv', 30)
print("------------------------")
print(idxSplitItems)
print("-------------------------")
print(final)
Now I have a problem with this part of the code, notice this:
print("After split data: "+ data)
print("After split final: "+ final[i-1][column])
This is to check if removing comma worked.
with example of
"BUTKOVIĆ VESNA , DIPL.IUR."
data returns
BUTKOVIĆ VESNA DIPL.IUR.
but final returns
BUTKOVIĆ VESNA , DIPL.IUR.
why does my final return "," again but in data its gone, must be something done in "split_bylen()" that makes it do that
Dictionaries are fun!
To overwrite the original csv see here. You would have to use Dictreader & Dictwriter. I keep your method of reading just for clarity.
writecsvs = {} #store each line of each new csv
# e.g. {'csv1':[[row0_split1,row0_num,row0_str,row0_num],[row1_split1,row1_num,row1_str,row1_num],...],
# 'csv2':[[row0_split2,row0_num,row0_str,row0_num],[row1_split2,row1_num,row1_str,row1_num],...],
# .
# .
# .}
with open(fileName, mode='rw', encoding="utf-8-sig") as rf:
reader = csv.reader(rf, delimiter=',')
for row in reader:
col1 = row[0]
# check size & split
# decide number of new csvs
# overwrite original csv
# store new content in writecsvs dict
for # Loop over each csv in writecsvs:
writelines = # Get List of Lines
out_file = open('csv1.csv', mode='w') # use the keys in writecsvs for filenames
for line in writelines:
out_file.write(line)
Hope this helps.
I have a project that I've been working on for my intro computer science class and I'm not quite sure what I'm doing wrong in my current implementation. I'm supposed to read data from a CSV, which includes data of customers of a fictional travel company. I'm then supposed to open a .txt file template and replace the placeholders (in the format [[placeholder]]) with the data on the CSV, then save a new txt file for each customer with the proper replacement. As if I am sending a new email to each customer in the CSV.
I was able to load the CSV and put the data in the CSV into an array, while the headers of the CSV (which are identical to the placeholders, just not in double brackets) are in a list:
file_obj = open(PATH_SAVE_DIR + csv_filename, newline='')
reader = csv.DictReader(file_obj)
headers = reader.fieldnames # list of headers
file_obj.close()
customerdata = []
with open(PATH_SAVE_DIR + csv_filename, 'r') as inf:
reader = csv.reader(inf)
row = next(reader)
for row in reader:
customerdata.append(row)
This gives me the following array, with the data from the CSV:
[['James', 'Butt', '6649 N Blue Gum St', 'New Orleans', 'Orleans', 'LA', '70116', '504-621-8927', 'jbutt#gmail.com', 'gold'], ['Josephine', 'Darakjy', '4 B Blue Ridge Blvd', 'Brighton', 'Livingston', 'MI', '48116', '810-292-9388', 'josephine_darakjy#darakjy.org', 'silver'], ['Art', 'Venere', '8 W Cerritos Ave #54', 'Bridgeport', 'Gloucester', 'NJ', '8014', '856-636-8749', 'art#venere.org', 'bronze']]
The part where I'm having difficulty understanding what is happening is when trying to replace the data of the txt file with the customer's data. My current code is able to replace the data in the file, but it only does it with the first set of data, and never seems to counts up in the array and work with the next customer's data:
file_obj = open(PATH_SAVE_DIR + EMAIL_TEMPLATE, 'r')
file_input = file_obj.read()
count = 0
for customer in range(len(customerdata)):
customernumber = str(customer + 1)
while count < len(customerdata):
for word in headers:
if word in file_input:
index = headers.index(word)
file_input = file_input.replace("[[" + word + "]]",
customerdata[count][index])
count += 1
file_output = open(PATH_SAVE_DIR + EMAIL + customernumber + ".txt", 'w')
file_output.write(file_input)
file_output.close()
file_obj.close()
So it is able to successfully create email1.txt, email2.txt and email3.txt, but all three of the files are identical and only include the replaced data of the first customer. I tried putting print (count) in different spots around my for word in headers loop, and it seems like it only runs that specific for loop a single time with the first customer, then doesn't attempt it again with the count going up (for the next customer in the array). How can I repeat this loop to do it for each customer?
I suppose the string replacement can be done with regular expression substitution as well, but we haven't been taught regular expressions yet, so I'm still a bit unclear as to how I would implement it using those.
Edit: I was able to come up with a working code. Still have to edit the formatting some, but at least it's now doing what I want it to do.
file_obj = open(PATH_SAVE_DIR + EMAIL_TEMPLATE, 'r')
file_input = file_obj.read()
file_obj.close()
for customer in range(len(customerdata)):
customernumber = str(customer + 1)
for word in headers:
if word in file_input:
index = headers.index(word)
replaced = file_input.replace("[[" + word + "]]",
customerdata[customer][index])
file_input = replaced
with open(PATH_SAVE_DIR + EMAIL + customernumber + ".txt", 'w') \
as file_output:
file_output.write(replaced)
with open(PATH_SAVE_DIR + EMAIL_TEMPLATE, 'r') as file_obj:
file_input = file_obj.read()
The issue is that you have two loop structures where you only want one. The first one is for customer in range(len(customerdata)):, and the second one is while count < len(customerdata):. So basically you're looping over all the customers twice. The file-write statement is embedded in one loop but not in the other, so your script is reading data from customers 0, 1, 2, and writing data[2] to the file, reading 0, 1, 2, and writing data[2] to the file... Does that help?
Edit: The other issue is that you're overwriting your template with your messages, so it stops functioning as a template after the first overwrite.
I try analyze text file with data - columns, and records.
My file:
Name Surname Age Sex Grade
Chris M. 14 M 4
Adam A. 17 M
Jack O. M 8
The text file has some empty data. As above.
User want to show Name and Grade:
import csv
with open('launchlog.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split() for line in stripped if line)
with open('log.txt', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name', 'Surname', 'Age', 'Sex', 'Grade'))
writer.writerows(lines)
log.txt :
Chris,M.,14,M,4
Adam,A.,17,M
Jack,O.,M,8
How to empty data insert a "None" string?
For example:
Chris,M.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
What would be the best way to do this in Python?
Use pandas:
import pandas
data=pandas.read_fwf("file.txt")
To get your dictionnary:
data.set_index("Name")["Grade"].to_dict()
Here's something in Pure Python™ that seems to do what you want, at least on the sample data file in your question.
In a nutshell what it does is first determine where each of the field names in column header line start and end, and then for each of the remaining lines of the file, does the same thing getting a second list which is used to determine what column each data item in the line is underneath (which it then puts in its proper position in the row that will be written to the output file).
import csv
def find_words(line):
""" Return a list of (start, stop) tuples with the indices of the
first and last characters of each "word" in the given string.
Any sequence of consecutive non-space characters is considered
as comprising a word.
"""
line_len = len(line)
indices = []
i = 0
while i < line_len:
start, count = i, 0
while line[i] != ' ':
count += 1
i += 1
if i >= line_len:
break
indices.append((start, start+count-1))
while i < line_len and line[i] == ' ': # advance to start of next word
i += 1
return indices
# convert text file with missing fields to csv
with open('name_grades.txt', 'rt') as in_file, open('log.csv', 'wt', newline='') as out_file:
writer = csv.writer(out_file)
header = next(in_file) # read first line
fields = header.split()
writer.writerow(fields)
# determine the indices of where each field starts and stops based on header line
field_positions = find_words(header)
for line in in_file:
line = line.rstrip('\r\n') # remove trailing newline
row = ['None' for _ in range(len(fields))]
value_positions = find_words(line)
for (vstart, vstop) in value_positions:
# determine what field the value is underneath
for i, (hstart, hstop) in enumerate(field_positions):
if vstart <= hstop and hstart <= vstop: # overlap?
row[i] = line[vstart:vstop+1]
break # stop looking
writer.writerow(row)
Here's the contents of the log.csv file it created:
Name,Surname,Age,Sex,Grade
Chris,C.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
I would use baloo's answer over mine -- but if you just want to get a feel for where your code went wrong, the solution below mostly works (there is a formatting issue with the Grade field, but I'm sure you can get through that.) Add some print statements to your code and to mine and you should be able to pick up the differences.
import csv
<Old Code removed in favor of new code below>
EDIT: I see your difficulty now. Please try the below code; I'm out of time today so you will have to fill in the writer parts where the print statement is, but this will fulfill your request to replace empty fields with None.
import csv
with open('Test.txt', 'r') as in_file:
with open('log.csv', 'w') as out_file:
writer = csv.writer(out_file)
lines = [line for line in in_file]
name_and_grade = dict()
for line in lines[1:]:
parts = line[0:10], line[11:19], line[20:24], line[25:31], line[32:]
new_line = list()
for part in parts:
val = part.replace('/n','')
val = val.strip()
val = val if val != '' else 'None'
new_line.append(val)
print(new_line)
Without using pandas:
Edited based on your comment, I hard coded this solution based on your data. This will not work for the rows doesn't have Surname column.
I'm writing out Name and Grade since you only need those two columns.
o = open("out.txt", 'w')
with open("inFIle.txt") as f:
for lines in f:
lines = lines.strip("\n").split(",")
try:
grade = int(lines[-1])
if (lines[-2][-1]) != '.':
o.write(lines[0]+","+ str(grade)+"\n")
except ValueError:
print(lines)
o.close()
This question is a followed up on this question
I've posted the solution by #tobiask here as well:
match_region = [map(str, blob.sentences[i-1:i+2]) # from prev to after next
for i, s in enumerate(blob.sentences) # i is index, e is element
if search_words & set(s.words)] # same as your condition
I am having trouble exporting the match_region file. I would like to turn this into a csv with the sentences as columns and every result as a row.
This will print the contents of the match_region to a file, but I didn't test it on your code, and it doesn't print the sentences.
with open('output.csv', 'w') as f:
for i, s in enumerate(match_region):
f.write('"' + i + '","' + s + '"\n')