Using .title() on a CSV file - python

I have a CSV file where I want:
the first letter of every name to be be capitalized and
the other letters to be lowercase.
I have tried using .title().
The CSV file that I want to have the capital letters (CleanNames.csv) will be 'pulling' these names from another CSV (ValidNames.csv) which is 'pulling' those names from a list of disorganized names (10000DirtyNames.csv).
Here is what I have so far:
import re
import csv
with open("10000DirtyNames.csv", "r") as file:
with open('ValidNames.csv', 'w+') as ValidNames_file:
write = csv.writer(ValidNames_file, delimiter=',');
data = file.read();
pattern = "[A-Za-z]{1,}";
search = re.findall(pattern, data);
write.writerow(search);
with open('CleanNames.csv', 'w') as CleanNames_file:
write2 = csv.writer(CleanNames_file, delimiter=',');
data2 = ValidNames_file.read();
write2.writerow(data2.title());
It works except the CleanName.csv is not being populated at all. There is no error message. What am I doing wrong?

Figured it out on my own but thought I would post my solution encase someone every needed to solve a similar problem.
import re
import csv
with open("10000DirtyNames.csv", "r") as file:
with open('ValidNames.csv', 'w+') as ValidNames_file:
write = csv.writer(ValidNames_file, delimiter=',');
data = file.read();
pattern = "[A-Za-z]{1,}";
search = re.findall(pattern, data);
write.writerow(search);
with open('ValidNames.csv') as ValidNames_file, open('CleanNames.csv', 'w') as CleanNames_file:
for name in ValidNames_file:
CleanNames_file.write(name.title())

Related

python iterate over a file and replace strings

I'm using the 're' library to replace occurrences of different strings in multiple files. The replacement pattern works fine, but I'm not able to maintain the changes to the files. I'm trying to get the same functionality that comes with the following lines:
with open(KEY_FILE, mode='r', encoding='utf-8-sig') as f:
replacements = csv.DictReader(f)
user_data = open(temp_file, 'r').read()
for col in replacements:
user_data = user_data.replace(col[ORIGINAL_COLUMN], col[TARGET_COLUMN])
data_output = open(f"{temp_file}", 'w')
data_output.write(user_data)
data_output.close()
The key line here is:
user_data = user_data.replace(col[ORIGINAL_COLUMN], col[TARGET_COLUMN])
It takes care of updating the data in place using the replace method.
I need to do the same but with the 're' library:
with open(KEY_FILE, mode='r', encoding='utf-8-sig') as f:
replacements = csv.DictReader(f)
user_data = open(temp_file, 'r').read()
a = open(f"{test_file}", 'w')
for col in replacements:
original_str = col[ORIGINAL_COLUMN]
target_str = col[TARGET_COLUMN]
compiled = re.compile(re.escape(original_str), re.IGNORECASE)
result = compiled.sub(target_str, user_data)
a.write(result)
I only end up with the last item in the .csv dict changed in the output file. Can't seem to get the changes made in previous iterations of the for loop to persist.
I know that it is pulling from the same file each time... which is why it is getting reset each loop, but I can't sort out a workaround.
Thanks
Try something like this?
#!/usr/bin/env python3
import csv
import re
import sys
from io import StringIO
KEY_FILE = '''aaa,bbb
xxx,yyy
'''
TEMP_FILE = '''here is aaa some text xxx
bla bla aaaxxx
'''
ORIGINAL_COLUMN = 'FROM'
TARGET_COLUMN = 'TO'
user_data = StringIO(TEMP_FILE).read()
with StringIO(KEY_FILE) as f:
reader = csv.DictReader(f, ['FROM','TO'])
for row in reader:
original_str = row[ORIGINAL_COLUMN]
target_str = row[TARGET_COLUMN]
compiled = re.compile(re.escape(original_str), re.IGNORECASE)
user_data = compiled.sub(target_str, user_data)
sys.stdout.write("modified user_data:\n" + user_data)
Some things to note:
The main problem was result = sub(..., user_data) rather than result = sub(..., result). You want to keep updating the same string, rather than always applying to the original.
The compiling of regex is fairly pointless in this case, since each is just used once.
I don't have access to your test files, so I used StringIO versions inline and printing to stdout; hopefully that's easy enough to translate back to your real code (:
In future posts, you might consider doing similar, so that your question has 100% runnable code someone else can try out without guessing.

How to replace part of text

I am really new to coding so don't be harsh on me, since my question is probably basic. I couldn't find a way to do it.
I would like to learn how to create automatizated process of creating custom links.(Preferably in Python)
Let me give you example.
https://website.com/questions/ineedtoreplacethis.pdf
I have a database (text file) of names, one name one line
(Oliver
David
Donald
etc.)
I am looking for a way how to automatically insert the name to the "ineedtoreplacethis" part of the link and create many many custom links like that at once.
Thank you in advance for any help.
f-string is probably the way to go.
Here is an example:
names = ['Olivier', 'David', 'Donald']
for name in names:
print(f"{name}.txt")
Output:
Olivier.txt
David.txt
Donald.txt
You can do this using string concatenation as explained below. This is after you get the data from the text file, achieving that is explained in the later part of the answer.
a= "Foo"
b= "bar"
a+b will return
"Foobar"
In your case,
original_link = "https://website.com/questions/"
sub_link = "ineedtoreplacethis.pdf"
out = original_link + sub_link
The value of out will be as you required.
To get the sub_link from your text file, read the text file as:
with open("database.txt","r") as file:
data= file.readlines() # Here I am assuming that your text file is CRLF terminated
Once you have the data , which is a list of all the values from your text file, you can iterate using loops.
for sub_link in data:
search_link = original_link+sub_link
"""Use this search_link to do your further operations"
Use a formatted string
filename = "test.txt"
lines = []
with open(filename) as my_file:
lines = my_file.readlines()
for i in lines:
print(f"https://website.com/questions/{i}.pdf")
EXPLAINATION:
Read the txt file by a list of lines
Iterate over the list using For loop
Using formatted string print them
Consider file textFile.txt as
Oliver
David
Donald
You can simply loop over the names in the file as
with open("textFile.txt", "r") as f:
name_list = f.read().split('\n')
link_prefix = 'https://website.com/questions/'
link_list = []
for word in name_list:
link_list.append(link_prefix + word + '.pdf')
print(link_list)
This will print output as (ie. contents of link_list is):
['https://website.com/questions/Oliver.pdf', 'https://website.com/questions/David.pdf', 'https://website.com/questions/Donald.pdf']
from pprint import pprint as pp
import re
url = "https://website.com/questions/ineedtoreplacethis.pdf"
pattern = re.compile(r"/(\w+)\.pdf") # search the end of the link with a regex
sub_link = pattern.search(url).group(1) # find the part to replace
print(f"{sub_link = }")
names = ["Oliver", "David", "Donald"] # text file content loaded into list
new_urls = []
for name in names:
new_url = url.replace(sub_link, str(name))
new_urls.append(new_url)
pp(new_urls) # Print out the formatted links to the console

Python - Extract Code from Text using Regex

I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles = []
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
your regex is '^C[0-9]{9}$'
^ start of line
C exact match
[0-9] any digit
{9} 9 times
$ end of line
import re
regex = re.compile('(^C\d{9})')
matches = []
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(C\d{9})',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search = {}
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(C\d{9})',i) for i in f]
search.update({f.name:data})
print(search)
This would return a dictionary with file names as keys and a list of found matches.

Remove RE: and FW: from a CSV file then save the output as an object

I am working with a csv file and attempting to remove "RE:" and "FW:" from a subject line so I can further summarize data on email conversations. With my current code I get the error message "TypeError: expected string or bytes-like object". Any advice on how I may execute this change and then save the output as a object that I can further manipulate? I am new to python, have been searching for similar solutions, but any input at all would be greatly appreciated.
import csv
import re
f = open('examplefile.csv',"r+")
p = re.compile('([\[\(] *)?.*(RE?S?|FWD?|Fwd?|re\[\d+\]?) *([-:;)\]][ :;\])-]*)|\]+ *$', re.IGNORECASE)
data = csv.reader(f)
p.sub("",data)
for row in data:
print (row)
In your code, data is a csv.reader object, but not the actual contents of your file. My guess is that you want to strip out the 'RE' and 'FW' from one field in your csv file.
If the subject line is the the 3rd column (2 in Python) in your csv file, you could do:
import csv
import re
p = re.compile('([\[\(] *)?.*(RE?S?|FWD?|Fwd?|re\[\d+\]?) *([-:;)\]][ :;\])-]*)|\]+ *$', re.IGNORECASE)
with open('examplefile.csv',"r+") as f:
f_reader = csv.reader(f)
for row in f_reader:
subject = p.sub("", row[2]) #clean the 3rd column
print(subject)
You need to replace the row data, not the reader object.
For example
p = re.compile('([\[\(] *)?.*(RE?S?|FWD?|Fwd?|re\[\d+\]?) *([-:;)\]][ :;\])-]*)|\]+ *$', re.IGNORECASE)
with open('examplefile.csv',"r+") as f:
data = csv.reader(f)
for row in data:
print (p.sub("",row[0]))

How to replace a list of special characters in a csv in python

I have some csv files that may or may not contain characters like “”à that are undesirable, so I want to write a simple script that will feed in a csv and feed out a csv (or its contents) with those characters replaced with more standard characters, so in the example:
bad_chars = '“”à'
good_chars = '""a'
The problem so far is that my code seems to produce a csv with perhaps the wrong encoding? Any help would be appreciated in making this simpler and/or making sure my output csv doesn't force an incorrect regex encoding--maybe using pandas?
Attempt:
import csv, string
upload_path = sys.argv[1]
input_file = open('{}'.format(upload_path), 'rb')
upload_csv = open('{}_fixed.csv'.format(upload_path.strip('.csv')), 'wb')
data = csv.reader(input_file)
writer = csv.writer(upload_csv, quoting=csv.QUOTE_ALL)
in_chars = '\xd2\xd3'
out_chars = "''"
replace_list = string.maketrans(in_chars, out_chars)
for line in input_file:
line = str(line)
new_line = line.translate(replace_list)
writer.writerow(new_line.split(','))
input_file.close()
upload_csv.close()
As you stamped your question with the pandas tag - here is a pandas solution:
import pandas as pd
(pd.read_csv('/path/to/file.csv')
.replace(r'RegEx_search_for_str', r'RegEx_replace_with_str', regex=True)
.to_csv('/path/to/fixed.csv', index=False)
)

Categories

Resources