.split(",") separating every character of a string - python

At some point of the program I ask it to take the user's text input and separate the text according to it's commas, and then I ",".join it again in a txt file. The idea is to have a list with all the comma separated information.
The problem is that, apparently, when I ",".join it, it separates every single character with commas, so if I've got the string info1,info2 it separates, getting info1 | info2, but then, when joining it back again it ends like i,n,f,o,1,,,i,n,f,o,2, which is highly unconfortable, since it get's the text back from the txt file to show it to the user later in the program. Can anyone help me with that?
categories = open('c:/digitalLibrary/' + connectedUser + '/category.txt', 'a')
categories.write(BookCategory + '\n')
categories.close()
categories = open('c:/digitalLibrary/' + connectedUser + '/category.txt', 'r')
categoryList = categories.readlines()
categories.close()
for category in BookCategory.split(','):
for readCategory in lastReadCategoriesList:
if readCategory.split(',')[0] == category.strip():
count = int(readCategory.split(',')[1])
count += 1
i = lastReadCategoriesList.index(readCategory)
lastReadCategoriesList[i] = category.strip() + "," + str(count).strip()
isThere = True
if not isThere:
lastReadCategoriesList.append(category.strip() + ",1")
isThere = False
lastReadCategories = open('c:/digitalLibrary/' + connectedUser + '/lastReadCategories.txt', 'w')
for category in lastReadCategoriesList:
if category.split(',')[0] != "" and category != "":
lastReadCategories.write(category + '\n')
lastReadCategories.close()
global finalList
finalList.append({"Title":BookTitle + '\n', "Author":AuthorName + '\n', "Borrowed":IsBorrowed + '\n', "Read":readList[len(readList)-1], "BeingRead":readingList[len(readingList)-1], "Category":BookCategory + '\n', "Collection":BookCollection + '\n', "Comments":BookComments + '\n'})
finalList = sorted(finalList, key=itemgetter('Title'))
for i in range(len(finalList)):
categoryList[i] = finalList[i]["Category"]
toAppend = (str(i + 1) + ".").ljust(7) + finalList[i]['Title'].strip()
s.append(toAppend)
categories = open('c:/digitalLibrary/' + connectedUser + '/category.txt', 'w')
for i in range(len(categoryList)):
categories.write(",".join(categoryList[i]))
categories.close()

You should pass ''.join() a list, you are passing in a single string instead.
Strings are sequences too, so ''.join() treats every character as a separate element instead:
>>> ','.join('Hello world')
'H,e,l,l,o, ,w,o,r,l,d'
>>> ','.join(['Hello', 'world'])
'Hello,world'

Related

Simple way to remove duplicate whitespaces and remove all \n efficiently

I have a file called test.txt It has a bunch of duplicate spaces. The test.txt file contains HTML. I want to remove all the unnessary whitespace to reduce the size of contents in the test.txt file. How can I remove the duplicate spaces and make the entire string on one line.
test.txt
<center>
<b class="test" >My name
is
fred</ b> <center>
What I want to print
<center><b class="test">My name is fred</b><center>
What gets printed
<center><b class="test" >Mynameisfred</b> <center>
program.py
def is_white_space(before, curr, after):
# remove duplicate spaces
if (curr == " " and (before == " " or after == " ")):
return True
# Remove all \n
elif (curr == "\n"):
return True
return False
f = open('test.txt', 'r')
contents = f.read()
f.close()
new = "";
i = 0
while (i < len(contents)):
if (i != 0 and
i != (len(contents) - 1) and
not is_white_space(contents[i - 1], contents[i], contents[i + 1])):
new += contents[i]
i += 1
print(new)
This will leave a space between digits or letters.
from string import ascii_letters, digits
def main():
with open('test.txt', 'r') as f:
parts = f.read().split()
keep_separated = set(ascii_letters) | set(digits)
for i in range(len(parts) - 1):
if parts[i][-1] in keep_separated and parts[i + 1][0] in keep_separated:
parts[i] = parts[i] + " "
print(''.join(parts))
if __name__ == '__main__':
main()

how to find out how many lines are in a variable

I'm doing a Telegram bot and I want to print string in Inline keyboard. I have a variable text which can change and I want to chack how much lines(string) in variable as it can do with it 0<name<2 and do restrictions. How can do it?
I could make it with len(), but it show me list index out range
text="head,hand,..."
selectKeyboard = telebot.types.InlineKeyboardMarkup( row_width=1)
if 0<name<2:
for i in range(len(text)):
one=types.InlineKeyboardButton(text=str(text[0]['name']),callback_data="first")
selectKeyboard.add(one)
if 0<name<3:
for i in range(len(text)):
one=types.InlineKeyboardButton(text=str(text[0]['name'])+" ",callback_data="first")
two=types.InlineKeyboardButton(text=str(text[1]['name'])+" ",callback_data="second")
selectKeyboard.add(one,two)
if 0<name<4:
for i in range(len(text)):
one=types.InlineKeyboardButton(text=str(text[0]['name'])+" ",callback_data="first")
two=types.InlineKeyboardButton(text=str(text[1]['name'])+" ",callback_data="second")
three = types.InlineKeyboardButton(text=str(text[2]['name']) + " " ,callback_data="three")
selectKeyboard.add(one,two,three)
if 0<name<5:
for i in range(len(text)):
one=types.InlineKeyboardButton(text=str(text[0]['name'])+" ",callback_data="first")
two=types.InlineKeyboardButton(text=str(text[1]['name'])+" "+,callback_data="second")
three = types.InlineKeyboardButton(text=str(text[2]['name']) + " " ,callback_data="three")
four = types.InlineKeyboardButton(text=str(text[3]['name']) + " " , callback_data="four")
selectKeyboard.add(one,two,three,four)
if 0<name<6:
for i in range(len(text)):
one=types.InlineKeyboardButton(text=str(text[0]['name'])+" ",callback_data="first")
two=types.InlineKeyboardButton(text=str(text[1]['name'])+" ",callback_data="second")
three = types.InlineKeyboardButton(
text=str(text[2]['name']) + " " ,
callback_data="three")
four = types.InlineKeyboardButton(
text=str(text[3]['name']) + " " ,
callback_data="four")
five=types.InlineKeyboardButton(
text=str(text[4]['name']) + " " ,
callback_data="five")
selectKeyboard.add(one, two, three, four,five)
The following piece of code is not doing what you imagine, in fact it'll just return False, because name is a string and you're comparing it against integers:
0 < name < 2
Instead, you should test for the number of variables, like this:
text = "head,hand,..."
num_vars = len(text.split(','))
if 0 < num_vars < 2:
# same for all the other comparisons

Python - surround instances of strings in a given list that are present in another string with HTML

I've written a function that surrounds a search term with a HTML element with given attributes. The idea is the resulting surrounded string is written to a log file later on with the search term highlighted.
def inject_html(needle, haystack, html_element="span", html_attrs={"class":"matched"}):
# Find all occurrences of a given string in some text
# Surround the occurrences with a HTML element and given HTML attributes
new_str = haystack
start_index = 0
while True:
try:
# Get the bounds
start = new_str.lower().index(needle.lower(), start_index)
end = start + len(needle)
# Needle is present, compose the HTML to inject
html_open = "<" + html_element + " " + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_close = "</" + html_element + ">"
new_str = new_str[0:start] + html_open + new_str[start:end] + html_close + new_str[end:len(new_str)]
start_index = end + len(html_close) + len(html_open)
except ValueError as ex:
# String doesn't occur in text after index, break loop
break
return new_str
I want to open this up to accept an array of needles, locating and surrounding them with HTML in the haystack. I could easily do this by surrounding the code with another loop which iterates through the needles, locating and surrounding instances of the search term. Problem is, this doesn't protect from accidentally surrounding previously injected HTML code., e.g.
def inject_html(needles, haystack, html_element="span", html_attrs={"class":"matched"}):
# Find all occurrences of a given string in some text
# Surround the occurrences with a HTML element and given HTML attributes
new_str = haystack
for needle in needles:
start_index = 0
while True:
try:
# Get the bounds
start = new_str.lower().index(needle.lower(), start_index)
end = start + len(needle)
# Needle is present, compose the HTML to inject
html_open = "<" + html_element + " " + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_close = "</" + html_element + ">"
new_str = new_str[0:start] + html_open + new_str[start:end] + html_close + new_str[end:len(new_str)]
start_index = end + len(html_close) + len(html_open)
except ValueError as ex:
# String doesn't occur in text after index, break loop
break
return new_str
search_strings = ["foo", "pan", "test"]
haystack = "Foobar"
print(inject_html(search_strings,haystack))
<s<span class="matched">pan</span> class="matched">Foo</span>bar
On the second iteration, the code searches for and surrounds the "pan" text from the "span" that was inserted in the previous iteration.
How would you recommend I change my original function to look for a list of needles without the risk of injecting HTML into undesired locations (such as within existing tags).
--- UPDATE ---
I got around this by maintaining a list of "immune" ranges (ones which have already been surrounded with HTML and therefore do not need to be checked again.
def inject_html(needles, haystack, html_element="span", html_attrs={"class":"matched"}):
# Find all occurrences of a given string in some text
# Surround the occurrences with a HTML element and given HTML attributes
immune = []
new_str = haystack
for needle in needles:
next_index = 0
while True:
try:
# Get the bounds
start = new_str.lower().index(needle.lower(), next_index)
end = start + len(needle)
if not any([(x[0] > start and x[0] < end) or (x[1] > start and x[1] < end) for x in immune]):
# Needle is present, compose the HTML to inject
html_open = "<" + html_element + " " + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_close = "</" + html_element + ">"
new_str = new_str[0:start] + html_open + new_str[start:end] + html_close + new_str[end:len(new_str)]
next_index = end + len(html_close) + len(html_open)
# Add the highlighted range (and HTML code) to the list of immune ranges
immune.append([start, next_index])
except ValueError as ex:
# String doesn't occur in text after index, break loop
break
return new_str
It's not particularly Pythonic though, I'd be interested to see if anyone can come up with something cleaner.
I'd use something like this:
def inject_html(phrases, text_body, html_element_name="span", html_attrs={"class":"matched"}):
new_text_body = []
html_start_tag = "<" + html_element_name + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_end_tag = "</" + html_element_name + ">"
text_body_lines = text_body.split("\n")
for line in text_body_lines:
for p in phrases:
if line.lower() == p.lower():
line = html_start_tag + p + html_end_tag
break
new_text_body.append(line)
return "\n".join(new_text_body)
It goes through line by line and replaces each line if the line is an exact match (case-insensitive).
ROUND TWO:
With the requirement that the match needs to be (1) case-insensitive and (2) matches multiple words/phrases on each line, I would use:
import re
def inject_html(phrases, text_body, html_element_name="span", html_attrs={"class": "matched"}):
html_start_tag = "<" + html_element_name + " " + " ".join(["%s=\"%s\"" % (k, html_attrs[k]) for k in html_attrs]) + ">"
html_end_tag = "</" + html_element_name + ">"
for p in phrases:
text_body = re.sub(r"({})".format(p), r"{}\1{}".format(html_start_tag, html_end_tag), text_body, flags=re.IGNORECASE)
return text_body
For each provided phrase p, this uses a case-insensitive re.sub() replacement to replace all instances of that phrase in the provided text. (p) matches the phrase via a regular expression group. \1 is a backfill operator that matches the found phrase, enclosing it in HTML tags.
text = """
Somewhat more than forty years ago, Mr Baillie Fraser published a
lively and instructive volume under the title _A Winter’s Journey
(Tatar) from Constantinople to Teheran. Political complications
had arisen between Russia and Turkey - an old story, of which we are
witnessing a new version at the present time. The English government
deemed it urgently necessary to send out instructions to our
representatives at Constantinople and Teheran.
"""
new = inject_html(["TEHERAN", "Constantinople"], text)
print(new)
> Somewhat more than forty years ago, Mr Baillie Fraser published a lively and instructive volume under the title _A Winter’s Journey (Tatar) from <span class="matched">Constantinople</span> to <span class="matched">Teheran</span>. Political complications had arisen between Russia and Turkey - an old story, of which we are witnessing a new version at the present time. The English government deemed it urgently necessary to send out instructions to our representatives at <span class="matched">Constantinople</span> and <span class="matched">Teheran</span>.

Formatting output csv files

Could I please get some help on the following problem. I can't seem to spot where I have gone wrong in my code. I have 2 output csv files from my code. The first produces the right format but the second does not:
First output file (fileB in my code)
A,B,C
D,E,F
Second output file (fileC in my code)
A,B,
C
D,E,
F
Here is my code:
file1 = open ('fileA.csv', 'rt', newline = '')
shore_upstream = open('fileB.csv', 'wt', newline = '')
shore_downstream = open('fileC.csv', 'wt', newline = '')
for line in file1:
first_comma = line.find(',')
second_comma = line.find(',', first_comma + 1)
start_coordinate = line [first_comma +1 : second_comma]
start_coordinate_number = int(start_coordinate)
end_coordinte = line [second_comma +1 :]
end_coordinate_number = int (end_coordinte)
upstream_start = start_coordinate_number - 2000
downstream_end = end_coordinate_number + 2000
upstream_start_string = str(upstream_start)
downstring_end_string = str(downstream_end)
upstream_shore = line[:first_comma]+','+ upstream_start_string + ',' + start_coordinate
shore_upstream.write(upstream_shore + '\n')
downstream_shore = line[:first_comma]+ ','+ end_coordinte + ',' + downstring_end_string
shore_downstream.write(downstream_shore + '\n')
file1.close()
shore_upstream.close()
shore_downstream.close()
By the way, I am using python 3.3.
Your variable end_coordinte may contain non-decimal characters in it, and probably contains a \n\t at the end, resulting in that output.
The simplest solution might be to evaluate those strings as a number, and printing them back as strings.
Replace:
upstream_shore = line[:first_comma]+','+ upstream_start_string + ',' + start_coordinate
downstream_shore = line[:first_comma]+ ','+ end_coordinte + ',' + downstring_end_string
by:
upstream_shore = line[:first_comma]+','+ upstream_start_string + ',' + str(start_coordinate_number)
downstream_shore = line[:first_comma]+ ','+ str(end_coordinate_number) + ',' + downstring_end_string
And pay attention to the line[:first_comma] output, as it may also contain characters you are not expecting.

Python Conditional XML Writing

I am using Python to convert CSV files to XML format. The CSV files have a varying amount of rows ranging anywhere from 2 (including headers) to infinity. (realistically 10-15 but unless there's some major performance issue, I'd like to cover my bases) In order to convert the files I have the following code:
for row in csvData:
if rowNum == 0:
xmlData.write(' <'+csvFile[:-4]+'-1>' + "\n")
tags = row
# replace spaces w/ underscores in tag names
for i in range(len(tags)):
tags[i] = tags[i].replace(' ', '_')
if rowNum == 1:
for i in range(len(tags)):
xmlData.write(' ' + '<' + tags[i] + '>' \
+ row[i] + '</' + tags[i] + '>' + "\n")
xmlData.write(' </'+csvFile[:-4]+'-1>' + "\n" + ' <' +csvFile[:-4]+'-2>' + "\n")
if rowNum == 2:
for i in range(len(tags)):
xmlData.write(' ' + '<' + tags[i] + '>' \
+ row[i] + '</' + tags[i] + '>' + "\n")
xmlData.write(' </'+csvFile[:-4]+'-2>' + "\n")
if rowNum == 3:
for i in range(len(tags)):
xmlData.write('<'+csvFile[:-4]+'-3>' + "\n" + ' ' + '<' + tags[i] + '>' \
+ row[i] + '</' + tags[i] + '>' + "\n")
xmlData.write(' </'+csvFile[:-4]+'-3>' + "\n")
rowNum +=1
xmlData.write('</csv_data>' + "\n")
xmlData.close()
As you can see, I have the upper-level tags set to be created manually if the row exists. Is there a more efficient way to achieve my goal of creating the <csvFile-*></csvFile-*> tags rather than repeating my code 15+ times? Thanks!
I would use xml.etree.ElementTree or lxml.etree to write the XML. xml.etree.ElementTree is in the standard library, but does not have built-in pretty-printing. (You could use the indent function from here, however).
lxml.etree is a third-party module, but it has built-in pretty-printing in its tostring method.
Using lxml.etree, you could do something like this:
import lxml.etree as ET
csvData = [['foo bar', 'baz quux'],['bing bang', 'bim bop', 'bip burp'],]
csvFile = 'rowboat'
name = csvFile[:-4]
root = ET.Element('csv_data')
for num, tags in enumerate(csvData):
row = ET.SubElement(root, '{f}-{n}'.format(f = name, n = num))
for text in tags:
text = text.replace(' ', '_')
tag = ET.SubElement(row, text)
tag.text = text
print(ET.tostring(root, pretty_print = True))
yields
<csv_data>
<row-0>
<foo_bar>foo_bar</foo_bar>
<baz_quux>baz_quux</baz_quux>
</row-0>
<row-1>
<bing_bang>bing_bang</bing_bang>
<bim_bop>bim_bop</bim_bop>
<bip_burp>bip_burp</bip_burp>
</row-1>
</csv_data>
Some suggestions:
In Python, almost never do you need to say
for i in range(len(tags)):
# do stuff with tags[i]
Instead say
for tag in tags:
to loop over all the items in tags.
Also instead of manually counting the times through a loop with
num = 0
for tags in csvData:
num += 1
instead use the enumerate function:
for num, tags in enumerate(csvData):
Strings like
' ' + '<' + tags[i] + '>' \
+ row[i] + '</' + tags[i] + '>' + "\n"
are incredibly difficult to read. It mixes together logic of
indentation, with the XML syntax of tags, with the minutia of end of
line characters. That's where xml.etree.ElementTree or lxml.etree
will help you. It will take care of the serialization of the XML for
you; all you need to provide is the relationship between the XML elements.
The code will be much more readable and easier to maintain.

Categories

Resources