Format individual characters differently within an Excel cell - python

I have a column in Excel 2013 containing letters and the digits 1,2,3 and 4 (representing pinyin pronunciations and tone values). They are all in the same font & format, but I would like to convert the numbers only to superscript. It does not seem that I can use any of Excel's built-in find-and-replace functionality to replace a single character in a cell with its superscript version: the entire cell format gets changed. I saw a thread Format individual characters in a single Excel cell with python which apparently holds a solution, but that was the first time I had heard of Python or xlwt.
Since I have never used Python and xlwt, can someone give me a basic step-by-step set of instructions to install those utilities, customize the script and run it?
Sample:
Li1Shi4
Qin3Fat1
Gon1Lin3
Den1Choi3
Xin1Nen3
Script from other thread:
import xlwt
wb = xlwt.Workbook()
ws = wb.add_sheet('Sheet1')
font0 = xlwt.easyfont('')
font1 = xlwt.easyfont('bold true')
font2 = xlwt.easyfont('color_index red')
style = xlwt.easyxf('font: color_index blue')
seg1 = ('bold', font1)
seg2 = ('red', font2)
seg3 = ('plain', font0)
seg4 = ('boldagain', font1)
ws.write_rich_text(2, 5, (seg1, seg2, seg3, seg4))
ws.write_rich_text(4, 1, ('xyz', seg2, seg3, '123'), style)
wb.save('rich_text.xls')
What is the syntax that will achieve the "find numbers and replace with superscript"? Is it a font or a style? The code from the other thread seems to manually input "seg1" , "seg2" , "seg3" etc. Or am I misunderstanding the code?
Thanks in advance. I am using Windows 8, 64 bit, Excel 2013.

I'm bored and in a teaching mood, so, here's a long "answer" that also explains a little bit about how you can figure these things out for yourself in the future :)
I typed abc123def into a cell, and recorded a macro using the macro recorder.
This is where you should always start if you don't know what the correct syntax is.
In any case, I selected the numeric part of this cell, and right-clicked, format cell, change font to superscript.
This is what the macro recorder gives me. This is a lot of code. Fortunately, it's a lot of junk.
Sub Macro2()
With ActiveCell.Characters(Start:=1, Length:=3).Font 'Applies to the first 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
With ActiveCell.Characters(Start:=4, Length:=3).Font 'Applies to the middle 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = True
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
With ActiveCell.Characters(Start:=7, Length:=3).Font 'Applies to the last 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
End Sub
What it represents is three blocks of formatting: the first is the first 3 characters that aren't changed, then the 3 that we applied superscript to, and then the last three characters.
Almost all of this is default properties, since I made no other changes, so I can revise it to this:
Sub Macro2()
With ActiveCell.Characters(Start:=4, Length:=3).Font
.Superscript = False
End With
End Sub
Now we can see that there are two important parts to this. The first part is how to specify which characters to format. This is done by refereing to a cell's .Characters:
ActiveCell.Characters(Start:=4, Length:=3).Font
So we can see that this macro refers to the characters in the positon 4-6 in the string "abc123def", or "123".
The next, obvious part is to assign the .Font.Superscript property is True.
Now you want to generalize this so that you can apply it anywhere. The above code is "hardcoded" the Start and Length arguments. We need to make it dynamic. Easiest way to do this is to go 1 character at a time, and check to see if it's numeric, if so, apply the superscript.
Sub ApplySuperscriptToNumbers()
Dim i As Long
Dim str As String
Dim rng As Range
Dim cl As Range
'## Generally should work on any contiguous "Selection" of cell(s)
Set rng = Range(Selection.Address)
'## Iterate over each cell in this selection
For Each cl In rng.Cells
str = cl.Value
'## Iterate over each character in the cell
For i = 1 To Len(str)
'## Check if this character is numeric
If IsNumeric(Mid(str, i, 1)) Then
'## Apply superscript to this 1 character
cl.Characters(Start:=i, Length:=1).Font.Superscript = True
End If
Next
Next
End Sub

Related

Hyphens in csv columns/unknown data causing int conversion errors

I know how to convert between data types. Unfortunately, something in the data is obviating my str to int conversion during the cleaning process.
My code executes normally when I don't cast to an int. When I examined the csv file I realized that there are hyphens in the BeginDate and EndDate columns. I thought this was the reason for me ValueError, but have learned in the comments that this is not the case.
raw text
from csv import reader
opened_file = open('/Users/tymac/Artworks.csv')
read_data = reader(opened_file)
moma = list(read_data)
moma_header = moma[0]
moma = moma[1:]
for row in moma:
bd = row[5] # BeginDate
bd = bd.replace("(", "")
bd = bd.replace(")", "")
#bd = int(bd)
# I've stopped the loop after the first row "moma[0]",
# therefore no other cells should be causing the error.
if row == moma[0]:
print(bd)
print(type(bd))
As per the comments section, you discovered that the parenthesis represents a negative number. Almost certainly, you have a cell that that is not an integer type. An easy way to find the issue is to wrap your conversion in a try/except. For now, just print the cell - later, you will need to decide what to do with it.
from csv import reader
opened_file = open('/Users/tymac/Artworks.csv')
read_data = reader(opened_file)
moma = list(read_data)
moma_header = moma[0]
moma = moma[1:]
for row in moma:
bd = row[5]
bd = bd.replace("(", "")
bd = bd.replace(")", "")
try:
bd = int(bd)
except ValueError:
print(bd) # Just to find your bad cell, otherwise choose what to do with it.
For example, if I have a csv with the following data;
FName, LName, Number
James, Jones, (20)
Sam, Smith, (30)
Someone, Else, nan
and I run the code (changing to row[2] instead of row[5]), I will get a printed result of "nan" because the conversion to int fails. This tells me that I have a row that contains something other than an iteger.
Adding my own answer because this was the solution in code. SteveJ's comments led me to ask myself questions resulting in absolute filters so I marked his answer as correct.
I didn't know a number with a leading zero is not an integer in Python. Some of the cells started with a leading zero and certainly looked like an integer eg 0196. In addition, I tried to use 0000 as a placeholder cell for unknown dates. The exceptions to the leading zero rule in Python are numbers that contain all zeros like 0000. However, since I was filtering out zeros with other conditions, it was safer to use 1111 as my placeholder integer.
I had to get aggressive with the cleaning and create filters that eliminated all possible outliers even though I could not see them. A "Just In Case Filter" to filter out everything that did not leave me with a 4 digit number string. Now I have 4 digit year integers with 1111 integer placeholder cells so all is good.
In the end, I was able to clean it using these filters.
def clean_date(string):
bad_chars = ["(", ")", "\n", "\r", "\t"]
for char in bad_chars:
string = string.replace(char, "")
if len(string) > 4:
string = string[:4]
elif len(string) < 4:
string = "1111" # Don't use "0000" for padding, placeholders etc.
elif " " in string:
string = "1111"
elif string.isdigit() == False:
string = "1111"
elif len(string.split('1', 1)[0]):
string = "1111"
return string
for row in moma:
bd = row[5] # BeginDate/Birth Date
bd = clean_date(bd)
bd = int(bd) # Conversion
if row == moma[0]:
print(bd)
print(type(bd))
# Date of birth as an int
# 1841 <class 'int'>

How do I get my code to show a certain number of characters from a text file

In my assignment i'm expected to look for a word and only return a set number of characters (which is 80, and 40 on each side surrounding the word), without the use of nltk or regex.
I've set my code up as so
open = open("a2.txt", 'r')
file2read = open.readlines()
name = 'word'
for line in file2read:
s2 = line.split ("\n", 1)
if name in line:
i = line.find(name)
half = (80 - len(name) - 2) // 2
left = line[i - half]
right = line[i + len(word) + half]
print(left + word + right)
but then my print out looks like this(updated screenshot) instead of the 80 character lines which i'm hoping to find.
Sorry if this is a really newbie error as i'm only 3 weeks into the program and i've been searching and can't seem to get the answer
Instead of doing readlines which might not be consistent due to differences in windows/Unix you can also read the entire text at once:
You don't need to separate it in lines:
with open('a2.txt', 'r') as file:
a = file.read()
name = 'word'
if name in a:
i = a.find(name)
half = (80 - len(name) - 2) // 2
left = a[i-half:i]
right = a[i+len(name):i + len(name) + half]
print(left + name + right)
This way you are reading the entire text at once. Finding your word and printing the necessary 80 characters. This is the output
ut. even know say trip tip sandwich. words describe it. meat eater, love it. b
If you want to make it work for all the words in the text. You will need to make a loop =) but that i'm sure you can figure it out by yourself!

ID3v1 Null Byte parsing in python

I am writing a tool to parse ID3 tags from files an edit them in a GUI fashion. Up until now everything is great. However I am trying to remove the null byte terminators when displaying the info and then adding it back when user saves it to preserver the ID3v1 format. However when doing a check for the null terminator I get nothing.
This is the portion of the code related to the handlig of the tag:
if(bytes.decode(check) == "TAG"):
title = self.__clean(bytes.decode(f.read(30)))
artist = self.__clean(bytes.decode(f.read(30)))
album = self.__clean(bytes.decode(f.read(30)))
year = bytes.decode(f.read(4))
comment = self.__clean(bytes.decode(f.read(30)))
tmp_gen = bytes.decode(f.read(1))
genre = self.__clean(Utils.genreByteToString(tmp_gen))
return TagV1(title, artist, album, year, comment, genre)
return None
The clean method is here:
def __clean(self, string):
counter = 0
for i in range(0, len(string)):
w = string[i]
if(not w.strip()) or b"\00" == w or w == b"00" or w == bytes.decode(b"\00"):
counter+=1
else:
counter = 0
if(counter == 2):
return string[0:i-1]
return string
I've tried every possible combination know of null byte. Either not w or not w.split() I even tried putting it in bytes and then looping thorught that for null byte but still nothing. My counter always stays 0 on the debugger. Also when trying to copy the value from the debugger it appears as this which is an empty space. In the debugger it appears as an empty square. I would appreciate the input.
Using PyChar 2017 1.4
I figured out that the only solution that works is to use
w == str.decode(b"\00") or rstrip("\0")
as denoted by Marteen
Everything else seems to not work. However there are still some places where it doesn't work. For example the comment in the file I am trying doesn't have null bytes until the last one.
Upon further inspection with a hex editor I have found some odd characters. The comment continues on with the \20 character in hex until position 29 where a null character is (for denoting it has a track indicator) the next character is a \01 for the track. Oddly the genre indicator is a 0C which translates to (cannot paste it, it's a box with ceros in it).
EDIT: Using the __clean() method checking for decoded null terminator aswell as w.isspace() seemed to fix the issue in both other cases.

Extracting Data from Multiple TXT Files and Creating a Summary CSV File in Python

I have a folder with about 50 .txt files containing data in the following format.
=== Predictions on test data ===
inst# actual predicted error distribution (OFTd1_OF_Latency)
1 1:S 2:R + 0.125,*0.875 (73.84)
I need to write a program that combines the following: my index number (i), the letter of the true class (R or S), the letter of the predicted class, and each of the distribution predictions (the decimals less than 1.0).
I would like it to look like the following when finished, but preferably as a .csv file.
ID True Pred S R
1 S R 0.125 0.875
2 R R 0.105 0.895
3 S S 0.945 0.055
. . . . .
. . . . .
. . . . .
n S S 0.900 0.100
I'm a beginner and a bit fuzzy on how to get all of that parsed and then concatenated and appended. Here's what I was thinking, but feel free to suggest another direction if that would be easier.
for i in range(1, n):
s = str(i)
readin = open('mydata/output/output'+s+'out','r')
#The files are all named the same but with different numbers associated
output = open("mydata/summary.csv", "a")
storage = []
for line in readin:
#data extraction/concatenation here
if line.startswith('1'):
id = i
true = # split at the ':' and take the letter after it
pred = # split at the second ':' and take the letter after it
#some have error '+'s and some don't so I'm not exactly sure what to do to get the distributions
ds = # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = #skip the character after the comma but take the have characters after
else:
#take the five characters after the comma
lineholder = id+' , '+true+' , '+pred+' , '+ds+' , '+dr
else: continue
output.write(lineholder)
I think using the indexes would be another option, but it might complicate things if the spacing is off in any of the files and I haven't checked this for sure.
Thank you for your help!
Well first of all, if you want to use CSV, you should use CSV module that comes with python. More about this module here: https://docs.python.org/2.7/library/csv.html I won't demonstrate how to use it, because it's pretty simple.
As for reading the input data, here's my suggestion how to break down every line of the data itself. I assume that lines of data in the input file have their values separated by spaces, and each value cannot contain a space:
def process_line(id_, line):
pieces = line.split() # Now we have an array of values
true = pieces[1].split(':')[1] # split at the ':' and take the letter after it
pred = pieces[2].split(':')[1] # split at the second ':' and take the letter after it
if len(pieces) == 6: # There was an error, the + is there
p4 = pieces[4]
else: # There was no '+' only spaces
p4 = pieces[3]
ds = p4.split(',')[0] # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = p4.split(',')[0][1:] #skip the character after the comma but take the have??? characters after
else:
dr = p4.split(',')[0]
return id_+' , '+true+' , '+pred+' , '+ds+' , '+dr
What I mainly used here was split function of strings: https://docs.python.org/2/library/stdtypes.html#str.split and in one place this simple syntax of str[1:] to skip the first character of the string (strings are arrays after all, we can use this slicing syntax).
Keep in mind that my function won't handle any errors or lines formated differently than the one you posted as an example. If the values in every line are separated by tabs and not spaces you should replace this line: pieces = line.split() with pieces = line.split('\t').
i think u can separte floats and then combine it with the strings with the help of re module as follows:
import re
file = open('sample.txt','r')
strings=[[num for num in re.findall(r'\d+\.+\d+',i) for i in file.readlines()]]
print (strings)
file.close()
file = open('sample.txt','r')
num=[[num for num in re.findall(r'\w+\:+\w+',i) for i in file.readlines()]]
print (num)
s= num+strings
print s #[['1:S','2:R'],['0.125','0.875','73.84']] output of the code
this prog is written for one line u can use it for multiple line as well but u need to use a loop for that
contents of sample.txt:
1 1:S 2:R + 0.125,*0.875 (73.84)
2 1:S 2:R + 0.15,*0.85 (69.4)
when you run the prog the result will be:
[['1:S,'2:R'],['1:S','2:R'],['0.125','0.875','73.84'],['0.15,'0.85,'69.4']]
simply concatenate them
This uses regular expressions and the CSV module.
import re
import csv
matcher = re.compile(r'[[:blank:]]*1.*:(.).*:(.).* ([^ ]*),[^0-9]?(.*) ')
filenametemplate = 'mydata/output/output%iout'
output = csv.writer(open('mydata/summary.csv', 'w'))
for i in range(1, n):
for line in open(filenametemplate % i):
m = matcher.match(line)
if m:
output.write([i] + list(m.groups()))

UCSC BLAT output python

Is there a way I can get the position number of the mismatch from the following BLAT result using Python?
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41629392 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41629348
As we can see, there are two mismatches in the above output. Can we get the position number of the mismatch/mutation using Python. This is how it appears in the source code also. So I'm a little confused on how to proceed.
Thank you.
You can find the mismatches using the .find method of a string. Mismatches are indicated by a space (' '), so we look for that in the middle line of the blat output. I don't know blat personally, so I'm not sure if the output always comes in triplet lines, but assuming it does, the following function will return a list of positions mismatching, each position represented as a tuple of the mismatching position in the top sequence, and the same in the bottom sequence.
blat_src = """00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41629392 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41629348"""
def find_mismatch(blat):
#break the blat input into lines
lines = blat.split("\n")
#give some firendly names to the different lines
seq_a = lines[0]
seq_b = lines[2]
#We're not interested in the '<' and '>' so we strip them out with a slice
matchstr = lines[1][9:-9]
#Get the integer values of the starts of each sequence segment
pos_a = int(seq_a[:8])
pos_b = int(seq_b[:8])
results = []
#find the index of first space character, mmpos = mismatch position
mmpos = matchstr.find(" ")
#if a space exists (-1 if none found)
while mmpos != -1:
#the position of the mismatch is the start position of the
#sequence plus the index within the segment
results.append((posa+mmpos, posb+mmpos))
#search the rest of the string (from mmpos+1 onwards)
mmpos = matchstr.find(" ", mmpos+1)
return results
print find_mismatch(blat_src)
Which produces
[(28, 41629419), (29, 41629420)]
Telling us positions 28 and 29 (indexed according to the top sequence) or positions 41629419 and 41629420 (indexed according to the bottom sequence) are mismatched.

Categories

Resources