Currently, I am working on parsing resumes to remove "-" only when it is used at the beginning of each line. I've tried identifying the first character of each string after the text has been split. Below is my code:
for line in text.split('\n'):
if line[0] == "-":
line[0] = line.replace('-', ' ')
line is a string. This is my way of thinking but every time I run this, I get the error IndexError: string index out of range. I'm unsure of why because since it is a string, the first element should be recognized. Thank you!
The issue you're getting is because some lines are empty.
Then your replacement is wrong:
first because it will assign the first "character" of the line but you cannot change a string because it's immutable
second because the replacement value is the whole string minus some dashes
third because line is lost at the next iteration. The original list of lines too, by the way.
If you want to remove the first character of a string, no need for replace, just slice the string (and don't risk to remove other similar characters).
A working solution would be to test with startswith and rebuild a new list of strings. Then join back
text = """hello
-yes--
who are you"""
new_text = []
for line in text.splitlines():
if line.startswith("-"):
line = line[1:]
new_text.append(line)
print("\n".join(new_text))
result:
hello
yes--
who are you
with more experience, you can pack this code into a list comprehension:
new_text = "\n".join([line[1:] if line.startswith("-") else line for line in text.splitlines()])
finally, regular expression module is also a nice alternative:
import re
print(re.sub("^-","",text,flags=re.MULTILINE))
this removes the dash on all lines starting with dash. Multiline flag tells regex engine to consider ^ as the start of the line, not the start of the buffer.
this could be due to empty lines. You could just check the length before taking the index.
new_text = []
text="-testing\nabc\n\n\nxyz"
for line in text.split("\n"):
if line and line[0] == '-':
line = line[1:]
new_text.append(line)
print("\n".join(new_text))
Related
So it's not a very difficult problem and I've been trying to do it. Here is my sample code:
import sys
for s in sys.stdin:
s = s[0:1].upper() + s[1:len(s)-1] + s[len(s)-1:len(s)].upper()
print(s)
This code only capitalizes the first letter and not the last letter as well. Any tips?
You are operating on lines, not words, since iterating over sys.stdin will give you strings that consist of each line of text that you input. So your logic won't be capitalizing individual words.
There is nothing wrong with your logic for capitalizing the last character of a string. The reason that you are not seeming to capitalize the end of the line is that there's an EOL character at the end of the line. The capitalization of EOL is EOL, so nothing is changed.
If you call strip() on the input line before you process it, you'll see the last character capitalized:
import sys
for s in sys.stdin:
s = s.strip()
s = s[0:1].upper() + s[1:len(s)-1] + s[len(s)-1:len(s)].upper()
print(s)
#Calculuswhiz's answer shows you how to deal with capitalizing each word in your input.
You first have to split the line of stdin, then you can operate on each word using a map function. Without splitting, the stdin is only read line by line in the for loop.
#!/usr/bin/python
import sys
def capitalize(t):
# Don't want to double print single character
if len(t) is 1:
return t.upper()
else:
return t[0].upper() + t[1:-1] + t[-1].upper()
for s in sys.stdin:
splitLine = s.split()
l = map(capitalize, splitLine)
print(' '.join(l))
Try it online!
You could just use the capitalize method for str which will do exactly what you need, and then uppercase the last letter individually, something like:
my_string = my_string.capitalize()
my_string = my_string[:-1] + my_string[-1].upper()
So I want to run a program that will read in a file line by line, and will then print out either Valid or Invalid based on what each line contains.
For this example, I am saying that the input file line can contain ABCabc or a space. If the line only contains these things, the word Valid should be printed. If it is just white space, or contains any other characters or letters, it should print out “Invalid”.
This is what I have come up with:
I can’t seem to get it to ever print out “Valid”
Can you tell why? Thanks!
input = sys.argv[1]
input = open(input,"r")
correctInput = ‘ABCabc ‘
line1 = input.readline()
while line1 != "":
if all(char in correctInput for char in line1):
print “Valid”
line2 = input.readline()
else:
print “Invalid”
line2 = input.readline()
line1 = line2
If you print out the value of line1 before your if else statement, you'll see it has a newline character in it. (The \n character.) This is the character that gets added to the end of each line whenever you hit the enter key on the keyboard, and you need to either discard the newline characters or include them as valid input.
To include it as valid input
Change correctInput = 'ABCabc '
to
correctInput = 'ABCabc \n'.
Or to discard the newline characters change
if all(char in correctInput for char in line1):
to
if all(char in correctInput for char in line1.replace('\n', '')):
Either method will work.
Bytheway, input is a function in Python. Although you're allowed to use it as a variable name, doing so will prevent you from using the input function in your program. Because of this, it is considered bad practice to use any of the built in function names as your variable names.
RegEx Solution
Just for fun, I came up with the following solution which solves your problem using regular expressions.
import re
with open(sys.argv[1]) as fh:
valid_lines = re.findall('^[ABCabc ]+\n', fh.read())
This finds any valid lines using the pattern '^[ABCabc ]+\n'. What does this regular expression pattern do?
The ^ symbol signifies the start of a line
Then comes the [ABCabc ]. When brackets are used, only characters inside of those brackets will be allowed.
The + after the brackets means that those characters that where in brackets must be found 1 or more times.
And lastly we make sure the valid characters we found are followed by a newline character (\n). This ensures we checked the complete line for valid characters.
Its because readline doesn't remove '\n' from end of the line. You could ignore that problem by splitting whole file content in lines and than validate them one by one.
import sys
file_name = sys.argv[1]
file = open(file_name ,"r")
correctInput = 'ABCabc '
lines = file.read().splitlines()
for line1 in lines:
if all(char in correctInput for char in line1):
print 'Valid'
else:
print 'Invalid'
I am reading a .dat file and the first few lines are just metadata before it gets to the actual data. A shortened example of the .dat file is below.
&SRS
SRSRUN=266128,SRSDAT=20180202,SRSTIM=122132,
fc.fcY=0.9000
&END
energy rc ai2
8945.016 301.32 6.7959
8955.497 301.18 6.8382
8955.989 301.18 6.8407
8956.990 301.16 6.8469
Or as the list:
[' &SRS\n', ' SRSRUN=266128,SRSDAT=20180202,SRSTIM=122132,\n', 'fc.fcY=0.9000\n', '\n', ' &END\n', 'energy\trc\tai2\n', '8945.016\t301.32\t6.7959\n', '8955.497\t301.18\t6.8382\n', '8955.989\t301.18\t6.8407\n', '8956.990\t301.16\t6.8469\n']
I tried this previously but it :
def import_absorptionscan(file_path,start,end):
for i in range(start,end):
lines=[]
f=open(file_path+str(i)+'.dat', 'r')
for line in f:
lines.append(line)
for line in lines:
for c in line:
if c.isalpha():
lines.remove(line)
print lines
But i get this error: ValueError: list.remove(x): x not in list
i started looking through stack overflow then but most of what came up was how to strip alphabetical characters from a string, so I made this question.
This produces a list of strings, with each string making up one line in the file. I want to remove any string which contains any alphabet characters as this should remove all the metadata and leave just the data. Any help would be appreciated thank you.
I have a suspicion you will want a more robust rule than "does the string contain a letter?", but you can use a regular expression to check:
re.search("[a-zA-Z]", line)
You'll probably want to take a look at the regular expression docs.
Additionally, you can use the any statement to check for letters. Inside your inner for loop add:
If any (word.isalpha() for word in line)
Notice that this will say that "ver9" is all numbers, so if this is a problem, just replace it with:
line_is_meta = False
for word in line:
if any (letter.isalpha() for letter in word):
line_is_meta = True
break
for letter in word:
if letter.isalpha():
line_is_meta = True
break
if not line_is_meta: lines.append (line)
I have a .txt doc full of text. I'd like to search it for specific characters (or ideally groups of characters (strings) , then do things with the charcter found, and the characters 2 in front/4behind the selected characters.
I made a version that searches lines for the character, but I cant find the equivalent for characters.
f = open("C:\Users\Calum\Desktop\Robopipe\Programming\data2.txt", "r")
searchlines = f.readlines()
f.close()
for i, line in enumerate(searchlines):
if "_" in line:
for l in searchlines[i:i+2]: print l, #if i+2 then prints line and the next
print
If I understand the problem, what you want is to repeatedly search one giant string, instead of a searching a list of strings one by one.
So, the first step is, don't use readlines, use read, so you get that one giant string in the first place.
Next, how do you repeatedly search for all matches in a string?
Well, a string is an iterable, just like a list is—it's an iterable of characters (which are themselves strings with length 1). So, you can just iterate over the string:
f = open(path)
searchstring = f.read()
f.close()
for i, ch in enumerate(searchstring):
if ch == "_":
print searchstring[i-4:i+2]
However, notice that this only works if you're only searching for a single-character match. And it will fail if you find a _ in the first four characters. And it can be inefficient to loop over a few MB of text character by character.* So, you probably want to instead loop over str.find:
i = 4
while True:
i = searchstring.find("_", i)
if i == -1:
break
print searchstring[i-4:i+2]
* You may be wondering how find could possibly be doing anything but the same kind of loop. And you're right, it's still iterating character by character. But it's doing it in optimized code provided by the standard library—with the usual CPython implementation, this means the "inner loop" is in C code rather than Python code, it doesn't have to "box up" each character to test it, etc., so it can be much, much faster.
You could use a regex for this:
The regex searches for any two characters (that are not _), an _, then any four characters that are not an underscore.
import re
with open(path) as f:
searchstring = f.read()
regex = re.compile("([^_]{2}_[^_]{4})")
for match in regex.findall(searchstring):
print match
With the input of:
hello_there my_wonderful_friend
The script returns:
lo_ther
my_wond
ul_frie
I've got a spreadsheet of information (UTF-8 CSV file being read in by the csv module) that contains information for a large number of products that need to go into an inventory db. I'm trying to setup descriptions from newlined rows of text to a html list tags.
The issue I'm having is that the following lines fail to replace the newline character in the string:
line[2] = "<ul><li>" + line[2]
line[2].replace('\n', '</li><li>')
line[2] += "</li></ul>"
The string continues to contain newline characters even when the second line is replaced by:
line[2] = line[2].rstrip()
What is going on, and what am I messing up? =)
From python manual
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
As you can see, it doesn't replace "in place", instead, try:
line[2] = "<ul><li>" + line[2]
line[2] = line[2].replace('\n', '</li><li>')
line[2] += "</li></ul>"
Replace isn't in place.
So do
line[2] = "<ul><li>" + line[2].replace('\n', '</li><li>') + "</li></ul>"
Don't forget to escape!
escaped = cgi.escape(line[2].rstrip()).replace("\n", "</li><li>")
line[2] = "<ul><li>%s</li></ul>" % escaped
Str.replace returns a copy instead of modifying in-place, and rstrip with no argument will strip all trailing whitespace. Since this is for HTML and trailing whitespace probably won't include something like "\n \n ", that probably doesn't matter to you, but it is something to be aware of.