I have a txt file, single COLUMN, taken from excel, of the following type:
AMANDA (LOUDLY SPEAKING)
JEFF
STEVEN (TEASINGLY)
AMANDA
DOC BRIAN GREEN
As output I want:
AMANDA
JEFF
STEVEN
AMANDA
DOC BRIAN GREEN
I tried with a for cycle on all the column and then:
if (str[i] == '('):
return str.split('(')
but it's clearly not working.
Do you have any possible solution? I would then need an output file as my original txt, so with each name for each line in a single column.
Thanks everyone!
(I am using PyCharm 3.2)
I'd use regex in this situation. \w will replace letters, the * will select 0 or more. Then we check that it is between parenthesis.
import re
fi = "AMANDA (LOUDLY) JEFF STEVEN (TEASINGLY) AMANDA"
with open("mytext.txt","r") as fi, open("out.txt", "w") as fo:
for line in fi:
fo.write(re.sub("\(.*?\)", "", line))
You can split the string into a list using a regular expression that matches everything in parentheses or a full word, remove all elements from the list which contain parentheses and then join the list to a string again. The advantage is that there will be no double spaces in the result string where a word in parantheses was removed.
import re
text = "AMANDA (LOUDLY SPEAKING) JEFF STEVEN (TEASINGLY) AMANDA DOC BRIAN GREEN"
words = re.findall("\(.*?\)|[^\s]+",text)
print " ".join([x for x in words if "(" not in x])
Related
Having an issue with Regex and not really understanding its usefulness right now.
Trying to extrapolate data from a file. file consists of first name, last name, grade
File:
Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B
Opening file code:
##Regex Code r'([A-Za-z]+)(: B)
regcode = r'([A-Za-z]+)(: B)'
answer=re.findall(regcode,file)
return answer
The expected result is first name last name. The given result is last name and letter grade. How do I just get the first name and last name for all B grades?
Since you must use regex for this task, here's a simple regex solution that returns the full name:
'(.*): B'
Which works in this case because:
(.*) returns all text up to a match of : B
Click here to see my test and matching output. I recommend this site for your regex testing needs.
You can do it without regex:
students = '''Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B'''
for x in students.split('\n'):
string = x.split(': ')
if string[1] == 'B':
print(string[0])
# Robert Right
# Jim Jim
or
[x[0:-3] for x in students.split('\n') if x[-1] == 'B']
If a regex solution is required (I perosnally like the solution of Roman Zhak more), put inside a group what you are interested in, i.e. the first name and the second name. Follows colon and B:
import re
file = """
Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B
"""
regcode = r'([A-Za-z]+) ([A-Za-z]+): B'
answer=re.findall(regcode,file,re.)
print(answer) # [('Robert', 'Right'), ('Jim', 'Jim')]
Add a capturing group ('()') to your expression. Everything outside the group will be ignored, even if it matches the expression.
re.findall('(\w+\s+\w+):\s+B', file)
#['Robert Right', 'Jim Jim']
'\w' is any alphanumeric character, '\s' is any space-like character.
You can add two groups, one for the first name and one for the last name:
re.findall('(\w+)\s+(\w+):\s+B', data)
#[('Robert', 'Right'), ('Jim', 'Jim')]
The latter will not work if there are more than two names on one line.
So some sample texts are this:
Greece: Rare
Athens
Patras
------
Italy: Unique
Milan
------
and i want to get the whole text between the second occurence of a newline before the "-" and the "-".
Expected output:
Patras
Milan
Is this possible through regex or should i try something else?
just search for line before the dashes:
import re
text="""Greece: Rare
Athens
Patras
------
"""
print(re.search("(.*)\n-+",text).group(1))
prints
Patras
note that (.*) group matches the line but not the previous lines thanks to the fact that . doesn't match \n by default.
Without regex, this can be done by looking at the index of the dashed line, and printing the previous line.
lines = text.splitlines()
index = next(i for i,x in enumerate(lines) if x.startswith("-"))
print(lines[index-1])
I'd go for the regex solution though.
This is a solution:
import re
texts=["""Greece: Rare
Athens
Patras
------
""","""Italy: Unique
Milan
------"""]
for text in texts:
print(re.search("\n(.*)\n[-]",text).group(1))
Output:
Patras
Milan
I have a file with names with spaces. I am trying to make files for each of the names in the file, only using their last names. Here is an example of the file:
Ernest Hemingway
Mark Twain
Ralph Waldo Emerson
Edgar Allan Poe
Robert Frost
The files created should be in the format of:
Hemingway.txt
Twain.txt
Waldo_Emerson.txt
Allan_Poe.txt
Where the spaces in the last names are replaced by underscores. I am having trouble with getting rid of the first names when replacing the spaces. This is what I have so far:
file_name=name.replace(" ", "_")
I'm not sure how to somehow ignore the first "element" when it replaces. The other thing I thought about doing is to use split.
Try this:
def get_last_name(name):
return "_".join(name.split()[1:])
split() splits the string into tokens (separated at whitespaces), and [1:] selects all but the first element of the split. We then join those elements together with an underscore "_".
You can just mix this replace with a substing:
my_string="Ralph Waldo Emerson"
my_string.split(" ",1)[1].replace(" ", "_")
This should do the trick.
I hope it helps.
BR
Here's one way using split and join along with further slicing to generate a list with the specified output structure:
lines = [line.rstrip('\n') for line in open('my_file.txt')]
['_'.join(i.split()[1:]) + '.txt' for i in lines]
Output
['Hemingway.txt',
'Twain.txt',
'Waldo_Emerson.txt',
'Allan_Poe.txt',
'Frost.txt']
A one-liner using list comprehension, where we ignore the first word, and join all other words in the string with an underscore
li = [ '_'.join(item.split()[1:])+'.txt' for item in open('file.txt')]
print(li)
So if the file.txt is
Ernest Hemingway
Mark Twain
Ralph Waldo Emerson
Edgar Allan Poe
Robert Frost
The output will be
['Hemingway.txt', 'Twain.txt', 'Waldo_Emerson.txt', 'Allan_Poe.txt', 'Frost.txt']
This should work
name_list = ["Ernest Hemingway","Ralph Waldo Emerson"]
filenames = []
for name in names:
filenames += ["_".join(name.split(" ")[1:]) + ".txt"]
I have a .txt file (scraped as pre-formatted text from a website) where the data looks like this:
B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
I'd like to remove all extra spaces (they're actually different number of spaces, not tabs) in between the columns. I'd also then like to replace it with some delimiter (tab or pipe since there's commas within the data), like so:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Looked around and found that the best options are using regex or shlex to split. Two similar scenarios:
Python Regular expression must strip whitespace except between quotes,
Remove white spaces from dict : Python.
You can apply the regex '\s{2,}' (two or more whitespace characters) to each line and substitute the matches with a single '|' character.
>>> import re
>>> line = 'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'
Stripping any leading and trailing whitespace from the line before applying re.sub ensures that you won't get '|' characters at the start and end of the line.
Your actual code should look similar to this:
import re
with open(filename) as f:
for line in f:
subbed = re.sub('\s{2,}', '|', line.strip())
# do something here
What about this?
your_string ='ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS'
print re.sub(r'\s{2,}','|',your_string.strip())
Output:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Expanation:
I've used re.sub() which takes 3 parameter, a pattern, a string you want to replace with and the string you want to work on.
What I've done is taking at least two space together , I 've replaced them with a | and applied it on your string.
s = """B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
"""
# Update
re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
In [71]: print re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Considering there are at least two spaces separating the columns, you can use this:
lines = [
'B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON ',
'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
]
for line in lines:
parts = []
for part in line.split(' '):
part = part.strip()
if part: # checking if stripped part is a non-empty string
parts.append(part)
print('|'.join(parts))
Output for your input:
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
It looks like your data is in a "text-table" format.
I recommend using the first row to figure out the start point and length of each column (either by hand or write a script with regex to determine the likely columns), then writing a script to iterate the rows of the file, slice the row into column segments, and apply strip to each segment.
If you use a regex, you must keep track of the number of columns and raise an error if any given row has more than the expected number of columns (or a different number than the rest). Splitting on two-or-more spaces will break if a column's value has two-or-more spaces, which is not just entirely possible, but also likely. Text-tables like this aren't designed to be split on a regex, they're designed to be split on the column index positions.
In terms of saving the data, you can use the csv module to write/read into a csv file. That will let you handle quoting and escaping characters better than specifying a delimiter. If one of your columns has a | character as a value, unless you're encoding the data with a strategy that handles escapes or quoted literals, your output will break on read.
Parsing the text above would look something like this (i nested a list comprehension with brackets instead of the traditional format so it's easier to understand):
cols = ((0,34),
(34, 50),
(50, 59),
(59, None),
)
for line in lines:
cleaned = [i.strip() for i in [line[s:e] for (s, e) in cols]]
print cleaned
then you can write it with something like:
import csv
with open('output.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in lines:
spamwriter.writerow([line[col_start:col_end].strip()
for (col_start, col_end) in cols
])
Looks like this library can solve this quite nicely:
http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery
Impressive...
I am working with a text file (620KB) that has a list of ID#s followed by full names separated by a comma.
The working regex I've used for this is
^([A-Z]{3}\d+)\s+([^,\s]+)
I want to also capture the first name and middle initial (space delimiter between first and MI).
I tried this by doing:
^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)
Which works, but I want to remove the new line break that is generated on the output file (I will be importing the two output files into a database (possibly Access) and I don't want to capture the new line breaks, also if there is a better way of writing the regex?
Full code:
import re
source = open('source.txt')
ticket_list = open('ticket_list.txt', 'w')
id_list = open('id_list.txt', 'w')
for lines in source:
m = re.search('^([A-Z]{3}\d+)\s+([^\s]+([\D+])+)', lines)
if m:
x = m.group()
print('Ticket: ' + x)
ticket_list.write(x + "\n")
ticket_list = open('First.txt', 'r')
for lines in ticket_list:
y = re.search('^(\d+)\s+([^\s]+([\D+])+)', lines)
if y:
z = y.group()
print ('ID: ' + z)
id_list.write(z + "\n")
source.close()
ticket_list.close()
id_list.close()
Sample Data:
Source:
ABC1000033830 SMITH, Z
100000012 Davis, Franl R
200000655 Gest, Baalio
DEF4528942681 PACO, BETH
300000233 Theo, David Alex
400000012 Torres, Francisco B.
ABC1200045682 Mo, AHMED
DEF1000006753 LUGO, G TO
ABC1200123123 de la Rosa, Maria E.
Depending on what kind of linebreak you're dealing with, a simple positive lookahead may remedy your pattern capturing the linebreak in the result. This was generated by RegexBuddy 4.2.0, and worked with all your test data.
if re.search(r"^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)(?=$)", subject, re.IGNORECASE | re.MULTILINE):
# Successful match
else:
# Match attempt failed
Basically, the positive lookahead makes sure that there is a linebreak (in this case, end of line) character directly after the pattern ends. It will match, but not capture the actual end of line.