Python3 - Replacing spaces with underscore, but skip first element - python

I have a file with names with spaces. I am trying to make files for each of the names in the file, only using their last names. Here is an example of the file:
Ernest Hemingway
Mark Twain
Ralph Waldo Emerson
Edgar Allan Poe
Robert Frost
The files created should be in the format of:
Hemingway.txt
Twain.txt
Waldo_Emerson.txt
Allan_Poe.txt
Where the spaces in the last names are replaced by underscores. I am having trouble with getting rid of the first names when replacing the spaces. This is what I have so far:
file_name=name.replace(" ", "_")
I'm not sure how to somehow ignore the first "element" when it replaces. The other thing I thought about doing is to use split.

Try this:
def get_last_name(name):
return "_".join(name.split()[1:])
split() splits the string into tokens (separated at whitespaces), and [1:] selects all but the first element of the split. We then join those elements together with an underscore "_".

You can just mix this replace with a substing:
my_string="Ralph Waldo Emerson"
my_string.split(" ",1)[1].replace(" ", "_")
This should do the trick.
I hope it helps.
BR

Here's one way using split and join along with further slicing to generate a list with the specified output structure:
lines = [line.rstrip('\n') for line in open('my_file.txt')]
['_'.join(i.split()[1:]) + '.txt' for i in lines]
Output
['Hemingway.txt',
'Twain.txt',
'Waldo_Emerson.txt',
'Allan_Poe.txt',
'Frost.txt']

A one-liner using list comprehension, where we ignore the first word, and join all other words in the string with an underscore
li = [ '_'.join(item.split()[1:])+'.txt' for item in open('file.txt')]
print(li)
So if the file.txt is
Ernest Hemingway
Mark Twain
Ralph Waldo Emerson
Edgar Allan Poe
Robert Frost
The output will be
['Hemingway.txt', 'Twain.txt', 'Waldo_Emerson.txt', 'Allan_Poe.txt', 'Frost.txt']

This should work
name_list = ["Ernest Hemingway","Ralph Waldo Emerson"]
filenames = []
for name in names:
filenames += ["_".join(name.split(" ")[1:]) + ".txt"]

Related

How can I break up a file delimited by colons and semi-colons and place each element on a separate line/row?

I have a file with colon and semi-colon delimited string elements. They are email addresses formatted as such:
Tony Stark, <ironman#stark-tesla.com>; Clark Kent, <Ckent1#dailyplanet.com>; Peter Parker, <pparker1#spidy.com>; etc.
What I would like to do is separate each email using it's semi-colon and place it on it's own line or row:
e.g.
Tony Stark, <ironman#stark-tesla.com>
Clark Kent, <Ckent1#dailyplanet.com>
Peter Parker, <pparker1#spidy.com>
What is the most efficient way of accomplishing this?
Thanks
try
string="Tony Stark, <ironman#stark-tesla.com>; Clark Kent, <Ckent1#dailyplanet.com>; Peter Parker, <pparker1#spidy.com>"
for i in string.split("; "):
print(i)
the .split method on strings returns an array containing the string split by the separator.
The for loop loops through the split string and prints out each one on it's own line.
You can use this function to return the expected string. You play with test case to see what you exactly want.

How to extract person name using regular expression?

I am new to Regular Expression and I have kind of a phone directory. I want to extract the names out of it. I wrote this (below), but it extracts lots of unwanted text rather than just names. Can you kindly tell me what am i doing wrong and how to correct it? Here is my code:
import re
directory = '''Mark Adamson
Home: 843-798-6698
(424) 345-7659
265-1864 ext. 4467
326-665-8657x2986
E-mail:madamson#sncn.net
Allison Andrews
Home: 612-321-0047
E-mail: AEA#anet.com
Cellular: 612-393-0029
Dustin Andrews'''
nameRegex = re.compile('''
(
[A-Za-z]{2,25}
\s
([A-Za-z]{2,25})+
)
''',re.VERBOSE)
print(nameRegex.findall(directory))
the output it gives is:
[('Mark Adamson', 'Adamson'), ('net\nAllison', 'Allison'), ('Andrews\nHome', 'Home'), ('com\nCellular', 'Cellular'), ('Dustin Andrews', 'Andrews')]
Would be really grateful for help!
Your problem is that \s will also match newlines. Instead of \s just add a space. That is
name_regex = re.compile('[A-Za-z]{2,25} [A-Za-z]{2,25}')
This works if the names have exactly two words. If the names have more than two words (middle names or hyphenated last names) then you may want to expand this to something like:
name_regex = re.compile(r"^([A-Za-z \-]{2,25})+$", re.MULTILINE)
This looks for one or more words and will stretch from the beginning to end of a line (e.g. will not just get 'John Paul' from 'John Paul Jones')
I can suggest to try the next regex, it works for me:
"([A-Z][a-z]+\s[A-Z][a-z]+)"
The following regex works as expected.
Related part of the code:
nameRegex = re.compile(r"^[a-zA-Z]+[',. -][a-zA-Z ]?[a-zA-Z]*$", re.MULTILINE)
print(nameRegex.findall(directory)
Output:
>>> python3 test.py
['Mark Adamson', 'Allison Andrews', 'Dustin Andrews']
Try:
nameRegex = re.compile('^((?:\w+\s*){2,})$', flags=re.MULTILINE)
This will only choose complete lines that are made up of two or more names composed of 'word' characters.

How do I eliminate all the parenthesis in a txt files?

I have a txt file, single COLUMN, taken from excel, of the following type:
AMANDA (LOUDLY SPEAKING)
JEFF
STEVEN (TEASINGLY)
AMANDA
DOC BRIAN GREEN
As output I want:
AMANDA
JEFF
STEVEN
AMANDA
DOC BRIAN GREEN
I tried with a for cycle on all the column and then:
if (str[i] == '('):
return str.split('(')
but it's clearly not working.
Do you have any possible solution? I would then need an output file as my original txt, so with each name for each line in a single column.
Thanks everyone!
(I am using PyCharm 3.2)
I'd use regex in this situation. \w will replace letters, the * will select 0 or more. Then we check that it is between parenthesis.
import re
fi = "AMANDA (LOUDLY) JEFF STEVEN (TEASINGLY) AMANDA"
with open("mytext.txt","r") as fi, open("out.txt", "w") as fo:
for line in fi:
fo.write(re.sub("\(.*?\)", "", line))
You can split the string into a list using a regular expression that matches everything in parentheses or a full word, remove all elements from the list which contain parentheses and then join the list to a string again. The advantage is that there will be no double spaces in the result string where a word in parantheses was removed.
import re
text = "AMANDA (LOUDLY SPEAKING) JEFF STEVEN (TEASINGLY) AMANDA DOC BRIAN GREEN"
words = re.findall("\(.*?\)|[^\s]+",text)
print " ".join([x for x in words if "(" not in x])

Removing white space from txt with python

I have a .txt file (scraped as pre-formatted text from a website) where the data looks like this:
B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
I'd like to remove all extra spaces (they're actually different number of spaces, not tabs) in between the columns. I'd also then like to replace it with some delimiter (tab or pipe since there's commas within the data), like so:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Looked around and found that the best options are using regex or shlex to split. Two similar scenarios:
Python Regular expression must strip whitespace except between quotes,
Remove white spaces from dict : Python.
You can apply the regex '\s{2,}' (two or more whitespace characters) to each line and substitute the matches with a single '|' character.
>>> import re
>>> line = 'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'
Stripping any leading and trailing whitespace from the line before applying re.sub ensures that you won't get '|' characters at the start and end of the line.
Your actual code should look similar to this:
import re
with open(filename) as f:
for line in f:
subbed = re.sub('\s{2,}', '|', line.strip())
# do something here
What about this?
your_string ='ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS'
print re.sub(r'\s{2,}','|',your_string.strip())
Output:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Expanation:
I've used re.sub() which takes 3 parameter, a pattern, a string you want to replace with and the string you want to work on.
What I've done is taking at least two space together , I 've replaced them with a | and applied it on your string.
s = """B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
"""
# Update
re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
In [71]: print re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Considering there are at least two spaces separating the columns, you can use this:
lines = [
'B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON ',
'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
]
for line in lines:
parts = []
for part in line.split(' '):
part = part.strip()
if part: # checking if stripped part is a non-empty string
parts.append(part)
print('|'.join(parts))
Output for your input:
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
It looks like your data is in a "text-table" format.
I recommend using the first row to figure out the start point and length of each column (either by hand or write a script with regex to determine the likely columns), then writing a script to iterate the rows of the file, slice the row into column segments, and apply strip to each segment.
If you use a regex, you must keep track of the number of columns and raise an error if any given row has more than the expected number of columns (or a different number than the rest). Splitting on two-or-more spaces will break if a column's value has two-or-more spaces, which is not just entirely possible, but also likely. Text-tables like this aren't designed to be split on a regex, they're designed to be split on the column index positions.
In terms of saving the data, you can use the csv module to write/read into a csv file. That will let you handle quoting and escaping characters better than specifying a delimiter. If one of your columns has a | character as a value, unless you're encoding the data with a strategy that handles escapes or quoted literals, your output will break on read.
Parsing the text above would look something like this (i nested a list comprehension with brackets instead of the traditional format so it's easier to understand):
cols = ((0,34),
(34, 50),
(50, 59),
(59, None),
)
for line in lines:
cleaned = [i.strip() for i in [line[s:e] for (s, e) in cols]]
print cleaned
then you can write it with something like:
import csv
with open('output.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in lines:
spamwriter.writerow([line[col_start:col_end].strip()
for (col_start, col_end) in cols
])
Looks like this library can solve this quite nicely:
http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery
Impressive...

Regex to help split up list into two-tuples

Given a list of actors, with their their character name in brackets, separated by either a semi-colon (;) or comm (,):
Shelley Winters [Ruby]; Millicent Martin [Siddie]; Julia Foster [Gilda];
Jane Asher [Annie]; Shirley Ann Field [Carla]; Vivien Merchant [Lily];
Eleanor Bron [Woman Doctor], Denholm Elliott [Mr. Smith; abortionist];
Alfie Bass [Harry]
How would I parse this into a list of two-typles in the form of [(actor, character),...]
--> [('Shelley Winters', 'Ruby'), ('Millicent Martin', 'Siddie'),
('Denholm Elliott', 'Mr. Smith; abortionist')]
I originally had:
actors = [item.strip().rstrip(']') for item in re.split('\[|,|;',data['actors'])]
data['actors'] = [(actors[i], actors[i + 1]) for i in range(0, len(actors), 2)]
But this doesn't quite work, as it also splits up items within brackets.
You can go with something like:
>>> re.findall(r'(\w[\w\s\.]+?)\s*\[([\w\s;\.,]+)\][,;\s$]*', s)
[('Shelley Winters', 'Ruby'),
('Millicent Martin', 'Siddie'),
('Julia Foster', 'Gilda'),
('Jane Asher', 'Annie'),
('Shirley Ann Field', 'Carla'),
('Vivien Merchant', 'Lily'),
('Eleanor Bron', 'Woman Doctor'),
('Denholm Elliott', 'Mr. Smith; abortionist'),
('Alfie Bass', 'Harry')]
One can also simplify some things with .*?:
re.findall(r'(\w.*?)\s*\[(.*?)\][,;\s$]*', s)
inputData = inputData.replace("];", "\n")
inputData = inputData.replace("],", "\n")
inputData = inputData[:-1]
for line in inputData.split("\n"):
actorList.append(line.partition("[")[0])
dataList.append(line.partition("[")[2])
togetherList = zip(actorList, dataList)
This is a bit of a hack, and I'm sure you can clean it up from here. I'll walk through this approach just to make sure you understand what I'm doing.
I am replacing both the ; and the , with a newline, which I will later use to split up every pair into its own line. Assuming your content isn't filled with erroneous ]; or ], 's this should work. However, you'll notice the last line will have a ] at the end because it didn't have a need a comma or semi-colon. Thus, I splice it off with the third line.
Then, just using the partition function on each line that we created within your input string, we assign the left part to the actor list, the right part to the data list and ignore the bracket (which is at position 1).
After that, Python's very useful zip funciton should finish the job for us by associating the ith element of each list together into a list of matched tuples.

Categories

Resources