Replace integer with number of spaces - python

If I have these names:
bob = "Bob 1"
james = "James 2"
longname = "longname 3"
And priting these gives me:
Bob 1
James 2
longname 3
How can I make sure that the numbers would be aligned (without using \t or tabs or anything)? Like this:
Bob 1
James 2
longname3

This is a good use for a format string, which can specify a width for a field to be filled with a character (including spaces). But, you'll have to split() your strings first if they're in the format at the top of the post. For example:
"{: <10}{}".format(*bob.split())
# output: 'Bob 1'
The < means left align, and the space before it is the character that will be used to "fill" the "emtpy" part of that number of characters. Doesn't have to be spaces. 10 is the number of spaces and the : is just to prevent it from thinking that <10 is supposed to be the name of the argument to insert here.
Based on your example, it looks like you want the width to be based on the longest name. In which case you don't want to hardcode 10 like I just did. Instead you want to get the longest length. Here's a better example:
names_and_nums = [x.split() for x in (bob, james, longname)]
longest_length = max(len(name) for (name, num) in names_and_nums)
format_str = "{: <" + str(longest_length) + "}{}"
for name, num in names_and_nums:
print(format_str.format(name, num))
See: Format specification docs

Related

How to replace string and exclude certain changing integers?

I am trying to replace
'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
with:
'AMAT 10Q Filing Section: Risk'
However, everything up until Filing Section: Risk will be constantly changing, except for positioning. I just want to pull the characters from position 0 to 5 and from 15 through 19.
df['section'] = df['section'].str.replace(
I'd like to manipulate this but not sure how?
Any help is much appreciated!
Given your series as s
s.str.slice(0, 5) + s.str.slice(15, 19) # if substring-ing
s.str.replace(r'\d{5}', '') # for a 5-length digit string
You may need to adjust your numbers to index properly. If that doesn't work, you probably want to use a regular expression to get rid of some length of numbers (as above, with the example of 5).
Or in a single line to produce the final output you have above:
s.str.replace(r'\d{10}_|\d{8}_', '').str.replace('_', ' ')
Though, it might not be wise to replace the underscores. Instead, if they change, explode the data into various columns which can be worked on separately.
If you want to replace a fix length/position of chars, use str.slice_replace to replace
df['section'] = df['section'].str.slice_replace(6, 14, ' ')
Other people would probably use regex to replace pieces in your string. However, I would:
Split the string
append the piece if it isn't a number
Join the remaining data
Like so:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
n = []
for i in s.split('_'):
try:
i = int(i)
except ValueError:
n.append(i)
print(' '.join(n))
AMAT 10Q Filing Section: Risk
Edit:
Re-reading your question, if you are just looking to substring:
Grabbing the first 5 characters:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
print(s[:4]) # print index 0 to 4 == first 5
print(s[15:19]) # print index 15 to 19
print(s[15:]) # print index 15 to the end.
If you would like to just replace pieces:
print(s.replace('_', ' '))
you could throw this in one line as well:
print((s[:4] + s[15:19] + s[28:]).replace('_', ' '))
'AMAT 10Q Filing Section: Risk'

python dataframe regex create new column from text cell

I have a dataframe and one of the columns contains a bunch of random text. Within the random text is one name per row. I would like to create a new column within the dataframe that is only the name. All of these name start with capital letters and are preceded by phrases like, "Meet" "name is" "hello to". I believe I should use regex but not sure beyond that.
Example texts from a dataframe cells:
"This is John. He is a rock star on tour in Australia." (desired name is John)
"Meet Randy. He probably has the best hairdo on planet Earth." (desired name is Randy)
"Say hello to Mike! His moustache won first prize at the county fair." (desired name is Mike)
I think the code should be something like:
df['name'][df['text'].str.extract('r'____________')
First get the regex patterns. My logic seeing your pattern is that:
every name starts with a capital letter,
has a space before the name
starts has a character after the name (exclamation mark or full stop),
after the name has a space else even Earth will be counted, which we do not want
The regex for the following is:
re1='(\\s+)' # White Space 1
re2='((?:[A-ZÀ-ÿ][a-zÀ-ÿ]+))' # Word 1
re3='([.!,?\\-])' # Any Single Character 1
re4='(\\s+)' # White Space 2
I use this website to get my regex: https://txt2re.com/
Now do:
df['name'] = df['text'].str.extract(re1+re2+re3+re4, expand=True)[1]
Output:
0 John
1 Randy
2 Mike
3 Amélie
Name: name, dtype: object

Formatting Python output into rows

So, I'm still sort of new to programming, and I'm trying to format the output of some arrays in Python. I'm finding it hard to wrap my head around some of the aspects of formatting.
I have a few arrays that I want to print, in the format of a table.
headings = ["Name", "Age", "Favourite Colour"]
names = ["Barry", "Eustace", "Clarence", "Razputin", "Harvey"]
age = [39, 83, 90, 15, 23]
favouriteColour = ["Green", "Baby Pink", "Sky Blue", "Orange", "Crimson"]
I want the output to look like this: (where the column widths are a little more than the max length in that column)
Name Age Favourite Colour
Barry 39 Green
Eustace 83 Baby Pink
Clarence 90 Sky Blue
Razputin 15 Orange
Harvey 23 Crimson
I tried to do this:
mergeArr = [headings, name, age, favouriteColour]
but (I think) that won't print the headings in the right place?
I tried this:
mergeArr = [name, age, favouriteColour]
col_width = max(len(str(element)) for row in merge for element in row) + 2
for row in merge:
print ("".join(str(element).ljust(col_width) for element in row))
but that prints the data of each object in columns, rather than rows.
Help is appreciated! Thanks.
You'd print the heading on its own (the one with name, age, favourite colour).
Then you use the code you have, but with:
rows = zip(name, age, favouriteColour)
for row in rows...
You might also look into the tabulate package for nicely formatted tables.
Just adding the extra formatting:
ll = [headings] + list(zip(names, age, favouriteColour))
for l in ll:
print("{:<10}\t{:<2}\t{:<16}".format(*l))
# Name Age Favourite Colour
# Barry 39 Green
# Eustace 83 Baby Pink
# Clarence 90 Sky Blue
# Razputin 15 Orange
# Harvey 23 Crimson
The parts in the curly braces are part of python's new character formatting features, while the TABs serve as delimiters. In sum, the .format() method looks for those curly braces inside the string part to determine what values inside the container l
go where and how those values should be formatted. For example, in the case of the headers, the following is what's happening:
headings = ["Name", "Age", "Favourite Colour"]
print("{:<10}\t{:<3}\t{:<16}".format(*headings))
We use the asterisk (*) in front of the list to unpack the elements inside that list.
The first curly brace is for the string "Name", and it is formatted with :<10 which means that it is adjusted to the left and padded with extra space characters, if the length of the string is less than 10. In essence, it will print all characters in a given string and add extra spaces to the right of that string.
The second curly brace is for "Age" and is formatted with :<3.
The third curly brace is for "Favourite Colour" and is formatted with :<16.
All those strings are delimited with the TAB character.
The combination of the above steps inside the print function yields:
# Name Age Favourite Colour
I hope this proves useful.
Use zip(*iterables):
print(heading)
for row in zip(names, age, favouriteColour):
print(row) # formatting is up to you :)
Jacob Krall is perfectly correct about using zip to combine your lists. Once you've done that, though, if you want your columns to align nicely (assuming you are using Python 3.x) then take a look at the .format() method which is available with strings and as part of the Python print function. This allows you to specify field widths in your output.

Learning Python the Hard Way: Example 5

The following gives a syntax error:
my eyes = 'Brown' my_hair = 'Brown'
print "Hes got %s and %s hair" % (my_eyes, my_hair)
The only way this seems to work is if I put Brown, Brown in the last parenthesis.
You're incorrectly assigning, you should try to unpack the tuple of strings into two variables. In addition, Python variables can not contain spaces so you'll want to use an underscore for eyes.
my_eyes, my_hair = 'Brown', 'Brown' # unpacking tuple here
Also, I suggest you use the format method which is more common. That style is deprecated.
print "He's got {0} and {1} hair".format(my_eyes, my_hair)
The problem turned out to be that the period at the end of the print statement was outside of the parenthesis. This now works: % (eyes, hair). The format version also works now.
Here's your variables:
name = "some name"
Age = 57
Height = 64
Weight = 135
Eyes = "brown"
Teeth = "white"
Hair = "brown"
To print a string with variables, use str.format.
print "Let's talk about {}".format(name)
print "She's {} inches tall".format(Height)
... So on
Make sure that your variables contain no spaces. They're case sensitive too :)

Execute only if string contains a ','?

I'm trying to execute a bunch of code only if the string I'm searching contains a comma.
Here's an example set of rows that I would need to parse (name is a column header for this tab-delimited file and the column (annoyingly) contains the name, degree, and area of practice:
name
Sam da Man J.D.,CEP
Green Eggs Jr. Ed.M.,CEP
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
My issue is that some of the rows contain a comma, which is followed by an acronym which represents an "area of practice" for the professional and some do not.
My code relies on the principle that each line contains a comma, and I will now have to modify the code in order to account for lines where there is no comma.
def parse_ieca_gc(s):
########################## HANDLE NAME ELEMENT ###############################
degrees = ['M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.', 'RN', 'B.S.Ed.', 'M.D.']
degrees_list = []
# separate area of practice from name and degree and bind this to var 'area'
split_area_nmdeg = s['name'].split(',')
area = split_area_nmdeg.pop() # when there is no area of practice and hence no comma, this pops out the name + deg and leaves an empty list, that's why 'print split_area_nmdeg' returns nothing and 'area' returns the name and deg when there's no comma
print 'split area nmdeg'
print area
print split_area_nmdeg
# Split the name and deg by spaces. If there's a deg, it will match with one of elements and will be stored deg list. The deg is removed name_deg list and all that's left is the name.
split_name_deg = re.split('\s',split_area_nmdeg[0])
for word in split_name_deg:
for deg in degrees:
if deg == word:
degrees_list.append(split_name_deg.pop())
name = ' '.join(split_name_deg)
# area of practice
category = area
re.search() and re.match() both do not work, it appears, because they return instances and not a boolean, so what should I use to tell if there's a comma?
The easiest way in python to see if a string contains a character is to use in. For example:
if ',' in s['name']:
if re.match(...) is not None :
instead of looking for boolean use that. Match returns a MatchObject instance on success, and None on failure.
You are already searching for a comma. Just use the results of that search:
split_area_nmdeg = s['name'].split(',')
if len(split_area_nmdeg) > 2:
print "Your old code goes here"
else:
print "Your new code goes here"

Categories

Resources