Using the split function in Python

Using the split function in Python - python

I am working with the CSV module, and I am writing a simple program which takes the names of several authors listed in the file, and formats them in this manner: john.doe
So far, I've achieved the results that I want, but I am having trouble with getting the code to exclude titles such as "Mr."Mrs", etc. I've been thinking about using the split function, but I am not sure if this would be a good use for it.
Any suggestions? Thanks in advance!
Here's my code so far:
import csv
books = csv.reader(open("books.csv","rU"))
for row in books:
print '.'.join ([item.lower() for item in [row[index] for index in (1, 0)]])

It depends on how much messy the strings are, in worst cases this regexp-based solution should do the job:
import re
x=re.compile(r"^\s*(mr|mrs|ms|miss)[\.\s]+", flags=re.IGNORECASE)
x.sub("", text)
(I'm using re.compile() here since for some reasons Python 2.6 re.sub doesn't accept the flags= kwarg..)
UPDATE: I wrote some code to test that and, although I wasn't able to figure out a way to automate results checking, it looks like that's working fine.. This is the test code:
import re
x=re.compile(r"^\s*(mr|mrs|ms|miss)[\.\s]+", flags=re.IGNORECASE)
names = ["".join([a,b,c,d]) for a in ['', ' ', ' ', '..', 'X'] for b in ['mr', 'Mr', 'miss', 'Miss', 'mrs', 'Mrs', 'ms', 'Ms'] for c in ['', '.', '. ', ' '] for d in ['Aaaaa', 'Aaaa Bbbb', 'Aaa Bbb Ccc', ' aa ']]
print "\n".join([" => ".join((n,x.sub('',n))) for n in names])

Depending on the complexity of your data and the scope of your needs you may be able to get away with something as simple as stripping titles from the lines in the csv using replace() as you iterate over them.
Something along the lines of:
titles = ["Mr.", "Mrs.", "Ms", "Dr"] #and so on
for line in lines:
line_data = line
for title in titles:
line_data = line_data.replace(title,"")
#your code for processing the line
This may not be the most efficient method, but depending on your needs may be a good fit.
How this could work with the code you posted (I am guessing the Mr./Mrs. is part of column 1, the first name):
import csv
books = csv.reader(open("books.csv","rU"))
for row in books:
first_name = row[1]
last_name = row[0]
for title in titles:
first_name = first_name.replace(title,"")
print '.'.(first_name, last_name)

Related

I want SIMBAD to treat the dash(hyphen) as a space

I have a code using astroquery.Simbad to query star names. However Simbad working with names like "LP 944-20". However, the data contains names as "LP-944-20". How can i make code to ignore that first dash(hyphen)?
My code:
from astroquery.simbad import Simbad
result_table = Simbad.query_object("LP-944-20", wildcard=True)
print(result_table)

One simple approach would be to just replace the first hyphen with space:
inp = ["LP-944-20", "944-20", "20"]
output = [x.replace("-", " ", 1) for x in inp]
print(output) # ['LP 944-20', '944 20', '20']

Extracting numbers from outlook email body with Python

I get hourly email alerts that tell me how much revenue the company has made in the last hour. I want to extract this information into a pandas dataframe so that i can run some analysis on it.
My problem is that i can't figure out how to extract data from the email body in a usable format. I think i need to use regular expressions but i'm not too familiar with them.
This is what i have so far:
import os
import pandas as pd
import datetime as dt
import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
#Empty Lists
email_subject = []
email_date = []
email_content = []
#find emails
for message in messages:
if message.SenderEmailAddress == 'oracle#xyz.com' and message.Subject.startswith('Demand'):
email_subject.append(message.Subject)
email_date.append(message.senton.date())
email_content.append(message.body)
The email_content list looks like this:
' \r\nDemand: $41,225 (-47%)\t \r\n \r\nOrders: 515 (-53%)\t \r\nUnits: 849 (-59%)\t \r\n \r\nAOV: $80 (12%) \r\nAUR: $49 (30%) \r\n \r\nOrders with Promo Code: 3% \r\nAverage Discount: 21% '
Can anyone tell me how i can split its contents to so that i can get the int value of Demand, Orders and Units in separate columns?
Thanks!

You could use a combination of string.split() and string.strip() to first extract each lines individually.
string = email_content
lines = string.split('\r\n')
lines_stripped = []
for line in lines:
line = line.strip()
if line != '':
lines_stripped.append(line)
This gives you an array like this:
['Demand: $41,225 (-47%)', 'Orders: 515 (-53%)', 'Units: 849 (-59%)', 'AOV: $80 (12%)', 'AUR: $49 (30%)', 'Orders with Promo Code: 3%', 'Average Discount: 21%']
You can also achieve the same result in a more compact (pythonic) way:
lines_stripped = [line.strip() for line in string.split('\r\n') if line.strip() != '']
Once you have this array, you use regexes as you correctly guessed to extract the values. I recommend https://regexr.com/ to experiment with your regex expressions.
After some quick experimenting, r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?' should work.
Here is the code that produces a dict from your lines_stripped we created above:
import re
regex = r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?'
matched_dict = {}
for line in lines_stripped:
match = re.match(regex, line)
matched_dict[match.groups()[0]] = (match.groups()[1], match.groups()[2])
print(matched_dict)
This produces the following output:
{'AOV': ('$80', '12%)'),
'AUR': ('$49', '30%)'),
'Average Discount': ('21%', ''),
'Demand': ('$41,225', '-47%)'),
'Orders': ('515', '-53%)'),
'Orders with Promo Code': ('3%', ''),
'Units': ('849', '-59%)')}
You asked for Units, Orders and Demand, so here is the extraction:
# Remove the dollar sign before converting to float
# Replace , with empty string
demand_string = matched_dict['Demand'][0].strip('$').replace(',', '')
print(int(demand_string))
print(int(matched_dict['Orders'][0]))
print(int(matched_dict['Units'][0]))
As you can see, Demand is a little bit more complicated because it contains some extra characters python can't decode when converting to int.
Here is the final output of those 3 prints:
41225
515
849
Hope I answered your question ! If you have more questions about regex, I encourage you to experiement with regexr, it's very well built !
EDIT: Looks like there is a small issue in the regex causing the final ')' to be included in the last group. This does not affect your question though !

Converting plural to singular in a text file with Python

I have txt files that look like this:
word, 23
Words, 2
test, 1
tests, 4
And I want them to look like this:
word, 23
word, 2
test, 1
test, 4
I want to be able to take a txt file in Python and convert plural words to singular. Here's my code:
import nltk
f = raw_input("Please enter a filename: ")
def openfile(f):
with open(f,'r') as a:
a = a.read()
a = a.lower()
return a
def stem(a):
p = nltk.PorterStemmer()
[p.stem(word) for word in a]
return a
def returnfile(f, a):
with open(f,'w') as d:
d = d.write(a)
#d.close()
print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))
I have also tried these 2 definitions instead of the stem definition:
def singular(a):
for line in a:
line = line[0]
line = str(line)
stemmer = nltk.PorterStemmer()
line = stemmer.stem(line)
return line
def stem(a):
for word in a:
for suffix in ['s']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
Afterwards I'd like to take duplicate words (e.g. test and test) and merge them by adding up the numbers next to them. For example:
word, 25
test, 5
I'm not sure how to do that. A solution would be nice but not necessary.

If you have complex words to singularize, I don't advise you to use stemming but a proper python package link pattern :
from pattern.text.en import singularize
plurals = ['caresses', 'flies', 'dies', 'mules', 'geese', 'mice', 'bars', 'foos',
'families', 'dogs', 'child', 'wolves']
singles = [singularize(plural) for plural in plurals]
print(singles)
returns:
>>> ['caress', 'fly', 'dy', 'mule', 'goose', 'mouse', 'bar', 'foo', 'foo', 'family', 'family', 'dog', 'dog', 'child', 'wolf']
It's not perfect but it's the best I found. 96% based on the docs : http://www.clips.ua.ac.be/pages/pattern-en#pluralization

It seems like you're pretty familiar with Python, but I'll still try to explain some of the steps. Let's start with the first question of depluralizing words. When you read in a multiline file (the word, number csv in your case) with a.read(), you're going to be reading the entire body of the file into one big string.
def openfile(f):
with open(f,'r') as a:
a = a.read() # a will equal 'soc, 32\nsoc, 1\n...' in your example
a = a.lower()
return a
This is fine and all, but when you want to pass the result into stem(), it will be as one big string, and not as a list of words. This means that when you iterate through the input with for word in a, you will be iterating through each individual character of the input string and applying the stemmer to those individual characters.
def stem(a):
p = nltk.PorterStemmer()
a = [p.stem(word) for word in a] # ['s', 'o', 'c', ',', ' ', '3', '2', '\n', ...]
return a
This definitely doesn't work for your purposes, and there are a few different things we can do.
We can change it so that we read the input file as one list of lines
We can use the big string and break it down into a list ourselves.
We can go through and stem each line in the list of lines one at a time.
Just for expedience's sake, let's roll with #1. This will require changing openfile(f) to the following:
def openfile(f):
with open(f,'r') as a:
a = a.readlines() # a will equal 'soc, 32\nsoc, 1\n...' in your example
b = [x.lower() for x in a]
return b
This should give us b as a list of lines, i.e. ['soc, 32', 'soc, 1', ...]. So the next problem becomes what do we do with the list of strings when we pass it to stem(). One way is the following:
def stem(a):
p = nltk.PorterStemmer()
b = []
for line in a:
split_line = line.split(',') #break it up so we can get access to the word
new_line = str(p.stem(split_line[0])) + ',' + split_line[1] #put it back together
b.append(new_line) #add it to the new list of lines
return b
This is definitely a pretty rough solution, but should adequately iterate through all of the lines in your input, and depluralize them. It's rough because splitting strings and reassembling them isn't particularly fast when you scale it up. However, if you're satisfied with that, then all that's left is to iterate through the list of new lines, and write them to your file. In my experience it's usually safer to write to a new file, but this should work fine.
def returnfile(f, a):
with open(f,'w') as d:
for line in a:
d.write(line)
print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))
When I have the following input.txt
soc, 32
socs, 1
dogs, 8
I get the following stdout:
Please enter a filename: input.txt
['soc, 32\n', 'socs, 1\n', 'dogs, 8\n']
['soc, 32\n', 'soc, 1\n', 'dog, 8\n']
None
And input.txt looks like this:
soc, 32
soc, 1
dog, 8
The second question regarding merging numbers with the same words changes our solution from above. As per the suggestion in the comments, you should take a look at using dictionaries to solve this. Instead of doing this all as one big list, the better (and probably more pythonic) way to do this is to iterate through each line of your input, and stemming them as you process them. I'll write up code about this in a bit, if you're still working to figure it out.

The Nodebox English Linguistics library contains scripts for converting plural form to single form and vice versa. Checkout tutorial: https://www.nodebox.net/code/index.php/Linguistics#pluralization
To convert plural to single just import singular module and use singular() function. It handles proper conversions for words with different endings, irregular forms, etc.
from en import singular
print(singular('analyses'))
print(singular('planetoids'))
print(singular('children'))
>>> analysis
>>> planetoid
>>> child

Using conditionals with variable strings in python

I'm pretty new to python, but I think I catch on fast.
Anyways, I'm making a program (not for class, but to help me) and have come across a problem.
I'm trying to document a list of things, and by things I mean close to a thousand of them, with some repeating. So my problem is this:
I would not like to add redundant names to the list, instead I would just like to add a 2x or 3x before (or after, whichever is simpler) it, and then write that to a txt document.
I'm fine with reading and writing from text documents, but my only problem is the conditional statement, I don't know how to write it, nor can I find it online.
for lines in list_of_things:
if(lines=="XXXX x (name of object here)"):
And then whatever under the if statement. My only problem is that the "XXXX" can be replaced with any string number, but I don't know how to include a variable within a string, if that makes any sense. Even if it is turned into an int, I still don't know how to use a variable within a conditional.
The only thing I can think of is making multiple if statements, which would be really long.
Any suggestions? I apologize for the wall of text.

I'd suggest looping over the lines in the input file and inserting a key in a dictionary for each one you find, then incrementing the value at the key by one for each instance of the value you find thereafter, then generating your output file from that dictionary.
catalog = {}
for line in input_file:
if line in catalog:
catalog[line] += 1
else:
catalog[line] = 1
alternatively
from collections import defaultdict
catalog = defaultdict(int)
for line in input_file:
catalog[line] += 1
Then just run through that dict and print it out to a file.

You may be looking for regular expressions and something like
for line in text:
match = re.match(r'(\d+) x (.*)', line)
if match:
count = int(match.group(1))
object_name = match.group(2)
...

Something like this?
list_of_things=['XXXX 1', 'YYYY 1', 'ZZZZ 1', 'AAAA 1', 'ZZZZ 2']
for line in list_of_things:
for e in ['ZZZZ','YYYY']:
if e in line:
print line
Output:
YYYY 1
ZZZZ 1
ZZZZ 2
You can also use if line.startswith(e): or a regex (if I am understanding your question...)

To include a variable in a string, use format():
>>> i = 123
>>> s = "This is an example {0}".format(i)
>>> s
'This is an example 123'
In this case, the {0} indicates that you're going to put a variable there. If you have more variables, use "This is an example {0} and more {1}".format(i, j)" (so a number for each variable, starting from 0).

This should do it:
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
from itertools import groupby
print ["%dx %s" % (len(list(group)), key) for key, group in groupby(a)]

There are two options to approach this. 1) something like the following using a dictionary to capture the count of items and then a list to format each item with its count
list_of_things = ['sun', 'moon', 'green', 'grey', 'sun', 'grass', 'green']
listItemCount = {}
countedList = []
for lines in list_of_thing:
if lines in listItemCount:
listItemCount[lines] += 1
else:
listItemCount[lines] = 1
for id in listItemCount:
if listItemCount[id] > 1:
countedList.append(id+' - x'str(listItemCount[id]))
else:
countedList.append(id)
for item in countedList:
print(item)
the output of the above would be
sun - x2
grass
green - x2
grey
moon
or 2) using collections to make things simpler as shown below
import collections
list_of_things = ['sun', 'moon', 'green', 'grey', 'sun', 'grass', 'green']
listItemCount = collections.Counter(list_of_things)
listItemCountDict = dict(listItemCount)
countedList = []
for id in listItemCountDict:
if listItemCountDict[id] > 1:
countedList.append(id+' - x'str(listItemCountDict[id]))
else:
countedList.append(id)
for item in countedList:
print(item)
the output of the above would be
sun - x2
grass
green - x2
grey
moon

Transposing datasets with multiple rows in a file into a single row per dataset

Could I please have some pointers to websites where I can read and get the skills to write python code to do the following?
So far I can only find python code that reads structured data into lists and dictionaries. I need to see an example with line processing to merge multiple rows of data to a single row.
Problem
I have datasets in a file, each dataset is enclosed in {}, with one item per row.
I need to transpose all the items of a data set to a single row ie transpose to tabular> Below is an example
Input file:
details_book1{
title,txt, book_book1
author,txt,author_book1
price,txt, price_book1 }
details_book2
{
title,txt, book_book2
author,txt,author_book2
price,txt, price_book2
}
Output Required:
details_book1,book_book1,author_book1,price_book1
details_book2,book_book2,author_book2,price_book2
...
details_bookn,book_bookn,author_bookn,price_bookn

I'm sorry I don't know of particular references, other than just learning about string and list manipulations, for which the python docs aren't too bad, but it could perhaps be as simple as something like this:
lines = [line for line in a.split('\n') if line]
books = []
book = ''
for line in lines:
if '}' in line:
book += ',' + line
book = book.replace('{', ' ').replace('}', ' ')
books.append([x.strip() for x in book.split(',') if x.strip()])
book = ''
else:
book += line + ','
This would create a list of lists of the entitites, and you could loop through the list, pulling out all the elements into variables:
for book, title, a, bookbook, author, b, authorbook, price, c, pricebook in books:
print '%s,%s,%s,%s' % (book, bookbook, authorbook, pricebook)
# result
details_book1,book_book1,author_book1,price_book1
details_book2,book_book2,author_book2,price_book2
This can fail in a few ways, though, and requires that your data match what you've shown so far. In particular, if you have commas in any of the the text, where I split the book variable around commas inside the second list comprehension will split into too many fields, and the unpacking later in the for loop (last example code snippet) will fail.
Also, if a block starts on the same line as the previous block's }, it will fail to cut up the data correctly. There are ways around this, but I wanted to keep things very simple.
Maybe this can help as a starting point.
I suppose you could do this as well:
import re
for book in re.findall('\w.*?{.*?}', a, flags=re.M|re.S):
book = book.replace('\n',',').replace('{',',').replace('}',',')
book = [x.strip() for x in book.split(',') if x.strip()]
print book
This uses a regular expression via the re.findall to find all words followed by any amount of whitespace, and anything at all (non-greedy) between curly braces. This results in a bit of a mess of newlines and missing commas, so then I replace newlines and braces with commas, then use a list comprehension to split around commas, strip whitespace around each split element, and leave out any empty strings that result.
This results in these lists each time in book:
['details_book1', 'title', 'txt', 'book_book1', 'author', 'txt', 'author_book1', 'price', 'txt', 'price_book1']
['details_book2', 'title', 'txt', 'book_book2', 'author', 'txt', 'author_book2', 'price', 'txt', 'price_book2']
Again, splitting around commas is a problem if anything like book titles or txt blurbs have commas in them (but if they do, I don't know how you're able to tell those blurbs apart from the comma-separated bits on each line).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using the split function in Python - python

Related

I want SIMBAD to treat the dash(hyphen) as a space

Extracting numbers from outlook email body with Python

Converting plural to singular in a text file with Python

Using conditionals with variable strings in python

Transposing datasets with multiple rows in a file into a single row per dataset

Categories

Resources