How to weight the beginnings of strings in Levenshtein distance algorithm

How to weight the beginnings of strings in Levenshtein distance algorithm - python

I am trying to use the Levenshtein distance algorithm (in Python, if it makes any difference) to do a fuzzy string comparison between two lists of company names. An example would be where list A contains XYZ INDUSTRIAL SUPPLY but list B might say XYZ INDUSTRIAL SUPPLY, INC. but they should still match.
Right now, my implementation is terribly inaccurate. As a second example, currently the algorithm finds abc metal finishing and xyz's mach & metal finishing very similar because of their endings, but they are totally different companies. I want to improve this accuracy and one way I think I can do that is by weighting the beginning of strings somehow. If the company names are supposed to match, chances are they start out similarly. Looking at my first example, the whole beginning matches, it's only at the very end where variations occur. Is there a way to accomplish this? I haven't been able to work it out.
Thanks!
Edit for more examples:
Should match:
s west tool supply, southwest tool supply
abc indust inc, abc industries
icg usa, icg usa llc
Shouldn't match (but do currently):
ohio state university, iowa state university
m e gill corporation, s g corporation
UPDATE:
Progress has been made :) In case anyone is ever interested in this sort of thing, I ended up experimenting with the costs of inserts deletes and substitutions. My idea was to weight the beginning of the strings more heavily, so I based the weights off of the current location within the matrix. The issue that this created though, was that everything was matching to a couple of very short names because of how my weights were being distributed. I fixed this by including the lengths in my calculation. My insertion weight, for example, ended up being (1 if index<=2/3*len(target) else .1*(len(source)-index)) where source is always the longer of the two strings. I'm planning to continue tweaking this and experimenting with other values, but it has already shown a great improvement. This is by no means anywhere close to an exact science but if it's working, that's what counts!

Related

Trying to work out how to produce a synthetic data set using python or javascript in a repeatable way

I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom

Data Cleanse: Grouping within variable company names

So working on some research on nursing homes which are often owned by a chain. We have a list of 9,000 + nursing homes corporate ownership. Now, if I was MERGING this data into anything I think this would not be too much of a challenge, but I am being asked to group the facilities that are associated with each other for another analysis.
For example:
ABCM
ABCM CORP
ABCM CORPORATION
ABCM CORPORATE
I have already removed all the extra spaces, non-alphanumeric, and upcased everything. Just trying to think of a way within like a 90% accuracy I can do this. The within the same variable is the part that is throwing me off. I do have some other details such as ownership, state, zip, etc. I use STATA, SAS, and Python if that helps!

welcome to SO.
String matching is - broadly speaking - a pain, whatever the software you are using, and in most cases need a human intervention to yield satisfactory results.
In Stata you may want to try matchit (ssc install matchit) for fuzzy string merge. I won't go into the details (I suggest you to look at the helpfile, which is pretty well-outlined), but the command returns each string matched with multiple similar entries - where "similar" depends on the chosen method, and you can specify a threshold for the level of similarities kept or discarded.
Even with all the above options, though, the final step is up to you: my personal experience tells me that no matter how restrictive you are, you'll always end up with several "false positives" that you'll have to work yourself!
Good luck!

Writing tuple output to a text file

Is it possible to write my output tuple to a text file? I am using following code to get text between two strings as write them to a text file:
def format_file(file, start, end):
f = open('C:\TEMP\Test.txt', 'r').read()
return tuple(x for x in ''.join(f.split(start)).replace('\n', '').split(end) if x != '')
print (format_file('XYZ', 'Q2 2016 Apple Inc Earnings Call - Final', 'Event Brief of Q1 2016 Apple Inc Earnings Call - Final'))
file = open('C:\TEMP\out.txt', 'w')
file.write(format_file('XYZ', 'Q2 2016 Apple Inc Earnings Call - Final', 'Event Brief of Q1 2016 Apple Inc Earnings Call - Final'))
But I keep getting following error:TypeError: write() argument must be str, not tuple.
When I try to return output as a string instead of a tuple I get a blank file. I would really appreciate any help on this one.
here is my input file text:
Q2 2016 Apple Inc Earnings Call - Final
OPERATOR: From Piper Jaffray, we'll hear from Gene Munster.
GENE MUNSTER, ANALYST, PIPER JAFFRAY & CO.: Good afternoon. Tim, can you talk a little bit about the iPhone ASP trends, and specifically you mentioned that the SE is going to impact, but how are you thinking about the aspirational market share that's out there, and your actual market share, and using price to close that gap? Is it just the SE or could there be other iPhone models that will be discounted, to try to be more aggressive in emerging markets?
And one for Luca. Can you talk a little bit about the services segment, in terms of what piece of the services is driving growth, and maybe a little bit about the profitability on a net basis versus the growth basis that you have referred to in the past. Thanks.
TIM COOK: I think the SE is attracting two types of customers. One is customers that wanted the latest technology, but wanted it in a more compact package. And we clearly see even more people than we thought in that category.
Secondly, it's attracting people aspire to own an iPhone, but couldn't quite stretch to the entry price of the iPhone, and we've established a new entry. I think both of these markets are very, very important to us, and we are really excited about where it can take us. I do think that we will be really happy with the new to iPhone customers that we see from here, because of the early returns we've had. We are currently supply constrained, but we'll be able to work our way out of this at some point. But it's great to see the overwhelming demand for it. I will let Luca comment on the ASPs.
LUCA MAESTRI: On the ASPs, Gene we mentioned that we were going to be down sequentially, and this is really the combination of two factors. So when we go from the March quarter to the June quarter, is the fact that we are having the SE entering the mix, and that obviously is going to have a downward pressure on ASP, and also this channel inventory reduction that we have talked about, obviously the channel inventory reduction will come from higher-end models, and that is also affecting the sequential trend on ASPs.
The question on services, when we look at our services business, obviously growing very well across the board. The biggest element, and the part of the services business that is growing very well, we mentioned 35%, is the App Store. It's interesting for us that our music business, which had been declining for a number of quarters, now that we have both a download model and a streaming model, we have now hit an inflection point, and we believe that this would be the bottom, and we can start growing from there over time.
We have many other services businesses that are doing very well, we have an iCloud business that is growing very quickly. Faster than the App Store, from a much lower base but I think it's important for us as we continue to develop these businesses. Tim have talked about Apple Pay. It doesn't provide a meaningful financial contribution at this point, but as we look at the amount of transactions that go into Apple Pay right now, and we think ahead for the long-term, that could be an interesting business for us, as well.
From a profitability standpoint, we have mentioned last time that when you look at it on a gross basis, so in terms of purchase value of these services, the profitability of the business is similar to Company average. Of course, when you met out the amount that is paid to developers, and you look at it, in terms of what is reported in our P&L, obviously that business has a profitability that is higher than Company average. We don't get into the specifics of specific products or services, but it is very clear it is significantly higher than Company average.
GENE MUNSTER: Thank you.
NANCY PAXTON: Thanks, Gene. Could we have the next question please?
OPERATOR: Katy Huberty with Morgan Stanley.
KATY HUBERTY, ANALYST, MORGAN STANLEY: Yes, thank you. First for Luca. This is the worst gross margin guide in a year and a half or so, and over the last couple of quarters, you have talked about number of tailwinds including component cost, the lower accounting deferrals that went into effect in September. You just mentioned the services margins are above corporate average. So the question is, are some of those tailwinds winding down? Or is a significant guide down in gross margin for the June quarter entirely related to volume and the 5 SE? And then I have a follow-up for Tim.
LUCA MAESTRI: Katy, clearly the commodity environment remains quite favorable, and we continue to expect cost improvements. The other dynamics that you have mentioned are still there, obviously what is different, and particularly as we look at it on a sequential basis coming out of the March quarter, we would have loss of leverage, and that obviously is going to have a negative impact on margins. The other factor that's important to keep in mind is this different mix of products.
Particularly when you look at iPhone, what I was mentioning to Gene earlier, I think we've got a couple of things that are affecting not only ASPs, but obviously, they also affects margins. And it's the fact that we have a channel inventory reduction at the top end of the range, and we've got the introduction of the iPhone SE at the entry level of the range. And so when you take into account those factors, those are the big elements that drive our guidance range right now.
KATY HUBERTY: Okay. Thank you. And that a question for Tim, appreciate the optimism around longer-term iPhone unit growth, but with developed market penetration in anywhere from 60% to 80%, the growth is going to have to come from new markets. You talked about India. Could you just spend a little bit more time on that market? What are some of the hurdles you have to overcome, for that to be a larger part of the business? When we expect Apple to have more distribution, and specifically your own stores in that country? Thanks.
TIM COOK: Katy, in the short term, let me just make a couple of comments on the developed markets, just to make sure this is clear. If you look at our installed base of iPhone today versus two years ago, it's increased by 80%. When you think about upgrade cycles, upgrade cycles would have varying rates on it. As I talked about on the comments, the iPhone 6s rate, upgrade rate is slightly higher than the iPhone 5s, but lower than the iPhone 6.
But the other multiplier in that equation is obviously the size of the installed base. The net of the idea is that I think there's still really, really good business in the developed markets, so I wouldn't want to write those off. It's our job to come up with great products that people desire, and also to continue to attract over Android switchers. With our worldwide share there's still quite a bit of room in the developed markets, as well.
From an India point of view, if you look at India, and each country has a different story a bit, but the things that have held not only us back, perhaps, but some others as well, is that the LTE rollout with India just really begins this year. So we will begin to see some really good networks coming on in India. That will unleash the power and capability of the iPhone, in a way that an older network, 2.5G or even some 3G networks, would not do. The infrastructure is one key one, and the second one is building the channel out.
Unlike the US as an example, where the carriers in the US sell the vast majority of phones that are sold in the United States, in India, the carriers in general sell virtually no phones, and it is out in retail, and retail is many, many different small shops. We've been in the process. It's not something we just started in the last few weeks.
We've been working in India now for a couple of years or more, but we've been working with great energy over the last 18 months or so, and I am encouraged by the results that we're beginning to see there, and believe there's a lot, lot more there. It is already the third largest smart phone market in the world, but because the smart phones that are working there are low-end, primarily because of the network and the economics, the market potential has not been as great there. I view India as where China was maybe 7 to 10 years ago from that point of view. I think there's a really great opportunity there.
NANCY PAXTON: Thank you, Katy. Could we have the next question please?
OPERATOR: We will go to Toni Sacconaghi with Bernstein.
TONI SACCONAGHI, ANALYST, BERNSTEIN: I have one, and then a follow-up, as well. My sense is that you talked about adjusting for the changes in channel inventory, that you are guiding for relatively normal sequential growth. And I think if you do the math it's probably the same or perhaps a touch worse in terms of iPhone unit growth sequentially, relative to normal seasonality between fiscal Q2 and Q3. I guess the question is, given that you should be entering new markets and you should see pronounced elasticity from the SE device, why wouldn't we be seeing something that was dramatically above normal seasonal, in terms of iPhone revenues and units for this quarter?
Maybe you could push back on me, but I can't help thinking that when Apple introduced the iPad Mini in a similar move, to move down market, there was great growth for one quarter, and the iPad never grew again and margins and ASPs went down. It looks like you are introducing the SE, and at least on a sequential basis, you not calling for any uplift, even adjusting for channel inventory, and ASPs I presume will go down and certainly it's impacting gross margins as you've guided to. Could you respond to, A, why you're not seeing the elasticity, and B, is the analogy with the iPad mini completely misplaced?
TIM COOK: Toni, it's Tim. Let me see if I can address your question. The channel inventory reduction that Luca referred to, the vast, vast majority of that is in iPhone. That would affect the unit compare that you maybe thinking about. The iPhone SE, we are thrilled with the response that we've seen on it.
It is clear that there is a demand there, even much beyond what we thought, and so that is really why we have the constraint that we have. Do I think it will be like the iPad Mini? No, I don't think so. I don't see that.
I think the tablet market in general, one of the challenges with the tablet market is that the replacement cycle is materially different than in the smart phone market. As you probably know, we haven't had an issue in customer satisfaction on the iPad. It is incredibly high, and we haven't had an issue with usage of the iPad. The usage is incredibly high.
But the consumer behavior there is you tend to hold on for very long period of time, before an upgrade. We continue to be very optimistic on the iPad business, and as I have said in my remarks, we believe we are going to have the best compare for iPad revenue this quarter that we have quite some time. We will report back in July on that one, but I think iPhone has a particularly different kind of cycle to it than the tablet market.
TONI SACCONAGHI: Okay, and if I could follow-up, Tim. You alluded to replacement cycles and differences between the iPad and the iPhone. My sense was, when you were going through the iPhone 6 cycle, was that you had commented that the upgrade cycle was not materially different. I think your characterization was that it accelerated a bit in the US, but international had grown to be a bigger part of your business, and replacement cycles there were typically a little bit longer. I'm wondering if it was only a modest difference between the 5s and the 6, how big a difference are we really seeing in terms of replacement cycles across the last three generations, and maybe you could help us, if the replacement cycle was flat this year relative to what you saw last year, how different would your results have been this quarter in the first half?
TIM COOK: There's a lot there. Let me just say I don't recall saying the things that you said I said about the upgrade cycle, so let me get that out of the way. Now let me describe without the specific numbers, the iPhone 6s upgrade cycle that we have measured for the first half of this year, so the first six months of our fiscal year to be precise, is slightly better than the rate that we saw with the iPhone 5s two years ago, but it's lower than the iPhone 6. I don't mean just a hair lower, it's a lot lower.
Without giving you exact numbers, if we would have the same rate on 6s that we did 6, there would -- it will be time for a huge party. It would be a huge difference. The great news from my point of view is, I think we are strategically positioned very well, because we have announced the SE, we are attracting customers that we previously didn't attract. That's really great, and this tough compare eventually isn't the benchmark. The install base is up 80% over the last two years, and so all of those I think bode well, and the switcher comments I made earlier, I wouldn't underestimate that, because that's very important for us in every geography. Thanks for the question.
NANCY PAXTON: Thanks, Toni. Can we have the next question please?
OPERATOR: From Cross Research Group, we'll hear from Shannon Cross.
SHANNON CROSS, ANALYST, CROSS RESEARCH: I have a couple of questions. One, Tim, can you talk a bit about what's going on in China? The greater China revenue I think was down 26%. You did talk about mainland China, but if you could talk about some of the trends you're seeing there, and how you think it's playing out, and maybe your thoughts on SE adoption within China as well.
TIM COOK: Shannon, thanks for the question. If you take greater China, we include Taiwan, Hong Kong, and mainland China in the greater China segment that you see reported on your data sheet. The vast majority of the weakness in the greater China region sits in Hong Kong, and our perspective on that is, it's a combination of the Hong Kong dollar being pegged to the US dollar, and therefore it carries the burden of the strength of the US dollar, and that has driven tourism, international shopping and trading down significantly compared to what it was in the year ago.
If you look at mainland China, which is one that I am personally very focused on, we are down 11% in mainland China, on a reported basis. On a constant currency basis, we are only down 7%, and the way that we really look at the health or underlying demand is look at sell-through, and if you look at there, we were down 5%. Keep in mind that is down 5% on comp a year ago that was up 81%.
As I back up from this and look at the larger picture, I think China is not weak, as has been talked about. I see China as -- may not have the wind at our backs that we once did, but it's a lot more stable than what I think is the common view of it. We remain really optimistic on China. We opened seven stores there during the quarter.
We are now at 35. We will open 5 more this quarter to achieve 40, which we had talked about before. And the LTE adoption continues to rise there, but it's got a long way ahead of it. And so we continue to be really optimistic about it, and just would ask folks to look underneath the numbers at the details in them before concluding anything. Thanks for the question.
SHANNON CROSS: Thanks. My second question is with regard to OpEx leverage, or thinking about when I look at the revenue, your revenue is below our expectations but OpEx is pretty much in line. So how are you thinking about potential for leverage, cost containment, maybe when macro is bad and revenue is under pressure, and how are you juggling that versus the required investment you need to go forward?
LUCA MAESTRI: It is Luca. Of course, we think about it. We think about it a lot, and so when you look at our results, for example, our OpEx for the quarter, for the March quarter was up 10%, which is the lowest rate that you have seen in years. And when you look within OpEx, you actually see two different dynamics. You see continued significant investments in research and development, because we really believe that's the future of the Company.
We continue to invest in initiatives and projects ahead of revenue. We had a much broader portfolio that we used to have. We do much more in-house technology development than we used to do a few years ago, which we think is a great investment for us to make. And so that parts we didn't need to protect, and we want to continue to invest in the business, right?
And then when you look at our SG&A portion of OpEx for the March quarter, it was actually down slightly. So obviously we think about it, and of course we look at our revenue trends, and we take measures accordingly. When you look at the guidance that we provided for the June quarter, that 10% year-over-year increase that I mentioned to you for the March quarter goes down to a range of 7% to 9% up, and again, the focus is on making investments in Road and continuing to run SG&A extremely tightly, and in a very disciplined way.
As you know, our E2R, expense to revenue ratio, is around 10%. It's something that we are very proud of, it's a number that is incredibly competitive in our industry, and we want to continue to keep it that way. At the same time, we don't want to under-invest in the business.
SHANNON CROSS: Thank you.
NANCY PAXTON: Thank you, Shannon. Could we have the next question please?
OPERATOR: From UBS we hear from Steve Milunovich.
STEVE MILUNOVICH, ANALYST, UBS: Tim, I first wanted to ask you about services and how do you view services? You've obviously highlighted it the last two quarters. Do you view it going forward as a primary driver of earnings, or do you view it, and you mentioned platforms in terms of your operating systems, which I would agree with. In that scenario I would argue it's more a supporter of the ecosystem, and a supporter of the hardware margins over time, and therefore somewhat subservient to hardware. It's great that it's growing, but longer-term, I would view its role as more creating an ecosystem that supports the high margins on the hardware, as opposed to independently driving earnings. How do you think about it?
TIM COOK: The most important thing for us, Steve, is that we want to have a great customer experience, so overwhelmingly, the thing that drives us are to embark on services that help that, and become a part of the ecosystem. The reality is that in doing so, we have developed a very large and profitable business in the services area, and so we felt last quarter and working up to that, that we should pull back the curtain so that people could -- our investors could see the services business, both in terms of the scale of it, and the growth of it. As we said earlier, the purchase value of the installed base services grew by 27% during the quarter, which was an acceleration over the previous quarter, and the value of it hit -- was just shy of $10 billion. It's huge, and we felt it was important to spell that out.
STEVE MILUNOVICH: Okay, and then going back to the upgrades of the installed base, you have clearly mentioned that you've pulled forward some demand, which makes sense, but there does seem to be a lengthening of the upgrade cycle, particularly in the US. AT&T and Verizon have talked about that. Investors I think perceive that maybe the marginal improvements on the phone might be less currently, and could be less going forward. At the same time, I think you just announced that you can get the upgrade program online, which I guess potentially could shorten it. Do you believe that upgrade cycles are currently lengthening, and can continue to do so?
TIM COOK: What we've seen is that it depends on what you compare it to. If you compare to the 5s, what we are seeing is the upgrade rate today is slightly higher, or that there are more people upgrading, if you will, in a similar time period, in terms of a rate, than the 5s. But if you compare to 6, you would clearly arrive at the opposite conclusion. I think it depends on people's reference points, and we thought it very important in this call to be very clear and transparent about what we're seeing. I think in retrospect, you could look at it and say, maybe the appropriate measure is more to the 5s, and I think everybody intuitively thought that the upgrades were accelerated with the 6, and in retrospect, when you look at the periods, they clearly were.
STEVE MILUNOVICH: Thank you.
NANCY PAXTON: Thanks, Steve. Could we have our next question, please?
OPERATOR: We will go to Rod Hall with JPMorgan.
ROD HALL, ANALYST, JPMORGAN: Yes, thanks for fitting me in. I wanted to start with a general, more general question. I guess, Tim, this one is aimed at you. As you think about where you thought things were going to head last quarter, when you reported to us, and how it's changed this quarter, obviously it's kind of a disappointing demand environment. Can you just help us understand what maybe the top two or three things are that have changed? And so as we walk away from this, we understand what the differences are, and what the direction of change is? Then I have a follow-up.
TIM COOK: I think you're probably indirectly asking about our trough comment, if you will, from last quarter. And when we made that, we did not contemplate or comprehend that we were going to make a $2 billion-plus reduction in channel inventory during this quarter. So if you factor that in and look at true customer demand, which is the way that we look at internally, I think you'll find a much more reasonable comparison.
ROD HALL: Okay, great. Thank you. And then for my follow-up, I wanted to ask you about the tax situation a little bit. Treasury obviously has made some rule changes, and I wonder, maybe if Luca, you could comment on what the impact to Apple from those is, if anything? and Tim, maybe more broadly how you see the tax situation for Apple looking forward? Thanks.
LUCA MAESTRI: Yes, Rod, these are new regulations, and we are in the processing of assessing them. Frankly from first read, we don't anticipate that they are going to have any material impact on our tax situation. Some of them relate to inversion transactions, obviously that's not an issue for us. Some of them are around internal debt financing, which is not something that we use, so we don't expect any issue there.
As you know, we are the largest US taxpayer by a wide margin, and we already pay full US tax on all the profits from the sales that we make in the United States, so we don't expect them to have any impact on us on tax reform. I will let Tim continue to provide more color, but we've been strong advocates for comprehensive corporate tax reform in this country. We continue to do that. We think a reform of the tax code would have significant benefits for the entire US economy, and we remain optimistic that we are going to get to a point where we can see that tax reform enacted. At that point in time, of course, we would have much more flexibility around optimizing our capital structure, and around providing more return of capital to our investors.
TIM COOK: The only thing I would add, Rod, is I think there are a growing number of people in both parties that would like to see comprehensive reform, and so I'm optimistic that it will occur. It's just a matter of when and that's difficult to say. But I think most people do recognize that it is in the US's interest to do this.
ROD HALL: Great, thanks.
NANCY PAXTON: Thank you, Rod. A replay of today's call will be available for two weeks as a podcast on the iTunes Store, as webcast on Apple.com/investor and via telephone. And the numbers for the telephone replay are 888-203-1112, or 719-457-0820, and please enter confirmation code 7495552. These replays will be available by approximately 5:00 PM Pacific time today.
Members of the press with additional questions can contact Kristin Huguet at 408-974-2414, and financial analysts can contact Joan Hoover or me with additional questions. Joan is at 408-974-4570, and I am at 408-974-5420. Thanks again for joining us.
OPERATOR: Ladies and gentlemen, that does conclude today's presentation. We do thank everyone for your participation.
[Thomson Financial reserves the right to make changes to documents, content, or other information on this web site without obligation to notify any person of such changes.
In the conference calls upon which Event Transcripts are based, companies may make projections or other forward-looking statements regarding a variety of items. Such forward-looking statements are based upon current expectations and involve risks and uncertainties. Actual results may differ materially from those stated in any forward-looking statement based on a number of important factors and risks, which are more specifically identified in the companies' most recent SEC filings. Although the companies may indicate and believe that the assumptions underlying the forward-looking statements are reasonable, any of the assumptions could prove inaccurate or incorrect and, therefore, there can be no assurance that the results contemplated in the forward-looking statements will be realized.
THE INFORMATION CONTAINED IN EVENT TRANSCRIPTS IS A TEXTUAL REPRESENTATION OF THE APPLICABLE COMPANY'S CONFERENCE CALL AND WHILE EFFORTS ARE MADE TO PROVIDE AN ACCURATE TRANSCRIPTION, THERE MAY BE MATERIAL ERRORS, OMISSIONS, OR INACCURACIES IN THE REPORTING OF THE SUBSTANCE OF THE CONFERENCE CALLS. IN NO WAY DOES THOMSON FINANCIAL OR THE APPLICABLE COMPANY OR THE APPLICABLE COMPANY ASSUME ANY RESPONSIBILITY FOR ANY INVESTMENT OR OTHER DECISIONS MADE BASED UPON THE INFORMATION PROVIDED ON THIS WEB SITE OR IN ANY EVENT TRANSCRIPT. USERS ARE ADVISED TO REVIEW THE APPLICABLE COMPANY'S CONFERENCE CALL ITSELF AND THE APPLICABLE COMPANY'S SEC FILINGS BEFORE MAKING ANY INVESTMENT OR OTHER DECISIONS.]
LOAD-DATE: April 29, 2016
LANGUAGE: ENGLISH
TRANSCRIPT: 042616a5987433.733
PUBLICATION-TYPE: Transcript
Copyright 2016 CQ-Roll Call, Inc.
All Rights Reserved
Copyright 2016 CCBN, Inc.
4 of 9 DOCUMENTS
FD (Fair Disclosure) Wire
January 26, 2016 Tuesday
Event Brief of Q1 2016 Apple Inc Earnings Call - Final
and the output I am expecting is everthing between 'Q2 2016 Apple Inc Earnings Call - Final' and 'Event Brief of Q1 2016 Apple Inc Earnings Call - Final' in a text file.

Did your try converting your tuple into a string and writing to the file?
s = str('XYZ', 'Q2 2016 Apple Inc Earnings Call - Final', 'Event Brief of Q1 2016 Apple Inc Earnings Call - Final')
file.write(s)

Need algorithm suggestions for flight routings

I'm in the early stages of thinking through a wild trip that involves visiting every commercial airport in India. A little research shows that the national carrier - Air India, has a special ticket called the Silver Pass that allows unlimited travel on their domestic network for 15 days. I would like to use this as my weapon of choice!
See this for a map of all the airports served by Air India
I have the following information available with me in Excel:
All of the domestic flight routes (departure airports and arrival airports in IATA codes)
Duration for every flight route
Weekly frequency for every flight (not all flights run on all days of the week, for example)
Given this information, how do I figure out what is the maximum number of airports that I can hit in 15 days using the Silver Pass ticket? Looking online shows that this is either a traveling salesman problem or a graph traversal problem. What would you guys recommend that I look at to solve this.
Some background on myself - I'm just beginning to learn Python and would like to figure out a way to solve this problem using that. Given that, what are the python-based algorithms/libraries that I should be looking at that will help me structure an approach to solving this?

Your problem is closely related to the Hamiltonian Path problem and Traveling Salesman Problem, which are NP-Hard.
Given an instance of Hamiltonian Path Problem - build a flight data:
Each vertex is an airport
Each edge is a flight
All flights leave at the same time and takes the same time.(*)
(*)The flight duration and departure time [which are common for all] should be calculated so you will be able to visit all terminals only if you visit each terminal only once. It can be easily done in polynomial time. Assume we have a fixed time of k hours for the ticket, we construct the flight table such that each flight takes exactly k/(n-1) hours, and there is a flight every k/(n-1) hours as well1 [remember all flights are at the same time].
It is easy to see that if and only if the graph has a hamiltonian path, you can use the ticket to visit al airports, since if we visit a certain airport twice in the path, we need at least n flights and the total time will be at least (k/(n-1)) * n > k, and we failed the time limit. [other way around is similar].
Thus your problem [for general case] is NP-Hard, and there is no known polynomial solution for it.
1: We assume it takes no time to pass between flights, this can be easily fixed by simply decreasing flight length by the time it takes to "jump" between two flights.

Representing your problem as a graph is definitely the best option. Since the duration, number of flights, and number of airports are relatively limited, and since you are (presumably) happy with approximate solutions, attacking this by brute force ought to be practical, and is probably your best option. Here's roughly what I would do:
Represent each airport as a node on the graph, and each flight as an edge.
Given a starting airport and a current time, select all the flights leaving that airport after the current time. Use a scoring function of some sort to rank them, such that flights to airports you haven't visited are ranked higher than flights to airports you haven't visited, and flights are ranked higher the sooner they are.
Recursively explore each outgoing edge, in order of score, and repeat the procedure for the arriving airport.
Any time you reach a node with no outgoing valid edges, compare it to the best possible solution. If it's an improvement, output it and set it as the new best solution.
Depending on the number of flights, you may be able to run this procedure exhaustively. The number of solutions grows exponentially with the number of flights, of course, so this will quickly become impractical. This is where the scoring function becomes useful - it prioritizes the solutions more likely to produce useful answers. You can run the procedure for as long as you want, and stop when it produces a solution you're happy with.
The properties of the scoring function will have a big impact on how good the solutions are. If your priority is exploring unique places, you want to put a big premium on unvisited airports, and since you want to explore as many as possible, you need to prioritize short transfer times. My suggestion for a starting point would be to make the penalty for going somewhere you've already been proportional to the time it would take to fly from there to somewhere else. That way, it'll still be explored as a stopover, but avoided where possible. Also, note that your scoring function will need context, namely the set of airports that have been visited by the current candidate path.
You can also use the scoring function to apply other constraints. Say you don't want to travel during the night (a reasonable assumption); you can penalize the score of edges that involve nighttime flights.

A good way to make Django geolocation aware? - Django/Geolocation

I would like to be able to associate various models (Venues, places, landmarks) with a City/Country.
But I am not sure what some good ways of implementing this would be.
I'm following a simple route, I have implemented a Country and City model.
Whenever a new city or country is mentioned it is automatically created.
Unfortunately I have various problems:
The database can easily be polluted
Django has no real knowledge of what those City/Countries really are
Any tips or ideas? Thanks! :)

A good starting places would be to get a location dataset from a service like Geonames. There is also GeoDjango which came up in this question.
As you encounter new location names, check them against your larger dataset before adding them. For your 2nd point, you'll need to design this into your object model and write your code accordingly.
Here are some other things you may want to consider:
Aliases & Abbreviations
These come up more than you would think. People often use the names of suburbs or neighbourhoods that aren't official towns. You can also consider ones like LA -> Los Angeles MTL for Montreal, MT. Forest -> Mount Forest, Saint vs (ST st. ST-), etc.
Fuzzy Search
Looking up city names is much easier when differences in spelling are accounted for. This also helps reduce the number of duplicate names for the same place.
You can do this by pre-computing the Soundex or Double Metaphone values for the cities in your data set. When performing a lookup, compute the value for the search term and compare against the pre-computed values. This will work best for English, but you may have success with other romance language derivatives (unsure what your options are beyond these).
Location Equivalence/Inclusion
Be able to determine that Brooklyn is a borough of New York City.
At the end of the day, this is a hard problem, but applying these suggestions should greatly reduce the amount of data corruption and other headaches you have to deal with.

Geocoding datasets from yahoo and google can be a good starting poing, Also take a look at geopy library in django.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.