Python - slow speed when extracting numbers/words from a string

Python - slow speed when extracting numbers/words from a string - python

Noob here trying to learn python by doing a project as I don't learn well from books.
I am using a huge lump of code to perform what seems to me to be a small operation -
I want to extract 4 variables from the following string
'Miami 0, New England 28'
(variables being home_team, away_team, home_score, away_score)
My program is running pretty slow and I think it might be this bit of code. I guess I am looking for the quickest/most efficient way of doing this.
Would regex be quicker? Thanks

It seems like your text could be split twice. First on , and next on whitespace:
info1,info2 = s.split(',')
home,home_score = info1.rsplit(None,1)
away,away_score = info2.rsplit(None,1)
e.g.:
>>> s = 'Miami 0, New England 28'
>>> info1,info2 = s.split(',')
>>> home,home_score = info1.rsplit(None,1)
>>> away,away_score = info2.rsplit(None,1)
>>> print [home,home_score,away,away_score]
['Miami', '0', ' New England', '28']
You could do this with regex without too much difficulty -- but you pay for it in terms of readability.

In case you do want a regex:
import re
s='Miami 0, New England 28'
l=re.findall(r'^([^\d]+)\s(\d+)\s*,\s*([^\d]+)\s(\d+)',s)
hm_team,away_team,hm_score,away_score=l[0]
print l
Prints [('Miami', '0', 'New England', '28')] and assigns those values to the variables.

import re
reg = re.compile('\s*(\D+?)\s*(\d+)'
'[,;:.#=#\s]*'
'(\D+?)\s*(\d+)'
'\s*')
for s in ('Miami 0, New England 28',
'Miami0,New England28 ',
' Miami 0 . New England28',
'Miami 0 ; New England 28',
'Miami0#New England28 ',
' Miami 0 # New England28'):
print reg.search(s).groups()
result
('Miami', '0', 'New England', '28')
('Miami', '0', 'New England', '28')
('Miami', '0', 'New England', '28')
('Miami', '0', 'New England', '28')
('Miami', '0', 'New England', '28')
('Miami', '0', 'New England', '28')
'\D' means 'no digit'

Related

Array keeps getting replaced when being added to a dictionary in python

Im trying to make a JSON object, which is basically a dictionary. This is my code which created a dictionary:
# Adding the data to the JSONData object
JSONData[str(gerechtNaam)] = {
"afbeeldingURL": gerechtAfbeelding,
"receptURL": recept,
"prijs": totalePrijs,
"porties": porties,
"moeilijkheid" :moeilijkheid,
"caloriePortie": calorien,
"voorbereidingsTijd": voorbereidingsTijd,
"wachtTijd": wachtTijd,
"totaleTijd": totaleTijd,
"ingredienten": naamEnKwantiteitIngredienten
}
This works, and generates the following:
{
'Gerooktekipsalade met avocado en walnoten': {
'afbeeldingURL': 'https://static-images.jumbo.com/product_images/Recipe_502535-01_560x560.jpg',
'receptURL': 'http://www.jumbo.com/gerooktekipsalade-met-avocado-en-walnoten/502535/',
'prijs': 16.868000000000002,
'porties': '4 porties',
'moeilijkheid': 'Eenvoudig',
'caloriePortie': '842 kcal per persoon',
'voorbereidingsTijd': '15 min',
'wachtTijd': '0',
'totaleTijd': '15 min',
'ingredienten': [
'2 kroppen minisla romaine ',
'200 g cherrytomaatjes',
'4 stengels bleekselderij',
'2 friszoete handappels ',
'380 g Nieuwe Standaard Kip gerookte kipfilet ',
'2 bosuitjes',
'2 avocado',
'150 ml whisky-cocktailsaus',
'3 el bieslook',
'60 g walnoten',
'1 stokbrood',
'1 snufje peper'
]
}
}
Which I then convert using the following code:
with open('receptData.json', 'w') as outfile:
json.dump(JSONData, outfile)
This works, and generated working JSON. The only problem is that when trying to run the code twice in a for loop, the last variabel, called 'ingredienten' which is a list that gets created in the loop, gets replaced for all objects in the dictionary. So when the second 'ingredienten' array is created, the 'ingredienten' array that had already been made and added to JSONData gets replaced by the new one. All the other variables stay correct, yet the list/array gets replaced every time the loop runs.
So the second time the code runs, this is the dictionary I get:
{
'Gerooktekipsalade met avocado en walnoten': {
'afbeeldingURL': 'https://static-images.jumbo.com/product_images/Recipe_502535-01_560x560.jpg',
'receptURL': 'http://www.jumbo.com/gerooktekipsalade-met-avocado-en-walnoten/502535/',
'prijs': 16.868000000000002,
'porties': '4 porties',
'moeilijkheid': 'Eenvoudig',
'caloriePortie': '842 kcal per persoon',
'voorbereidingsTijd': '15 min',
'wachtTijd': '0',
'totaleTijd': '15 min',
'ingredienten': **[
'4 avocado',
'100 g gerookte zalm',
'8 kleine eieren ',
'25 g alfalfa',
'1 snufje peper',
'1 bakplaat'
]**
},
'Gevulde avocado met ei en zalm uit de oven': {
'afbeeldingURL': 'https://static-images.jumbo.com/product_images/Recipe_502536-01_560x560.jpg',
'receptURL': 'http://www.jumbo.com/gevulde-avocado-met-ei-en-zalm-uit-de-oven/502536/',
'prijs': 8.72,
'porties': '4 porties',
'moeilijkheid': 'Eenvoudig',
'caloriePortie': '234 kcal per persoon',
'voorbereidingsTijd': '10 min',
'wachtTijd': '15 min',
'totaleTijd': '25 min',
'ingredienten': **[
'4 avocado',
'100 g gerookte zalm',
'8 kleine eieren ',
'25 g alfalfa',
'1 snufje peper',
'1 bakplaat'
]**
}
}
In which the first 'ingredienten' list is now the same as the second one, which should not be the case. I've tried multiple things but none worked....

While you haven't shown the code that creates it, I'm pretty sure the problem is that you're reusing the variable naamEnKwantiteitIngredienten, which is the list you're using as the value pointed to by the 'ingredienten' key in your dictionary. If that list gets modified in place (perhaps by filling it up with a different set of ingredients), you'll also see the modified version in your previous dictionary if you haven't dumped it to a JSON string yet.
There are I think a two main ways you could fix the problem.
One is to create the JSON immediately after you make the dictionary, rather than waiting to do it later. While this might resolve this issue, it might be inconvenient for your program (or impossible, if you need all the dictionaries to be defined at the same time for other reasons).
The other solution is to make sure that the dictionaries you create are independent of each other. Rather than reusing the same list in all of them, you should make sure that each one contains a separate list. The most obvious place to fix this may be wherever you create the value that ends up in naamEnKwantiteitIngredienten, but you could instead fix it within the code you show by copying the list just before you put it in the dictionary:
JSONData[str(gerechtNaam)] = {
"afbeeldingURL": gerechtAfbeelding,
"receptURL": recept,
"prijs": totalePrijs,
"porties": porties,
"moeilijkheid" :moeilijkheid,
"caloriePortie": calorien,
"voorbereidingsTijd": voorbereidingsTijd,
"wachtTijd": wachtTijd,
"totaleTijd": totaleTijd,
"ingredienten": naamEnKwantiteitIngredienten[:] # slice here to copy the list!
}

python regex for incomplete decimals numbers

I have a string of numbers which may have incomplete decimal reprisentation
for example
a = '1. 1,00,000.00 1 .99 1,000,000.999'
desired output
['1','1,00,000.00','1','.99','1,000,000.999']
so far i have tried the following 2
re.findall(r'[-+]?(\d+(?:[.,]\d+)*)',a)
which gives
['1', '1,00,000.00', '1', '99', '1,000,000.999']
which makes .99 to 99 which is not desired
while
re.findall(r'[-+]?(\d*(?:[.,]\d+)*)',a)
gives
['1', '', '', '1,00,000.00', '', '', '1', '', '.99', '', '1,000,000.999', '']
which gives undesirable empty string results as well
this is for finding currency values in a string so the commas separators don't have a set pattern or mat not be present at all

My suggestion is to use the regex below:
I've implemented a snippet in python.
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
result = re.split('/\.?\d\.?\,?/', a)
print result
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']

You can use re.split:
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
d = re.split('(?<=\d)\.\s+|(?<=\d)\s+', a)
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']

This regex will give you your desired output:
([0-9]+(?=\.))|([0-9,]+\.[0-9]+)|([0-9]+)|(\.[0-9]+)
You can test it here: https://regex101.com/r/VfQIJC/6

Python Variable Amount Of Input

I'm working on a program that determines whether a graph is strongly connected.
I am reading standard input on a sequence of lines.
The lines have two or three whitespace-delimited tokens, the name of the source and destination vertices, and an optional decimal edge weight.
Input might look like this:
'''
Houston Washington 1000
Vancouver Houston 300
Dallas Sacramento 800
Miami Ames 2000
SanFrancisco LosAngeles
ORD PVD 1000
'''
How can I read in this input and add it to my graph?
I believe I will be using a collection like this:
flights = collections.defaultdict(dict)
Thank you for any help!

with d as your data, you can use split your line with '\n' in it and then strip trailing white space and find the last occurrence of . With that you can slice your string to get the name and the number associated with it.
Here I've stored the data to a dictionary. You can modify it according to your requirement!
Use regular expression modules re.sub to remove the extra spaces.
>>> import re
>>> d
'\nHouston Washington 1000\nVancouver Houston 300\nDallas Sacramento 800\nMiami Ames 2000\nSanFrancisco LosAngeles\nORD PVD 1000\n'
>>>[{'Name':re.sub(r' +',' ',each[:each.strip().rfind(' ')]).strip(),'Flight Number':each[each.strip().rfind(' '):].strip()} for each in filter(None,d.split('\n'))]
[{'Flight Number': '1000', 'Name': 'Houston Washington'}, {'Flight Number': '300', 'Name': 'Vancouver Houston'}, {'Flight Number': '800', 'Name': 'Dallas Sacramento'}, {'Flight Number': '2000', 'Name': 'Miami Ames'}, {'Flight Number': 'LosAngeles', 'Name': 'SanFrancisco'}, {'Flight Number': '1000', 'Name': 'ORD PVD'}]
Edit:
To match your flights dict,
>>> flights={'Houston':{'Washington':''},'Vancouver':{'Houston':''}} #sample dict
>>> for each in filter(None,d.split('\n')):
... flights[each.split()[0]][each.split()[1]]=each.split()[2]

python split by "\t" is not showing all elements in it

I am trying to split by "\t" but it is not printing all the elements in it
import sys
reload(sys)
sys.setdefaultencoding('utf8')
s = ['A\t"Ravi"\t"Tirupur"\t"India"\t"641652"\t"arunachalamravi#gmail.com"\t"17379602"\t"+ 2"\t"Government Higher Secondary School', ' Tiruppur"\t\t"1989"\t"Maths',' Science"\t"No"\t"Encotec Energy 2 X 600 MW ITPCL"\t"Associate Vice President- Head Maintenance"\t"2015"\t"2016"\t"No"\t"27-Mar-2017"\t"9937297878"\t\t"2874875"\t"Submitted"\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t']
print s[0].split("\t")
Results
['A', '"Ravi"', '"Tirupur"', '"India"', '"641652"', '"arunachalamravi#gmail.com"', '"17379602"', '"+ 2"', '"Government Higher Secondary School']
But i want results upto this
2874875, Submitted
How to fix the code and where is the change?

Easy, you have more than one item in your list so when you do s[0] you just get the first one, fix your list or fix your code like this:
joined_string = ''.join(s)
print joined_string.split("\t")
It should work

With your data you should do something like this:
s[2].split("\t")[10:12]

You could use Python's chain() function to create a single list from the multiple elements:
from itertools import chain
s = ['A\t"Ravi"\t"Tirupur"\t"India"\t"641652"\t"arunachalamravi#gmail.com"\t"17379602"\t"+ 2"\t"Government Higher Secondary School', ' Tiruppur"\t\t"1989"\t"Maths',' Science"\t"No"\t"Encotec Energy 2 X 600 MW ITPCL"\t"Associate Vice President- Head Maintenance"\t"2015"\t"2016"\t"No"\t"27-Mar-2017"\t"9937297878"\t\t"2874875"\t"Submitted"\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t']
result = list(chain.from_iterable(x.rstrip('\t').split('\t') for x in s))
print result
This would give you all of the split entries, and remove the trailing tabs from the end:
['A', '"Ravi"', '"Tirupur"', '"India"', '"641652"', '"arunachalamravi#gmail.com"', '"17379602"', '"+ 2"', '"Government Higher Secondary School', ' Tiruppur"', '', '"1989"', '"Maths', ' Science"', '"No"', '"Encotec Energy 2 X 600 MW ITPCL"', '"Associate Vice President- Head Maintenance"', '"2015"', '"2016"', '"No"', '"27-Mar-2017"', '"9937297878"', '', '"2874875"', '"Submitted"']
If you also want to get rid of the quotes, then use this instead:
result = [v.strip('"') for v in chain.from_iterable(x.rstrip('\t').split('\t') for x in s)]
Giving you:
['A', 'Ravi', 'Tirupur', 'India', '641652', 'arunachalamravi#gmail.com', '17379602', '+ 2', 'Government Higher Secondary School', ' Tiruppur', '', '1989', 'Maths', ' Science', 'No', 'Encotec Energy 2 X 600 MW ITPCL', 'Associate Vice President- Head Maintenance', '2015', '2016', 'No', '27-Mar-2017', '9937297878', '', '2874875', 'Submitted']

Key Error In Python

Here is my code
with open('yvd.txt') as fd:
name='Trevor Jones'
input=[x.split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
print('\n'.join(to_search[name]))
I'm trying to search for a name in a large file and print the information that follows, minus the seperators. Here is an a part of the text file
|Trevor Jones|1|MOV|White Male|Light|10||3000|2500|Old Man Living In Retirement Home|
However, when I run the script I get a key error saying "KeyError: 'Trevor Jones'" which doesn't make sense because Trevor Jones exists in the file.
Anyone have any ideas?

>>> text = '|Trevor Jones|1|MOV|White Male|Light|10||3000|2500|Old Man Living In Retirement Home|'
>>> x = text.split('|')
>>> x
['', 'Trevor Jones', '1', 'MOV', 'White Male', 'Light', '10', '', '3000', '2500', 'Old Man Living In Retirement Home', '']
Here you can see the problem: x[0] is ''.
I would just use text.strip('|').split('|')
If you are wondering why this is, think about using '|'.join(x), which needs to be able to reconstruct it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - slow speed when extracting numbers/words from a string - python

In case you do want a regex: import re s='Miami 0, New England 28' l=re.findall(r'^([^\d]+)\s(\d+)\s,\s([^\d]+)\s(\d+)',s) hm_team,away_team,hm_score,away_score=l[0] print l Prints [('Miami', '0', 'New England', '28')] and assigns those values to the variables.

Related

Array keeps getting replaced when being added to a dictionary in python

python regex for incomplete decimals numbers

Python Variable Amount Of Input

python split by "\t" is not showing all elements in it

Key Error In Python

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - slow speed when extracting numbers/words from a string - python

In case you do want a regex: import re s='Miami 0, New England 28' l=re.findall(r'^([^\d]+)\s(\d+)\s*,\s*([^\d]+)\s(\d+)',s) hm_team,away_team,hm_score,away_score=l[0] print l Prints [('Miami', '0', 'New England', '28')] and assigns those values to the variables.

Related

Array keeps getting replaced when being added to a dictionary in python

python regex for incomplete decimals numbers

Python Variable Amount Of Input

python split by "\t" is not showing all elements in it

Key Error In Python

Categories

Resources

In case you do want a regex: import re s='Miami 0, New England 28' l=re.findall(r'^([^\d]+)\s(\d+)\s,\s([^\d]+)\s(\d+)',s) hm_team,away_team,hm_score,away_score=l[0] print l Prints [('Miami', '0', 'New England', '28')] and assigns those values to the variables.