So I may have a string 'Bank of China', or 'Embassy of China', and 'International China'
I want to replace all country instances except when we have an 'of ' or 'of the '
Clearly this can be done by iterating through a list of countries, checking if the name contains a country, then checking if before the country 'of ' or 'of the ' exists.
If these do exist then we do not remove the country, else we do remove country. The examples will become:
'Bank of China', or 'Embassy of China', and 'International'
However iteration can be slow, particularly when you have a large list of countries and a large lists of texts for replacement.
Is there a faster and more conditionally based way of replacing the string? So that I can still use a simple pattern match using the Python re library?
My function is along these lines:
def removeCountry(name):
for country in countries:
if country in name:
if 'of ' + country in name:
return name
if 'of the ' + country in name:
return name
else:
name = re.sub(country + '$', '', name).strip()
return name
return name
EDIT: I did find some info here. This does describe how to do an if, but I really want a
if not 'of '
if not 'of the '
then replace...
You could compile a few sets of regular expressions, then pass your list of input through them. Something like:
import re
countries = ['foo', 'bar', 'baz']
takes = [re.compile(r'of\s+(the)?\s*%s$' % (c), re.I) for c in countries]
subs = [re.compile(r'%s$' % (c), re.I) for c in countries]
def remove_country(s):
for regex in takes:
if regex.search(s):
return s
for regex in subs:
s = regex.sub('', s)
return s
print remove_country('the bank of foo')
print remove_country('the bank of the baz')
print remove_country('the nation bar')
''' Output:
the bank of foo
the bank of the baz
the nation
'''
It doesn't look like anything faster than linear time complexity is possible here. At least you can avoid recompiling the regular expressions a million times and improve the constant factor.
Edit: I had a few typos, bu the basic idea is sound and it works. I've added an example.
I think you could use the approach in Python: how to determine if a list of words exist in a string to find any countries mentioned, then do further processing from there.
Something like
countries = [
"Afghanistan",
"Albania",
"Algeria",
"Andorra",
"Angola",
"Anguilla",
"Antigua",
"Arabia",
"Argentina",
"Armenia",
"Aruba",
"Australia",
"Austria",
"Azerbaijan",
"Bahamas",
"Bahrain",
"China",
"Russia"
# etc
]
def find_words_from_set_in_string(set_):
set_ = set(set_)
def words_in_string(s):
return set_.intersection(s.split())
return words_in_string
get_countries = find_words_from_set_in_string(countries)
then
get_countries("The Embassy of China in Argentina is down the street from the Consulate of Russia")
returns
set(['Argentina', 'China', 'Russia'])
... which obviously needs more post-processing, but very quickly tells you exactly what you need to look for.
As pointed out in the linked article, you must be wary of words ending in punctuation - which could be handled by something like s.split(" \t\r\n,.!?;:'\""). You may also want to look for adjectival forms, ie "Russian", "Chinese", etc.
Not tested:
def removeCountry(name):
for country in countries:
name = re.sub('(?<!of (the )?)' + country + '$', '', name).strip()
Using negative lookbehind re.sub just matches and replaces when country is not preceded by of or of the
The re.sub function accepts a function as replacement text, which is called in order to get the text that should be substituted in the given match. So you could do this:
import re
def make_regex(countries):
escaped = (re.escape(country) for country in countries)
states = '|'.join(escaped)
return re.compile(r'\s+(of(\sthe)?\s)?(?P<state>{})'.format(states))
def remove_name(match):
name = match.group()
if name.lstrip().startswith('of'):
return name
else:
return name.replace(match.group('state'), '').strip()
regex = make_regex(['China', 'Italy', 'America'])
regex.sub(remove_name, 'Embassy of China, International Italy').strip()
# result: 'Embassy of China, International'
The result might contain some spurious space (in the above case a last strip() is needed). You can fix this modifying the regex to:
\s*(of(\sthe)?\s)?(?P<state>({}))
To catch the spaces before of or before the country name and avoid the bad spacing in the output.
Note that this solution can handle a whole text, not just text of the form Something of Country and Something Country. For example:
In [38]: regex = make_regex(['China'])
...: text = '''This is more complex than just "Embassy of China" and "International China"'''
In [39]: regex.sub(remove_name, text)
Out[39]: 'This is more complex than just "Embassy of China" and "International"'
an other example usage:
In [33]: countries = [
...: 'China', 'India', 'Denmark', 'New York', 'Guatemala', 'Sudan',
...: 'France', 'Italy', 'Australia', 'New Zealand', 'Brazil',
...: 'Canada', 'Japan', 'Vietnam', 'Middle-Earth', 'Russia',
...: 'Spain', 'Portugal', 'Argentina', 'San Marino'
...: ]
In [34]: template = 'Embassy of {0}, International {0}, Language of {0} is {0}, Government of {0}, {0} capital, Something {0} and something of the {0}.'
In [35]: text = 100 * '\n'.join(template.format(c) for c in countries)
In [36]: regex = make_regex(countries)
...: result = regex.sub(remove_name, text)
In [37]: result[:150]
Out[37]: 'Embassy of China, International, Language of China is, Government of China, capital, Something and something of the China.\nEmbassy of India, Internati'
Related
I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1
I need to get input from the user and check if the word following 'city' is inside of my dictionary (the key).
This is my dic:
mydic = {'Paris':132, 'Rome':42, 'San Remo':23}
I need the user to write 'city' (they will do so, given instructions that I gave them) but after it they have to write a city name. So they will write something like: 'city paris' and that has to return: 'Paris has 132 churches' or if they write 'city san remo' it has to return: 'San remo has 23 churches'.
The code has a conditional because if the user types 'city gba' then it returns a specific thing. So the issue is in the elif part where there are a lot more city options.
This is what I thought could work but doesn't for obvious reasons:
user_input = input().lower()
if user_input == 'city gba':
print('City gba has', city_data.get('gba'), 'churches')
elif user_input.split() == 'city' + '':
for x in user_input:
if x in mydic.keys():
print(x, 'has', city_data.get(x), ' churches.')
How else can I do this?
Thank you.
Your code is very close to working
user_input = input().lower() # Assume user input is "city Rome"
tag, city = user_input.split(" ", 1) # This will set tag = 'city' and city = 'rome'
if tag == 'city':
if city.capitalize() in mydic: # You could also do mydic.keys() it doesn't matter
print(f"{city} has {county_data.get(city)} churches.")
EDIT: .split() does not work as cities like united states or vatican city are separated by spaces. We should use .split(" ", 1) which will only split on the first occurrence of a space.
city, name = user_input.split(' ', 1) # Extract "city" and the city name
if name.lower() in mydic.keys():
print(x, 'has', county_data.get(x), ' churches.')
Note that you need the name and the key to be an exact match. I recommend that you drop both to lower-case for the comparison. Code the dict as all lower-case.
Your use-case is exactly the point of the new Python 3.10 feature Structural Pattern Matching. So your code could be rewritten to:
user_input = input().lower()
match user_input.split():
case ['city', 'gba']:
print('City gba has', city_data.get('gba'), 'churches')
case ['city', city]:
if city in mydict:
print(city, 'has', city_data[city], 'churches.')
Something as simple as this should work.
code
#prompt user for city name
city = input('Enter a city. Format \'city name\': ')
#split user input at spaces. #['city','cityName']
#take the second element in this list
city = city.split()[1]
#check if that city is in the given dictionary. If so we print the cities church count.
if city in mydic.keys():print(f'{city} has {mydic[city]} churches')
#if it is not visible in the dictionary we show that city is not located in the given dictionary
else:print('City not in dictionary')
input
Enter a city. Format 'city name': city Paris
output
Paris has 132 churches
I'm just starting to get to grips with Scrapy. So far, I've figured out how to extract the relevant sections of a web page and to crawl through web pages.
However, I'm still unsure as to how one can format the results in a meaningful tabular format.
When the scraped data is an table format, it's straightforward enough. However, sometimes the data isn't. e.g. this link
I can access the names using
response.xpath('//div[#align="center"]//h3').extract()
Then I can access the details using
response.xpath('//div[#align="center"]//p').extract()
Now, I need to format the data like this, so I can save it to a CSV file.
Name: J Speirs Farms Ltd
Herd Prefix: Pepperstock
Membership No. 7580
Dept. Herd Mark: UK244821
Membership Type: Youth
Year Joined: 2006
Address: Pepsal End Farm, Pepperstock, Luton, Beds
Postcode: LU1 4LH
Region: East Midlands
Telephone: 01582450962
Email:
Website:
Ideally, I'd like to define the structure of the data, then use populate according to the scraped data. Because in some cases, certain fields are not available, e.g. Email: and Website:
I don't need the answer, but would appreciate if someone can point me in the right direction.
All of the data seem to be separated by newlines, so simply use str.splitlines():
> names = response.xpath('//div[#align="center"]//a[#name]')
> details = names[0].xpath('following-sibling::p[1]/text()').extract_first().splitlines()
['J Speirs Farms Ltd ', 'Herd Prefix: Pepperstock ', 'Membership No. 7580 ', 'Dept. Herd Mark: UK244821 ', 'Membership Type: Youth ', 'Year Joined: 2006 ', 'Address: Pepsal End Farm ', ' Pepperstock ', ' Luton ', ' Beds ', 'Postcode: LU1 4LH ', 'Region: East Midlands ', 'Telephone: 01582450962 ']
> name = names[0].xpath('#name').extract_first()
'J+Speirs+Farms+Ltd+++'
Now you just need to figure out how to parse those bits into clean format:
Some names are split in multiple lines but you can identify and fix the list by checking whether members contain : or ., if not they belong to preceding member that does:
clean_details = [f'Name: {details[0]}']
# first item is name, skip
for d in details[1:]:
if ':' in d or 'No.' in d:
clean_details.append(d)
else:
clean_details[-1] += d
Finally parse the cleaned up details list we have:
item = {}
for detail in clean_details:
values = detail.split(':')
if len(values) < 2: # e.g. Membership No.
values = detail.split('No.')
if len(values) == 2: # e.g. telephone: 1337
label, text = values
item[label] = text.strip()
>>> pprint(item)
{'Address': 'Pepsal End Farm Pepperstock Luton Beds',
'Dept. Herd Mark': 'UK244821',
'Herd Prefix': 'Pepperstock',
'Membership ': '7580',
'Membership Type': 'Youth',
'Name': 'J Speirs Farms Ltd',
'Postcode': 'LU1 4LH',
'Region': 'East Midlands',
'Telephone': '01582450962',
'Year Joined': '2006'}
You can define a class for the items you want to save and import the class to your spider. Then you can directly save the items.
i want to get correct result from my condition, here is my condition
this is my database
and here is my code :
my define text
#define
country = ('america','indonesia', 'england', 'france')
city = ('new york', 'jakarta', 'london', 'paris')
c1="Country"
c2="City"
c3="<blank>"
and condition ("text" here is passing from select database, ofc using looping - for)
if str(text) in str(country) :
stat=c1
elif str(text) in str(city) :
stat=c2
else :
stat=c3
and i got wrong result for the condition, like this
any solution to make this code work ? it work when just contain 1 text when using "in", but this case define so many text condition.
If i understood you correctly you need.
text = "i was born in paris"
country = ('america','indonesia', 'england', 'france')
city = ('new york', 'jakarta', 'london', 'paris')
def check(text):
for i in country:
if i in text.lower():
return "Country"
for i in city:
if i in text.lower():
return "City"
return "<blank>"
print(check(text))
print(check("I dnt like vacation in america"))
Output:
City
Country
You could be better off using dictionaries. I assume that text is a list:
dict1 = {
"countries" : ['america','indonesia', 'england', 'france'],
"city" : ['new york', 'jakarta', 'london', 'paris']
}
for x in text:
for y in dict1['countries']:
if y in x:
print 'country: ' + x
for z in dict1['city']:
if z in x:
print 'city: ' + x
First of all, check what you are testing.
>>> country = ('america','indonesia', 'england', 'france')
>>> city = ('new york', 'jakarta', 'london', 'paris')
>>>
>>> c1="Country"
>>> c2="City"
>>> c3="<blank>"
Same as your setup. So, you are testing for the presence of a substring.
>>> str(country)
"('america', 'indonesia', 'england', 'france')"
Let's see if we can find a country.
>>> 'america' in str(country)
True
Yes! Unfortunately a simple string test such as the one above, besides involving an unnecessary conversion of the list to a string, also finds things that aren't countries.
>>> "ca', 'in" in str(country)
True
The in test for strings is true if the string to the right contains the substring on the left. The in test for lists is different, however, and is true when the tested list contains the value on the left as an element.
>>> 'america' in country
True
Nice! Have got got rid of the "weird other matches" bug?
>>> "ca', 'in" in country
False
It would appear so. However, using the list inclusion test you need to check every word in the input string rather than the whole string.
>>> "I don't like to vacation in america" in country
False
The above is similar to what you are doing now, but testing list elements rather than the list as a string. This expression generates a list of words in the input.
>>> [word for word in "I don't like to vacation in america".split()]
['I', "don't", 'like', 'to', 'vacation', 'in', 'america']
Note that you may have to be more careful than I have been in splitting the input. In the example above, "america, steve" when split would give ['america,', 'steve'] and neither word would match.
The any function iterates over a sequence of expressions, returning True at the first true member of the sequence (and False if no such element is found). (Here I use a generator expression instead of a list, but the same iterable sequence is generated).
>>> any(word in country for word in "I don't like to vacation in america".split())
True
For extra marks (and this is left as an exercise for the reader) you could write
a function that takes two arguments, a sentence and a list of possible matches,
and returns True if any of the words in the sentence are present in the list. Then you could use two different calls to that function to handle the countries and the
cities.
You could speed things up somewhat by using sets rather than lists, but the principles are the same.
Suppose I have a list of strings
"Measles outbreak in the U.S worse than ever"
"MMR vaccination rates in California at all time low"
"I don't live in California"
and two lists of keywords
location = ['California', 'West Coast', 'Los Angeles']
disease = ['Measles', 'MMR', 'Pertussis']
How can I pick out the strings that contain atleast one keyword form both disease and location.
For example, the second string should be picked out, but not the first or last.
Make location and disease sets, split the substrings into words and see if a word from the split string appears in both sets
location = {'California', 'West Coast', 'Los Angeles'}
disease = {'Measles', 'MMR', 'Pertussis'}
l = ['West Coast MMR',"Measles outbreak in the U.S worse than ever","MMR vaccination rates in California at all time low","I don't live in California"]
import re
r = re.compile("West Coast|Los Angeles|California")
for s in l:
if r.search(s) and any(word in disease for word in s.split()):
print(s)
for s in l:
if r.search(s) and disease.intersection(s.split()):
print(s)
if location.intersection(spl) and disease.intersection(spl): will only be True if at least one from the string appears in both sets. r.search(s) catches the two word substrings from location.
Depending on how your actual location list looks a mix the set and re approach might be the fastest, checking the set first then using or r.search(s) where the regex is compiled to match the multi-word substrings.
You may also want to use word boundaries so you don't match Californian etc..:
r = re.compile("West Coast|Los Angeles|\bCalifornia\b")
Depending on what other words can appear you may need to do other adjustments. Without knowing your actual data set then it is impossible to give a definitive or optimal answer.
import re
strings = [
"Measles outbreak in the U.S worse than ever.",
"MMR vaccination rates in Los Angeles at all time low.",
"I don't live in California.",
"The West Coast has many cases of Pertussis.",
"Do Californians even get Measles?",
]
kw_sets = [
["California", "West Coast", "Los Angeles"],
["Measles", "MMR", "Pertussis"],
]
patterns = ('|'.join(r'\b{}\b'.format(re.escape(kw)) for kw in kw_set)
for kw_set in kw_sets)
compiled_patterns = [re.compile(pattern) for pattern in patterns]
filterfunc = lambda s: all(cp.search(s) for cp in compiled_patterns)
filtered_strings = list(filter(filterfunc, strings))
print(*filtered_strings, sep='\n')
This is a regular expression solution that targets Python 3.x.
Output:
MMR vaccination rates in Los Angeles at all time low.
The West Coast has many cases of Pertussis.
Assuming strings is a list containing the defined strings, then
location = set(['California', 'West Coast', 'Los Angeles'])
disease = set(['Measles', 'MMR', 'Pertussis'])
res = [s for s in strings if ( set(s.split()) & location and set(s.split()) & disease ) ]
print res
will do as needed. Note that the set(s.split()) operation is done twice and should be factored out.
Handling multi-word keywords correctly and more than two keyword sets nicely:
strings = ("Measles outbreak in the U.S worse than ever",
"MMR vaccination rates in California at all time low",
"I don't live in California",
"MMR vaccination rates in California at all time",
"low West Coast Measles")
kwsets = (['California', 'West Coast', 'Los Angeles'],
['Measles', 'MMR', 'Pertussis'],
['low', 'prices', 'today'])
for string in strings:
if all(any(kw in string for kw in kws) for kws in kwsets):
print(string)
If there were no multi-word keywords, this would also work:
strings = ("Measles outbreak in the U.S worse than ever",
"MMR vaccination rates in California at all time low",
"I don't live in California",
"MMR vaccination rates in California at all time")
kwsets = ({'California', 'West Coast', 'Los Angeles'},
{'Measles', 'MMR', 'Pertussis'},
{'low', 'prices', 'today'})
for string in strings:
if all(map(set(string.split()).intersection, kwsets)):
print(string)