python and join confusion - python

Problem: Write a function that will return a string of country codes from an argument that is a string of prices (containing dollar amounts following the country codes). Your function will take as an argument a string of prices like the following: "US$40, AU$89, JP$200". In this example, the function would return the string "US, AU, JP".
Hint: You may want to break the original string into a list, manipulate the individual elements, then make it into a string again.
Input:
def get_country_codes(prices):
values = ""
price_codes = prices.split(',')
for price_code in price_codes:
values = value + price_code.strip()[0:2])
return values
list1 = [ , ]
print(get_country_codes("NZ$300, KR$1200, DK$5").join(list1))

Since some existing currencies have a three letters symbol, such as CAD, we have to expect an unknown number of characters before any amount.
def get_countries(s):
countries = [c.split('$')[0] for c in s.split(',')]
return ','.join(countries)
s = "US$40, AU$89, JP$200, CAD$15"
print(get_countries(s))
Output
US, AU, JP, CAD
Alternatively, you can use re to simply remove anything following the country code in your string.
import re
s = "US$40, AU$89, JP$200, CAD$15"
countries = re.sub('\W\d+', '', s)
print(countries)

Try this:
codes="NZ$300, KR$1200, DK$5"
get_country_codes=lambda c: ', '.join(e[0:2] for e in c.split(", "))
get_country_codes(codes)

Related

Convert String to Float (Currency) but has more than one decimal points [duplicate]

This question already has answers here:
Why not use Double or Float to represent currency?
(16 answers)
Closed 1 year ago.
so i was webscraping Foot locker Website , now when i get the price i get it in more than one decimal points.
i want to round it off to 2 digits after decimal point, how can i do that ?
My price list:
90.00
170.00
198.00
137.99137.99158.00
When i try the float function/Method i get an error, can someone Please help :)
print(float(Price))
90.0
170.0
198.0
ValueError: could not convert string to float: '137.99137.99158.00'
and i also want to round it off to two decimal points, so 90.0 will become 90.00 :)
After a second look at your prices it seems to me that the problem with the multiple decimal points is due to missing spaces between the prices. Maybe the webscraper needs a fix? If you want to go on with what you have, you can do it with regular expressions. But my fix only works if prices are always given with two decimal digits.
import re
list_prices = [ '90.00', '170.00', '198.00', '137.99137.99158.00' ]
pattern_price = re.compile(r'[0-9]+\.[0-9]{2}')
list_prices_clean = pattern_price.findall('\n'.join(list_prices))
print(list_prices_clean)
# ['90.00', '170.00', '198.00', '137.99', '137.99', '158.00']
You're getting that error because the input 137.99137.99158.00 is not a valid input for the float function. I have written the below function to clean your inputs.
def clean_invalid_number(num):
split_num = num.split('.')
num_len = len(split_num)
if len(split_num) > 1:
temp = split_num[0] + '.'
for i in range(1,num_len):
temp += split_num[i]
return temp
else:
return num
To explain the above, I used the split function which returns a list. If the list length is greater than 1 then there is more than 1 fullstop which means the data needs to be cleaned.The list does not contain the character you split.
As for returning 2 decimal points simply use
Price = round(Price,2)
Returning two 90.00 instead of 90.0 does not make sense if you are casting to float.
Here is the full code as a demo:
prices = ['90.00', '170.00', '198.00', '137.99137.99158.00']
prices = [round(float(clean_invalid_number(p)),2 ) for p in prices]
print(prices)
[90.0, 170.0, 198.0, 137.99]
replace first dot by a temporary delimiter
delete all other dots
replace temporary delimiter with dot
round
print with two decimals
like this:
list_prices = [ '90.00', '170.00', '198.00', '137.99137.99158.00']
def clean_price(price, sep='.'):
price = str(price)
price = price.replace(sep, 'DOT', 1)
price = price.replace(sep, '')
price = price.replace('DOT', '.')
rounded = round(float(price),2)
return f'{rounded:.2f}'
list_prices_clean = [clean_price(price) for price in list_prices]
print(list_prices_clean)
# ['90.09', '170.00', '198.00', '137.99']
EDIT:
In case you mean rounding after the last decimal point:
def clean_price(price, sep='.'):
price = str(price)
num_seps = price.count(sep)
price = price.replace(sep, '', num_seps-1)
rounded = round(float(price),2)
return f'{rounded:.2f}'
list_prices_clean = [clean_price(price) for price in list_prices]
print(list_prices_clean)
# ['90.00', '170.00', '198.00', '1379913799158.00']
No need to write custom methods, use regular expressions (regex) to extract patterns from Strings. Your problem is that the long string (137.99137.99158.00) are 3 prices without spaces in between. The regex expression "[0-9]+.[0-9][0-9]" finds all patterns with one or more numbers before a "." and two numbers after the "."
import re
reg = "[0-9]+\.[0-9]{0,2}";
test = "137.99137.99158.00";
p = re.compile(reg);
result = p.search(test);
result.group(0)
Output:
137.99
Short explanation:
'[0-9]' "numbers"
'+' "one or more"
'.' "String for the dot"
Regex seems to be quite weird at the start, but it is an essential skill. Especially when you want to mine text.
Ok, i have finally sound a solution to my Problem, nad thank you everyone for helping out as well
def Price(s):
try:
P = s.find("div",class_="ProductPrice").text.replace("$","").strip().split("to")[1].split(".")
return round(float(".".join(P[0:2])),2)
except:
P = s.find("div",class_="ProductPrice").text.replace("$","").strip().split("to")[0].split(".")
return float(".".join(P[0:2]))

Function to extract company register number from text string using Regex

I have a function which extracts the company register number (German: handelsregisternummer) from a given text. Although my regex for this particular problem matches the correct format (please see demo), I can not extract the correct company register number.
I want to extract HRB 142663 B but I get HRB 142663.
Most numbers are in the format HRB 123456 but sometimes there is the letter B attached to the end.
import re
def get_handelsregisternummer(string, keyword):
# https://regex101.com/r/k6AGmq/10
reg_1 = fr'\b{keyword}[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*)(?: B)?'
match = re.compile(reg_1)
handelsregisternummer = match.findall(string) # list of matched words
if handelsregisternummer: # not empty
return handelsregisternummer[0]
else: # no match found
handelsregisternummer = ""
return handelsregisternummer
Example text scraped from website. Linebreaks make words attached to each other:
text_impressum = """"Berlin, HRB 142663 BVAT-ID.: DE283580648Tax Reference Number:"""
Apply function:
for keyword in ['HRB', 'HRA', 'HR B', 'HR A']:
handelsregisternummer = get_handelsregisternummer(text_impressum, keyword=keyword)
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = keyword + " " + handelsregisternummer
break
if not handelsregisternummer: # if list is empty
handelsregisternummer = 'not specified'
handelsregisternummer_dict = {'handelsregisternummer':handelsregisternummer}
Afterwards I get:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663'}
But I want this:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663 B'}
You need to use two capturing groups in the regex to capture the keyword and the number, and just match the rest:
reg_1 = fr'\b({keyword})[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*(?: B)?)'
# |_________| |___________________|
Then, you need to concatenate, join all the capturing groups matched and returned with findall:
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = " ".join(handelsregisternummer)
break
See the Python demo.

String format - French accents

I have an issue with scraping some string data from Wikipedia. Here is my code:
import scrapy
import json
class communes_spider(scrapy.Spider):
name = "city"
start_urls = ['https://fr.wikipedia.org/wiki/Liste_des_communes_de_Belgique_par_population']
def parse(self, response):
for city in response.css('table.wikitable td a::text').getall():
if city == '2':
pass
elif city == '3':
pass
else:
yield {
'cities': city + ', BE'
}
The issue is that the strings are in french and some cities contains "è" or "é". When I export them to a json file, a word like "Liège" is exported like this "Li\u00e8ge". How can I turn those strings into french letters ?
You dont need to convert them into french.
They are one and the same.
You can check them in ipython as follows
In [1]: l2 = 'Liège'
In [2]: l2
Out[2]: 'Li\xc3\xa8ge'
In [3]: print(l2)
Liège
A character is the smallest possible component of a text. 'A','B','C', etc are all different characters. So are ‘È’ and ‘Í’. Characters are abstractions and vary depending on the language or context you are talking about.
The Unicode standard describes how characters are represented by code points.
A code point is an integer value usually denoted in Base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0*12ca 4810 decimal).
The Unicode standard contains a lot of tables listing character and their corresponding code points.
In [14]: a='\xc3\xa8'
In [15]: b='è'
In [16]: if a==b:
....: print(True)
....:
True

Using f String to insert characters or symbols #

I have two variables that store two numbers in total.
I want to combine those numbers and separate them with a comma. I read that I can use {variablename:+} to insert a plus or spaces or a zero but the comma doesn't work.
x = 42
y = 73
print(f'the number is {x:}{y:,}')
here is my weird solution, im adding a + and then replacing the + with a comma. Is there a more direct way?
x = 42
y = 73
print(f'the number is {x:}{y:+}'.replace("+", ","))
Lets say I have names and domain names and I want to build a list of email addresses. So I want to fuse the two names with an # symbol in the middel and a .com at the end.
Thats just one example I can think off the top of my head.
x = "John"
y = "gmail"
z = ".com"
print(f'the email is {x}{y:+}{z}'.replace(",", "#"))
results in:
print(f'the email is {x}{y:+}{z}'.replace(",", "#"))
ValueError: Sign not allowed in string format specifier
You are over-complicating things.
Since only what's between { and } is going to be evaluated, you can simply do
print(f'the number is {x},{y}') for the first example, and
print(f'the email is {x}#{y}{z}') for the second.
When you put something into "{}" inside f-formatting, it's actually being evaluated. So, everything which shouldn't put it outside the "{}".
Some Examples:
x = 42
y = 73
print(f'Numbers are: {x}, {y}') # will print: 'Numbers are: 42, 73'
print(f'Sum of numbers: {x+y}') # will print: 'Sum of numbers: 115'
You can even do something like:
def compose_email(user_name, domain):
return f'{user_name}#{domain}'
user_name = 'user'
domain = 'gmail.com'
print(f'email is: {compose_email(user_name, domain)}')
>>email is: user#gmail.com
For more examples see:
Nested f-strings

Python Pandas Column and Fuzzy Match + Replace

Intro
Hello, I'm working on a project that requires me to replace dictionary keys within a pandas column of text with values - but with potential misspellings. Specifically I am matching names within a pandas column of text and replacing them with "First Name". For example, I would be replacing "tommy" with "First Name".
However, I realize there's the issue of misspelled names and text within the column of strings that won't be replaced by my dictionary. For example 'tommmmy" has extra m's and is not a first name within my dictionary.
#Create df
d = {'message' : pd.Series(['awesome', 'my name is tommmy , please help with...', 'hi tommy , we understand your quest...'])}
names = ["tommy", "zelda", "marcon"]
#create dict
namesdict = {r'(^|\s){}($|\s)'.format(el): r'\1FirstName\2' for el in names}
#replace
d['message'].replace(namesdict, regex = True)
#output
Out:
0 awesome
1 my name is tommmy , please help with...
2 hi FirstName , we understand your quest...
dtype: object
so "tommmy" doesn't match to "tommy" in the -> I need to deal with misspellings. I thought about trying to do this prior to the actual dictionary key and value replacement, like scan through the pandas data frame and replace the words within the column of strings ("messages") with the appropriate name. I've seen a similar approach using an index on specific strings like this one
but how do you match and replace words within the sentences within a pandas df, using a list of correct spelling? Can I do this within the df.series replace argument? Should I stick with a regex string replace?*
Any suggestions appreciated.
Update , trying Yannis's answer
I'm trying Yannis's answer but I need to use a list from an outside source, specifically the US census of first names for matching. But it's not matching on the whole names with the string I download.
d = {'message' : pd.Series(['awesome', 'my name is tommy , please help with...', 'hi tommy , we understand your quest...'])}
import requests
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
#US Census first names (5000 +)
firstnamelist = re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
#turn list to string, force lower case
fnstring = ', '.join('"{0}"'.format(w) for w in firstnamelist )
fnstring = ','.join(firstnamelist)
fnstring = (fnstring.lower())
##turn to list, prepare it so it matches the name preceded by either the beginning of the string or whitespace.
names = [x.strip() for x in fnstring.split(',')]
#import jellyfish
import difflib
def best_match(tokens, names):
for i,t in enumerate(tokens):
closest = difflib.get_close_matches(t, names, n=1)
if len(closest) > 0:
return i, closest[0]
return None
def fuzzy_replace(x, y):
names = y # just a simple replacement list
tokens = x.split()
res = best_match(tokens, y)
if res is not None:
pos, replacement = res
tokens[pos] = "FirstName"
return u" ".join(tokens)
return x
d["message"].apply(lambda x: fuzzy_replace(x, names))
Results in:
Out:
0 FirstName
1 FirstName name is tommy , please help with...
2 FirstName tommy , we understand your quest...
But if I use a smaller list like this it works:
names = ["tommy", "caitlyn", "kat", "al", "hope"]
d["message"].apply(lambda x: fuzzy_replace(x, names))
Is it something with the longer list of names that's causing a problem?
Edit:
Changed my solution to use difflib. The core idea is to tokenize your input text and match each token against a list of names. If best_match finds a match then it reports the position (and the best matching string), so then you can replace the token with "FirstName" or anything you want. See the complete example below:
import pandas as pd
import difflib
df = pd.DataFrame(data=[(0,"my name is tommmy , please help with"), (1, "hi FirstName , we understand your quest")], columns=["A", "message"])
def best_match(tokens, names):
for i,t in enumerate(tokens):
closest = difflib.get_close_matches(t, names, n=1)
if len(closest) > 0:
return i, closest[0]
return None
def fuzzy_replace(x):
names = ["tommy", "john"] # just a simple replacement list
tokens = x.split()
res = best_match(tokens, names)
if res is not None:
pos, replacement = res
tokens[pos] = "FirstName"
return u" ".join(tokens)
return x
df.message.apply(lambda x: fuzzy_replace(x))
And the output you should get is the following
0 my name is FirstName , please help with
1 hi FirstName , we understand your quest
Name: message, dtype: object
Edit 2
After the discussion, I decided to have another go, using NLTK for parts of speech tagging and run the fuzzy matching only for the NNP tags (proper nouns) against the name list. The problem is that sometimes the tagger doesn't get the tag right, e.g. "Hi" might be also tagged as proper noun. However if the list of names are lowercased then get_close_matches doesn't match Hi against a name but matches all other names. I recommend that df["message"] is not lowercased to increase the chances that NLTK tags the names properly. One can also play with StanfordNER but nothing will work 100%. Here is the code:
import pandas as pd
import difflib
from nltk import pos_tag, wordpunct_tokenize
import requests
import re
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
# US Census first names (5000 +)
firstnamelist = re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
# turn list to string, force lower case
# simplified things here
names = [w.lower() for w in firstnamelist]
df = pd.DataFrame(data=[(0,"My name is Tommmy, please help with"),
(1, "Hi Tommy , we understand your question"),
(2, "I don't talk to Johhn any longer"),
(3, 'Michale says this is stupid')
], columns=["A", "message"])
def match_names(token, tag):
print token, tag
if tag == "NNP":
best_match = difflib.get_close_matches(token, names, n=1)
if len(best_match) > 0:
return "FirstName" # or best_match[0] if you want to return the name found
else:
return token
else:
return token
def fuzzy_replace(x):
tokens = wordpunct_tokenize(x)
pos_tokens = pos_tag(tokens)
# Every token is a tuple (token, tag)
result = [match_names(token, tag) for token, tag in pos_tokens]
x = u" ".join(result)
return x
df['message'].apply(lambda x: fuzzy_replace(x))
And I get in the output:
0 My name is FirstName , please help with
1 Hi FirstName , we understand your question
2 I don ' t talk to FirstName any longer
3 FirstName says this is stupid
Name: message, dtype: object

Categories

Resources