Extract dollar amount at multiple places in sentence in python - python

I have a sentence as below where in I need to extract the dollar amounts with commas to be able to populate into a dictionary.I have tried with few options but couldn't succeed.Please guide.
For par\n
$3,500 single /$7,000 group
For nonpar\n
$7,000 single /$14,000 group
Expected output is :
"rates":{
"single" : "$3,500 (par) / $7,000 (nonpar)",
"group" : "$7,000 (par) / $14,000 (nonpar)"
}
\n here is on a new line
Amount might have decimal points and commas after every 3 values as below.
I was able to write regex for amount alone,but not finding right approach to extend it to my requirement.
re.search(r'^\$\d{1,3}(,\d{3})*(\.\d+)?$','$1,212,500.23')
Edit1:
Went ahead with one more step:
re.findall(r'\$\d{1,3}(?:\,\d{3})*(?:\.\d{2})?', str)
Could get all values in list (but need to have strategy to know which value corresponds to what?)
Edit2:
re.findall(r'\For par\W*(\$\d{1,3}(?:\,\d{3})*(?:\.\d{2})?\W*single)\s*\W*(\$\d{1,3}(?:\,\d{3})*(?:\.\d{2})?\W*group)', str)
Please help me to refine this and make it more generic.
Thanks

Related

How to Filter Rows in a DataFrame Based on a Specific Number of Characters and Numbers

New Python user here, so please pardon my ignorance if my approach seems completely off.
I am having troubles filtering rows of a column based off of their Character/Number format.
Here's an example of the DataFrame and Series
df = {'a':[1,2,4,5,6], 'b':[7, 8, 9,10 ], 'target':[ 'ABC1234','ABC123', '123ABC', '7KZA23']
The column I am looking to filter is the "target" column based on their character/number combos and I am essentially trying to make a dict like below
{'ABC1234': counts_of_format
'ABC123': counts_of_format
'123ABC': counts_of_format
'any_other_format': counts_of_format}
Here's my progress so far:
col = df['target'].astype('string')
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
matches = re.findall(abc1234_pat, col)
I keep getting this error:
TypeError: expected string or bytes-like object
I've double checked the dtype and it comes back as string. I've researched the TypeError and the only solutions I can find it converting it to a string.
Any insight or suggestion on what I might be doing wrong, or if this is simply the wrong approach to this problem, will be greatly appreciated!
Thanks in advance!
I am trying to create a dict that returns how many times the different character/number combos occur. For example, how many time does 3 characters followed by 4 numbers occur and so on.
(Your problem would have been earlier and easier understood had you stated this in the question post itself rather than in a comment.)
By characters, you mean letters; by numbers, you mean digits.
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
Since you want to count occurrences of all character/number combos, this approach of using one concrete pattern would not lead very far. I suggest to transform the targets to a canonical form which serves as the key of your desired dict, e. g. substitute every letter with C and every digit with N (using your terms).
Of the many ways to tackle this, one is using str.translate together with a class which does the said transformation.
class classify():
def __getitem__(self, key):
return ord('C' if chr(key).isalpha() else 'N' if chr(key).isdigit() else None)
occ = df.target.str.translate(classify()).value_counts()#.todict()
Note that this will purposely raise an exception if target contains non-alphanumeric characters.
You can convert the resulting Series to a dict with .to_dict() if you like.

Converting String Data Values with two commas from csv or txt files into float in python

I just received a dataset from a HPLC run and the problem I ran into is that the txt data from the software generates two dotted separated values for instance "31.456.234 min". Since I want to plot the data with matplotlib and numpy I can only see the data where the values are not listed with two commas. This is due to every value which is smaller than 1 is represented with one comma like "0.765298" the rest of the values is, as aforementioned, listed with two commas.
I tried to solve this issue with a .split() and .find() method, however, this is rather inconvenient and I was wondering whether there would be a more elegant way to solve this issue, since I need in the end again x and y values for plotting.
Many thanks for any helping answers in advance.
This is not very clear regarding comma and dots.
For the decimal number you say that you have comma but you show a dot : 0.765298
I guess you can not have dots for either thousand separator and decimal...
If you have english notation I guess the numbers are:
"31,456,234 min" and "0.765298"
In this case you can use the replace method :
output = "31,456,234"
number = float(output.replace(',',''))
# result : 31456234.0
EDIT
Not very sure to have understood what you are looking for and the format of the numbers...
However if the second comma in 31.456.234 is unwanted here is a solution :
def conv(n):
i = n.find('.')
return float(n[:i]+'.'+n[i:].replace('.',''))
x = '31.456.234'
y = '0.765298'
print(conv(x)) # 31.456234
print(conv(y)) # 0.765298

Search for specific strings in rows of dataframe and if strings exist then mark in another column in python

I have a dataframe with two columns
Current Dataframe
SE# Response COVID Words Mentioned
123456 As a merchant I appreciated your efforts in pricing with Covid-19
456789 you guys suck and didn't handle our relationship during this pandemic
347896 I love your company
Desired Dataframe
SE# Response COVID Words Mentioned
123456 As a merchant I appreciated your efforts in pricing with Covid-19 Y
456789 you guys suck and didn't handle our relationship during this pandemic Y
347896 I love your company N
terms = ['virus', 'Covid-19','covid19','flu','covid','corona','Corona','COVID-19','co-vid19','Coronavirus','Corona Virus','COVID','purell','pandemic','epidemic','coronaviru','China','Chinese','chinese','crona','korona']
These are the list of strings that need to be checked in each response. The goal is to be able to add or remove elements from the list of terms.
The above are just examples of records. I have a list of strings related to covid-19 that need to be searched in each response. If any of the strings exist, in the 'COVID Words Mentioned' column, mark a "Y" and "N" if the words do not show up.
How do I code this in python?
Much appreciated!
For each search term, set up a result vector:
d = {}
for i in LIST_OF_STRINGS:
d[i] = df['response'].str.contains(i, na=False)
I pass na=False because otherwise, Pandas will fill NA in cases where the string column is itself NA. We don't want that behaviour. The complexity of the operation increases rapidly with the number of search terms. Also consider changing this function if you want to match whole words, because contains matches sub-strings.
Regardless, take results and reduce them with bit-wise and. You need two imports:
from functools import reduce
from operator import and_
df[reduce(and_, d.values())]
The final line there selects the only elements with any of the words. You could alternatively try mapping the output of the reduction from {True, False} to {'Y', 'N'} using np.where.

Replacing certain values in a column with messy data

I have a very lengthy dataset, that is stored as a dataframe. The column I am looking at is called "Country". This column has quite a few countries within it. The issue is that I want to change various values to "USA".
The values I am trying to change are
U.S
United States
United states
etc.
There are too many variations and typos (more than 100) for me to go through. Is there any simpler way to change these values? Since there are other countries in the dataset, I cannot just change all the values to USA
One of thing you can do is to stick to the first letter of each word. For all of the instance the first letter is U and for the second part (if you split the whole string) is S. Here, I am using regular expressions package that is usually used when you are working with texts.
Import re
Split_parts = [re.split(r'[^A-Z,^a-z]', i) for i in df['country']]
The above line of code splits the string based on any none alphabetic character (e.g. period, comma, semicolon, etc.).
after splitting you can create a for loop that generates True, False elements if the first characters are U and S respectively.
value= []
for i in Split_parts:
if i[0][0] in ['u','U'] and i[1][0] in ['s','S']:
value.append(True)
else:
value.append(False)
After that you can replace the string with what you need (i.e. USA):
for i in range(len(value)):
if value[i]==True:
df['country'][i]='USA'
The only country in world that has U and S as the first letters of its words respectively is United States. The solution here is not something that can be used for all problems you may face. For each one you have to look for differences.

Extract a specific number from a string

I have this string 553943040 21% 50.83MB/s 0:00:39
The length of the numbers can vary
The percent can contain one or two numbers
The spaces between the start of the string and the first number may vary
I need to extract the first number, in this case 553943040
I was thinking that the method could be to:
1) Replace the percent with a separator. something like:
string=string.replace("..%","|") # where the "." represent any character, even an space.
2) Get the first part of the new string by cutting everything after the separator.
string=string.split("|")
string=string[0]
3) Remove the spaces.
string=string.strip()
I know that the stages 2 and 3 works, but I'm stocked on the first. Also if there is any better method of getting it would be great to know it!
Too much work.
>>> '553943040 21% 50.83MB/s 0:00:39'.split()[0]
'553943040'

Categories

Resources