I have a set of files composed as follows:
Product: Name
Description: description of product
I want to extract only the name and the description's content without the 'Product:' and 'Description:'. For this I do:
div = re.split('Product:\s+|Description:\s+', contentOfFile)
The problem is that I get a table of 3 elements instead of 2 with a ' ' (space) at the beginning. I don't know if space is always taken into account because I just want to get in this case:
["Name","description of product"]
Let's simplify your example. What if we split on pipe instead of your regular expressions?
>>> "|a|b".split('|')
['', 'a', 'b']
If the string starts with a separator, split will add an extra empty element in the returned value. Now in your case the separator is a regular expression, but similarly, your string begins with something that matches that expression, so the first returned element is an empty string.
To address it, you can just skip the first element
div = re.split('Product:\s+|Description:\s+', contentOfFile)[1:]
You don't need split, use findall:
>>> re.findall(r":\s+(.*)", a)
['Name', 'description of product']
Using this solution, you won't be dependent on the text before the :, so even when you have:
SomeText: Name
BlaBlaBla: description of product
it'll extract Name and description of product. It's a good practice to write general solution for your problem and try to think about possible future scenarios.
A general solution through split method without using regex.
>>> x = """Product: Name
Description: description of product"""
>>> [i.split(':')[1].lstrip() for i in x.split('\n')]
['Name', 'description of product']
i think you can try the strip function instead of split...
it aldo help to remove space..
here a small example of split function
str1 = "Product: Name";
str2 = "Description: description of product";
print str1.lstrip('Product:, ');
print str2.lstrip('Description:, ');
and the output shown as below....
Name
description of product
Related
I want to get names from a website in a list.
soup = bs4.BeautifulSoup(page.text, 'html.parser')
tbl = soup.find('ul', class_='static-top-names part1')
for link in tbl:
names = link.get_text()
print(names)
So i'm trying to get some names from a website and when i applied above code, i get names as a . When i try to iterate over it i get below output.
John
Mark
Steve and so on.
I want to get rid of the number in the text data and also just want to have the names in a list format.
All i want is to get these pure names and hopefully put them in a list form. Any help?
If the format is always #. name, then you can do the following:
name.split('. ', 1)[1]
Use regular expression for consistency.
import re
s = '1.TEST'
print(re.sub('\d+.','',s))
will give you TEST only. This will eliminate any size of number following with a dot. Basically, replace any number followed by a dot with emptiness.
Iterate over your original list and do the above at the same time using list comprehension
new_list = [re.sub('\d+.','',s) for s in original_list]
This should give you the new list as per your requirement.
You can simply split with '.' dot character or even a space if there is a space before name.
So name.split('' )[-1] name.split('.')[-1] would give just the name. Then you can append those names into a list.
Something like this.
names = [link.get_text().split(' ')[-1] for link in tbl]
This will you the list of just names, i used [-1] as the list index after since your text contains only two items after splitting with space. So if there are more items please use appropriate index.
I have a list of strings that looks like this:
Input:
prices_list = ["CNY1234", "$ 4.421,00", "PHP1,000", "€432"]
I want to remove everything except .isdigit(), and '.|,'. In other words, I would like to split before the first occurrence of any digit with maxsplit=1:
Desired output:
["1234", "4.421,00", "1,000", "432"]
First attempt (two regex replacements):
# Step 1: Remove special characters
prices_list = [re.sub(r'[^\x00-\x7F]+',' ', price).encode("utf-8") for price in prices_list]
# Step 2: Remove [A-Aa-z]
prices_list = [re.sub(r'[A-Za-z]','', price).strip() for price in prices_list]
Current output:
['1234', '$ 4.421,00', '1,000', '432'] # $ still in there
Second attempt (still two regex replacements):
prices_list = [''.join(re.split("[A-Za-z]", re.sub(r'[^\x00-\x7F]+','', price).encode("utf-8").strip())) for price in price_list]
This (of course) leads to the same output as my first attempt. Also, this isn't much shorter and looks very ugly. Is there a better (shorter) way to do this?
Third attempt (list comprehension/nestedfor-loop/no regex):
prices_list = [''.join(token) for token in price for price in price_list if token.isdigit() or token == ',|;']
which yields:
NameError: name 'price' is not defined
How to best parse the above-mentioned price list?
If you need to leave only specific characters, it's better to tell regex to do exactly that thing:
import re
prices_list = ["CNY1234", "$ 4.421,00", "PHP1,000", "€432"]
prices = list()
for it in prices_list:
pattern = r"[\d.|,]+"
s = re.search(pattern, it)
if s:
prices.append(s.group())
> ['1234', '4.421,00', '1,000', '432']
The Problem
Correct me if I'm wrong, but essentially you're trying to remove symbols and such and only leave any trailing digits, right?
I would like to split before the first occurrence of any digit
That, I feel, is the simplest way to frame the regex problem that you are trying to solve.
A Solution
# -*- coding: utf-8 -*-
import re
# Match any contiguous non-digit characters
regex = re.compile(r"\D+")
# Input list
prices_list = ["CNY1234", "$ 4.421,00", "PHP1,000", "€432"]
# Regex mapping
desired_output = map(lambda price: regex.split(price, 1)[-1], prices_list)
This gives me ['1234', '4.421,00', '1,000', '432'] as the output.
Explanation
The reason this works is because of the lambda and the map function. Basically, the map function takes in a lambda (a portable, one-line function if you will), and executes it on every element in the list. The negative index takes the last element that the list of matches that the split method generates
Essentially, this works because of the assumption that you don't want any initial non-digits in your output.
Caveats
This code not only keeps . and , in the resulting substring, but all characters in the resulting substring. So, an input string of "$10e7" will be output as '10e7'.
If you were to have just digits and . and ,, such as "10.00" as an input string, you would get '00' in the corresponding location in the output list.
If none of these are desired behavior, you would have to get rid of the negative indexing next to the regex.split(price, 1) and do further processing on the resulting list of lists so that you can handle all of those pesky edge cases that arise with using regex.
Either way, I would try and throw more extreme examples at it just to make sure that it's what you need.
Let's say we have this data extract:
ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10
I want to retrieve the line when from = paris, and type = member.
Which means in this example I have only:
1,paris,berlin,member,12
That satisfy these rules. I am trying to do this with Regex only. I am still learning and I could only get this:
^.*(paris).*(member).*$
However, this will give me also the second line where paris is a destination.
The idea I guess is to:
Divide the line by commas.
Check if the second item is equal to 'paris'
Check if the fourth item is equal to 'member', or even check if there is 'member' in that line as there is no confusion with this part.
Any solution where I can use only regex?
Use [^,]* instead of .* to match a sequence of characters that doesn't include the comma separator. Use this for each field you want to skip when matching the line.
^[^,]*,paris,[^,]*,member,
Note that this is a very fragile mechanism compared to use the csv module, since it will break if you have any fields that contain comma (the csv module understands quoting a field to protect the delimiter).
This should do it:
^.*,(paris),.*,(member),.*$
As many have pointed out, I would read this into a dictionary using csv. However, if you insist on using regex, this should work:
[0-9]+\,paris.*[^-]member.*
try this.
import re
regex = r"\d,paris,\w+,member,\d+"
str = """ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10"""
str = str.split("\n")
for line in str:
if (re.match(regex, line)):
print(line)
You can try this:
import re
s = """
ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10
"""
final_data = re.findall('\d+,paris,\w+,member,\d+', s)
Output:
['1,paris,berlin,member,12']
However, note that the best solution is to read the file and use a dictionary:
import csv
l = list(csv.reader(open('filename.csv')))
final_l = [dict(zip(l[0], i)) for i in l[1:]]
final_data = [','.join(i[b] for b in l[0]) for i in final_l if i['from'] == 'paris' and i['type'] == 'member']
s = 'myName.Country.myHeight'
required = s.split('.')[0]+'.'+s.split('.')[1]
print required
myName.Country
How can I get the same 'required' string with better and shorter way?
Use str.rpartition like this
s = 'myName.Country.myHeight'
print s.rpartition(".")[0]
# myName.Country
rpartition returns a three element tuple,
1st element being the string before the separator
then the separator itself
and the the string after the separator
So, in our case,
s = 'myName.Country.myHeight'
print s.rpartition(".")
# ('myName.Country', '.', 'myHeight')
And we have picked only the first element.
Note: If you want to do it from the left, instead of doing it from the right, we have a sister function called str.partition.
You have a few options.
1
print s.rsplit('.',1)[0]
2
print s[:s.rfind('.')]
3
print s.rpartition('.')[0]
Well, that seems just fine to me... But here are a few other ways I can think of :
required = ".".join(s.split(".")[0:2]) // only one split
// using regular expressions
import re
required = re.sub(r"\.[^\.]$", "", s)
The regex only works if there are no dots in the last part you want to split off.
Using python 3.3:
I need some help in writing the body for this function that swaps the positions of the last name and first name.
Essentially, I have to write a body to swap the first name from a string to the last name's positions.
The initial order is first name followed by last name (separated by a comma). Example: 'Albus Percival Wulfric Brian, Dumbledore'
The result I want is: 'Dumbledore Albus Percival Wulfric Brian'
My approach was:
name = 'Albus Percival Wulfric Brian, Dumbledore
name = name[name.find(',')+2:]+", "+name[:name.find(',')]
the answer I get is: 'Dumbledore, Albus Percival Wulfric Brian' (This isn't what I want)
There should be no commas in between.
I'm a new user to Python, so please don't go into too complex ways of solving this.
Thanks kindly for any help!
You can split a string on commas into a list of strings using syntax like astring.split(',')
You can join a list of strings into a single string on whitespace like ' '.join(alist).
You can reverse a list using list slice notation: alist[::-1]
You can strip surrounding white space from a string using astring.strip()
Thus:
' '.join(aname.split(',')[::-1]).strip()
You're adding the comma in yourself:
name = name[name.find(',')+2:] + ", " + name[:name.find(',')]
Make it:
name = name[name.find(',')+2:] + name[:name.find(',')]