How do i convert data into comma separated values, i want to convert like
I have this data in excel on single cell
"ABCD x3 ABC, BAC x 3"
Want to convert to
ABCD,ABCD,ABCD,ABC,BAC,BAC,BAC
can't find an easy way to do that.
I am trying to solve it in python so i can get a structured data
Hi Zeeshan to try and sort the string into usable data while also multiplying certain parts of the string is kind of tricky for me.
the best solution I can think of is kind of gross but it seems to work. hopefully my comments aren't too confusing <3
import re
data = "ABCD x3 AB BAC x2"
#this will split the string into a list that you can iterate through.
Datalist = re.findall(r'(\w+)', data)
#create a new list for the final result
newlist = []
for object in Datalist:
#for each object in the Datalist list
#if the object starts with 'x'
if re.search("x.*", object):
#convert the multiplier to type(string) and then split the x from the multiplier number string
xvalue = str(object).split('x')
#grab and remove the last item added to the newlist because it hasnt been multiplied.
lastitem = newlist.pop()
#now we can add the last item back in by as many times as the x value
newlist.extend([lastitem] * int(xvalue[1]))
else:
#if the object doesnt start with an x then we can just add it to the list.
newlist.extend([object])
#print result
print(newlist)
#re.search() - looks for a match in a string
#.split() - splits a string into multiple substrings
#.pop() - removes the last item from a list and returns that item.
#.extend() - adds an item to the end of a list
keep in mind that to find the multiplier its looking for x followed by a number (x1). if there is a space for example = (x 1) then it will match x but it wont return a value because there is a space.
there might be multiple ways around this issue and I think the best fix will be to restructure how the data is Formatted into the cell.
here are a couple of ways you can work with the data. it wont directly solve your issue but I hope it will help you think about how you approach it (not being rude I don't actually have a good way to handle your example <3 )
split() will split your string as character 'x' and return a list of substrings you can iterate over.
data = 'ABCD ABCD ABCD ABC BAC BAC BAC'
splitdata = data.split(' ')
print(splitdata)
#prints - ['ABCD', 'ABCD', 'ABCD', 'ABC', 'BAC', 'BAC', 'BAC']
you could also try and match strings from the data
import re
data2 = "ABCD x3 ABC BAC x3"
result = []
for match in re.finditer(r'(\w+) x(\d+)', data2):
substring, count = match.groups()
result.extend([substring] * int(count))
print(result)
use re.finditer to go through the string and match the data with the following format = '(\w+) x(\d+)'
each match then gets added to the list.
'\w' is used to match a character.
'\d' is used to match a digit.
'+' is the quantifier, means one or more.
so we are matching = '(\w+) x(\d+)',
which broken down means we are matching (\w+) one or more characters followed by a 'space' then 'x' followed by (\d+) one or more digits
so because your cell data is essentially a string followed by a multiplier then a string followed by another string and then another multiplier, the data just feels too random for a general solution and i think this requires a direct solution that can only work if you know exactly what data is already in the cell. that's why i think the best way to fix it is to rework the data in the cell first. im in no way an expert and this answer is to help you think of ways around the problem and to add to the discussion :) ,if someone wants to correct me and offer a better solution to this I would love to know myself.
Related
I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'
I am trying in Python 3 to get a list of all substrings of a given String a, which start after a delimiter x and end right before a delimiter y.
I have found solutions which only get me the first occurence, but the result needs to be a list of all occurences.
start = '>'
end = '</'
s = '<script>a=eval;b=alert;a(b(/XSS/.source));</script><script>a=eval;b=alert;a(b(/XSS/.source));</script>'"><marquee><h1>XSS by Xylitol</h1></marquee>'
print((s.split(start))[1].split(end)[0])
the above example is what I've got so far. But I am searching for a more elegant and stable way to get all the occurences.
So the expected return as list would contain the javascript code as following entries:
a=eval;b=alert;a(b(/XSS/.source));
a=eval;b=alert;a(b(/XSS/.source));
Looking for patterns in strings seems like a decent job for regular expressions.
This should return a list of anything between a pair of <script> and </script>:
import re
pattern = re.compile(r'<script>(.*?)</script>')
s = '<script>a=eval;b=alert;a(b(/XSS/.source));</script><script>a=eval;b=alert;a(b(/XSS/.source));</script>\'"><marquee><h1>XSS by Xylitol</h1></marquee>'
print(pattern.findall(s))
Result:
['a=eval;b=alert;a(b(/XSS/.source));', 'a=eval;b=alert;a(b(/XSS/.source));']
I receive an input string having values expressed in two possible formats. E.g.:
#short format
data = '"interval":19'
>>> "interval":19
#extended format
data = '"interval":{"t0":19,"tf":19}'
>>> "interval":{"t0":19,"tf":19}
I would like to check whether a short format is used and, in case, make it extended.
Considering that the string could be composed of multiple values, i.e.
data = '"interval":19,"interval2":{"t0":10,"tf":15}'
>>> "interval":19,"interval2":{"t0":10,"tf":15}
I cannot just say:
if ":{" not in data:
#then short format is used
I would like to code something like:
if ":$(a general int/float/double number)" in data:
#extract the number
#replace ":{number}" with the extended format
I know how to code the replacing part.
I need help for implementing if condition: in my mind, I model it like a variable substring, in which the variable part is the number inside it, while the rigid format is the $(value name) + ":" part.
"some_value":19
^ ^
rigid format variable part
EDIT - WHY NOT PARSE IT?
I know the string is "JSON-friendly" and I can convert it into a dictionary, easily accessing then the values.
Indeed, I already have this solution in my code. But I don't like it since the input string could be multilevel and I need to iterate on the leaf values of the resulting dictionary, independently from the dictionary levels. The latter is not a simple thing to do.
So I was wondering whether a way to act directly on the string exists.
If you replace all keys, except t0, tf, followed by numbers, it should work.
I show you an example on a multilevel string, probably to be put in a better shape:
import re
s = '"interval": 19,"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval":23}}'
gex = '("(?!(t0|tf)")\w+":)\s*(\d+)'
new_s = re.sub(gex, r'\1 {"t0": \3, "tf": \3}', s)
print(new_s)
>>> print(new_s)
"interval": {"t0": 19, "tf": 19},"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval": {"t0": 23, "tf": 23}}}
You could use a regular expression. ("interval":)(\d+) will look for the string '"interval":' followed by any number of digits.
Let's test this
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'xxx', data)
print(result)
# -> xxx,"interval2":{"t0":10,"tf":15},xxx
We see that we found the correct places. Now we're going to create your target format. Here the matched groups come in handy. In the regular expression ("interval":) is group 1, (\d+) is group 2.
Now we use the content of those groups to create your wanted result.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"interval":{"t0":25,"tf":25}
If there are floating point values involved you'll have to change (\d+) to ([.\d]+).
If you want any Unicode standard word characters and not only interval you can use the special sequence \w and because it could be multiple characters the expression will be \w+.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'("\w+":)([.\d]+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":{"t0":10,"tf":10},"tf":{"t0":15,"tf":15}},"Monty":{"t0":25.4,"tf":25.4}
Dang! Yes, we found "Monty" but now the values from the second part are found too. We'll have to fix this somehow. Let's see. We don't want ("\w+") if it's preceded by { so were going to use a negative lookbehind assertion: (?<!{)("\w+"). And after the number part (\d+) we don't want a } or an other digit so we're using a negative lookahead assertion here: ([.\d]+)(?!})(?!\d).
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'(?<!{)("\w+":)([.\d]+)(?!})(?!\d)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"Monty":{"t0":25.4,"tf":25.4}
Hooray, it works!
Regular expressions are powerful and fun, but if you start to add more constraints this might become unmanageable.
I have a list of strings that looks like this:
Input:
prices_list = ["CNY1234", "$ 4.421,00", "PHP1,000", "€432"]
I want to remove everything except .isdigit(), and '.|,'. In other words, I would like to split before the first occurrence of any digit with maxsplit=1:
Desired output:
["1234", "4.421,00", "1,000", "432"]
First attempt (two regex replacements):
# Step 1: Remove special characters
prices_list = [re.sub(r'[^\x00-\x7F]+',' ', price).encode("utf-8") for price in prices_list]
# Step 2: Remove [A-Aa-z]
prices_list = [re.sub(r'[A-Za-z]','', price).strip() for price in prices_list]
Current output:
['1234', '$ 4.421,00', '1,000', '432'] # $ still in there
Second attempt (still two regex replacements):
prices_list = [''.join(re.split("[A-Za-z]", re.sub(r'[^\x00-\x7F]+','', price).encode("utf-8").strip())) for price in price_list]
This (of course) leads to the same output as my first attempt. Also, this isn't much shorter and looks very ugly. Is there a better (shorter) way to do this?
Third attempt (list comprehension/nestedfor-loop/no regex):
prices_list = [''.join(token) for token in price for price in price_list if token.isdigit() or token == ',|;']
which yields:
NameError: name 'price' is not defined
How to best parse the above-mentioned price list?
If you need to leave only specific characters, it's better to tell regex to do exactly that thing:
import re
prices_list = ["CNY1234", "$ 4.421,00", "PHP1,000", "€432"]
prices = list()
for it in prices_list:
pattern = r"[\d.|,]+"
s = re.search(pattern, it)
if s:
prices.append(s.group())
> ['1234', '4.421,00', '1,000', '432']
The Problem
Correct me if I'm wrong, but essentially you're trying to remove symbols and such and only leave any trailing digits, right?
I would like to split before the first occurrence of any digit
That, I feel, is the simplest way to frame the regex problem that you are trying to solve.
A Solution
# -*- coding: utf-8 -*-
import re
# Match any contiguous non-digit characters
regex = re.compile(r"\D+")
# Input list
prices_list = ["CNY1234", "$ 4.421,00", "PHP1,000", "€432"]
# Regex mapping
desired_output = map(lambda price: regex.split(price, 1)[-1], prices_list)
This gives me ['1234', '4.421,00', '1,000', '432'] as the output.
Explanation
The reason this works is because of the lambda and the map function. Basically, the map function takes in a lambda (a portable, one-line function if you will), and executes it on every element in the list. The negative index takes the last element that the list of matches that the split method generates
Essentially, this works because of the assumption that you don't want any initial non-digits in your output.
Caveats
This code not only keeps . and , in the resulting substring, but all characters in the resulting substring. So, an input string of "$10e7" will be output as '10e7'.
If you were to have just digits and . and ,, such as "10.00" as an input string, you would get '00' in the corresponding location in the output list.
If none of these are desired behavior, you would have to get rid of the negative indexing next to the regex.split(price, 1) and do further processing on the resulting list of lists so that you can handle all of those pesky edge cases that arise with using regex.
Either way, I would try and throw more extreme examples at it just to make sure that it's what you need.
I am trying to do something which I thought would be simple (and probably is), however I am hitting a wall. I have a string that contains document numbers. In most cases the format is ######-#-### however in some cases, where the single digit should be, there are multiple single digits separated by a comma (i.e. ######-#,#,#-###). The number of single digits separated by a comma is variable. Below is an example:
For the string below:
('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
I need to return:
['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']
I have only gotten as far as returning the strings that match the ######-#-### pattern:
import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m
Thanks in advance for any help!
Matt
Perhaps something like this:
>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
a, b, c = m.groups()
for x in b.split(','):
print a + x + c
...
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003
Or using a list comprehension
>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']
Use '\d{6}-\d(,\d)*-\d{3}'.
* means "as many as you want (0 included)".
It is applied to the previous element, here '(,\d)'.
I wouldn't use a single regular expression to try and parse this. Since it is essentially a list of strings, you might find it easier to replace the "&" with a comma globally in the string and then use split() to put the elements into a list.
Doing a loop of the list will allow you to write a single function to parse and fix the string and then you can push it onto a new list and the display your string.
replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
newItem = myfunction(item)
newList.append(newItem)
newstring = newlist(join(','))
(\d{6}-)((?:\d,?)+)(-\d{3})
We take 3 capturing groups. We match the first part and last part the easy way. The center part is optionally repeated and optionally contains a ','. Regex will however only match the last one, so ?: won't store it at all. What where left with is the following result:
>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'), ('030421-', '1', '-003')]
You'll have to manually process the 2nd term to split them up and join them, but a list comprehension should be able to do that.