Separation of a splited string - python

def getOnlyNames(unfilteredString):
unfilteredString = unfilteredString[unfilteredString.index(":"):]
NamesandNumbers = [item.strip() for item in unfilteredString.split(';')]
OnlyNames = []
for i in len(productsPrices):
x = [item.strip() for item in productsPrices[i].split(',')]
products.append(x[0])
return products
So I'm trying to make a function that will separate a following string
"Cars: Mazda 3,30000; Mazda 5, 49900;"
So I will get only:
Mazda 3,Mazda 5
First I was removing the :
then I try to get only the name of the car without the price of it

You can use regex for this:
import re
>>> s = "Cars: Mazda 3,30000; Mazda 5, 49900;"
>>> re.findall("[:;]\W*([^:;]*?)(?:,)", s)
['Mazda 3', 'Mazda 5']
>>> s = "Mazda 3, 35000; Cars: Mazda 4,30000; Mazda 5, 49900;"
>>> re.findall("[:;]\W*([^:;]*?)(?:,)", s)
['Mazda 4', 'Mazda 5']

"Cars: Mazda 3,30000; Mazda 5, 49900;"
split on the colon
['Cars', ' Mazda 3,30000; Mazda 5, 49900;']
split the last item on the semicolon
[' Mazda 3,30000', ' Mazda 5, 49900', '']
split the first two items on the comma.
[' Mazda 3', '30000'], [' Mazda 5', ' 49900']
take the first item of each and strip the whitespace
'Mazda 3'
'Mazda 5'

Related

Extract a list of values from a column in a pandas dataframe

I’m trying to extract a list of values from a column in a dataframe.
For example:
# dataframe with "num_fruit" column
fruit_df = pd.DataFrame({"num_fruit": ['1 "Apple"',
'100 "Peach Juice3" 1234 "Not_fruit" 23 "Straw-berry" 2 "Orange"']})
# desired output: a list of values from the "num_fruit" column
[['1 "Apple"'],
['100 "Peach Juice3"', '1234 "Not_fruit"', '23 "Straw-berry"', '2 "Orange"']]
Any suggestions? Thanks a lot.
What I’ve tried:
import re
def split_fruit_val(val):
return re.findall('(\d+ ".+")', val)
result_list = []
for val in fruit_df['num_fruit']:
result = split_fruit_val(val)
result_list.append(result)
print(result_list)
#output: some values were not split appropriately
[['1 "Apple"'],
['100 "Peach Juice3" 1234 "Not_fruit" 23 "Straw-berry" 2 "Orange"']]
Lets split with positive lookahead for a number
fruit_df['num_fruit'].str.split(r'\s(?=\d+)')
0 [1 "Apple"]
1 [100 "Peach Juice3", 1234 "Not_fruit", 23 "Str...
Name: num_fruit, dtype: object

Python regex find single digit if no digits before it

I have a list of strings and I want to use regex to get a single digit if there are no digits before it.
strings = ['5.8 GHz', '5 GHz']
for s in strings:
print(re.findall(r'\d\s[GM]?Hz', s))
# output
['8 GHz']
['5 GHz']
# desired output
['5 GHz']
I want it to just return '5 GHz', the first string shouldn't have any matches. How can I modify my pattern to get the desired output?
As per my comment, it seems that you can use:
(?<!\d\.)\d+\s[GM]?Hz\b
This matches:
(?<!\d\.) - A negative lookbehind to assert position is not right after any single digit and literal dot.
\d+ - 1+ numbers matching the integer part of the frequency.
[GM]?Hz - An optional uppercase G or M followed by "Hz".
\b - A word boundary.
>>> strings = ['5.8 GHz', '5 GHz']
>>>
>>> for s in strings:
... match = re.match(r'^[^0-9]*([0-9] [GM]Hz)', s)
... if match:
... print(match.group(1))
...
5 GHz
Updated Answer
import re
a = ['5.8 GHz', '5 GHz', '8 GHz', '1.2', '1.2 Some Random String', '1 Some String', '1 MHz of frequency', '2 Some String in Between MHz']
res = []
for fr in a:
if re.match('^[0-9](?=.[^0-9])(\s)[GM]Hz$', fr):
res.append(fr)
print(res)
Output:
['5 GHz', '8 GHz']
My two cents:
selected_strings = list(filter(
lambda x: re.findall(r'(?:^|\s+)\d+\s+(?:G|M)Hz', x),
strings
))
With ['2 GHz', '5.8 GHz', ' 5 GHz', '3.4 MHz', '3 MHz', '1 MHz of Frequency'] as strings, here selected_strings:
['2 GHz', ' 5 GHz', '3 MHz', '1 MHz of Frequency']

What's the best way to parse through a list of strings and return joined strings based on slices of these strings?

Here is an example list and the desired output:
list = ['1 Michael Jessica', '2 Christopher Ashley', '3 Matthew Brittany', '4 Joshua Amanda']
output = [ 'Michael 1', 'Jessica 1', 'Christopher 2', 'Ashley 2', 'Matthew 3', 'Brittany 3', etc]
# Then I sort it but that doesn't matter right now
I'm a python newbie and combined the concepts I understand to yield this horrendously ridiculous code that I'm almost embarrassed to post. No doubt there is a proper and easier way! I'd love some advice and help. Please don't worry about my code or editing it. Just posting it for reference if it helps. Ideally, brand new code is what I'm looking for.
list = ['1 Michael Jessica', '2 Christopher Ashley', '3 Matthew Brittany', '4 Joshua Amanda']
list3 = []
list4 = []
y = []
for n in list:
x = n.split()
y.append(x)
print(y)
for str in y:
for pos in range(0, 3, 2): # Number and Name 1
test = str[pos]
list3.append(test)
for str in y:
for pos in range(0, 2): # Number and Name 2
test = str[pos]
list4.append(test)
list3.reverse()
list4.reverse()
print(list3)
print(list4)
length = int(len(list3) / 2)
start = 0
finish = 2
length2 = int(len(list4) / 2)
start2 = 0
finish2 = 2
for num in range(0, length):
list3[start:finish] = [" ".join(list3[start:finish])]
start += 1
finish += 1
for num in range(0, length):
list4[start2:finish2] = [" ".join(list4[start2:finish2])]
start2 += 1
finish2 += 1
print(list3)
print(list4)
list5 = list3 + list4
list5.sort()
print(list5)
Other answers are also looks good, I believe this would be the much dynamic way if there is any displacement in numbers. So re will be the good choice to slice and play.
import re
ls = ['1 Michael Jessica', '2 Christopher Ashley', '3 Matthew Brittany', '4 Joshua Amanda']
result = []
for l in ls:
key = re.findall('\d+',l)[0]
for i in re.findall('\D+',l):
for val in i.split():
result.append('{} {}'.format(val, key))
print(result)
Below is the one liner for the same:
result2 = ['{} {}'.format(val, re.findall('\d+',l)[0]) for l in ls for i in re.findall('\D+',l) for val in i.split()]
print(result2)
Happy Coding !!!
This is one approach using a simple iteration and str.split
Ex:
lst = ['1 Michael Jessica', '2 Christopher Ashley', '3 Matthew Brittany', '4 Joshua Amanda']
result = []
for i in lst:
key, *values = i.split()
for n in values:
result.append(f"{n} {key}") #or result.append(n + " " + key)
print(result)
Output:
['Michael 1', 'Jessica 1', 'Christopher 2', 'Ashley 2', 'Matthew 3', 'Brittany 3', 'Joshua 4', 'Amanda 4']
[" ".join([item, name.split()[0]]) for name in a for index, item in enumerate(name.split()) if index != 0]
input = ['1 Michael Jessica', '2 Christopher Ashley', '3 Matthew Brittany', '4 Joshua Amanda']
result = []
for item in input:
item_split = item.split(' ')
item_number = item_split.pop(0)
for item_part in item_split:
result.append('{} {}'.format(item_part, item_number))
print(result)
lst = ['1 Michael Jessica', '2 Christopher Ashley', '3 Matthew Brittany', '4 Joshua Amanda']
result = []
for item in lst:
a, b, c = item.split()
result.append("{} {}".format(b, a))
result.append("{} {}".format(c, a))
print(result)
output
['Michael 1', 'Jessica 1', 'Christopher 2', 'Ashley 2', 'Matthew 3', 'Brittany 3', 'Joshua 4', 'Amanda 4']

How to strip and split in pandas

Is there a way to perform a split by new line and also do a strip of whitespaces in a single line ?
this is how my df looks like originally
df["Source"]
0 test1 \n test2
1 test1 \n test2
2 test1 \ntest2
Name: Source, dtype: object
I used to do a split based on new line and create a list with the below code
Data = (df["Source"].str.split("\n").to_list())
Data
[['test1 ', ' test2 '], [' test1 ', ' test2 '], [' test1 ', 'test2 ']]
I want to further improve this and remove any leading or trailing white spaces and i am not sure how to use the split and strip in a single line
df['Port']
0 443\n8080\n161
1 25
2 169
3 25
4 2014\n58
Name: Port, dtype: object
when i try to split it based on the new line , it fills in nan values for the ones that does not have \n
df['Port'].str.split("\n").to_list()
[['443', '8080', '161'], nan, nan, nan, ['2014', '58']]
the same works perfectly for other columns
df['Source Hostname']
0 test1\ntest2\ntest3
1 test5
2 test7\ntest8\n
3 test1
4 test2\ntest4
Name: Source Hostname, dtype: object
df["Source Hostname"].str.split('\n').apply(lambda z: [e.strip() for e in z]).tolist()
[['test1', 'test2', 'test3'], ['test5'], ['test7', 'test8', ''], ['test1'], ['test2', 'test4']]
df['Source'].str.split('\n').apply(lambda x: [e.strip() for e in x]).tolist()
Use Series.str.strip for remove traling whitespaces and then split by regex \s*\n\s* for one or zero whitespaces before and after \n:
df = pd.DataFrame({'Source':['test1 \n test2 ',
' test1 \n test2 ',
' test1 \ntest2 ']})
print (df)
Source
0 test1 \n test2
1 test1 \n test2
2 test1 \ntest2
Data = (df["Source"].str.strip().str.split("\s*\n\s*").to_list())
print (Data)
[['test1', 'test2'], ['test1', 'test2'], ['test1', 'test2']]
Or if possible split by arbitrary whitespace (it means spaces or \n here):
Data = (df["Source"].str.strip().str.split().to_list())
print (Data)
[['test1', 'test2'], ['test1', 'test2'], ['test1', 'test2']]

How can I divide by index and add item in nested list keeping it nested?

My question is, how can I divide items by index and add string specific value
in nested list.
My list:
lst = [[' 21693282.469 7 -4963125.899 7 -3821950.54648 21693275.40648\n',
' 20789781.031 7 887006.789 7 698075.62748 20789776.77048\n',
' 24667814.375 5 24667811.441 8 1051991.202 5 827710.336 8 24667810.98847\n',
' 21414305.883 6 21414301.563 9 -5293000.520 6 -4102616.060 9 21414301.17248\n',
' 23395450.500 6 1349998.701 6 1080794.20346 23395447.42246\n',
' 20965956.617 8 -3636447.948 8 -2813703.22949 20965951.97349\n'],
[' 20670086.656 7 2718596.518 7 2116872.80448 20670081.07848\n',
' 24222343.500 3 2146415.760 3 1607697.95946 24222340.25446\n',
' 22829139.453 6 1683633.646 6 1300012.93847 22829132.80147\n',
' 22934656.609 6 1700166.043 6 1314411.856 7 22934663.711 7\n',
' 20055874.828 9 267080.471 9 212506.020 9 20055882.121 9\n',
' 22774080.570 7 1762178.392 7 1346501.808 8 22774088.434 8\n',
' 20194290.688 8 -2867460.044 8 -2213132.457 9 20194298.629 9\n',
' 21679624.156 7 1345827.111 7 1067174.299 8 21679631.973 8\n']]
My code is here:
result=[]
def extract_line():
for list in lst:
for j in list:
for i in range(0,len(j)-1,16):
num = j[i:i+16].strip()
result.append(num if num else 'None')
yield result
for result in extract_line():
print result
I can get only one list not nested
I want to get list keeping nested list like this:
[['21693282.469 7', 'None', '-4963125.899 7', '-3821950.54648', '21693275.40648',
'20789781.031 7', 'None', '887006.789 7', '698075.62748', '20789776.77048',
'24667814.375 5', '24667811.441 8', '1051991.202 5', '827710.336 8', '24667810.98847',
'21414305.883 6', '21414301.563 9', '-5293000.520 6', '-4102616.060 9', '21414301.17248',
'23395450.500 6', 'None', '1349998.701 6', '1080794.20346', '23395447.42246',
'20965956.617 8', 'None', '-3636447.948 8', '-2813703.22949', '20965951.97349'],
['20670086.656 7', 'None', '2718596.518 7', '2116872.80448', '20670081.07848',
'24222343.500 3', 'None', '2146415.760 3', '1607697.95946', '24222340.25446',
'22829139.453 6', 'None', '1683633.646 6', '1300012.93847', '22829132.80147',
'22934656.609 6', 'None', '1700166.043 6', '1314411.856 7', '22934663.711 7',
........ '21679631.973 8']]
I'm so sorry
cuz I'm not habituated using this webpage yet
I edit my question readable easier than before
Quick solution: if myList is the list you want to be nested, just write myList = [myList].
First, DON'T use list (which is a class) as a variable name. Python will not say anything, but it will bite you (hard) later.
(Same with dict, file, ...). When in doubt, add my_ before anything
Second, it seems you don't want to return a list of items, but a list which has a single item in it.
The easiest solution would be not to yield result, but
yield [result]
Which would return the list you want (your list as the only element in a list)
Below I posted the solution. Introduce the sublist, to keep your list nesting.
lst = [[' 21693282.469 7 -4963125.899 7 -3821950.54648 21693275.40648\n',
' 20789781.031 7 887006.789 7 698075.62748 20789776.77048\n',
' 24667814.375 5 24667811.441 8 1051991.202 5 827710.336 8 24667810.98847\n',
' 21414305.883 6 21414301.563 9 -5293000.520 6 -4102616.060 9 21414301.17248\n',
' 23395450.500 6 1349998.701 6 1080794.20346 23395447.42246\n',
' 20965956.617 8 -3636447.948 8 -2813703.22949 20965951.97349\n'],
[' 20670086.656 7 2718596.518 7 2116872.80448 20670081.07848\n',
' 24222343.500 3 2146415.760 3 1607697.95946 24222340.25446\n',
' 22829139.453 6 1683633.646 6 1300012.93847 22829132.80147\n',
' 22934656.609 6 1700166.043 6 1314411.856 7 22934663.711 7\n',
' 20055874.828 9 267080.471 9 212506.020 9 20055882.121 9\n',
' 22774080.570 7 1762178.392 7 1346501.808 8 22774088.434 8\n',
' 20194290.688 8 -2867460.044 8 -2213132.457 9 20194298.629 9\n',
' 21679624.156 7 1345827.111 7 1067174.299 8 21679631.973 8\n']]
result=[]
def extract_line():
for list in lst:
print 'list:',list
sublist=[]
for j in list:
for i in range(0,len(j)-1,16):
num = j[i:i+16].strip()
sublist.append(num if num else 'None')
result.append(sublist)
yield result
for result in extract_line():
print '\n\n\n'
print result

Categories

Resources