Convert list of string to dict - Remove extra comma [duplicate] - python

This question already has answers here:
Convert a String representation of a Dictionary to a dictionary
(11 answers)
Closed 1 year ago.
I am trying to create a dictionary from a list of strings. My attempt to convert this list of string to list of dictionary is as below:
author_dict = [[dict(map(str.strip, s.split(':')) for s in author_transform.split(','))] for author_transform in list_of_strings]
Everything was working fine until I encountered this piece of string:
[[country:United States,affiliation:University of Maryland, Baltimore County,name:tim oates,id:2217452330,gridid:grid.266673.0,affiliationid:79272384,order:2],........,[]]
As this string has an extra comma(,) in the middle of the intended value of affiliation key: my list is getting a spit at the wrong place. Is there a way (or idea) I can use to avoid this kind of situation?
If it is not possible, any suggestions on how can I ignore thiskind of list?

I would solve this by using a regular expression for splitting. This way you can split only on those commas that are followed by a colon without another comma in between.
In your code, replace
author_transform.split(',')
with
re.split(',(?=[^,]+:)', author_transform)
(And don’t forget to import re, of course.)
So, the whole code snippet becomes this:
author_dict = [
[
dict(map(str.strip, s.split(':'))
for s in re.split(',(?=[^,]+:)', author_transform))
]
for author_transform in list_of_strings
]
I took the liberty of reformatting the code, so the structure of the list comprehensions becomes clear.

Related

Most efficient way to clean this array ['1\xa0790\xa0000\xa0kr', '1\xa0980\xa0000\xa0kr']? [duplicate]

This question already has answers here:
Remove characters except digits from string using Python?
(19 answers)
Closed 3 months ago.
What is the most efficient way to clean this array ['1\xa0790\xa0000\xa0kr', '1\xa0980\xa0000\xa0kr'] into a new array that looks like this ['1790000', '1980000]?
I am a beginner in python and appreciate any advice, thank you!.
I tried a douple for loop and deleted chars that were equal to "x","a". When trying backslash it failed.
That string does not contain any x's, a's, or backslashes. The string '\xa0' contains one character -- a non-breaking space, with the hex value A0. Use
s = s.replace('\xa0','')
This doesn't help the "0kr" at the end. You can use another replace to get rid of those.
Re will do the job of cleaning for regex. something below
import re
def clean_array(array):
# Only number in array
array = [re.sub(r'\D', '', x) for x in array]
return array
print(clean_array(['1\xa0790\xa0000\xa0kr', '1\xa0980\xa0000\xa0kr']))```

Is there a string function similar to "split()" that works for strings without a repeated character? [duplicate]

This question already has answers here:
How do I split a string into a list of characters?
(15 answers)
Closed 2 years ago.
I want to split the ascii_letters* intoa list (in the string module) and it doesn't have any repeated characters. I tried to put the split marker as '' but that didn't work; I got an ValueError: empty separator message. Is there a string manipulator other than split() which I can use? I might be able to put spaces in, but that may become tedious and might take up a lot of code space.
import string
letters = string.ascii_letters
print(letters.split(''))
*The ascii_letters is a string that contains 'abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ'
list(letters)
might be what you are looking for.
You can use a regex to split a string using split() of the re module.
re.split(r'.', str)
To split at every character.
Or simply use list(str) to get the list of characters as suggested by #Klaus D.

Loop through elements in list of strings and combine if condition is met [duplicate]

This question already has answers here:
Modifying a list while iterating when programming with python [duplicate]
(5 answers)
Closed 2 years ago.
I am trying to write a code that would loop through elements in a list of strings and combine the elements that start with a lower case letter with a previous element. For example, given this list:
test_list = ['Example','This is a sample','sentence','created to illustrate','the problem.','End of example']
I would like to end up with the following list:
test_list = ['Example','This is a sample sentence created to illustrate the problem.','End of example']
Here is the code I have tried (which doesn't work):
for i in range(len(test_list)):
if test_list[i].islower():
test_list[i-1:i] = [' '.join(test_list[i-1:i])]
I think there might be a problem with me trying to use this join recursively. Could someone recommend a way to solve this? As background, the reason I need this is because I have many PDF documents of varying sizes converted to text which I split into paragraphs to extract specific items using re.split('\n\s*\n',document) on each doc. It works for most docs but, for whatever reason, some of them have '\n\n' literally after every other word or just in random places that do not correspond to end of paragraph, so I am trying to combine these to achieve a more reasonable list of paragraphs. On the other hand, if anyone has a better idea of how to split raw extracted text into paragraphs, that would be awesome, too. Thanks in advance for the help!
you could use:
output = [test_list[0]]
for a, b in zip(test_list, test_list[1:]):
if b[0].islower():
output[-1] = f'{output[-1]} {b}'
else:
output.append(b)
output
output:
['Example',
'This is a sample sentence created to illustrate the problem.',
'End of example']

Differentiating between double and single characters? [duplicate]

This question already has an answer here:
Dynamic variable name in python
(1 answer)
Closed 6 years ago.
I have a weird variable name for a dictionary that I have to use that looks like so:
val.display_counter1.rjj = {}
This is illegal so I've decided to use this format:
val_display__counter1_rjj = {}
Later in the code I need to match up that dict variable name with the original name. So I'm trying to find a way to replace those single underscores with dots and the double underscores with a single underscore. I'm sure that there is a regex solution, but regex isn't my strong suit.
Is there a way to selectively replace like this?
Edit:
There is some confusion with my question so allow me to clarify. The original name:
val.display_counter1.rjj
This is NOT a variable in itself but merely an item name from the 3D software package Modo. There are many items that share this format. What I am trying to do is create a class of dicts that will store information about these items. I want to name the dicts for the items and be able to match them in program.
For me to make this match I need to revert my dict name back to it's original so I can make the match:
val_display__counter1_rjj --> val.display_counter1.rjj
All I need to know is how to make the Regex match ONLY the single underscore and discard the matches that are surrounded by other underscores.
Also, not sure why this is marked as duplicate. But my question doesn't involve dynamic variables.
Well, I am new to Python.
But hope this works!!!
import re;
val_display__counter1_rij= {};
l = ['val_display__counter1_rij', 'val_display___counter1_rij','val.display__counter1_rij'] # list of variables to match
for x in l:
if "." not in x:
article = re.sub(r'(?is)_', '.', x)
if ".." in article:
article= article.replace("..","__");
if (article == 'val.display__counter1.rij'):
print(article)

Using regex to remove substrings from list items in python

Im sure this must be a duplicate question but I can't find an answer anywhere. I have a list with multiple strings as below:
['>ctg7180000016561_3757\nAAAAATTTAGTTAAAACTATAACATTAGCTTGTCAAGCTAAAATTACTATGTAAGTAGTAATTTTTA\n', '>ctg7180000016561_3824\nATCCCTCAAATAGCACCCATTAACTGATTATCCTTATTCTTAATATTCACCACCTCTCTCCTAATATTTAGAGCTTCTAACTATTTCTTTATCATGTACCCCCCCAAAAAATCTGTTTTTTATAAAAAAACTAGTATAAATAACTGATCATGATAACTAACCTCTTTTCGTCTTTCGACCCCTCTACTAACTTAAATACTAACTTTAACTGAGTTAGGACTATCCTCGGGGTGGCTGTAATCCCGAGGATATTTTGGATTATCCCCTCGCGTTTCTCCCTGCTTTGAATAAAACTTATCAGTACTCTTCACAAAGAATTCAAAGTCCTTGTTAACAACAAAAAATCCCAAGGCAGAACCCTAATCCTGATTTCCTTATTTTCTATTATTTTATTTAATAACTTCATAGGACTATTCCCATATATTTTCACATCCACAAGTCACATAGTATTAACCCTGTCCCTGGCTCTCCCCATATGACTAAGATTTATATTGTATGGGTGGGTAAATAATACAACCCACATGCTAGCCCATCTAGTACCCCAAGGAACCCCTGCCGTTCTAATACCATTTATGGTGTGTATTGAAACAATCAGAAATGTTATCCGACCCGGCACCCTGGCAATCCGGCTATCCGCAAATATAATTGCAGGACACCTACTAATAACCCTTCTAGGTAACACGGGAAAC\n', '>ctg7180000016561_4513\nT\n']
And all I want to do is remove the numbers after the underscore, so in this example the output would be:
['>ctg7180000016561\nAAAAATTTAGTTAAAACTATAACATTAGCTTGTCAAGCTAAAATTACTATGTAAGTAGTAATTTTTA\n', '>ctg7180000016561\nATCCCTCAAATAGCACCCATTAACTGATTATCCTTATTCTTAATATTCACCACCTCTCTCCTAATATTTAGAGCTTCTAACTATTTCTTTATCATGTACCCCCCCAAAAAATCTGTTTTTTATAAAAAAACTAGTATAAATAACTGATCATGATAACTAACCTCTTTTCGTCTTTCGACCCCTCTACTAACTTAAATACTAACTTTAACTGAGTTAGGACTATCCTCGGGGTGGCTGTAATCCCGAGGATATTTTGGATTATCCCCTCGCGTTTCTCCCTGCTTTGAATAAAACTTATCAGTACTCTTCACAAAGAATTCAAAGTCCTTGTTAACAACAAAAAATCCCAAGGCAGAACCCTAATCCTGATTTCCTTATTTTCTATTATTTTATTTAATAACTTCATAGGACTATTCCCATATATTTTCACATCCACAAGTCACATAGTATTAACCCTGTCCCTGGCTCTCCCCATATGACTAAGATTTATATTGTATGGGTGGGTAAATAATACAACCCACATGCTAGCCCATCTAGTACCCCAAGGAACCCCTGCCGTTCTAATACCATTTATGGTGTGTATTGAAACAATCAGAAATGTTATCCGACCCGGCACCCTGGCAATCCGGCTATCCGCAAATATAATTGCAGGACACCTACTAATAACCCTTCTAGGTAACACGGGAAAC\n', '>ctg7180000016561\nT\n']
I am using regex and I have a perfect match but I cant work out how to actually remove the substrings. My code so far is:
pattern = re.compile('_[0-9]*')
for x in SequenceList:
re.sub(pattern, '', x)
I'm aware that this is just changing the variable x, but even when I just print x within the for loop the pattern isn't removed. How do I actually remove the pattern and alter the list?
Thank you and sorry if this is already answered somewhere!
Strings are immutable. So, re.sub will create a new string. Instead, you can use list comprehension to create a new list with the replaced strings like this
import re
pattern = re.compile(r"_\d+")
print [pattern.sub("", item) for item in data]

Categories

Resources