How to split and remove a string in a list? - python

Here's my example code:
list1 = [{'name': 'foobar', 'parents': 'John Doe and Bartholomew Shoe'},
{'name': 'Wisteria Ravenclaw', 'parents': 'Douglas Lyphe and Jackson Pot'
}]
I need to split parent into a list and remove 'and' string. So the output should look like this:
list1 = [{'name': 'foobar', 'parents': ['John Doe', 'Bartholomew Shoe'],
{'name': 'Wisteria Ravenclaw', 'parents': ['Douglal Lyphe', 'Jackson', 'Pot']
}]
Please help me figure this out.
for people in list1:
people['parents'] = people['parents'].split('and')
I'm not sure how to move that ', ' string.

You should use people inside loop, not the iterator itself.
for people in list1:
people['parents'] = people['parents'].split(' and ')
and then when you print list1, you get:
[{'name': 'foobar', 'parents': ['John Doe', 'Bartholomew Shoe']}, {'name': 'Wisteria Ravenclaw', 'parents': ['Douglas Lyphe', 'Jackson Pot']}]

Expanding on what others said: You may want to split on a regular expression so that
you don't split on and in case a name happens to contain that substring,
you remove the whitespace around and.
Like so:
import re
list1 = [
{'name': 'foobar', 'parents': 'John Doe and Bartholomew Shoe'},
{'name': 'Wisteria Ravenclaw', 'parents': 'Douglas Lyphe and Jackson Pot'}
]
for people in list1:
people['parents'] = re.split(r'\s+and\s+', people['parents'])
print(list1)

Related

How to Split a Dictionary Value into 2 Separate Key Values

I currently have a dictionary where the values are:
disney_data = {
'title': ['Gus (1976)',
'Johnny Kapahala: Back on Board (2007)',
'The Adventures of Huck Finn (1993)',
'The Simpsons (1989)',
'Atlantis: Milo’s Return (2003)']
}
I would like to split up the title from the year value and have a dictionary like:
new_disney_data = {
'title' : ['Gus',
'Johnny Kapahala: Back on Board',
'The Adventures of Huck Finn',
'The Simpsons',
'Atlantis: Milo’s Return'],
'year' : ['1976',
'2007',
'1993',
'1989',
'2003']
}
I tried using the following, but I know something is off - I'm still relatively fresh to python so any help would be greatly apprecated!
for value in disney_data.values():
new_disney_data['title'].append(title[0,-7])
new_disney_data['year'].append(title[-7,-1])
There are two concepts you can use here:
The first would be .split(). This usually works better than indexing in a string (in case someone placed a space after the brackets in the string, for example). Read more.
The second would be comprehension. Read more.
Using these two, here is one possible solution.
titles = [item.split('(')[0].strip() for item in disney_data['title']]
years = [item.split('(')[1].split(')')[0].strip() for item in disney_data['title']]
new_disney_data = {
'title': titles,
'year': years
}
print(new_disney_data)
Edit: I also used .strip(). This removes any trailing whitespace like spaces, tabs, or newlines from the ends of a string. Read more
You're not that far off. In your for-loop you iterate over values of the dict, but you want to iterate over the titles. Also the string slicing syntax is [id1:id2]. So this would probably do what you are looking for:
new_disney_data = {"title":[], "year":[]}
for value in disney_data["title"]:
new_disney_data['title'].append(value[0:-7])
new_disney_data['year'].append(value[-5:-1])
new_disney_data = {
'title': [i[:-6].rstrip() for i in disney_data['title']],
'year': [i[-5:-1] for i in disney_data['title']]
}
this code can do it
import re
disney_data = {
'title': ['Gus (1976)',
'Johnny Kapahala: Back on Board (2007)',
'The Adventures of Huck Finn (1993)',
'The Simpsons (1989)',
'Atlantis: Milo’s Return (2003)']
}
disney_data['year'] = []
for index,line in enumerate(disney_data.get('title')):
match = re.search(r'\d{4}', line)
if match is not None:
disney_data['title'][index] = line.split('(')[0].strip()
disney_data['year'].append(match.group())
print(disney_data)
it searches for every line in the title if there are 4 digits, if exists then add to year, and remove digits and parenthesis from the title.
Something like this
disney_data = {
'title': ['Gus (1976)',
'Johnny Kapahala: Back on Board (2007)',
'The Adventures of Huck Finn (1993)',
'The Simpsons (1989)',
'Atlantis: Milo’s Return (2003)']
}
new_disney_data = {'title': [], 'year': []}
#split title into two columns title and year in new dict
for title in disney_data['title']:
new_disney_data['title'].append(title.split('(')[0]) #split title by '('
new_disney_data['year'].append(title.split('(')[1].split(')')[0]) #split year by ')'
print(disney_data)
print(new_disney_data)
Using split and replace.
def split(data):
o = {'title' : [], 'year' : []}
for (t, y) in [d.replace(')','').split(' (') for d in data['title']]:
o['title'].append(t)
o['year'].append(y)
return o
Using Regular Expession
import re
def regex(data):
r = re.compile("(.*?) \((\d{4})\)")
o = {'title' : [], 'year' : []}
for (t, y) in [r.findall(d)[0] for d in data['title']]:
o['title'].append(t)
o['year'].append(y)
return o

How to convert list of lists to dict key value pairs python

I have a list of lists like so:
splitted = [['OID:XXXXXXXXXXX1',
' street:THE ROAD',
'town:NEVERPOOL',
'postcode:M1 2DD',
'Name:SOMEHWERE',
'street:THE ROAD',
'town:NEVERLAND',
'postcode:M1 2DD'],
['OID:XXXXXXXXXXX2',
' Name:30',
'street:DA PLACE',
'town:PERTH',
'postcode:PH1 2DD',
'Name:30',
'street:DA PLACE',
'town:PERTH',
'postcode:PH1 2DD']]
I'd like to convert these to key values pairs like so:
{'OID': 'XXXXXXXXXXX1', ' street': 'THE ROAD', 'town': 'NEVERPOOL', 'postcode': 'M1 2DD', 'Name': 'SOMEWHERE', 'street': 'THE ROAD', 'town': 'NEVERPOOL', 'postcode': 'M1 2DD'}, {'MPXN': 'XXXXXXXXXXX2', ' Name': '30', 'street': 'DA PLACE', 'town': 'PERTH', 'postcode': 'PH1 2DD', 'primaryName': '30', 'street1': 'DA PLACE', 'town': 'PERTH', 'postcode': 'PH1 2DD'}
I am unable to find a way online to convert a list of lists into key-value pairs as a dict to then be consumed by pandas. The purpose of this is to convert the dicts into a pandas DataFrame so I can then consume it and work with it in a tabular format
The code I have used thus far is here:
output = []
for list in splitted:
for key_value in list:
key, value = key_value.split(':', 1)
if not output or key in output[0]:
output.append({})
output[-1][key] = value
The problem with the above code is that it does not maintain the list of lists and mixes up the OID field with other data items, I'd like a dict starting from each OID.
Any help would be greatly appreciated :)
The problem is that you append a new dict to output every time that the first element of output has the key you're looking for. This causes your code to fail, because after the first list in splitted has been processed, the first element of output looks like this:
{'OID': 'XXXXXXXXXXX1',
' street': 'THE ROAD',
'town': 'NEVERPOOL',
'postcode': 'M1 2DD',
'Name': 'SOMEHWERE',
'street': 'THE ROAD'}
and all key values you will see henceforth already exist in said element.
What you actually want to do is to add a new dict every time you encounter a new list in splitted.
output = []
for l in splitted:
output.append(dict())
for key_value in l:
key, value = key_value.split(':', 1)
output[-1][key] = value
And now you get what you expected:
[{'OID': 'XXXXXXXXXXX1',
' street': 'THE ROAD',
'town': 'NEVERLAND',
'postcode': 'M1 2DD',
'Name': 'SOMEHWERE',
'street': 'THE ROAD'},
{'OID': 'XXXXXXXXXXX2',
' Name': '30',
'street': 'DA PLACE',
'town': 'PERTH',
'postcode': 'PH1 2DD',
'Name': '30'}]
While I have your attention:
dicts are an unordered data type in python (or ordered by insertion-order), so "mixes up the OID field with other data items" isn't really a thing. You wanted to create a single dict with all those keys, but you ended up creating a bunch of dicts, each with one key (after the first one)
list is a built-in class in python, so creating a variable called list shadows this class. You shouldn't do this, because later you might encounter errors if you want to use the list class.
Debugging is a crucial skill for a programmer to have. I encourage you to take a look at these links: How to debug small programs.
|
What is a debugger and how can it help me diagnose problems? You can use a debugger to step through your code and observe how each statement affects the state of your program, and this helps you figure out where you're going wrong.
I believe it is simplier:
splitted = [['OID:XXXXXXXXXXX1',
' street:THE ROAD',
'town:NEVERPOOL',
'postcode:M1 2DD',
'Name:SOMEHWERE',
'street:THE ROAD',
'town:NEVERLAND',
'postcode:M1 2DD'],
['OID:XXXXXXXXXXX2',
' Name:30',
'street:DA PLACE',
'town:PERTH',
'postcode:PH1 2DD',
'Name:30',
'street:DA PLACE',
'town:PERTH',
'postcode:PH1 2DD']]
output = []
for i in range(len(splitted)):
output.append(dict())
for j in splitted[i]:
k,v = j.split(':')
output[i][k] = v
the output is:
[{'OID': 'XXXXXXXXXXX1',
' street': 'THE ROAD',
'town': 'NEVERLAND',
'postcode': 'M1 2DD',
'Name': 'SOMEHWERE',
'street': 'THE ROAD'},
{'OID': 'XXXXXXXXXXX2',
' Name': '30',
'street': 'DA PLACE',
'town': 'PERTH',
'postcode': 'PH1 2DD',
'Name': '30'}]
You could also try to use dictionary comprehension:
[{x.split(":")[0]:x.split(":")[1] for x in splitted[0]},
{x.split(":")[0]:x.split(":")[1] for x in splitted[1]}]
The output is:
[{'OID': 'XXXXXXXXXXX1', ' street': 'THE ROAD', 'town': 'NEVERLAND', 'postcode': 'M1 2DD', 'Name': 'SOMEHWERE', 'street': 'THE ROAD'}, {'OID': 'XXXXXXXXXXX2', ' Name': '30', 'street': 'DA PLACE', 'town': 'PERTH', 'postcode': 'PH1 2DD', 'Name': '30'}]

How to prevent duplicate UUID's on duplicate texts (in list of dicts) in python?

I have to filter texts that I process by checking if people's names appear in the text (texts). If they do appear, the texts are appended as nested list of dictionaries to the existing list of dictionaries containing people's names (people). However, since in some texts more than one person's name appears, the child document containing the texts will be repeated and added again. As a result, the child document does not contain a unique ID and this unique ID is very important, regardless of the texts being repeated.
Is there a smarter way of adding a unique ID even if the texts are repeated?
My code:
import uuid
people = [{'id': 1,
'name': 'Bob',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]},
{'id': 2,
'name': 'Kate',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]},
{'id': 3,
'name': 'Joe',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]}]
texts = ['this text has the name Bob and Kate',
'this text has the name Kate only ']
for text in texts:
childDoc={'id': str(uuid.uuid1()), #the id will duplicate when files are repeated
'text': text}
for person in people:
if person['name'] in childDoc['text']:
person['_childDocuments_'].append(childDoc)
Current output:
[{'id': 1,
'name': 'Bob',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'},
{'id': '7752597f-410f-11eb-9341-9cb6d0897972', #duplicate ID here
'text': 'this text has the name Bob and Kate'}]},
{'id': 2,
'name': 'Kate',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'},
{'id': '7752597f-410f-11eb-9341-9cb6d0897972', #duplicate ID here
'text': 'this text has the name Bob and Kate'},
{'id': '77525980-410f-11eb-b667-9cb6d0897972',
'text': 'this text has the name Kate only '}]},
{'id': 3,
'name': 'Joe',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]}]
As you can see in the current output, the ID for the text 'this text has the name Bob and Kate' has the same identifier: '7752597f-410f-11eb-9341-9cb6d0897972' , because it is appended twice. But I would like each identifier to be different.
Desired output:
Same as current output, except we want every ID to be different for every appended text even if these texts are the same/duplicates.
Move the generation of the UUID inside the inner loop:
for text in texts:
for person in people:
if person['name'] in text:
childDoc={'id': str(uuid.uuid1()),
'text': text}
person['_childDocuments_'].append(childDoc)
This does not actually ensure that the UUID are unique. For that you need to have a set of used UUID, and when generating a new one you check if it is already used and if it is you generate another. And test that one and repeat until you have either exhausted the UUID space or have found an unused UUID.
There is a 1 in 2**61 chance that duplicates are generated. I can't accept collisions as they result in data loss. So when I use UUID I have a loop around the generator that looks like this:
used = set()
while True:
identifier = str(uuid.uuid1())
if identifier not in used:
used.add(identifier)
break
The used set is actually stored persistently. I don't like this code although I have a program that uses it as it ends up in an infinite loop when it can't find a unused UUID.
Some document databases provide automatic UUID assignment and they do this for you internally to ensure that a given database instance never ends up with two documents with the same UUID.

How do i split this list into pairs or single elements. Is there an easier way of doing this?

In python I have a list of names, however some have a second name and some do not, how would I split the list into names with surnames and names without?
I don't really know how to explain it so please look at the code and see if you can understand (sorry if I have worded it really badly in the title)
See code below :D
names = ("Donald Trump James Barack Obama Sammy John Harry Potter")
# the names with surnames are the famous ones
# names without are regular names
list = names.split()
# I want to separate them into a list of separate names so I use split()
# but now surnames like "Trump" are counted as a name
print("Names are:",list)
This outputs
['Donald', 'Trump', 'James', 'Barack', 'Obama', 'Sammy', 'John', 'Harry', 'Potter']
I would like it to output something like ['Donald Trump', 'James', 'Barack Obama', 'Sammy', 'John', 'Harry Potter']
Any help would be appreciated
As said in the comments, you need a list of famous names.
# complete list of famous people
US_PRESIDENTS = (('Donald', 'Trump'), ('Barack', 'Obama'), ('Harry', 'Potter'))
def splitfamous(namestring):
words = namestring.split()
# create all tuples of 2 adjacent words and compare them to US_PRESIDENTS
for i, name in enumerate(zip(words, words[1:])):
if name in US_PRESIDENTS:
words[i] = ' '.join(name)
words[i+1] = None
# remove empty fields and return the result
return list(filter(None, words))
names = "Donald Trump James Barack Obama Sammy John Harry Potter"
print(splitfamous(names))
The resulting list:
['Donald Trump', 'James', 'Barack Obama', 'Sammy', 'John', 'Harry Potter']

Query dictionary based on a criteria and skip values that are missing

data = [
{'firstname': 'Tom ', 'lastname': 'Frank', 'title': 'Mr',
'education': 'B.Sc'},{'firstname': 'Anne ', 'middlename': 'David', 'lastname': 'Frank', 'title': 'Doctor',
'education': 'Ph.D'} , {'firstname': 'Ben ', 'lastname': 'William', 'title': 'Mr'}
]
I want to query the list of dictionaries based on the key 'education'. If the person's detail does not have this key the entire dictionary will be passed over.The desired output is
[(' Mr Tom Frank', 'B.Sc'),
('Doctor Anne David Frank', 'Ph.D') ]
My attempt would have an extra space between Tom and Frank as in Mr Tom Frank as well as between Anne and David . Here is the actual output
[('Mr Tom Frank', 'B.Sc'), ('Doctor Anne David Frank', 'Ph.D')]
I would like to avoid this if possible.
Here is the code I have written. I apologize if the code does not seem to be readable enough and I am ready to take any comments.
def qualified_applicants(data):
full_name_education=[ ]
keys = ['title','firstname','middlename','lastname']
for record in data:
#check to see if 'education' is one of the key
if 'education' in record.keys():
full_name=[' '.join([record.get(key,'') for key in keys])]
# make a tuple of education and full names
full_name_education.append(tuple(full_name+[record['education']]))
return full_name_education
You can use regex:
import re
data = [
{'firstname': 'Tom ', 'lastname': 'Frank', 'title': 'Mr',
'education': 'B.Sc'},{'firstname': 'Anne ', 'middlename': 'David', 'lastname': 'Frank', 'title': 'Doctor',
'education': 'Ph.D'} , {'firstname': 'Ben ', 'lastname': 'William', 'title': 'Mr'}
]
new_data = [(re.sub('\s{2,}', ' ', ' '.join(re.sub('\s+$', '', i.get(b, '')) for b in ['title', 'firstname', 'middlename', 'lastname'])), i['education']) for i in data if 'education' in i]
Output:
[('Mr Tom Frank', 'B.Sc'), ('Doctor Anne David Frank', 'Ph.D')]
The 'firstname' entries for your data appear to have a trailing blank. You can trim such leading and trailing white space using the strip method of the string returned by record.get(). This would make your list comprehension line be:
full_name = [' '.join([record.get(key,'').strip() for key in keys])]
to be tolerant of the extra whitespace.
FWIW, I think you would probably be better off having full_name not be a list but a plain string.
The codes seems to be working with the addition of one line of code like so:
temp=[' '.join(record.get(key,'') for key in keys)]
full_name=[' '.join(full_name.split() ) for full_name in temp ]
The rest of the lines didn't need any change.
This could be verbose but it is working. What is the most pythonic way of achieving the same result?

Categories

Resources