Extracting Strings From a List - python

Hi I'm fairly new to Python and needed help with extracting strings from a list. I am using Python on Visual Studios.
I have hundreds of similar strings and I need to extract specific information so I can add it to a table in columns - the aim is to automate this task using python. I would like to extract the data between the headers 'Names', 'Ages' and 'Jobs'. The issue I am facing is that the number of entries of names, ages and jobs varies a lot within all the lists and so I would like to write unique code which could apply to all the lists.
list_x = ['Names','Ashley','Lee','Poonam','Ages', '25', '35', '42' 'Jobs', 'Doctor', 'Teacher', 'Nurse']
I am struggling to extract
['Ashley', 'Lee', 'Poonam']
I have tried the following:
for x in list_x:
if x == 'Names':
for y in list_x:
if y == 'Ages':
print(list_x[x:y])
This however comes up with the following error:
"Exception has occurred: typeError X
slice indices must be integers or None or have an index method"
Is there a way of doing this without specifying exact indices?

As the comment suggested editing the data is the easiest way to go, but if you have to...
newList = oldList[oldList.index('Names') + 1:oldList.index("Ages")]
It just finds the indices of "Names" and "Ages" in the list, and extracts the bit between.
Lots can (and will) go wrong with this method though - if there's a name which is "Names", or if they are misspelt, etc.

For completeness sake, it might be not a bad idea to use an approach similar to the below.
First, build a list of indices of each of the desired headers:
list_x = ['Names', 'Ashley', 'Lee', 'Poonam', 'Ages', '25', '35', '42', 'Jobs', 'Doctor', 'Teacher', 'Nurse']
headers = ('Names', 'Ages', 'Jobs')
header_indices = [list_x.index(header) for header in headers]
print('indices:', header_indices) # [0, 4, 8]
Then, create a list of values for each header, which we can infer from the positions where each header shows up in the list:
values = {}
for i in range(len(header_indices)):
header = headers[i]
start = header_indices[i] + 1
try:
values[header] = list_x[start:header_indices[i + 1]]
except IndexError:
values[header] = list_x[start:]
And finally, we can display it for debugging purposes:
print('values:', values)
# {'Names': ['Ashley', 'Lee', 'Poonam'], 'Ages': ['25', '35', '42'], 'Jobs': ['Doctor', 'Teacher', 'Nurse']}
assert values['Names'] == ['Ashley', 'Lee', 'Poonam']
For better time complexity O(N), we can alternatively use an approach like below so that we only have one for loop over the list to build a dict object with the values:
from collections import defaultdict
values = defaultdict(list)
header_idx = -1
for x in list_x:
if x in headers:
header_idx += 1
else:
values[headers[header_idx]].append(x)
print('values:', values)
# defaultdict(<class 'list'>, {'Names': ['Ashley', 'Lee', 'Poonam'], 'Ages': ['25', '35', '42'], 'Jobs': ['Doctor', 'Teacher', 'Nurse']})

Related

Removing duplicates from 1million array of dictionariies

I have a HUGE list of objects.
That looks like this
output = [
{
'name': 'some name',
'id': '1'
}
]
clean_list = []
for i in range(len(output)):
if output[i] not in output[i + 1:]:
clean_list.append(output[i])
It's 1 million, and this is the method I'm currently using.. however it takes hours to complete this operation when my array is massive.
Is there a optimal way to remove duplicates?
There's two issues with the code you propose.
1. speed of "in" test
You wrote
clean_list = []
but you really want
clean_list = set()
That way in completes in O(1) constant
time, rather than O(N) linear time.
The linear probe inside the for
gives the loop O(N^2) quadratic cost.
2. equality test of hashable items
Your source items look like e.g. dict(name='some name', id=1).
You want to turn them into hashable tuples,
e.g. ('some name', 1),
so you can take advantage of set's O(1) fast lookups.
from pprint import pp
clean = set()
for d in output:
t = tuple(d.values())
if t not in clean:
clean.add(t)
pp(sorted(clean))
But wait! No need for checks, since a set will
reject attempts to add same thing twice.
So this suffices:
clean = set()
for d in output:
clean.add(tuple(d.values()))
And now it's simple enough that a set comprehension makes sense.
clean = {tuple(d.values())
for d in output}
Consider uniquifying on just name,
if the id value doesn't matter to you.
tl;dr: Doing ~ 10^6 operations is way better
than doing ~ 10^12 of them.
for such a large dataset it makes sence to use pandas, something like this:
import pandas as pd
output = [
{'name': 'some name 1','id': '1'},
{'name': 'some name 2','id': '2'},
{'name': 'some name 3','id': '3'},
{'name': 'some name 2','id': '2'}]
clean_list = pd.DataFrame(output).drop_duplicates().to_dict('records')
>>> clean_list
'''
[{'name': 'some name 1', 'id': '1'},
{'name': 'some name 2', 'id': '2'},
{'name': 'some name 3', 'id': '3'}]

How to print a dictionary based on a value

This is list of dictionary.
It is basically a sample data, but there are are more items in the list.
I want to basically get the dictionary using a value of the dictionary.
[{'status_id': '153080620724_10157915294545725', 'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT', 'link_name': 'Timeline Photos', 'status_type': 'photo', 'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3', 'status_published': '10/17/2016 20:56:51', 'num_reactions': '6813', 'num_comments': '543', 'num_shares': '359', 'num_likes': '6178', 'num_loves': '572', 'num_wows': '39', 'num_hahas': '17', 'num_sads': '0', 'num_angrys': '7'}
{'status_id': '153080620724_10157914483265725', 'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 8th, we will END this RIGGED system once and for all!", 'link_name': '', 'status_type': 'video', 'status_link': 'https://www.facebook.com/DonaldTrump/videos/10157914483265725/', 'status_published': '10/17/2016 18:00:41', 'num_reactions': '33768', 'num_comments': '3644', 'num_shares': '17653', 'num_likes': '26649', 'num_loves': '487', 'num_wows': '1155', 'num_hahas': '75', 'num_sads': '191', 'num_angrys': '5211'}
{'status_id': '153080620724_10157913199155725', 'status_message': "Crooked Hillary's State Department colluded with the FBI and the DOJ in a DISGRACEFUL quid pro quo exchange where her staff promised FBI agents more overseas positions if the FBI would alter emails that were classified. This is COLLUSION at its core and Crooked Hillary's super PAC, the media, is doing EVERYTHING they can to cover it up. It's a RIGGED system and we MUST not let her get away with this -- our country deserves better! Vote on Nov. 8 and let's take back the White House FOR the people and BY the people! #AmericaFirst! #RIGGED http://www.politico.com/story/2016/10/fbi-state-department-clinton-email-229880", 'link_name': '', 'status_type': 'video', 'status_link': 'https://www.facebook.com/DonaldTrump/videos/10157913199155725/', 'status_published': '10/17/2016 15:34:46', 'num_reactions': '85627', 'num_comments': '8810', 'num_shares': '32594', 'num_likes': '73519', 'num_loves': '2943', 'num_wows': '1020', 'num_hahas': '330', 'num_sads': '263', 'num_angrys': '7552'}
{'status_id': '153080620724_10157912962325725', 'status_message': 'JournoCash: Media gives $382,000 to Clinton, $14,000 Trump, 27-1 margin:', 'link_name': 'JournoCash: Media gives $382,000 to Clinton, $14,000 Trump, 27-1 margin', 'status_type': 'link', 'status_link': 'http://www.washingtonexaminer.com/journocash-media-gives-382000-to-clinton-14000-trump-27-1-margin/article/2604736', 'status_published': '10/17/2016 14:17:24', 'num_reactions': '22696', 'num_comments': '3665', 'num_shares': '5082', 'num_likes': '14029', 'num_loves': '122', 'num_wows': '2091', 'num_hahas': '241', 'num_sads': '286', 'num_angrys': '5927'}
]
I want the value for the highest number of 'num_likes' and print the status_id for that particular dictionary which has the highest 'num_likes'. I also want to understand how the method or process to implement this. I basically use the list to obtain the values and then find the maximum, is there any other way to do it?
The output should be just status_id.
Here I'm declaring your list-of-dictionaries as variable list_of_objs.
Since the num_likes value is string-type using int(obj['num_likes']) to convert the string-to-int - passing that to max method will return what is th max_likes .
list_of_objs = [{..}, {..}, {..}]
max_likes = max([int(obj['num_likes']) for obj in list_of_objs if 'num_likes' in obj.keys()])
print(max_likes)
max_likes_objs =[obj for obj in list_of_objs if int(obj['num_likes'])==max_likes]
print(max_likes_objs)
Last line what I've printed is list of all the dictionaries that have the max-value of num-likes
You can try this:
k=max([i['num_likes'] for i in d])
[i['status_id'] for i in d if i['num_likes']==k][0]
Using a simpler list as example:
l = [
{'likes': 5, 'id': 1},
{'likes': 2, 'id': 2},
{'likes': 7, 'id': 3},
{'likes': 1, 'id': 4},
]
result = list(filter(lambda item: item['likes'] == max([item['likes'] for item in l]), l))
print(result)
this will print [{'likes': 7, 'id': 3}]. The problem here is that if you can have more than one "maximum like item". This is why the function return a list. To print all of the IDs you can to:
print([item['id'] for item in result])
If you are sure that there are no more than one item or, otherwise, you need exactly one (maybe the first) you can do:
result = list(filter(lambda item: item['likes'] == max([item['likes'] for item in l]), l))
result = result[0]['id']
print(result)
which will print 3 in the example.
Now how to approach this problem: first you need the maximum number of likes:
max([item['likes'] for item in l])
call it maxLikes. Then you need the to take all the items with this likes value:
filter(lambda item: item['likes'] == maxLikes, l)
this is a filter applied on the list l (the last argument on the right), with a lambda function that could be read as "all items with 'likes' property equal to the maxLikes number".
Then you transform this in a list with list.
Declaring list_of_status_ids = [{}, {} ...]
Iterate list_of_status_ids and add in a dict having key as num_likes and values as list of status_id.
Then take max of num_likes and get all status_id corresponding to that max num_likes.
from collections import defaultdict
status_id_map = defaultdict(list)
[status_id_map[obj['num_likes']].append(obj['status_id']) for obj in list_of_status_ids]
print status_id_map.get(max(status_id_map.keys()))

searching for distinct dict in a list

With the following list of dict:
[ {'isn': '1', 'fid': '4', 'val': '1', 'ptm': '05/08/2019 14:22:39', 'sn': '111111' 'procesado': '0'},
{'isn': '1', 'fid': '4', 'val': '0', 'ptm': '05/08/2019 13:22:39', 'sn': '111111', 'procesado': '0'},
<...> ]
I would need to compare for each dict of the list if there are other element with:
equal fid
equal sn
distinct val (if val(elemX)=0 then val(elemY)=1)
distinct ptm (if val=0 of elemX then ptm of elemX < ptm of elemY)
This could be done in a traditional way using an external for loop an internal while, but this is not the optimal way to do it.
Trying to find a way to do that, I tried with something like this:
for p in lista:
print([item for item in lista if ((item["sn"] == p["sn"]) & (item["val"] == 0) & (p["val"] == 1) & (
datetime.strptime(item["ptm"], '%d/%m/%Y %H:%M:%S') < datetime.strptime(p["ptm"],'%d/%m/%Y %H:%M:%S')))])
But this does not work (and also is not optimal)
Just build a mapping from (fid,sn,val) onto a list of candidates (the whole dict, its index, or just its ptm (shown below), depending on what output you need). Also check whether any of its opposite numbers (under (fid,sn,!val)) are already present and do the ptm comparison if so:
seen={}
for d in dd:
f=d['fid']; s=d['sn']; v=int(d['val'])
p=datetime.strptime(d['ptm'],'%d/%m/%Y %H:%M:%S')
for p0 in seen.get((f,s,not v),()):
if p0!=p and (p0<p)==v: …
seen.setdefault((f,s,v),[]).append(p)
If you have a large number of values with the same key, you could use a tree to hasten the ptm comparisons, but that seems unlikely here. Using real data types for the individual values, and perhaps a namedtuple to contain them, would of course make this a lot nicer.

Returning a list of dictionaries by iterating over given data

So let's say I've a list of students and there is a dictionary containing some data for each student as given below :
students= [{'student_name': 'name1',
'regNO': '12',
},{'student_name': 'name2',
'regNO': '13',
},{'student_name': 'name3',
'regNO': '14',
}
]
So based on the above data I want to return another list of dictionaries containing data for each student,
I wrote the following code :
res_dict = {}
res_list = []
for student in students:
res_dict['name']=student['student_name']
res_list.append(res_dict)
print(res_list)
I was hoping that in the output, for each student , there would be a dictionary with key being 'name' and value being the student name taken from 'students' list. I expected it to be as follows :
[{'name': 'name1'}, {'name': 'name2'}, {'name': 'name3'}]
But the output turned out to be this :
[{'name': 'name3'}, {'name': 'name3'}, {'name': 'name3'}]
Can anyone help me identify the issue in my code ?
The better way to get the desired result is via using list comprehension expression as:
[{'name': student['student_name']} for student in students]
The issue with your code is you are updating the values in same reference of the dict object and appending the same object again to the list. Change your code to:
for student in students:
res_dict = {} # Create new `dict` object
res_dict['name'] = student['student_name']
res_list.append(res_dict)
OR, you may just do:
for student in students:
res_list.append({'name': student['student_name']})
The subtle concept of list is that it does not copy the item that you append to it.Instead of that it just stores the pointer to the newly added object via append,similar to pointer arrays in c.So when you try print(res_dict),it will give you the the result like this [{'name': 'name3'}, {'name': 'name3'}, {'name': 'name3'}].But when you append this to the list,all the items in the list point to the same object.You can verify this by this small fragment of code
for i in res_list:
print(id(i))
You will find the same memory address for all the list elements.
But when you take a copy of the dictionary with the help of d.copy() and append that to the res_list,you can see that all the list objects are pointing to different objects by the same technique using id(i) and for loop as shown above.
So finally the corrected code would be
students= [{'student_name': 'name1',
'regNO': '12',
},{'student_name': 'name2',
'regNO': '13',
},{'student_name': 'name3',
'regNO': '14',
}
]
res_dict = {}
res_list = []
for student in students:
res_dict['name']=student['student_name']
res_list.append(res_dict.copy())
print(res_list)
Using list comprehension would always expose the contents to be modified.
This is a classic reference issue. In other words you are appending the reference to the same dict object. Thus, when you change the name value on it, for all times it shows up in the list it will reflect its new value. Instead, create a new dict for each iteration and append that :)
Here is a simple way to do it.
n_list = [{'name':i['student_name']} for i in students]
You should reset res_dict = {} to an empty dict within your loop for each new entry.

How do I turn list values into an array with an index that matches the other dic values?

Hoping someone can help me out. I've spent the past couple hours trying to solve this, and fair warning, I'm still fairly new to python.
This is a repost of a question I recently deleted. I've misinterpreted my code in the last example.The correct example is:
I have a dictionary, with a list that looks similar to:
dic = [
{
'name': 'john',
'items': ['pants_1', 'shirt_2','socks_3']
},
{
'name': 'bob',
items: ['jacket_1', 'hat_1']
}
]
I'm using .append for both 'name', and 'items', which adds the dic values into two new lists:
for x in dic:
dic_name.append(dic['name'])
dic_items.append(dic['items'])
I need to split the item value using '_' as the delimiter, so I've also split the values by doing:
name, items = [i if i is None else i.split('_')[0] for i in dic_name],
[if i is None else i.split('_')[0] for i in chain(*dic_items)])
None is used in case there is no value. This provides me with a new list for name, items, with the delimiter used. Disregard the fact that I used '_' split for names in this example.
When I use this, the index for name, and item no longer match. Do i need to create the listed items in an array to match the name index, and if so, how?
Ideally, I want name[0] (which is john), to also match items[0] (as an array of the items in the list, so pants, shirt, socks). This way when I refer to index 0 for name, it also grabs all the values for items as index 0. The same thing regarding the index used for bob [1], which should match his items with the same index.
#avinash-raj, thanks for your patience, as I've had to update my question to reflect more closely to the code I'm working with.
I'm reading a little bit between the lines but are you trying to just collapse the list and get rid of the field names, e.g.:
>>> dic = [{'name': 'john', 'items':['pants_1','shirt_2','socks_3']},
{'name': 'bob', 'items':['jacket_1','hat_1']}]
>>> data = {d['name']: dict(i.split('_') for i in d['items']) for d in dic}
>>> data
{'bob': {'hat': '1', 'jacket': '1'},
'john': {'pants': '1', 'shirt': '2', 'socks': '3'}}
Now the data is directly related vs. indirectly related via a common index into 2 lists. If you want the dictionary split out you can always
>>> dic_name, dic_items = zip(*data.items())
>>> dic_name
('bob', 'john')
>>> dic_items
({'hat': '1', 'jacket': '1'}, {'pants': '1', 'shirt': '2', 'socks': '3'})
You need a list of dictionaries because the duplicate keys name and items are overwritten:
items = [[i.split('_')[0] for i in d['items']] for d in your_list]
names = [d['name'] for d in your_list] # then grab names from list
Alternatively, you can do this in one line with the built-in zip method and generators, like so:
names, items = zip(*((i['name'], [j.split('_')[0] for j in i['items']]) for i in dic))
From Looping Techniques in the Tutorial.
for name, items in div.items():
names.append(name)
items.append(item)
That will work if your dict is structured
{'name':[item1]}
In the loop body of
for x in dic:
dic_name.append(dic['name'])
dic_items.append(dic['items'])
you'll probably want to access x (to which the items in dic will be assigned in turn) rather than dic.

Categories

Resources