I'm stuck on the following problem:
I have a list with a ton of duplicative data. This includes entry numbers and names.
The following gives me a list of unique (non duplicative) names of people from the Data2014 table:
tablequery = c.execute("SELECT * FROM Data2014")
tablequery_results = list(people2014)
people2014_count = len(tablequery_results)
people2014_list = []
for i in tablequery_results:
if i[1] not in people2014_list:
people2014_list.append(i[1])
people2014_count = len(people2014_list)
# for i in people2014_list:
# print(i)
Now that I have a list of people. I need to iterate through tablequery_results again, however, this time I need to find the number of unique entry numbers each person has. There are tons of duplicates in the tablequery_results list. Without creating a block of code for each individual person's name, is there a way to iterate through tablequery_results using the names from people2014_list as the unique identifier? I can replicate the code from above to give me a list of unique entry numbers, but I can't seem to match the names with the unique entry numbers.
Please let me know if that does not make sense.
Thanks in advance!
I discovered my answer after delving into SQL a bit more. This gives me a list with two columns. The person's name in the first column, and then the numbers of entries that person has in the second column.
def people_data():
data_fetch = c.execute("SELECT person, COUNT(*) AS `NUM` FROM Data2014 WHERE ACTION='UPDATED' GROUP BY Person ORDER BY NUM DESC")
people_field_results = list(data_fetch)
people_field_results_count = len(people_field_results)
for i in people_field_results:
print(i)
print(people_field_results_count)
Related
I am reading several CSV files. I do not have prior knowledge if the files contain duplicate customer information.
class customerIdentityCard:
def __init__(self, id=0, firstName="", lastName=""):
self.id = id.upper() # alphanumeric field
self.firstName = firstName.upper()
self.lastName = lastName.upper()
before I blindly do a:
customerList = []
#in my for loop reading data from csv file
customerList.append(customerIdentityCard(field[0],field[1],field[2])
I want to check if the customer already exists in my customerList.
Since I know each customer has a unique ID number, I don't care if the other fields have spelling errors or other name variations. Just want to be sure that I am not putting duplicate IDs in my list.
Using Python 3.9.5 on Windows.
I would use a set() to maintain a unique list of ids that have been added already. set() is a little like a list that only contains unique elements but it is also fast to search. As other have suggested, there are also ways to solve this with a dict though.
Maybe this to get started with:
all_customers = [] ## list from your csv
unique_customers = set()
customerList = []
for customer in all_customers:
if customer[0] in unique_customers:
continue
customerList.append(customerIdentityCard(customer[0],customer[1],customer[2]))
unique_customers.add(customer[0])
I have produced a set of matching IDs from a database collection that looks like this:
{ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feb247f1bb7a1297060342e')}
Each ObjectId represents an ID on a collection in the DB.
I got that list by doing this: (which incidentally I also think I am doing wrong, but I don't yet know another way)
# Find all question IDs
question_list = list(mongo.db.questions.find())
all_questions = []
for x in question_list:
all_questions.append(x["_id"])
# Find all con IDs that match the question IDs
con_id = list(mongo.db.cons.find())
con_id_match = []
for y in con_id:
con_id_match.append(y["question_id"])
matches = set(con_id_match).intersection(all_questions)
print("matches", matches)
print("all_questions", all_questions)
print("con_id_match", con_id_match)
And that brings up all the IDs that are associated with a match such as the three at the top of this post. I will show what each print prints at the bottom of this post.
Now I want to get each ObjectId separately as a variable so I can search for these in the collection.
mongo.db.cons.find_one({"con": matches})
Where matches (will probably need to be a new variable) will be one of each ObjectId's that match the DB reference.
So, how do I separate the ObjectId in the matches so I get one at a time being iterated. I tried a for loop but it threw an error and I guess I am writing it wrong for a set. Thanks for the help.
Print Statements:
**matches** {ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feb247f1bb7a1297060342e')}
**all_questions** [ObjectId('5feafb52ae1b389f59423a91'), ObjectId('5feafb64ae1b389f59423a92'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feb247f1bb7a1297060342e'), ObjectId('6009b6e42b74a187c02ba9d7'), ObjectId('6010822e08050e32c64f2975'), ObjectId('601d125b3c4d9705f3a9720d')]
**con_id_match** [ObjectId('5feb247f1bb7a1297060342e'), ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8')]
Usually you can just use find method that yields documents one-by-one. And you can filter documents during iterating with python like that:
# fetch only ids
question_ids = {question['_id'] for question in mongo.db.questions.find({}, {'_id': 1})}
matches = []
for con in mongo.db.cons.find():
con_id = con['question_id']
if con_id in question_ids:
matches.append(con_id)
# you can process matched and loaded con here
print(matches)
If you have huge amount of data you can take a look to aggregation framework
I am working on a project and I need to write a function to generate ID's for every client in our the company. There's an existing list of clients already and some of them have a numerical 5 to 6 digits ID ranging from 40,000 to 200,000. There are other existing clients that do not have an ID and I would like to keep consistency with the already existing ID numbers (e.g. 43606 or 125490).
So in order to keep a similar format I created an Exclusion_List that contains all of the existing ID numbers. Then I was going to write a function using np.random.uniform(low=40000, high=200000) so that generates a number within that range that would look similar to the other ID numbers.
The problem that I have is that I don't know how to set a loop to check if the randomly generated ID is already in the exclusion list and if so; to generate a new one then.
This is what I have so far:
exclusions = [43606,125490,...]
def ID_Generator(new_clients): # This is a list of new client
new_client_IDs = []
for client in new_clients:
ID = int(np.random.uniform(low=40000, high=200000))
while ID not in exclusions:
new_client_IDs.append(ID)
I am not sure how to handle the scenario when the randomly generated number is in the exclusion list. I would love the function to output a dataframe containing the client names in one column and the ID number in a second column.
Appreciate any help on this!
Similar answer to Niranjan, but no list comprehension needed,
import numpy as np
import pandas as pd
exlcusion_list = [43606,125490]
free_ids = np.arange(40000, 200000)
free_ids = free_ids[~np.isin(free_ids, exlcusion_list)]
def get_ids(client_names):
new_client_ids = np.random.choice(free_ids, len(client_names), replace=False)
return pd.DataFrame(data=new_client_ids, index=client_names, columns=["id"])
print(get_ids(["Bob", "Fred", "Max"]))
which gives
id
Bob 125205
Fred 185058
Max 86158
Simple approach I can think as of now is.
Generate a list from 40000-200000.
Remove all the exclusions from the above list.
Randomly pick any id from the remaining list (In case order matters, use ids sequentially).
import random
exclusions = [43606,125490]
all = range(40000,200000)
remaining = [x for x in all if x not in exclusions]
random.choice(remaining)
exclusions = [43606,125490,...]
def ID_Generator(new_clients): # This is a list of new client
new_client_IDs = []
while len(new_client_IDs) < len(new_clients):
ID = randint(40000, 200000)
if ID not in exclusions:
new_client_IDs.append(ID)
if list(dict.fromkeys(new_client_IDs)):
new_client_IDs = list(dict.fromkeys(new_client_IDs))
I have a Couch DB with followers and friends ids of a single twitter user. Friends are identified under the group “friend_edges” and followers under “follower_edges”.
I am trying to find ids of those who are both followers and friends (at the same time) of that user.
In order to do that, I was requested to convert lists of followers and friends into sets, and then use the intersection operation between sets-- like set1.intersection(set.2)
Below is my code. It returns the only 2 values of friends who are also followers. Since the dataset has almost 2,000 ids, I’m positive this value is wrong.
Can someone tell me what is wrong with my code?… I appreciate your guidance but, although there are many ways program these tasks, I do need to use the Sets and .intersection, so please try and help me using those only... =)
from twitter_login import oauth_login
from twitter_DB import load_from_DB
from sets import Set
def friends_and_followers(users):
#open a lists for friends and another for followers
friends_list, followers_list = [], []
#find the users id under the label "friend_edges"
if id in users["friend_edges"] :
#loop in the "friend edges" group and find id's values
for value in id:
#add value to the list of friends
friends_list += value
#put the rest of the ids under the followers' list
else:
followers_list += value
return friends_list, followers_list
print friends_list, followers_list
#convert list of friends into a set
flist= set(friends_list)
#convert list of friends into a set
follwlist= set(followers_list)
if __name__ == '__main__':
twitter_api = oauth_login()
# check couchdb to look at this database
DBname = 'users-thatguy-+-only'
# load all the tweets
ff_results = load_from_DB(DBname)
#show number loaded
print 'number loaded', len(ff_results)
#iterate over values in the file
for user_id in ff_results:
#run the function over the values
both_friends_followers = friends_and_followers(user_id)
print "Friends and Followers of that guy: ", len(both_friends_followers)
The reason you get a length of two is because you return this:
return friends_list, followers_list
Which is a tuple of two lists, then take the length of that tuple, which is two.
I managed to convert from dictionary to set by extracting the values and adding those to a list using list.append(), as follows:
if 'friend_edges' in doc.keys():
flist = []
for x in doc['friend_edges']:
flist.append(x)
I have a list contains 700,000 items and a dictionary contains 300,000 keys. Some of the 300k keys are contained within the 700k items stored in the list.
Now, I have built a simple comparison and handling loop:
# list contains about 700k lines - ids,firstname,lastname,email,lastupdate
list = open(r'myfile.csv','rb').readlines()
dictionary = {}
# dictionary contains 300k ID keys
dictionary[someID] = {'first':'john',
'last':'smith',
'email':'john.smith#gmail.com',
'lastupdate':datetime_object}
for line in list:
id, firstname, lastname, email, lastupdate = line.split(',')
lastupdate = datetime.datetime.strptime(lastupdate,'%Y-%m-%d %H:%M:%S')
if id in dictionary.keys():
# update dictionary[id]'s keys:values
if lastupdate > dictionary[id]['lastupdate']:
# update values in dictionary[id]
else:
# create new id inside dictionary and fill with keys:values
I wish to speed things up a little and use multiprocessing for this kind of job. For this, I thought I could split the list to four smaller lists, Pool.map each list and check them separately with each of the four processes I'll make to create four new dictionaries. Problem is that in order create one whole dictionary with last updated values, I will have to repeat the process with the 4 new created dictionaries and so on.
Have anyone ever experienced with such problem and have a solution or an idea for that problem?
Thanks
if id in dictionary.keys():
NO! Please No! This is an O(n) operation!!! The right way to do it is simply
if id in dictionary
which takes O(1) time!!!
Before thinking about using multiprocessing etc you should avoid this really inefficient operations. If the dictionary has 300k keys that line was probably the bottleneck.
I have assumed python2; if this is not the case then you should use the python-3.x. In python3 using key in dictionary.keys() is O(1) because .keys() now returns a view of the dict instead of the list of keys, however is still a bit faster to omit .keys().
I think you should start with not splitting the same line for each token over and over again:
id, firstname, lastname, email, lastupdate = line.split(',')
lastupdate = datetime.datetime.strptime(lastupdate,'%Y-%m-%d %H:%M:%S')