Flagging Entries with the Same Names? - python

I'm working with data where people have entered their names and some contact information. However, since they were unable to enter multiple entries for some of the fields, some people entered their names multiple times, resulting in 'duplicate' entries...
I'm trying to mark duplicate entries by the same user using a variable 'flag'.
For each row, what I want to happen is that if the name entry in the row is NOT the same as the name entry in the next row, the flag entry should increase by one.
How do I do this?
This is the code I currently have:
# FLAG 2
import csv
myjson = []
with(open("ieca_first_col_fake_text.txt", "rU")) as f:
sheet = csv.DictReader(f,delimiter="\t")
sheet.fieldnames.append('flag')
print sheet.fieldnames
for row in sheet:
myjson.append(row)
flag_counter = 0
myjson[0]['flag'] = flag_counter
for i in range(len(myjson)-1):
if myjson[i]['name'] != myjson[i+1]['name']:
myjson[i+1]['flag'] = flag_counter + 1
else:
myjson[i]['flag'] = flag_counter
for i in range(len(myjson)):
print myjson[i]
This is example data:
name phone email website area degree
Diane Grant Albrecht M.S.
Lannister G. Cersei M.A.T., CEP 111-222-3333 cersei#got.com www.got.com
Argle D. Bargle Ed.M.
Sam D. Man Ed.M. 000-000-1111 dman123#gmail.com www.daManWithThePlan.com
Sam D. Man Ed.M.
Sam D. Man Ed.M. 111-222-333 dman123#gmail.com www.daManWithThePlan.com
D G Bamf M.S.
Amy Tramy Lamy Ph.D.
And this is the output that results from operating on the example data:
['name', 'phone', 'email', 'website', 'flag']
{'website': '', 'phone': '', 'flag': 0, 'name': 'Diane Grant Albrecht M.S.', 'email': ''}
{'website': 'www.got.com', 'phone': '111-222-3333', 'flag': 1, 'name': 'Lannister G. Cersei M.A.T., CEP', 'email': 'cersei#got.com'}
{'website': '', 'phone': '', 'flag': 1, 'name': 'Argle D. Bargle Ed.M.', 'email': ''}
{'website': 'www.daManWithThePlan.com', 'phone': '000-000-1111', 'flag': 0, 'name': 'Sam D. Man Ed.M.', 'email': 'dman123#gmail.com'}
{'website': None, 'phone': '', 'flag': 0, 'name': 'Sam D. Man Ed.M.', 'email': None}
{'website': 'www.daManWithThePlan.com', 'phone': '111-222-333', 'flag': None, 'name': 'Sam D. Man Ed.M.', 'email': ' dman123#gmail.com'}
{'website': '', 'phone': '', 'flag': 1, 'name': 'D G Bamf M.S.', 'email': ''}
{'website': '', 'phone': '', 'flag': 1, 'name': 'Amy Tramy Lamy Ph.D.', 'email': ''}
Note that the flags do not correspond to the desired pattern.
And here is an ideal output (notice the difference in flag entries):
['name', 'phone', 'email', 'website', 'flag']
{'website': '', 'phone': '', 'flag': 0, 'name': 'Diane Grant Albrecht M.S.', 'email': ''}
{'website': 'www.got.com', 'phone': '111-222-3333', 'flag': 1, 'name': 'Lannister G. Cersei M.A.T., CEP', 'email': 'cersei#got.com'}
{'website': '', 'phone': '', 'flag': 2, 'name': 'Argle D. Bargle Ed.M.', 'email': ''}
{'website': 'www.daManWithThePlan.com', 'phone': '000-000-1111', 'flag': 3, 'name': 'Sam D. Man Ed.M.', 'email': 'dman123#gmail.com'}
{'website': None, 'phone': '', 'flag': 3, 'name': 'Sam D. Man Ed.M.', 'email': None}
{'website': 'www.daManWithThePlan.com', 'phone': '111-222-333', 'flag': 3, 'name': 'Sam D. Man Ed.M.', 'email': ' dman123#gmail.com'}
{'website': '', 'phone': '', 'flag': 4, 'name': 'D G Bamf M.S.', 'email': ''}
{'website': '', 'phone': '', 'flag': 5, 'name': 'Amy Tramy Lamy Ph.D.', 'email': ''}

EDIT:
Ths loop workes for me (output as expected):
for i in range(len(myjson)-1):
if myjson[i]['name'] != myjson[i+1]['name']:
print "not same" ,myjson[i]['name'] ,' ', myjson[i+1]['name']
flag_counter = flag_counter + 1
myjson[i+1]['flag'] = flag_counter
else:
print 'equal', myjson[i]['name'] ,' ', myjson[i+1]['name']
myjson[i]['flag'] = flag_counter
Note that I had to format the csv file by hand (tabs weren't tabs, but spaces). Make sure it is correct in your file. The names have to be exactly correct, no additional spaces allows
But I am not sure if this is the only bug, as there are many dangerous 'off-by-one' traps. If it still doesn't work, just update your output and code and we will see!

Related

get a part from a dictionary

i'm trying to get the pulse as an output for the given url using this code
from OTXv2 import OTXv2
from OTXv2 import IndicatorTypes
otx = OTXv2("my_key")
test=otx.get_indicator_details_full(IndicatorTypes.DOMAIN, "google.com")
and when i print test i become this output:
{'general': {'sections': ['general', 'geo', 'url_list', 'passive_dns', 'malware', 'whois', 'http_scans'], 'whois': 'http://whois.domaintools.com/google.com', 'alexa': 'http://www.alexa.com/siteinfo/google.com', 'indicator': 'google.com', 'type': 'domain', 'type_title': 'Domain', 'validation': [{'source': 'ad_network', 'message': 'Whitelisted ad network domain www-google-analytics.l.google.com', 'name': 'Whitelisted ad network domain'}, {'source': 'akamai', 'message': 'Akamai rank: #3', 'name': 'Akamai Popular Domain'}, {'source': 'alexa', 'message': 'Alexa rank: #1', 'name': 'Listed on Alexa'}, {'source': 'false_positive', 'message': 'Known False Positive', 'name': 'Known False Positive'}, {'source': 'majestic', 'message': 'Whitelisted domain google.com', 'name': 'Whitelisted domain'}, {'source': 'whitelist', 'message': 'Whitelisted domain google.com', 'name': 'Whitelisted domain'}], 'base_indicator': {'id': 12915, 'indicator': 'google.com', 'type': 'domain', 'title': '', 'description': '', 'content': '', 'access_type': 'public', 'access_reason': ''}, 'pulse_info': {'count': 0, 'pulses': [], 'references': [], 'related': {'alienvault': {'adversary': [], 'malware_families': [], 'industries': []}, 'other': {'adversary': [], 'malware_families': [], 'industries': []}}}, 'false_positive':...
i want to get only the part 'count': 0 in pulse_info
i tried using test.values() but it's like i have many dictionaries together
any idea how can i solve that?
Thank you
print(test["general"]["pulse_info"]["count"])

From MongoDB convert from dictionary to row with Pandas

This is a test coming from MongoDB, I need to convert to MySQL. But! Sometimes there is more then one "agents", if that's the case I need each agent on their own row and that agent should have the same "display_name". For example Walter should have Gloria on one row and Barb on next and both have Walt Mosley under "display_name".
[{'name': 'Loomis, Gloria',
'primaryemail': 'gloria#gmail.com',
'primaryphone': '212-382-1121'},
{'name': 'Hogson, Barb',
'primaryemail': 'bho124#aol.com',
'primaryphone': ''}]
I've tried this but it just splits out the key/values.
a,b,c = [[d[e] for d in test] for e in sorted(test[0].keys())]
print(a,b,c)
This is the original JSON format:
{'_id': ObjectId('58e6ececafb08d6'),
'item_type': 'Contributor',
'role': 0,
'short_bio': 'Walter Mosley (b. 1952)',
'firebrand_id': 1588,
'display_name': 'Walter Mosley',
'first_name': 'Walter',
'last_name': 'Mosley',
'slug': 'walter-mosley',
'updated': datetime.datetime(2020, 1, 7, 8, 17, 11, 926000),
'image': 'https://s3.amazonaws.com/8588-book-contributor.jpg',
'social_media_name': '',
'social_media_link': '',
'website': '',
'agents': [{'name': 'Loomis, Gloria',
'primaryemail': 'gloria#gmail.com',
'primaryphone': '212-382-1121'},
{'name': 'Hogson, Barb',
'primaryemail': 'bho124#aol.com',
'primaryphone': ''}],
'estates': [],
'deleted': False}
If you've an array of dictionaries from your JSON file, try this :
JSON input :
inputJSON = [{'item_type': 'Contributor',
'role': 0,
'short_bio': 'Walter Mosley (b. 1952)',
'firebrand_id': 1588,
'display_name': 'Walter Mosley',
'first_name': 'Walter',
'last_name': 'Mosley',
'slug': 'walter-mosley',
'image': 'https://s3.amazonaws.com/8588-book-contributor.jpg',
'social_media_name': '',
'social_media_link': '',
'website': '',
'agents': [{'name': 'Loomis, Gloria',
'primaryemail': 'gloria#gmail.com',
'primaryphone': '212-382-1121'},
{'name': 'Hogson, Barb',
'primaryemail': 'bho124#aol.com',
'primaryphone': ''}],
'estates': [],
'deleted': False}]
Code :
import copy
finalJSON = []
for each in inputJSON:
for agnt in each.get('agents'):
newObj = copy.deepcopy(each)
newObj['agents'] = agnt
finalJSON.append(newObj)
print(finalJSON)

Get value from data-set field sublist

I have a dataset (that pull its data from a dict) that I am attempting to clean and republish. Within this data set, there is a field with a sublist that I would like to extract specific data from.
Here's the data:
[{'id': 'oH58h122Jpv47pqXhL9p_Q', 'alias': 'original-pizza-brooklyn-4', 'name': 'Original Pizza', 'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/HVT0Vr_Vh52R_niODyPzCQ/o.jpg', 'is_closed': False, 'url': 'https://www.yelp.com/biz/original-pizza-brooklyn-4?adjust_creative=IelPnWlrTpzPtN2YRie19A&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=IelPnWlrTpzPtN2YRie19A', 'review_count': 102, 'categories': [{'alias': 'pizza', 'title': 'Pizza'}], 'rating': 4.0, 'coordinates': {'latitude': 40.63781, 'longitude': -73.8963799}, 'transactions': [], 'price': '$', 'location': {'address1': '9514 Ave L', 'address2': '', 'address3': '', 'city': 'Brooklyn', 'zip_code': '11236', 'country': 'US', 'state': 'NY', 'display_address': ['9514 Ave L', 'Brooklyn, NY 11236']}, 'phone': '+17185313559', 'display_phone': '(718) 531-3559', 'distance': 319.98144420799355},
Here's how the data is presented within the csv/spreadsheet:
location
{'address1': '9514 Ave L', 'address2': '', 'address3': '', 'city': 'Brooklyn', 'zip_code': '11236', 'country': 'US', 'state': 'NY', 'display_address': ['9514 Ave L', 'Brooklyn, NY 11236']}
Is there a way to pull location.city for example?
The below code simply adds a few fields and exports it to a csv.
def data_set(data):
df = pd.DataFrame(data)
df['zip'] = get_zip()
df['region'] = get_region()
newdf = df.filter(['name', 'phone', 'location', 'zip', 'region', 'coordinates', 'rating', 'review_count',
'categories', 'url'], axis=1)
if not os.path.isfile('yelp_data.csv'):
newdf.to_csv('data.csv', header='column_names')
else: # else it exists so append without writing the header
newdf.to_csv('data.csv', mode='a', header=False)
If that doesn't make sense, please let me know. Thanks in advance!

Python/Shell Script - Merging 2 rows of a CSV file where Address column has 'New Line' character

I have a CSV file, which contains couple of columns. For Example :
FName,LName,Address1,City,Country,Phone,Email
Matt,Shew,"503, Avenue Park",Auckland,NZ,19809224478,matt#xxx.com
Patt,Smith,"503, Baker Street
Mickey Park
Suite 510",Austraila,AZ,19807824478,patt#xxx.com
Doug,Stew,"12, Main St.
21st Lane
Suit 290",Chicago,US,19809224478,doug#xxx.com
Henry,Mark,"88, Washington Park",NY,US,19809224478,matt#xxx.com
In excel it looks something likes this :
It's a usual human tendency to feed/copy-paste address in the particular manner, usually sometimes people copy their signature and paste it to the Address column which creates such situation.
I have tried reading this using Python CSV module and it looks like that python doesn't distinguish between the '\n' Newline between the field values and the end of line.
My code :
import csv
with open(file_path, 'r') as f_obj:
input_data = []
reader = csv.DictReader(f_obj)
for row in reader:
print row
The output looks somethings like this :
{'City': 'Auckland', 'Address1': '503, Avenue Park', 'LName': 'Shew', 'Phone': '19809224478', 'FName': 'Matt', 'Country': 'NZ', 'Email': 'matt#xxx.com'}
{'City': 'Austraila', 'Address1': '503, Baker Street\nMickey Park\nSuite 510', 'LName': 'Smith', 'Phone': '19807824478', 'FName': 'Patt', 'Country': 'AZ', 'Email': 'patt#xxx.com'}
{'City': 'Chicago', 'Address1': '12, Main St. \n21st Lane \nSuit 290', 'LName': 'Stew', 'Phone': '19809224478', 'FName': 'Doug', 'Country': 'US', 'Email': 'doug#xxx.com'}
{'City': 'NY', 'Address1': '88, Washington Park', 'LName': 'Mark', 'Phone': '19809224478', 'FName': 'Henry', 'Country': 'US', 'Email': 'matt#xxx.com'}
I just wanted to write the same content to a file where all the values for a Address1 keys should not have '\n' character and looks like :
{'City': 'Auckland', 'Address1': '503, Avenue Park', 'LName': 'Shew', 'Phone': '19809224478', 'FName': 'Matt', 'Country': 'NZ', 'Email': 'matt#xxx.com'}
{'City': 'Austraila', 'Address1': '503, Baker Street Mickey Park Suite 510', 'LName': 'Smith', 'Phone': '19807824478', 'FName': 'Patt', 'Country': 'AZ', 'Email': 'patt#xxx.com'}
{'City': 'Chicago', 'Address1': '12, Main St. 21st Lane Suit 290', 'LName': 'Stew', 'Phone': '19809224478', 'FName': 'Doug', 'Country': 'US', 'Email': 'doug#xxx.com'}
{'City': 'NY', 'Address1': '88, Washington Park', 'LName': 'Mark', 'Phone': '19809224478', 'FName': 'Henry', 'Country': 'US', 'Email': 'matt#xxx.com'}
Any suggestions guys ???
PS:
I have more than 100K such records in my csv file !!!
You can replace the print row with a dict comprehsion that replaces newlines in the values:
row = {k: v.replace('\n', ' ') for k, v in row.iteritems()}
print row

Removing duplicate entries?

I need to compare values from different rows. Each row is a dictionary, and I need to compare the values in adjacent rows for the key 'flag'. How would I do this? Simply saying:
for row in range(1,len(myjson))::
if row['flag'] == (row-1)['flag']:
print yes
returns a TypeError: 'int' object is not subscriptable
Even though range returns a list of ints...
RESPONSE TO COMMENTS:
List of rows is a list of dictionaries. Originally, I import a tab-delimited file and read it in using the csv.dict module such that it is a list of dictionaries with the keys corresponding to the variable names.
Code: (where myjson is a list of dictionaries)
for row in myjson:
print row
Output:
{'website': '', 'phone': '', 'flag': 0, 'name': 'Diane Grant Albrecht M.S.', 'email': ''}
{'website': 'www.got.com', 'phone': '111-222-3333', 'flag': 1, 'name': 'Lannister G. Cersei M.A.T., CEP', 'email': 'cersei#got.com'}
{'website': '', 'phone': '', 'flag': 2, 'name': 'Argle D. Bargle Ed.M.', 'email': ''}
{'website': 'www.daManWithThePlan.com', 'phone': '000-000-1111', 'flag': 3, 'name': 'Sam D. Man Ed.M.', 'email': 'dman123#gmail.com'}
{'website': '', 'phone': '', 'flag': 3, 'name': 'Sam D. Man Ed.M.', 'email': ''}
{'website': 'www.daManWithThePlan.com', 'phone': '111-222-333', 'flag': 3, 'name': 'Sam D. Man Ed.M.', 'email': 'dman123#gmail.com'}
{'website': '', 'phone': '', 'flag': 4, 'name': 'D G Bamf M.S.', 'email': ''}
{'website': '', 'phone': '', 'flag': 5, 'name': 'Amy Tramy Lamy Ph.D.', 'email': ''}
Also:
type(myjson)
<type 'list'>
For comparing adjacent items you can use zip:
Example:
>>> lis = [1,1,2,3,4,4,5,6,7,7]
for x,y in zip(lis, lis[1:]):
if x == y :
print x,y,'are equal'
...
1 1 are equal
4 4 are equal
7 7 are equal
For your list of dictionaries, you can do something like :
from itertools import izip
it1 = iter(list_of_dicts)
it2 = iter(list_of_dicts)
next(it2)
for x,y in izip(it1, it2):
if x['flag'] == y['flag']
print yes
Update:
For more than 2 adjacent items you can use itertools.groupby:
>>> lis = [1,1,1,1,1,2,2,3,4]
for k,group in groupby(lis):
print list(group)
[1, 1, 1, 1, 1]
[2, 2]
[3]
[4]
For your code it would be :
>>> for k, group in groupby(dic, key = lambda x : x['flag']):
... print list(group)
...
[{'website': '', 'phone': '', 'flag': 0, 'name': 'Diane Grant Albrecht M.S.', 'email': ''}]
[{'website': 'www.got.com', 'phone': '111-222-3333', 'flag': 1, 'name': 'Lannister G. Cersei M.A.T., CEP', 'email': 'cersei#got.com'}]
[{'website': '', 'phone': '', 'flag': 2, 'name': 'Argle D. Bargle Ed.M.', 'email': ''}]
[{'website': 'www.daManWithThePlan.com', 'phone': '000-000-1111', 'flag': 3, 'name': 'Sam D. Man Ed.M.', 'email': 'dman123#gmail.com'}, {'website': '', 'phone': '', 'flag': 3, 'name': 'Sam D. Man Ed.M.', 'email': ''}, {'website': 'www.daManWithThePlan.com', 'phone': '111-222-333', 'flag': 3, 'name': 'Sam D. Man Ed.M.', 'email': 'dman123#gmail.com'}]
[{'website': '', 'phone': '', 'flag': 4, 'name': 'D G Bamf M.S.', 'email': ''}]
[{'website': '', 'phone': '', 'flag': 5, 'name': 'Amy Tramy Lamy Ph.D.', 'email': ''}]
Your exception indicates that list_of_rows is not what you think it is.
To look at other, adjacent rows, provided list_of_rows is indeed a list, I'd use enumerate() to include the current index and then use that index to load next and previous rows:
for i, row in enumerate(list_of_rows):
previous = list_of_rows[i - 1] if i else None
next = list_of_rows[i + 1] if i + 1 < len(list_of_rows) else None
Looks like you want to access list elements in batches:
http://code.activestate.com/recipes/303279/
You could try this
pre_item = list_of_rows[0]['flag']
for row in list_of_rows[1:]:
if row['flag'] == pre_item :
print yes
pre_item = row['flag']
list_of_rows = [ { 'a': 'foo',
'flag': 'bar' },
{ 'a': 'blo',
'flag': 'bar' } ]
for row, successor_row in zip(list_of_rows, list_of_rows[1:]):
if row['flag'] == successor_row['flag']:
print "yes"
It's simple. If you need to remove those dicts that have the same value for key "flag", as the title of your post suggests (it is somewhat misleading because your dictionaries are not strictly speaking duplicates), you can simply loop over the whole list of dictionaries, keeping track of flags in a separate list, if an item has a flag which is already in the list of flags simply don't add it, it would look something like:
def filterDicts(listOfDicts):
result = []
flags = []
for di in listOfDicts:
if di["flag"] not in flags:
result.append(di)
flags.append(di["flag"])
return result
When called with value of list of dictionaries that you have provided, it returns list with 5 items, each has an unique value of flag.

Categories

Resources