I'm attempting to merge two datasets using a unique ID for clients that are present in only one dataset. I've assigned the unique IDs to each full name as a dictionary, but each person is unique, even if they have the same name. I need to assign each unique ID iteratively to each instance of that person's name.
example of the dictionary:
{'Corey Davis': {'names_id':[1472]}, 'Jose Hernandez': {'names_id': [3464,15202,82567,98472]}, ...}
I've already attempted using the .map() function as well as
referrals['names_id'] = referrals['full_name'].copy()
for key, val in m.items():
referrals.loc[referrals.names_id == key, 'names_id'] = val
but of course, it only assigns the last value encountered, 98472.
I am hoping for something along the lines of:
full_name names_id \
Corey Davis 1472
Jose Hernandez 3464
Jose Hernandez 15202
Jose Hernandez 82657
Jose Hernandez 98472
but I get
full_name names_id \
Corey Davis 1472
Jose Hernandez 98472
Jose Hernandez 98472
Jose Hernandez 98472
Jose Hernandez 98472
Personally, what I would do is:
inputs = [{'full_name':'test', 'names_id':[1]}, {'full_name':'test2', 'names_id':[2,3,4]}]
# Create list of dictionaries for each 'entry'
entries = []
for input in inputs:
for name_id in input['names_id']:
entries.append({'full_name': input['full_name'], 'names_id': name_id})
# Now you have a list of dicts - each being one line of your table
# entries is now
# [{'full_name': 'test', 'names_id': 1},
# {'full_name': 'test2', 'names_id': 2},
# {'full_name': 'test2', 'names_id': 3},
# {'full_name': 'test2', 'names_id': 4}]
# I like pandas and use it for its dataframes, you can create a dataframe from list of dicts
import pandas as pd
final_dataframe = pd.DataFrame(entries)
Related
data.txt TXT file, NOT py
Mich->
Anne
Luke
Carl
Marl->
Fill
Luke
Anne->
Luke
Fill
python file:
with open('data.txt') as f:
dati = f.read()
dati = dati.strip()
dati = dati.splitlines()
diz = {}
for items in dati:
if items[-2:] == '->':
key = items.replace('->', '')
if key in diz:
continue
else:
diz[key]=[]
else:
diz[key].append(items)
print(diz)
OUTPUT: d = {'Mich': ['Anne', 'Luke', 'Carl'], 'Marl': ['Fill', 'Luke'], 'Anne': ['Luke', 'Fill']}
I would like understand how can I access to lists and compare names here if elements of d are in an other file (data.txt).
for example, if I have to know which keys have the same names, what I have to do?
Thanks everbody.
I tried set, for do intersection, but I can't do this with this lists
for output I thought about (mich, marl and anne know luke)
I searched everywhere on internet how analise lists in dictionary, maybe its impossible?
One way to iterate the list would be like this:
d = {'Mich': ['Anne', 'Luke', 'Carl'],
'Marl': ['Fill', 'Luke'],
'Anne': ['Luke', 'Fill']}
names = []
for k, v in d.items():
print(k,"has: ")
for items in v:
print(items)
# here you can check if this data is in other file
names.append(items)
print(names)
This will result in:
Mich has:
Anne
Luke
Carl
Marl has:
Fill
Luke
Anne has:
Luke
Fill
['Anne', 'Luke', 'Carl', 'Fill', 'Luke', 'Luke', 'Fill']
You Should give more information about how the data is structured in the other file too, this all I can do with the information given
I have a pandas dataframe containing some business data, and a json that contains (among other things) a way to map one identifier to another (more granular to less granular). Is there any python or pandas function to accomplish this without converting the json to a dataframe first?
e.g.
input data frame
Bob, 123, Jakes Lane
mapping json
json = {'street number':123, 'street name':'Jakes Lane', postalcode='A1B2C3'}
What I want as the output is
Bob, 123, Jakes Lane, A1B2C3
Per your comment, fixed code to do join using 2 methods.
First, we need to define problem data. There are 2 datasets
dataframe with name, street number/address (no postal code)
json list of addresses: street number/address, and postal code
goal is to join 2 datasets to generate list of name, st number, st name, posttalcode - generate CSV (comma separated line per person)
Below are 2 methods to do join. IMHO, doing it with temp dataframe would be much better. Unsure of size of each dataset, assume they are LARGE. Thus METHOD2 seems to be better approach as it does 1 pass thru json and 1 pass thru dataframe.
Here is the data with a json.load():
import json
df = pd.DataFrame([{"name":"Bob", "street number":123, "street name":"Jakes Lane"}])
# Define JSON content (of file content)
jstr = '{"addrList" : [{"street number":123, "street name":"Jakes Lane", "postalcode" : "A1B2C3"}]}'
# Import json into dict & convert to string
jDict = json.loads(jstr)
METHOD 1 : for every address in json, do apply add postal code that match
df['postalcode'] = np.nan
sn = 'street number'
snm = 'street name'
cols = ['name','street number','street']
infoList = []
for addr in jDict['addrList']:
pc = addr['postalcode']
infoList = list(df.apply(lambda r: '%s, %s, %s, %s'%(r['name'],r[sn],r[snm],pc) if (r[sn] == addr[sn]) & (r[snm] == addr[snm]) else r.postalcode, axis=1))
infoList
METHOD 2 : Transform json to dict index by st number/name; search dataframe for match
d = {str(a[sn])+a[snm]:a['postalcode'] for a in jDict['addrList']}
def matchAddr(r):
a = str(r[sn])+r[snm]
try: pc = d[a]
except: pc = np.nan
return '%s, %s, %s, %s'%(r['name'],r[sn],r[snm],pc)
list(df.apply(lambda r : matchAddr(r), axis=1))
This demo code shows how to iterate all rows of a dataframe and collect values (zipcode) from a JSON. In the final step, the values are used to create a new column of the dataframe
import pandas as pd
from io import StringIO
data2 = """ID, Name, Number, Street
1, Bob, 123, Jakes Lane
2, Cat, 415, Carrot Alley
3, Derek, 741, Dirt Lane"""
# Proposed JSON object data
better_json = {
123: {'street number': 123, 'street name': 'Jakes Lane', 'postalcode': 'A1B2C3'},
741: {'street number': 741, 'street name': 'Dirt Lane', 'postalcode': 'A2Z4C6'},
531: {'street number': 531, 'street name': 'Fair Lane', 'postalcode': 'A0Z1C9'}
}
df2 = pd.read_csv(StringIO( data2 ), sep=',\s+', index_col='ID', engine='python')
print(df2)
# looping and get zipcode
link_values = []
for i, item in df2.Number.iteritems():
try:
#print(i, item, better_json[item]['postalcode'])
link_values.append(better_json[item]['postalcode'])
except:
#print(i, item, "No match!")
link_values.append("")
df2["zipcode"] = link_values
print()
print(df2)
The output printed by the code:
Name Number Street
ID
1 Bob 123 Jakes Lane
2 Cat 415 Carrot Alley
3 Derek 741 Dirt Lane
Name Number Street zipcode
ID
1 Bob 123 Jakes Lane A1B2C3
2 Cat 415 Carrot Alley
3 Derek 741 Dirt Lane A2Z4C6
I have very deeply nested list of dictionaries. I am trying to capture 'keys' from specific nested dictionary and convert it to a data frame. How do I do this? I have basic dictionary knowledge to generate keys, I have tried appending [] and {} and it didn't quite work. Any guidance appreciated!
import pandas as pd
from pprint import pprint
d = {'Main':{
'SecondLevel':
[{'Identifier':'abc',
'StudentInfo':{'Name':'Mike','Grade':'1',
'TeachersAssigned':[{'Name':'Paul'},
{'Name':'Smith'}
]}},
{
'StudentInfo':{'Name':'Mandy','Grade':'1',
'TeachersAssigned':[{'Name':'Baker'},
{'Name':'Smith'}
]}}]}}
pprint(d)
list_dict = []
for doc in d['Main']['SecondLevel']:
identifier = '' if doc.get('Identifier') is None else doc['Identifier']
studentname = doc['StudentInfo']['Name']
list_dict.append(identifier)
list_dict.append(studentname)
for teach in doc['StudentInfo']['TeachersAssigned']:
teachers_name = teach['Name']
list_dict.append(teachers_name)
pprint(list_dict)
>>> ['abc', 'Mike', 'Paul', 'Smith', '', 'Mandy', 'Baker', 'Smith']
pd.DataFrame(list_dict)
>>> single column with list of the values from above
I am trying to get it to like this:
Identifier StudentInfo TeachersAssigned
abc Mike Paul
abc Mike Smith
Mandy Baker
Mandy Smith
Am I doing the for loop wrong for list comprehension?
Given your dictionary this is how I manage. But as I explained before, you cannot have columns of different length in a DataFrame, therefore you can use np.nan
import pandas as pd
import numpy as np
import pandas as pd
d = {'Main':{
'SecondLevel':
[{'Identifier':'abc',
'StudentInfo':{'Name':'Mike','Grade':'1',
'TeachersAssigned':[{'Name':'Paul'},
{'Name':'Smith'}
]}},
{
'StudentInfo':{'Name':'Mandy','Grade':'1',
'TeachersAssigned':[{'Name':'Baker'},
{'Name':'Smith'}
]}}]}}
data = {'Identifier':[],'Name':[],'TeachersAssigned':[]}
for i in range(len(d['Main']['SecondLevel'])):
for j in range(len(d['Main']['SecondLevel'][i]['StudentInfo']['TeachersAssigned'])):
try:
data['Identifier'].append(d['Main']['SecondLevel'][i]['Identifier'])
except KeyError:
data['Identifier'].append(np.nan)
data['Name'].append(d['Main']['SecondLevel'][i]['StudentInfo']['Name'])
data['TeachersAssigned'].append(d['Main']['SecondLevel'][i]['StudentInfo']['TeachersAssigned'][j]['Name'])
df = pd.DataFrame(data)
print(df)
Output:
Identifier Name TeachersAssigned
0 abc Mike Paul
1 abc Mike Smith
2 NaN Mandy Baker
3 NaN Mandy Smith
I am using PySpark Python3 - Spark 2.1.0 and I have the a list of differents list, such as:
lista_archivos = [[['FILE','123.xml'],['NAME','ANA'],['SURNAME','LÓPEZ'],
['BIRTHDATE','05-05-2000'],['NATIONALITY','ESP']], [['FILE','458.xml'],
['NAME','JUAN'],['SURNAME','PÉREZ'],['NATIONALITY','ESP']], [['FILE','789.xml'],
['NAME','PEDRO'],['SURNAME','CASTRO'],['BIRTHDATE','07-07-2007'],['NATIONALITY','ESP']]]
This list have elements with different lengths. So now, I would like to create a DataFrame from this list, where the columns are the first attribute (i.e. 'FILE, NAME, SURNAME, BIRTHDATE, NATIONALITY) and the data is the second attribute.
As you can see, the second list has not the column 'BIRTHDATE', I need the DataFrame to create this column with a NaN or white space in this place.
Also, I need DataFrame to be like this:
FILE NAME SURNAME BIRTHDATE NATIONALITY
----------------------------------------------------
123.xml ANA LÓPEZ 05-05-2000 ESP
458.xml JUAN PÉREZ NaN ESP
789.xml PEDRO CASTRO 07-07-2007 ESP
The data of the lists have to be in the same columns.
I have done this code, but it doesn't seems like the table I'd like:
dictOfWords = { i : lista_archivos[i] for i in range(0, len(lista_archivos) ) }
d = dictOfWords
tabla = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in dictOfWords.items() ]))
tabla_final = tabla.transpose()
tabla_final
Also, I have done this:
dictOfWords = { i : lista_archivos[i] for i in range(0, len(lista_archivos) ) }
print(dictOfWords)
tabla = pd.DataFrame.from_dict(dictOfWords, orient='index')
tabla
And the result is not good.
I would like a pandas DataFrame and a Spark DataFrame if it is possible.
Thanks!!
The following should work in your case:
In [5]: lista_archivos = [[['FILE','123.xml'],['NAME','ANA'],['SURNAME','LÓPEZ'],
...: ['BIRTHDATE','05-05-2000'],['NATIONALITY','ESP']], [['FILE','458.xml'],
...: ['NAME','JUAN'],['SURNAME','PÉREZ'],['NATIONALITY','ESP']], [['FILE','789.xml'],
...: ['NAME','PEDRO'],['SURNAME','CASTRO'],['BIRTHDATE','07-07-2007'],['NATIONALITY','ESP']]]
In [6]: pd.DataFrame(list(map(dict, lista_archivos)))
Out[6]:
BIRTHDATE FILE NAME NATIONALITY SURNAME
0 05-05-2000 123.xml ANA ESP LÓPEZ
1 NaN 458.xml JUAN ESP PÉREZ
2 07-07-2007 789.xml PEDRO ESP CASTRO
Essentially, you convert your sublists to dict objects, and feed a list of those to the data-frame constructor. The data-frame constructor works with list-of-dicts very naturally.
I have a bunch of lines in text with names and teams in this format:
Team (year)|Surname1, Name1
e.g.
Yankees (1993)|Abbot, Jim
Yankees (1994)|Abbot, Jim
Yankees (1993)|Assenmacher, Paul
Yankees (2000)|Buddies, Mike
Yankees (2000)|Canseco, Jose
and so on for several years and several teams.
I would like to aggregate names of players according to team (year) combination deleting any duplicated names (it may happen that in the original database there is some redundant information). In the example, my output should be:
Yankees (1993)|Abbot, Jim|Assenmacher, Paul
Yankees (1994)|Abbot, Jim
Yankees (2000)|Buddies, Mike|Canseco, Jose
I've written this code so far:
file_in = open('filein.txt')
file_out = open('fileout.txt', 'w+')
from collections import defaultdict
teams = defaultdict(set)
for line in file_in:
items = [entry.strip() for entry in line.split('|') if entry]
team = items[0]
name = items[1]
teams[team].add(name)
I end up with a big dictionary made up by keys (the name of the team and the year) and sets of values. But I don't know exactly how to go on to aggregate things up.
I would also be able to compare my final sets of values (e.g. how many players have Yankee's team of 1993 and 1994 in common?). How can I do this?
Any help is appreciated
You can use a tuple as a key here, for eg. ('Yankees', '1994'):
from collections import defaultdict
dic = defaultdict(list)
with open('abc') as f:
for line in f:
key,val = line.split('|')
keys = tuple(x.strip('()') for x in key.split())
vals = [x.strip() for x in val.split(', ')]
dic[keys].append(vals)
print dic
for k,v in dic.iteritems():
print "{}({})|{}".format(k[0],k[1],"|".join([", ".join(x) for x in v]))
Output:
defaultdict(<type 'list'>,
{('Yankees', '1994'): [['Abbot', 'Jim']],
('Yankees', '2000'): [['Buddies', 'Mike'], ['Canseco', 'Jose']],
('Yankees', '1993'): [['Abbot', 'Jim'], ['Assenmacher', 'Paul']]})
Yankees(1994)|Abbot, Jim
Yankees(2000)|Buddies, Mike|Canseco, Jose
Yankees(1993)|Abbot, Jim|Assenmacher, Paul