I have a bunch of lines in text with names and teams in this format:
Team (year)|Surname1, Name1
e.g.
Yankees (1993)|Abbot, Jim
Yankees (1994)|Abbot, Jim
Yankees (1993)|Assenmacher, Paul
Yankees (2000)|Buddies, Mike
Yankees (2000)|Canseco, Jose
and so on for several years and several teams.
I would like to aggregate names of players according to team (year) combination deleting any duplicated names (it may happen that in the original database there is some redundant information). In the example, my output should be:
Yankees (1993)|Abbot, Jim|Assenmacher, Paul
Yankees (1994)|Abbot, Jim
Yankees (2000)|Buddies, Mike|Canseco, Jose
I've written this code so far:
file_in = open('filein.txt')
file_out = open('fileout.txt', 'w+')
from collections import defaultdict
teams = defaultdict(set)
for line in file_in:
items = [entry.strip() for entry in line.split('|') if entry]
team = items[0]
name = items[1]
teams[team].add(name)
I end up with a big dictionary made up by keys (the name of the team and the year) and sets of values. But I don't know exactly how to go on to aggregate things up.
I would also be able to compare my final sets of values (e.g. how many players have Yankee's team of 1993 and 1994 in common?). How can I do this?
Any help is appreciated
You can use a tuple as a key here, for eg. ('Yankees', '1994'):
from collections import defaultdict
dic = defaultdict(list)
with open('abc') as f:
for line in f:
key,val = line.split('|')
keys = tuple(x.strip('()') for x in key.split())
vals = [x.strip() for x in val.split(', ')]
dic[keys].append(vals)
print dic
for k,v in dic.iteritems():
print "{}({})|{}".format(k[0],k[1],"|".join([", ".join(x) for x in v]))
Output:
defaultdict(<type 'list'>,
{('Yankees', '1994'): [['Abbot', 'Jim']],
('Yankees', '2000'): [['Buddies', 'Mike'], ['Canseco', 'Jose']],
('Yankees', '1993'): [['Abbot', 'Jim'], ['Assenmacher', 'Paul']]})
Yankees(1994)|Abbot, Jim
Yankees(2000)|Buddies, Mike|Canseco, Jose
Yankees(1993)|Abbot, Jim|Assenmacher, Paul
Related
I have a csv file with the columns: Name, Height, City
Now I need to return all the heights corresponding to similar cities.
So I have created a variable for all unique cities:
uniqueCity = []
for i in city:
if i not in uniqueCity:
uniqueCity.append(i)
I am able to print all heights corresponding to each city, but I cant seem to sort them on the height value per city
def printCity(city):
for i in uniqueCity:
print(i)
for j in range(len(city)):
if i == city[j]:
print(name[j], height[j])
What am I missing?
I am not allowed to use any third party libraries.
Full code:
import csv
with open('heightData.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
next(csvreader)
name = []
city = []
height = []
for row in csvreader:
name.append(row[0])
city.append(row[1])
height.append(int(row[2]))
city.sort()
uniqueCity = []
for i in city:
if i not in uniqueCity:
uniqueCity.append(i)
def printCity(city):
for i in uniqueCity:
print(i)
for j in range(len(city)):
if i == city[j]:
print(name[j], height[j])
printCity(city)
Sample data:
name,city,height
Mariam Cox,St_Paul,67
Daniel Ashley,St_Paul,65
Oliver Clay,Minneapolis,75
Rae Finley,Minneapolis,81
Brady Joyce,Virginia,68
Harding Jones,Virginia,80
Expected output:
Minneapolis:
Oliver Clay 75
Rae Finley 81
St_Paul:
Daniel Ashley 65
Mariam Cox 67
Virginia:
Brady Joyce 68
Harding Jones 80
The problem is, once you separated the data into separate lists for each column, there's nothing connecting the same row for each column. Then, when you do city.sort(), the other columns don't also get sorted, and now you have the city column out of order with respect to the others.
Instead, you could put each row into a tuple, and add all tuples to a list. Then sort() that list using the key argument to select any column (in this case, select the [2] item of each row to sort by height:
with open('heightData.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
next(csvreader)
csvdata = []
for row in csvreader:
row[2] = int(row[2])
csvdata.append(tuple(row))
csvdata.sort(key=lambda row: row[2])
Which gives:
csvdata = [('Mariam Cox', 'St_Paul', 67),
('Brady Joyce', 'Virginia', 68),
('Oliver Clay', 'Minneapolis', 75),
('Harding Jones', 'Virginia', 80)]
From your edit, I see that you want to first group your data by city, and then print the names of people, sorted by their heights. You have two options to group your data:
Sort by city and then use python's builtin itertools.groupby()
import itertools
csvdata.sort(key=lambda row: row[1]) # Sort by city
grouped_rows = {k: list(v) for k, v in itertools.groupby(csvdata, key=lambda row: row[1])} # Group by city
Create a dictionary where the keys are cities and the values are lists of rows belonging to that city.
import collections
grouped_rows = collections.defaultdict(list)
for row in csvdata:
city = row[1]
grouped_rows[city].append(row)
Then, you can iterate over either of these grouped_rows objects, sort the lists within on the [2] item, and print them:
for city in sorted(grouped_rows.keys()):
city_rows = sorted(grouped_rows[city], key=lambda row: row[2])
print(city)
for row in city_rows:
print("\t", row[0], row[2])
Minneapolis
Oliver Clay 75
St_Paul
Mariam Cox 67
Virginia
Brady Joyce 68
Harding Jones 80
For the assignment, it had to be a function. But this seems to work for me.
#create tuple of all heights corresponding to each city
def heightTuple(city):
cityHeight = collections.defaultdict(list)
for i in range(len(city)):
cityHeight[city[i]].append(height[i])
for i in cityHeight:
cityHeight[i].sort()
print(cityHeight)
data.txt TXT file, NOT py
Mich->
Anne
Luke
Carl
Marl->
Fill
Luke
Anne->
Luke
Fill
python file:
with open('data.txt') as f:
dati = f.read()
dati = dati.strip()
dati = dati.splitlines()
diz = {}
for items in dati:
if items[-2:] == '->':
key = items.replace('->', '')
if key in diz:
continue
else:
diz[key]=[]
else:
diz[key].append(items)
print(diz)
OUTPUT: d = {'Mich': ['Anne', 'Luke', 'Carl'], 'Marl': ['Fill', 'Luke'], 'Anne': ['Luke', 'Fill']}
I would like understand how can I access to lists and compare names here if elements of d are in an other file (data.txt).
for example, if I have to know which keys have the same names, what I have to do?
Thanks everbody.
I tried set, for do intersection, but I can't do this with this lists
for output I thought about (mich, marl and anne know luke)
I searched everywhere on internet how analise lists in dictionary, maybe its impossible?
One way to iterate the list would be like this:
d = {'Mich': ['Anne', 'Luke', 'Carl'],
'Marl': ['Fill', 'Luke'],
'Anne': ['Luke', 'Fill']}
names = []
for k, v in d.items():
print(k,"has: ")
for items in v:
print(items)
# here you can check if this data is in other file
names.append(items)
print(names)
This will result in:
Mich has:
Anne
Luke
Carl
Marl has:
Fill
Luke
Anne has:
Luke
Fill
['Anne', 'Luke', 'Carl', 'Fill', 'Luke', 'Luke', 'Fill']
You Should give more information about how the data is structured in the other file too, this all I can do with the information given
Consider the following .csv:
country: region: rank1: rank2: rank3:
switzerland West Europe 1 7.25 0.04
Iceland West Europe 2 7.04 1.03
Canada North America 3 7.32 0.03
I need to create a dictionary in the following structure, with one big dictionary containing dictionaries for each region containing another dictionary for countries, with tuples for the ranks (rank 1 and 2 contained in the same tuple). Like this:
{
'Western Europe':
{'switzerland': ((1, 7.25), (0.04)),
'Iceland': ((2, 7.04), (1.03))
...}
'North America':
{'Canada': ((3, 7.32), (0.03))
...}
}
Should I first read in the columns to be tuples, then create dictionaries of the countries containing those tuples? then add those to dictionaries of region?
import csv
def build_dict():
d={}
with open('2015.csv', mode='r') as fp:
reader = csv.reader(fp)
next(fp, None)
#problem area
for row in reader:
key = row[0]
if key in d:
pass
d[key] = row[1:]
print(d)
Which I'm aware gives me a dictionary that I don't want (dictionary with dictionaries for each country), but I've never worked with this many nested data structures and if anybody could show me the light I would be very grateful. I just feel like there's a lot going on here, and there's got to be an efficient way to do something like this.
This should do what you need.
import csv
def build_dict():
d={}
with open('2015.csv', mode='r') as fp:
reader = csv.reader(fp)
next(fp, None)
for row in reader:
key = row[1]
if not key in d: #If this is a region we haven't seen yet
d[key] = {} #Add it to the top-level dictionary
d[key][row[0]] = ((row[2], row[3]), row[4]) #Fill in data for that country
print(d)
There's probably a much better way to do this with something like pandas, but this gets the job done while being close to your original code.
I suggest to work with Pandas (excuse me for not working with your code):
Load the dictionary as a DataFrame, group by region, and iterate over the grouped sub-DataFrames. There might be more efficient solutuons without iterating for countries inside each region.
import pandas as pd
df = pd.read_csv('2015.csv')
d = {} # initiate an empty dictionary
for region, df_region in df.groupby(by='region'):
# region is the region name (string), and df_region is a small DataFrame with only one region.
d_region = {} # a new sub-dictionary for each region
for row in df_region.iterrows():
# each row is a pandas.Series object representing a different country.
# we can access its values by, for example, row['country']
d_region[row['country']] = ((row['rank1'], row['rank2']), row['rank3'])
# add to our general dictionary the new sub-dictionary, with region as its key
d['region'] = d_region
I'm attempting to merge two datasets using a unique ID for clients that are present in only one dataset. I've assigned the unique IDs to each full name as a dictionary, but each person is unique, even if they have the same name. I need to assign each unique ID iteratively to each instance of that person's name.
example of the dictionary:
{'Corey Davis': {'names_id':[1472]}, 'Jose Hernandez': {'names_id': [3464,15202,82567,98472]}, ...}
I've already attempted using the .map() function as well as
referrals['names_id'] = referrals['full_name'].copy()
for key, val in m.items():
referrals.loc[referrals.names_id == key, 'names_id'] = val
but of course, it only assigns the last value encountered, 98472.
I am hoping for something along the lines of:
full_name names_id \
Corey Davis 1472
Jose Hernandez 3464
Jose Hernandez 15202
Jose Hernandez 82657
Jose Hernandez 98472
but I get
full_name names_id \
Corey Davis 1472
Jose Hernandez 98472
Jose Hernandez 98472
Jose Hernandez 98472
Jose Hernandez 98472
Personally, what I would do is:
inputs = [{'full_name':'test', 'names_id':[1]}, {'full_name':'test2', 'names_id':[2,3,4]}]
# Create list of dictionaries for each 'entry'
entries = []
for input in inputs:
for name_id in input['names_id']:
entries.append({'full_name': input['full_name'], 'names_id': name_id})
# Now you have a list of dicts - each being one line of your table
# entries is now
# [{'full_name': 'test', 'names_id': 1},
# {'full_name': 'test2', 'names_id': 2},
# {'full_name': 'test2', 'names_id': 3},
# {'full_name': 'test2', 'names_id': 4}]
# I like pandas and use it for its dataframes, you can create a dataframe from list of dicts
import pandas as pd
final_dataframe = pd.DataFrame(entries)
I need some help with a assignment for python.
The task is to convert a .csv file to a dictionary, and do some changes. The problem is that the .csv file only got 1 column, but 3 rows.
The .csv file looks like this in excel
A B
1.male Bob West
2.female Hannah South
3.male Bruce North
So everything is in column A.
My code looks so far like this:
import csv
reader = csv.reader(open("filename.csv"))
d={}
for row in reader:
d[row[0]]=row[0:]
print(d)
And the output
{'\ufeffmale Bob West': ['\ufeffmale Bob West'], 'female Hannah South':
['female Hannah South'], 'male Bruce North': ['male Bruce North']}
but I want
{1 : Bob West, 2 : Hannah South, 3 : Bruce North}
The male/female should be changed with ID, (1,2,3). And i donĀ“t know how to figure out the 1 column thing.
Thanks in advance.
You can use dict comprehension and enumerate the csv object,
import csv
reader = csv.reader(open("filename.csv"))
x = {num+1:name[0].split(" ",1)[-1].rstrip() for (num, name) in enumerate(reader)}
print(x)
# output,
{1: 'Bob West', 2: 'Hannah South', 3: 'Bruce North'}
Or you can do it without using csv module simply by reading the file,
with open("filename.csv", 'r') as t:
next(t) # skip first line
x = {num+1:name.split(" ",1)[-1].strip() for (num, name) in enumerate(t)}
print(x)
# output,
{1: 'Bob West', 2: 'Hannah South', 3: 'Bruce North'}
as per Simit, but using regular expressions and realising that your 1. and A and B are you just trying to explain Excel cell and column identifiers
import re, csv
reader = csv.reader(open("data.csv"))
out = {}
for i, line in enumerate(reader, 1):
m = re.match(r'^(male|female) (.*)$', line)
if not m:
print(f"error processing {repr(line)}")
continue
out[i] = m[2]
print(out)
I like to use Pandas for stuff like this. You can use Pandas to import it and then export it to a dict.
import pandas as pd
df = pd.read_csv('test.csv',header=-1)
# Creates new columns in the dataframe based on the rules of the question
df['Name']=df[0].str.split(' ',1).str.get(1)
df['ID'] = df[0].str.split('.',1).str.get(0)
The dataframe should have three columns:
0 - This is the raw data.
Name - The name as defined in the problem.
ID - The number that comes before the period.
I didn't include gender, but it really won't fit into the dict. I'm also assuming your data does not have a header.
The next part converts your pandas dataframe to a dict in the output that you want.
output_dict = dict()
for i in range(len(df[['ID','Name']])):
output_dict[df.iloc[i]['ID']] = df.iloc[i]['Name']
This should work for the given input:
data.csv:
1.male Bob West,
2.female Hannah South,
3.male Bruce North,
Code:
import csv
reader = csv.reader(open("data.csv"))
d = {}
for row in reader:
splitted = row[0].split('.')
# print splitted[0]
# print ' '.join(splitted[1].split(' ')[1:])
d[splitted[0]] = ' '.join(splitted[1].split(' ')[1:])
print(d)
Output
{'1': 'Bob West', '3': 'Bruce North', '2': 'Hannah South'}
import cv with open('Employee_address.txt', mode='r') as CSV_file:
csv_reader= csv.DirectReader(csv_file)
life_count=0
for row in csv_reader:
if line_count==0:
print(f'columns names are {",".join()}')
line += 1
print(f'\t{row["name"]} works in the {row["department"]} department, and lives in{row["living address"]}.line_count +=1 print(f'Processed {line_count} lines.')