I am currently writing some code that reads lines in from a text file. The line is split into 3 different segments, with the first segment being a user ID.
For example, one line would look like this:
11 490 5
I have a list with as many elements as there are users, where each element corresponds with a user (eg exampleList[4] stores data for the 5th user).
Each list element contains a dictionary of indefinite length, where the key is the second segment of the line, and the value is the third segment of the line.
The length of the dictionary (the number of key-value pairs) increases if the same user's ID occurs in another line. The idea is that when another line with the same user ID is encountered, the data from that line is appended to the dictionary in the list element that corresponds to that user.
For example, the above line would be stored in something like this:
exampleList[10] = {490:5}
and if the program read another line like this: 11 23 9
the list item would update itself to this:
exampleList[10] = {490:5, 23:9}
The way my program works is that it first collects the number of users, and then creates a list like this:
exampleList = [{}] * numberOfUsers
It then extracts the position of whitespace in the line using re.finditer, which is then used to extract the numbers through basic string operations.
That part works perfectly, but I'm unsure of how to update dictionaries within a list, namely appending new key-value pairs to the dictionary.
I've read about using a for loop here, but that won't work for me since that adds it to every dictionary in the cell instead of just appending it to the dictionary in a certain cell only.
Sample code:
oFile = open("file.txt", encoding = "ISO-8859-1")
text = oFile.readlines()
cL = [{}] * numOfUsers #imported from another method
for line in text:
a = [m.start() for m in re.finditer('\t', line)]
userID = int(line[0:a[0]])
uIDIndex = userID - 1
cL[uIDIndex].update({int(line[a[0]+1:a[1]]):int(line[a[1]+1:a[2]])})
print(cL)
file.txt:
1 242 3
3 302 3
5 333 10
1 666 9
expected output:
[{242:3 , 666:9},{},{302:3},{},{333:10}]
actual output:
[{242: 3, 333: 10, 302: 3, 666: 9}, {242: 3, 333: 10, 302: 3, 666: 9}, {242: 3, 333: 10, 302: 3, 666: 9}, {242: 3, 333: 10, 302: 3, 666: 9}, {242: 3, 333: 10, 302: 3, 666: 9}]
For some reason, it populates all dictionaries in the list with all the values.
I'm not positive I understand your problem correctly but I was able to get the output you desired.
Note that this solution completely ignores the fourth value in the list
import re
fileData = [] #data from file.txt parsed through regex
with open("file.txt") as f:
for line in f:
regExp = re.match(r"(\d+)\s+(\d+)\s(\d+)", line) #extracts data from row in file
fileData.append((int(regExp.group(1)), int(regExp.group(2)), int(regExp.group(3)))) #make 2-d list of data
maxIndex = max(fileData, key=lambda x: x[0])[0] #biggest index in the list (5 in this case)
finaList = [] #the list where your output will be stored
for i in range(1, maxIndex+1): #you example output showed a 1-indexed dict
thisDict = {} #start with empty dict
for item in fileData:
if item[0] == i:
thisDict[item[1]] = item[2] #for every item with same index as this dict, add new key-value to dict
finaList.append(thisDict) #add this dict to output list
print(finaList)
You can just access the dictionary by the index. Here is a simple example:
>>> A = []
>>> A.append(dict())
>>> A.append(dict())
>>> A[0][5] = 7
>>> A
[{5: 7}, {}]
>>> A[1][4] = 8
>>> A[0][3] = 9
>>> A[1][8] = 10
>>> A
[{3: 9, 5: 7}, {8: 10, 4: 8}]
Related
i have a pandas dataframe
where you can find 3 columns. the third is the second one with some str slicing.
To every warranty_claim_number, there is a key_part_number (first column).
this dataframe has a lot of rows.
I have a second list, which contains 70 random select warranty_claim_numbers.
I was hoping to find the corresponding key_part_number from those 70 claims in my dataset.
Then i would like to create a dictionary with the key_part_number as key and the corresponding value as warranty_claim_number.
At last, count how often each key_part_number appears in this dataset and update the key.
This should like like this:
dicti = {4:'000120648353',10:'000119582589',....}
first of all you need to change the datatype of warranty_claim_numbers to string or you wont get the leading 0's
You can subset your df form that list of claim numbers:
df = df[df["warranty_claim_number"].isin(claimnumberlist)]
This gives you a dataframe with only the rows with those claim numbers.
countofkeyparts = df["key_part_number"].value_counts()
this gives you a pandas series with the values and you can cast i to a dict with to_dict()
countofkeyparts = countofkeyparts.to_dict()
The keys in a dict have to be unique so if you want the count as a key you can have the value be a list of key_part_numbers
values = {}
for key, value in countofkeyparts.items():
values[value]= values.get(value,[])
values[value].append(key)
According to your example, you can't use the number of occurrences as the key of the dictionary because the key in the dictionary is unique and you can't exclude multiple data columns with the same frequency of occurrence, so it is recommended to set the result in this format: dicti = {4:['000120648353', '09824091'],10:['000119582589'] ,....}
I'll use randomly generated data as an example
from collections import Counter
import random
lst = [random.randint(1, 10) for i in range(20)]
counter = Counter(lst)
print(counter) # First element, then number of occurrences
nums = set(counter.values()) # All occurrences
res = {item: [val for val in counter if counter[val] == item] for item in nums}
print(res)
# Counter({5: 6, 8: 4, 3: 2, 4: 2, 9: 2, 2: 2, 6: 1, 10: 1})
# {1: [6, 10], 2: [3, 4, 9, 2], 4: [8], 6: [5]}
This does what you want:
# Select rows where warranty_claim_numbers item is in lst:
df_wanted = df.loc[df["warranty_claim_numbers"].isin(lst), "warranty_claim_numbers"]
# Count the values in that row:
count_values = df_wanted.value_counts()
# Transform to Dictionary:
print(count_values.to_dict())
I'd like to write a function that will take one argument (a text file) to use its contents as keys and assign values to the keys. But I'd like the keys to go from 1 to n:
{'A': 1, 'B': 2, 'C': 3, 'D': 4... }.
I tried to write something like this:
Base code which kind of works:
filename = 'words.txt'
with open(filename, 'r') as f:
text = f.read()
ready_text = text.split()
def create_dict(lst):
""" go through the arg, stores items in it as keys in a dict"""
dictionary = dict()
for item in lst:
if item not in dictionary:
dictionary[item] = 1
else:
dictionary[item] += 1
return dictionary
print(create_dict(ready_text))
The output: {'A': 1, 'B': 1, 'C': 1, 'D': 1... }.
Attempt to make the thing work:
def create_dict(lst):
""" go through the arg, stores items in it as keys in a dict"""
dictionary = dict()
values = list(range(100)) # values
for item in lst:
if item not in dictionary:
for value in values:
dictionary[item] = values[value]
else:
dictionary[item] = values[value]
return dictionary
The output: {'A': 99, 'B': 99, 'C': 99, 'D': 99... }.
My attempt doesn't work. It gives all the keys 99 as their value.
Bonus question: How can I optimaze my code and make it look more elegant/cleaner?
Thank you in advance.
You can use dict comprehension with enumerate (note the start parameter):
words.txt:
colorless green ideas sleep furiously
Code:
with open('words.txt', 'r') as f:
words = f.read().split()
dct = {word: i for i, word in enumerate(words, start=1)}
print(dct)
# {'colorless': 1, 'green': 2, 'ideas': 3, 'sleep': 4, 'furiously': 5}
Note that "to be or not to be" will result in {'to': 5, 'be': 6, 'or': 3, 'not': 4}, perhaps what you don't want. Having only one entry out of two (same) words is not the result of the algorithm here. Rather, it is inevitable as long as you use a dict.
Your program sends a list of strings to create_dict. For each string in the list, if that string is not in the dictionary, then the dictionary value for that key is set to 1. If that string has been encountered before, then the value of that key is increased by 1. So, since every key is being set to 1, then that must mean there are no repeat keys anywhere, meaning you're sending a list of unique strings.
So, in order to have the numerical values increase with each new key, you just have to increment some number during your loop:
num = 0
for item in lst:
num += 1
dictionary[item] = num
There's an easier way to loop through both numbers and list items at the same time, via enumerate():
for num, item in enumerate(lst, start=1): # start at 1 and not 0
dictionary[item] = num
You can use this code. If an item has been in the lst more than once, the idx is considered one time in dictionary!
def create_dict(lst):
""" go through the arg, stores items in it as keys in a dict"""
dictionary = dict()
idx = 1
for item in lst:
if item not in dictionary:
dictionary[item]=idx
idx += 1
return dictionary
I was looking for a way to read a csv file with an unknown number of columns into a nested dictionary. i.e. for input of the form
file.csv:
1, 2, 3, 4
1, 6, 7, 8
9, 10, 11, 12
I want a dictionary of the form:
{1:{2:{3:4}, 6:{7:8}}, 9:{10:{11:12}}}
This is in order to allow O(1) search of a value in the csv file.
Creating the dictionary can take a relatively long time, as in my application I only create it once, but search it millions of times.
I also wanted an option to name the relevant columns, so that I can ignore unnecessary once
Here's a simple, albeit brittle approach:
>>> d = {}
>>> with io.StringIO(s) as f: # fake a file
... reader = csv.reader(f)
... for row in reader:
... nested = d
... for val in map(int, row[:-2]):
... nested = nested.setdefault(val, {})
... k, v = map(int, row[-2:]) # this will fail if you don't have enough columns
... nested[k] = v
...
>>> d
{1: {2: {3: 4}, 6: {7: 8}}, 9: {10: {11: 12}}}
However, this assumes the number of columns is at least 2.
Here is what I came up with. Feel free to comment and suggest improvements.
import csv
import itertools
def list_to_dict(lst):
# Takes a list, and recursively turns it into a nested dictionary, where
# the first element is a key, whose value is the dictionary created from the
# rest of the list. the last element in the list will be the value of the
# innermost dictionary
# INPUTS:
# lst - a list (e.g. of strings or floats)
# OUTPUT:
# A nested dictionary
# EXAMPLE RUN:
# >>> lst = [1, 2, 3, 4]
# >>> list_to_dict(lst)
# {1:{2:{3:4}}}
if len(lst) == 1:
return lst[0]
else:
data_dict = {lst[-2]: lst[-1]}
lst.pop()
lst[-1] = data_dict
return list_to_dict(lst)
def dict_combine(d1, d2):
# Combines two nested dictionaries into one.
# INPUTS:
# d1, d2: Two nested dictionaries. The function might change d1 and d2,
# therefore if the input dictionaries are not to be mutated,
# you should pass copies of d1 and d2.
# Note that the function works more efficiently if d1 is the
# bigger dictionary.
# OUTPUT:
# The combined dictionary
# EXAMPLE RUN:
# >>> d1 = {1: {2: {3: 4, 5: 6}}}
# >>> d2 = {1: {2: {7: 8}, 9: {10, 11}}}
# >>> dict_combine(d1, d2)
# {1: {2: {3: 4, 5: 6, 7: 8}, 9: {10, 11}}}
for key in d2:
if key in d1:
d1[key] = dict_combine(d1[key], d2[key])
else:
d1[key] = d2[key]
return d1
def csv_to_dict(csv_file_path, params=None, n_row_max=None):
# NAME: csv_to_dict
#
# DESCRIPTION: Reads a csv file and turns relevant columns into a nested
# dictionary.
#
# INPUTS:
# csv_file_path: The full path to the data file
# params: A list of relevant column names. The resulting dictionary
# will be nested in the same order as parameters in 'params'.
# Default is None (read all columns)
# n_row_max: The maximum number of rows to read. Default is None
# (read all rows)
#
# OUTPUT:
# A nested dictionary containing all the relevant csv data
csv_dictionary = {}
with open(csv_file_path, 'r') as csv_file:
csv_data = csv.reader(csv_file, delimiter=',')
names = next(csv_data) # Read title line
if not params:
# A list of column indices to read from csv
relevant_param_indices = list(range(0, len(names) - 1))
else:
# A list of column indices to read from csv
relevant_param_indices = []
for name in params:
if name not in names:
# Parameter name is not found in title line
raise ValueError('Could not find {} in csv file'.format(name))
else:
# Get indices of the relevant columns
relevant_param_indices.append(names.index(name))
for row in itertools.islice(csv_data, 1, n_row_max):
# Get a list containing relevant columns only
relevant_cols = [row[i] for i in relevant_param_indices]
# Turn the string to numbers. Not necessary
float_row = [float(element) for element in relevant_cols]
# Build nested dictionary
csv_dictionary = dict_combine(csv_dictionary, list_to_dict(float_row))
return csv_dictionary
I am not used to code with Python, but I have to do this one with it. What I am trying to do, is something that would reproduce the result of SQL statment like this :
SELECT T2.item, AVG(T1.Value) AS MEAN FROM TABLE_DATA T1 INNER JOIN TABLE_ITEMS T2 ON T1.ptid = T2.ptid GROUP BY T2.item.
In Python, I have two lists of dictionnaries, with the common key 'ptid'. My dctData contains around 100 000 pdit and around 7000 for the dctItems. Using a comparator like [i for i in dctData for j in dctItems if i['ptid'] == j['ptid']] is endless:
ptid = 1
for line in lines[6:]: # Skipping header
data = line.split()
for d in data:
dctData.append({'ptid' : ptid, 'Value': float(d)})
ptid += 1
dctData = [{'ptid':1,'Value': 0}, {'ptid':2,'Value': 2}, {'ptid':3,'Value': 2}, {'ptid':4,'Value': 5}, {'ptid':5,'Value': 3}, {'ptid':6,'Value': 2}]
for line in lines[1:]: # Skipping header
data = line.split(';')
dctItems.append({'ptid' : int(data[1]), 'item' : data[3]})
dctItems = [{'item':21, 'ptid':1}, {'item':21, 'ptid':2}, {'item':21, 'ptid':6}, {'item':22, 'ptid':2}, {'item':22, 'ptid':5}, {'item':23, 'ptid':4}]
Now, what I would like to get for result, is a third list that would present the average values according to each item in dctItems dictionnary, while the link between the two dictionnaries would be based on the 'pdit' value.
Where for example with the item 21, it would calculate the mean value of 1.3 by getting the values (0, 2, 2) of the ptid 1, 2 and 6:
And finally, the result would look something like this, where the key Value represents the mean calculated :
dctResults = [{'id':21, 'Value':1.3}, {'id':22, 'Value':2.5}, {'id':23, 'Value':5}]
How can I achieve this?
Thanks you all for your help.
Given those data structures that you use, this is not trivial, but it will become much easier if you use a single dictionary mapping items to their values, instead.
First, let's try to re-structure your data in that way:
values = {entry['ptid']: entry['Value'] for entry in dctData}
items = {}
for item in dctItems:
items.setdefault(item['item'], []).append(values[item['ptid']])
Now, items has the form {21: [0, 2, 2], 22: [2, 3], 23: [5]}. Of course, it would be even better if you could create the dictionary in this form in the first place.
Now, we can pretty easily calculate the average for all those lists of values:
avg = lambda lst: float(sum(lst))/len(lst)
result = {item: avg(values) for item, values in items.items()}
This way, result is {21: 1.3333333333333333, 22: 2.5, 23: 5.0}
Or if you prefer your "list of dictionaries" style:
dctResult = [{'id': item, 'Value': avg(values)} for item, values in items.items()]
After many attempts I was able to get the code below to index specific columns and rows for a given specific csv file. Now I would like to convert the code below into a dictionary, I read the documentation on dict and zip, however I'm still not clear...
CSV file contains 500 records and columns A:L corresponding to the headers below:
first_name, last_name, company, address, city, county, state, zip, phone1, phone2, email, web
import csv
f= open('us-500.csv', 'rU')
reader = csv.reader(f) # use list or next
rows = list(reader)
for row in rows[0:20]:
print "".join(row[8])
I'm going to take a wild guess at what you want.
You have a CSV file with, say, 10 columns.
You want a dictionary that uses column 8 of each row as the keys, and the whole row (that is, a list of all of the columns) as the corresponding values.*
So, instead of list(reader), which just gives you a list of rows, you want this:
d = {row[8]: row for row in reader}
Or, if you're using, say, Python 2.5 and don't have dictionary comprehensions:
d = dict((row[8], row) for row in reader)
So, given this input file:
John, Smith, 2, 3, 4, 5, 6, 7, 8, 9, 10
Ed, Jones, 20, 30, 40, 50, 60, 70, 80, 90, 100
You'd get this dictionary:
{'8': ['John', 'Smith', '2', '3', '4', '5', '6', '7', '8', '9', '10'],
'80': ['Ed', 'Jones', '20', '30', 40', '50', '60', '70', '80', '90', '100']}
* This assumes that the column 8 values are unique. Otherwise, this wouldn't make sense at all. You might instead want, say, a multi-dict, mapping each column-8 value to the list of all rows that have that column-8 value, or a dict mapping each column-8 value to a "multi-row" that zips together each of the column values of each of the rows that have that column-8 value, or… who knows what. All of these are pretty easy to write once you understand the basic idea and know which one you want.
EDIT --> Based on the asker's comment, I think this is more what is wanted (using DictReader makes this much simpler):
import csv
with open('c:\us-500.csv', 'rU') as f:
reader = csv.DictReader(f)
address_book = {}
for row in reader:
address_book[row['phone1']] = row
Gives a dictionary for the file with the primary key as the 8th column "phone1". Access values like this.
address_book['555-1212']['first_name']
address_book['978-3425']['email']
Edit2 --> Removing the original answer now. Basically it was re-implementing the DictReader functionality.
if you can get your data into two lists,(that are in the proper order that you want them in) then you are ready to convert to dictionarys.
>>> list_1 = ['pie','farts','boo']
>>>
>>> list_2 = ['apple','stanky','scary']
>>>
>>> dict(zip(list_1,list_2))
{'boo': 'scary', 'farts': 'stanky', 'pie': 'apple'}
>>>
>>> dict(zip(list_2,list_1))
{'apple': 'pie', 'stanky': 'farts', 'scary': 'boo'}
>>>
the zip command is kinda cool because it turns two lists into one list that has smaller lists inside of it,
>>> list(zip(list_1,list_2))
[('pie', 'apple'), ('farts', 'stanky'), ('boo', 'scary')]
then you just convert that into a dict
>>> dict(zip(list_1,list_2))
{'boo': 'scary', 'farts': 'stanky', 'pie': 'apple'}
You can use dict comprehension.
list1 = range(10)
list2 = range(20)
a = {k: v for k, v in zip(list1, list2)}
print a
Even can use dict() method.
b = dict(zip(list1, list2))
output in both case is:-
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}
From your comments, it sounds like what you're asking for is something like this:
A list of rows.
One or more "index" multi-dicts, that map the values from a particular column to a set of row numbers that have that value.
A "multi-dict" is just a dict mapping keys to some kind of collection, like a set or list. You can build one very easily by using defaultdict.
You can get each row number together with its list of values using the enumerate function.
So, let's build a couple of indices on your data:
import collections
import csv
f= open('us-500.csv', 'rU')
reader = csv.reader(f) # use list or next
rows = list(reader)
phone1_index = collections.defaultdict(set)
phone2_index = collections.defaultdict(set)
for i, row in enumerate(rows):
phone1_index[row[8]].add(i)
phone2_index[row[9]].add(i)
(Note that this really isn't quite the same as an index in a typical database—it's just as good for finding all rows where phone1 == ?, but not helpful for where phone1 < ?.)
But really, there's no reason to think in terms of indices. If you just store the rows themselves inside the dicts, you're not wasting any space; you can have two references to the same object in Python without having to copy all the data.
There is a minor technical problem, in that rows are lists, and therefore mutable, and therefore can't be stored in sets. But you probably don't actually want them to be mutable, they just happen to come out that way, so you can use tuples instead:
f= open('us-500.csv', 'rU')
reader = csv.reader(f) # use list or next
phone1_map = collections.defaultdict(set)
phone2_map = collections.defaultdict(set)
for row in reader:
row = tuple(row)
phone1_map[row[8]].add(row)
phone2_map[row[9]].add(row)
While we're at it, this looks like a good job for namedtuple:
header = 'first_name, last_name, company, address, city, county, state, zip, phone1, phone2, email, web'
Row = collections.namedtuple('Row', header.split(', '))
f= open('us-500.csv', 'rU')
reader = csv.reader(f) # use list or next
phone1_map = collections.defaultdict(set)
phone2_map = collections.defaultdict(set)
for row in reader:
row = Row(row)
phone1_map[row.phone1].add(row)
phone2_map[row.phone2].add(row)
So, now if you want to find the last name of everyone whose phone1 or phone2 is 1.555.555.1212:
matches = phone1_map['1.555.555.1212'] | phone2_map['1.555.555.1212']
names = {match.name for match in matches}