culling values in csv.DictReader

culling values in csv.DictReader - python

I'm working with a huge csv that I am parsing with csv.DictReader , what would be some most efficient way to trim the data in the resulting dictionary based on the key name .
Say, just keep the keys that contain "JAN" .
Thanks !

result = {key:val for key, val in row.items() if 'JAN' in key}
where row is a dictionary obtained from DictReader.

Okay, here's a dirt stupid example of using csv.DictReader with /etc/passwd
#!python
keepers = dict()
r = csv.DictReader(open('/etc/passwd', 'r'), delimiter=":", \
fieldnames=('login','pw', 'uid','gid','gecos','homedir', 'shell'))
for i in r:
if i['uid'] < 1:
continue
keepers[i['login']]=i
Now, trying to apply that to your question ... I'm only guessing that you were building a dictionary of dictionaries based on the phrase "from the resulting dictionary." It seems obvious that the read/object is going to return a dictionary for every input record. So there will be one resulting dictionary for every line of your file (assuming any of the common CSV "dialects").
Naturally I could have used if i['uid'] > 1 or if "Jan" in i['gecos'] and only added to my "keepers" if the condition holds true. I wrote it this way to emphasize how you can easily skip those values in which you're not interested, such that the rest of your for suite could do various interesting things with those records that are of interest.
However, this answer is so simple that I have to suspect that I'm not understanding the question. (I'm using ''/etc/passwd'' and a colon separated list simply because it's an extremely well known format and world-readable copies are readily available on Linux, Unix, and MacOS X systems).

You could do something like this:
>>> with open('file.csv') as f:
... culled = [{k: d[k] for k in d if "JAN" in k} for d in csv.DictReader(f)]
When I tried this on a simple CSV file with the following contents:
JAN11,FEB11,MAR11,APR11,MAY11,JUN11,JUL11,AUG11,SEP11,OCT11,NOV11,DEC11,JAN12,FEB12,MAR12,APR12
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
I got the following result:
>>> with open('file.csv') as f:
... culled = [{k: d[k] for k in d if "JAN" in k} for d in csv.DictReader(f)]
...
>>> culled
[{'JAN11': '1', 'JAN12': '13'}, {'JAN11': '17', 'JAN12': '29'}]

Related

How to store all values of many nested dictionaries in a list

I'm trying to write some code that opens up about 900 nested dictionaries with roughly 99% similar content (very large) and store the value of each key in a list named after the key. For example, if I had two dictionaries: {data=37} {data=74} I would want to combine those two values into a list named data that outputs the following [37,74]
Here is the code I'm currently using to accomplish this:
import pandas as pd
df = pd.read_csv("/Users/---.csv")
i=True
def sort(d):
for k, v in d.items():
if isinstance(v, dict):
sort(v)
else:
global i
if i==True:
print("{0} : {1}".format(k, v))
setattr(sys.modules[__name__], k, [v])
i=False
else:
print("{0} : {1}".format(k, v))
globals()["{}".format(k)].append(v)
for i in df['file_num']:
with open("/Users/--/allDAFs{}.json".format(i)) as f:
data=json.load(f)
sort(data)
The problem with this is two fold:
a. There are some duplicates and I'm not sure why. There are 1400 values for some key when there are only 900 files.
b. I can't link these to a file_num. As you can see I'm sorting through these using file_num and I'd like to link each value to the file_num it came from.
I know I may not be doing this the best way possible so any insight or advice would be greatly appreciated.
EDIT: This is how I need the end result to look like, preferably in a pandas DataFrame:

I would use a defaultdict. Maybe I'm missing something, but I don't really see the problem.
import pandas as pd
import collections
def sort(d, output):
for k, v in d.items():
if isinstance(v, dict):
sort(v, output)
else:
output[k].append(v)
df = pd.read_csv("/Users/---.csv")
results = collections.defaultdict(list)
for i in df['file_num']:
with open("/Users/--/allDAFs{}.json".format(i)) as f:
data = json.load(f)
sort(data, results)
Then, results will be a dictionary of lists that you can address by same keys.
For your issues with ZIPcodes, make sure, the keys are alays the same datatype (str), maybe even ascii.

I used df.json_normalize() like Richie suggested, which worked well.

Try to get hierarchical string structure into dict

I'm pretty new to python and now for 2 days, I'm struggling with getting a hierarchical based string structure into a python dict/list structure to handle it better:
Example Strings:
Operating_System/Linux/Apache
Operating_System/Linux/Nginx
Operating_System/Windows/Docker
Operating_System/FreeBSD/Nginx
What I try to achieve is to split each string up and pack it into a python dict, that should be
something like:
{'Operating_System': [{'Linux': ['Apache', 'Nginx']}, {'Windows': ['Docker']}, {'FreeBSD': ['Nginx']}]}
I tried multiple ways, including zip() and some ways by string split('/') and then doing
it by nested iteration but I could not yet solve it. Does anyone know a good/elegant way to achieve
something like this with python3 ?
best regards,
Chris

one way about it ... defaultdict could help here :
#assumption is that it is a collection of strings
strings = ["Operating_System/Linux/Apache",
"Operating_System/Linux/Nginx",
"Operating_System/Windows/Docker",
"Operating_System/FreeBSD/Nginx"]
from collections import defaultdict
d = defaultdict(dict)
e = defaultdict(list)
m = [entry.split('/') for entry in strings]
print(m)
[['Operating_System', 'Linux', 'Apache'],
['Operating_System', 'Linux', 'Nginx'],
['Operating_System', 'Windows', 'Docker'],
['Operating_System', 'FreeBSD', 'Nginx']]
for a,b,c in m:
e[b].append(c)
d[a] = e
print(d)
defaultdict(dict,
{'Operating_System': defaultdict(list,
{'Linux': ['Apache', 'Nginx'],
'Windows': ['Docker'],
'FreeBSD': ['Nginx']})})
if u want them exactly as u shared in ur output, u could skip the defaultdict(dict) part :
mapp = {'Operating_System':[{k:v} for k,v in e.items()]}
mapp
{'Operating_System': [{'Linux': ['Apache', 'Nginx']},
{'Windows': ['Docker']},
{'FreeBSD': ['Nginx']}]
}
this post was also useful

Simplifying the code to a dictionary comprehension

In a directory images, images are named like - 1_foo.png, 2_foo.png, 14_foo.png, etc.
The images are OCR'd and the text extract is stored in a dict by the code below -
data_dict = {}
for i in os.listdir(images):
if str(i[1]) != '_':
k = str(i[:2]) # Get first two characters of image name and use as 'key'
else:
k = str(i[:1]) # Get first character of image name and use 'key'
# Intiates a list for each key and allows storing multiple entries
data_dict.setdefault(k, [])
data_dict[k].append(pytesseract.image_to_string(i))
The code performs as expected.
The images can have varying numbers in their name ranging from 1 to 99.
Can this be reduced to a dictionary comprehension?

No. Each iteration in a dict comprehension assigns a value to a key; it cannot update an existing value list. Dict comprehensions aren't always better--the code you wrote seems good enough. Although maybe you could write
data_dict = {}
for i in os.listdir(images):
k = i.partition("_")[0]
image_string = pytesseract.image_to_string(i)
data_dict.setdefault(k, []).append(image_string)

Yes. Here's one way, but I wouldn't recommend it:
{k: d.setdefault(k, []).append(pytesseract.image_to_string(i)) or d[k]
for d in [{}]
for k, i in ((i.split('_')[0], i) for i in names)}
That might be as clean as I can make it, and it's still bad. Better use a normal loop, especially a clean one like Dennis's.
Slight variation (if I do the abuse once, I might as well do it twice):
{k: d.setdefault(k, []).append(pytesseract_image_to_string(i)) or d[k]
for d in [{}]
for i in names
for k in i.split('_')[:1]}
Edit: kaya3 now posted a good one using a dict comprehension. I'd recommend that over mine as well. Mine are really just the dirty results of me being like "Someone said it can't be done? Challenge accepted!".

In this case itertools.groupby can be useful; you can group the filenames by the numeric part. But making it work is not easy, because the groups have to be contiguous in the sequence.
That means before we can use groupby, we need to sort using a key function which extracts the numeric part. That's the same key function we want to group by, so it makes sense to write the key function separately.
from itertools import groupby
def image_key(image):
return str(image).partition('_')[0]
images = ['1_foo.png', '2_foo.png', '3_bar.png', '1_baz.png']
result = {
k: list(v)
for k, v in groupby(sorted(images, key=image_key), key=image_key)
}
# {'1': ['1_foo.png', '1_baz.png'],
# '2': ['2_foo.png'],
# '3': ['3_bar.png']}
Replace list(v) with list(map(pytesseract.image_to_string, v)) for your use-case.

Short way to add dictionary item to a set only when its string value isn't empty?

I currently have this piece of code:
name_set = set()
reader = [{'name':'value1'}, {'name':''}, {'name':'value2'}]
for row in reader:
name = row.get('name', None)
if name:
name_set.add(name)
print(name_set)
In the real code the reader is a DictReader, but I use a list with dicts to represent this.
Note that the if name: will check for:
Empty string present in the Dictionary (thus "")
Not whenever the key does not exist in the Dictionary
Although, I think this code is easy readable, but I'm wondering if there is a shorter way as this code is 6 lines to simply extract values from dicts and save these in a set.

Your existing code is fine.
But since you asked for a "short" way, you could just use set comprehensions/arithmetic:
>>> reader = [{'name':'value1'}, {'name':''}, {'name':'value2'}]
>>> {d['name'] for d in reader} - {''}
{'value1', 'value2'}

Mapping data from excel with Python

I am reading data from an xls spreadsheet with xlrd. First, I gather the index for the column that contains the data that I need (may not always be in the same column in every instance):
amr_list, pssr_list, inservice_list = [], [], []
for i in range(sh.ncols):
for j in range(sh.nrows):
if 'amrprojectnumber' in sh.cell_value(j,i).lower():
amr_list.append(sh.cell_value(j,i))
if 'pssrnumber' in sh.cell_value(j,i).lower():
pssr_list.append(sh.cell_value(j,i))
if 'inservicedate' in sh.cell_value(j,i).lower():
inservice_list.append(sh.cell_value(j,i))
Now I have three lists, which I need to use for writing data to a new workbook. The values in a row are related. So the index of an item in one list corresponds to the same index of the items in the other lists.
The amr_list has repeating string values. For example:
['4006BA','4006BA','4007AC','4007AC','4007AC']
The pssr_list always shares the same value as the amr_list but with additional info:
['4006BA(1)','4006BA(2)','4007AC(1)','4007AC(2)','4007AC(3)']
Finally, the inservice_list may or may not contain a variable date (as read from excel):
[40780.0, '', 40749.0, 40764.0, '']
This is the result I want from the data:
amr = { '4006BA':[('4006BA(1)',40780.0),('4006BA(2)','')], '4007AC':[('4007AC(1)',40749.0),('4007AC(2)',40764.0),('4007AC(3)','')] }
But I am having a hard time figuring out how an easy way to get there. Thanks in advance.

Maybe this can help:
A = ['4006BA','4006BA','4007AC','4007AC','4007AC']
B = ['4006BA(1)','4006BA(2)','4007AC(1)','4007AC(2)','4007AC(3)']
C = [40780.0, '', 40749.0, 40764.0, '']
result = dict()
for item in xrange(len(A)):
key = A[item]
result.setdefault(key, [])
result[key].append( (B[item], C[item] ) )
print result
This will print you the data in the format you are looking for.

look into itertools.groupby and
zip(amr_list, pssr_list, inservice_list)
For your case:
dict((x,list(a[1:] for a in y)) for x,y in
itertools.groupby(zip(amr_list, pssr_list, inservice_list), lambda z: z[0]))
Note that this assumes your input is sorted by amr_list.
Another approach would be:
combined={}
for k, v in zip(amr_list, zip(pssr_list, inservice_list)):
combined.setdefault(k, []).append(v)
Which does not require your input to be sorted.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

culling values in csv.DictReader - python

I'm working with a huge csv that I am parsing with csv.DictReader , what would be some most efficient way to trim the data in the resulting dictionary based on the key name . Say, just keep the keys that contain "JAN" . Thanks !

result = {key:val for key, val in row.items() if 'JAN' in key} where row is a dictionary obtained from DictReader.

Related

How to store all values of many nested dictionaries in a list

Try to get hierarchical string structure into dict

Simplifying the code to a dictionary comprehension

Short way to add dictionary item to a set only when its string value isn't empty?

Mapping data from excel with Python

Categories

Resources