Python: Remove value from dictionary and multiply remaining two - python

I'm working with a large CSV file where each row includes the date and two values. I'm trying to set up a dictionary with the date as the key for the two values. I then need to multiply the two values of each key and record the answer. I have 3000 rows in the file.
Sample:
So far I have the date set as the key for each pair of values however it's also reusing the date as a third value for each key set, is there a way to remove this?
Once I've removed this, is there a way to multiply the values by eachother in each key set?
This is my code so far:
main_file = "newnoblanks.csv"
import csv
import collections
import pprint
with open(main_file) as fp:
root = csv.reader(fp, delimiter=',')
result = collections.defaultdict(list)
for row in root:
date = row[0].split(",")[0]
result[date].append(row)
print ("Result:-")
pprint.pprint(result)
This is my output:

I don't think you even need to use a defaultdict here, just assign the whole row (minus the date) to the key of the dict. You should just be able to do
with open(main_file) as fp:
root = csv.reader(fp, delimiter=',')
result = dict()
for row in root:
date = row[0].split(",")[0]
result[date] = row[1:]
If you want to get the product of the two values, you could do something like
for key in result:
result[key] = reduce(lambda x, y: x*y, result[key])

I know this has been answered, but feel there is an alternative worth considering:
import csv
from pprint import pprint
with open('newnoblanks.csv') as fp:
root = csv.reader(fp)
result = dict((date, float(a) * float(b)) for date, a, b in root)
pprint(result)
With the following data file:
19/08/2004,49.8458,44994500
20/08/2004,53.80505,23005800
23/08/2004,54.34653,18393200
The output is:
{'19/08/2004': 2242786848.1,
'20/08/2004': 1237828219.29,
'23/08/2004': 999606595.5960001}

Related

Convert CSV file to a dictionary of lists

My csv file looks like this:
Name,Surname,Fathers_name
Prakash,Patel,sudeep
Rohini,Dalal,raghav
Geeta,vakil,umesh
I want to create a dictionary of lists which should be like this:
dict = {Name: [Pakash,Rohini,Geeta], Surname: [Patel,Dalal,vakil], Fathers_name: [sudeep,raghav,umesh]}
This is my code:
with open(ram_details, 'r') as csv_file:
csv_content = csv.reader(csv_file,delimiter=',')
header = next(csv_content)
if header != None:
for row in csv_content:
dict['Name'].append(row[0])
It is throwing an error that key does not exists? Also, if there is any better way to get the desired output!!! Can someone help me with this?
Your code looks fine. It should work, still if you are getting into any trouble you can always use defaultdict.
from collections import defaultdict
# dict = {'Name':[],'Surname':[],'FatherName':[]}
d = defaultdict(list)
with open('123.csv', 'r') as csv_file:
csv_content = csv.reader(csv_file,delimiter=',')
header = next(csv_content)
if header != None:
for row in csv_content:
# dict['Name'].append(row[0])
# dict['Surname'].append(row[1])
# dict['FatherName'].append(row[2])
d['Name'].append(row[0])
d['Surname'].append(row[1])
d['FatherName'].append(row[2])
Please don't name a variable similar to a build in function or type (such as dict).
The problem is that you haven't initialized a dictionary object yet. So you try to add a key and value to an object which is not known to be dict yet. In any case you need to do the following:
result = dict() # <-- this is missing
result[key] = value
Since you want to create a dictionary and want to append to it directly you can also use python's defaultdict.
A working example would be:
import csv
from collections import defaultdict
from pprint import pprint
with open('details.csv', 'r') as csv_file:
csv_content = csv.reader(csv_file, delimiter=',')
headers = list(map(str.strip, next(csv_content)))
result = defaultdict(list)
if headers != None:
for row in csv_content:
for header, element in zip(headers, row):
result[header].append(element)
pprint(result)
Which leads to the output:
defaultdict(<class 'list'>,
{'Fathers_name': ['sudeep', 'raghav', 'umesh'],
'Name': ['Prakash', 'Rohini ', 'Geeta '],
'Surname': ['Patel ', 'Dalal ', 'vakil ']})
Note 1) my csv file had some extra trailing spaces, which can be removed using strip(), as I did for the headers.
Note 2) I am using the zip function to iterate over the elements and headers at the same time (this saves me to index the row).
Possible alternative is using pandas to_dict method (docs)
You may try to use pandas to achieve that:
import pandas as pd
f = pd.read_csv('todict.csv')
d = f.to_dict(orient='list')
Or if you like a one liner:
f = pd.read_csv('todict.csv').to_dict('orient='list')
First you read your csv file to a pandas data frame (I saved your sample to a file named todict.csv). Then you use the dataframe to dict method to convert to dictionary, specifying that you want lists as your dictinoary values, as explained in the documentation.

how to read csvfile into dictionary format

I have a csv file as below:
a,green
a,red
a,blue
b,white
b,black
b,brown
I want to read it into python dictionary as below
{'a':{'green','red','blue'},'b':{'white','black','brown'}}
How can I do? Help me please
I will assume you want a dict with a list of values per key
{'a':['green','red','blue'],'b':['white','black','brown']}
If so, a possible quick workaround could be something like this
import csv
# Get a all the rows in the format [[k, v], ...]
rows = list(csv.reader(open('file_path_here', 'r')))
# Get all the unique keys
keys = set(r[0] for r in rows)
# Get a list of values for the given key
def get_values_list(key, _rows):
return [r[1] for r in _rows if r[0] == key]
# Generate the dict
keys_dict = dict((k, get_values_list(k, rows)) for k in keys)
print keys_dict
But I'm pretty sure this solution has a lot of room for improvement if you spend some time on it.

python3 csv with duplicate keys + python defaultdict

I have a csv file which is having lot of serial numbers and material numbers for ex: show below (I need only first 2columns i.e serial and chassis and rest is not required).
serial chassis type date
ZX34215 Test XX YY
ZX34215 final-001 XX YY
AB30000 Used XX YY
ZX34215 final-002 XX YY
I have below snippet which gets all the serial and material numbers into a dictionary but here duplicate keys are eliminated and it captures latest serial key.
Working code
import sys
import csv
with open('file1.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict1 = {rows[0]:rows[1] for rows in reader}
print(mydict1)
I need to capture duplicate keys with respective values also but it failed. I used python defaultdict and looks like I missed something here.
not working
from collections import defaultdict
with open('file1.csv',mode='r') as infile:
data=defaultdict(dict)
reader=csv.reader(infile)
list_res = list(reader)
for row in reader:
result=data[row[0]].append(row[1])
print(result)
Can some one correct me to capture duplicate keys into dictionary.
You need to pass a list to your defaultdict not dict :
data=defaultdict(list)
Also you don't need to convert the reader object to list, for iterating over it, you also shouldn't assign the append snipped to a variable in each iteration:
data=defaultdict(list)
with open('file1.csv') as infile:
reader=csv.reader(infile)
for row in reader:
try:
data[row[0]].append(row[1])
except IndexError:
pass
print(data)

Remove specific rows from CSV file in python

I am trying to remove rows with a specific ID within particular dates from a large CSV file.
The CSV file contains a column [3] with dates formatted like "1962-05-23" and a column with identifiers [2]: "ddd:011232700:mpeg21:a00191".
Within the following date range:
01-01-1951 to 12-31-1951
07-01-1962 to 12-31-1962
01-01 to 09-30-1963
7-01 to 07-31-1965
10-01 to 10-31-1965
04-01-1966 to 11-30-1966
01-01-1969 to 12-31-1969
01-01-1970 to 12-31-1989
I want to remove rows that contain the ID ddd:11*
I think I have to create a variable that contains both the date range and the ID. And look for these in every row, but I'm very new to python so I'm not sure what would be an eloquent way to do this.
This is what I have now. -CODE UPDATED
import csv
import collections
import sys
import re
from datetime import datetime
csv.field_size_limit(sys.maxsize)
dateranges = [("01-01-1951","12-31-1951"),("07-01-1962","12-31-1962")]
dateranges = list(map(lambda dr: tuple(map(lambda x: datetime.strptime(x,"%m-%d-%Y"),dr)),dateranges))
def datefilter(x):
x = datetime.strptime(x,"%Y-%m-%d")
for r in dateranges:
if r[0]<=x and r[1]>=x: return True
return False
writer = csv.writer(open('filtered.csv', 'wb'))
for row in csv.reader('my_file.csv', delimiter='\t'):
if datefilter(row[3]):
if not row[2].startswith("dd:111"):
writer.writerow(row)
else:
writer.writerow(row)
writer.close()
I'd recommend using pandas: it's great for filtering tables. Nice and readable.
import pandas as pd
# assumes the csv contains a header, and the 2 columns of interest are labeled "mydate" and "identifier"
# Note that "date" is a pandas keyword so not wise to use for column names
df = pd.read_csv(inputFilename, parse_dates=[2]) # assumes mydate column is the 3rd column (0-based)
df = df[~df.identifier.str.contains('ddd:11')] # filters out all rows with 'ddd:11' in the 'identifier' column
# then filter out anything not inside the specified date ranges:
df = df[((pd.to_datetime("1951-01-01") <= df.mydate) & (df.mydate <= pd.to_datetime("1951-12-31"))) |
((pd.to_datetime("1962-07-01") <= df.mydate) & (df.mydate <= pd.to_datetime("1962-12-31")))]
df.to_csv(outputFilename)
See Pandas Boolean Indexing
Here is how I would approach that, but it may not be the best method.
from datetime import datetime
dateranges = [("01-01-1951","12-31-1951"),("07-01-1962","12-31-1962")]
dateranges = list(map(lambda dr: tuple(map(lambda x: datetime.strptime(x,"%m-%d-%Y"),dr)),dateranges))
def datefilter(x):
# The date format is different here to match the format of the csv
x = datetime.strptime(x,"%Y-%m-%d")
for r in dateranges:
if r[0]<=x and r[1]>=x: return True
return False
with open(main_file, "rb") as fp:
root = csv.reader(fp, delimiter='\t')
result = collections.defaultdict(list)
for row in root:
if datefilter(row[3]):
# use a regular expression or any other means to filter on id here
if row[2].startswith("dd:111"): #code to remove item
What I have done is create a list of tuples of your date ranges (for brevity, I only put 2 ranges in it), and then I convert those into datetime objects.
I have used maps for doing that in one line: first loop over all tuples in that list, applying a function which loops over all entries in that tuple and converts to a date time, using the tuple and list functions to get back to the original structure. Doing it the long way would look like:
dateranges2=[]
for dr in dateranges:
dateranges2.append((datetime.strptime(dr[0],"%m-%d-%Y"),datetime.strptime(dr[1],"%m-%d-%Y"))
dateranges = dateranges2
Notice that I just convert each item in the tuple into a datetime, and add the tuples to the new list, replacing the original (which I don't need anymore).
Next, I create a datefilter function which takes a datestring, converts it to a datetime, and then loops over all the ranges, checking if the value is in the range. If it is, we return True (indicating this item should be filtered), otherwise return False if we have checking all ranges with no match (indicating that we don't filter this item).
Now you can check out the id using any method that you want once the date has matched, and remove the item if desired. As your example is constant in the first few characters, we can just use the string startswith function to check the id. If it is more complex, we could use a regex.
My kinda approach workds like this -
import csv
import re
import datetime
field_id = 'ddd:11'
d1 = datetime.date(1951,1,01) #change the start date
d2 = datetime.date(1951,12,31) #change the end date
diff = d2 - d1
date_list = []
for i in range(diff.days + 1):
date_list.append((d1 + datetime.timedelta(i)).isoformat())
with open('mwevers_example_2016.01.02-07.25.55.csv','rb') as csv_file:
reader = csv.reader(csv_file)
for row in reader:
for date in date_list:
if row[3] == date:
print row
var = re.search('\\b'+field_id,row[2])
if bool(var) == True:
print 'olalala'#here you can make a function to copy those rows into another file or any list
import csv
import sys
import re
from datetime import datetime
csv.field_size_limit(sys.maxsize)
field_id = 'ddd:11'
dateranges = [("1951-01-01", "1951-12-31"),
("1962-07-01", "1962-12-31"),
("1963-01-01", "1963-09-30"),
("1965-07-01", "1965-07-30"),
("1965-10-01", "1965-10-31"),
("1966-04-01", "1966-11-30"),
("1969-01-01", "1989-12-31")
]
dateranges = list(map(lambda dr:
tuple(map(lambda x:
datetime.strptime(x, "%Y-%m-%d"), dr)),
dateranges))
def datefilter(x):
x = datetime.strptime(x, "%Y-%m-%d")
for r in dateranges:
if r[0] <= x and r[1] >= x:
return True
return False
output = []
with open('my_file.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t', quotechar='"')
next(reader)
for row in reader:
if datefilter(row[4]):
var = re.search('\\b'+field_id, row[3])
if bool(var) == False:
output.append(row)
else:
output.append(row)
with open('output.csv', 'w') as outputfile:
writer = csv.writer(outputfile, delimiter='\t', quotechar='"')
writer.writerows(output)

Need more efficient way to parse out csv file in Python

Here's a sample csv file
id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601
This is the output I'm looking for (list of serial_no withing a list of ids):
[2, [500,501,502]]
[3, [600, 601]]
I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
each_row = []
each_row.append(row[0])
each_row.append(row[1])
zipped_data.append(each_row)
for rec in zipped_data:
if rec[0] not in ids:
ids.append(rec[0])
for id in ids:
for rec in zipped_data:
if rec[0] == id:
ser_no.append(rec[1])
tmp.append(id)
tmp.append(ser_no)
print tmp
tmp = []
ser_no = []
**I've omitted var initializing for simplicity of code
print tmp
Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!
from collections import defaultdict
records = defaultdict(list)
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
records[row[0]].append(row[1])
#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results
If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])
Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.
result = collections.defaultdict(list)
for row in data:
result[row[0]].append(row[1])
Here's a version I wrote, looks like there are plenty of answers for this one already though.
You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).
#!/usr/bin/python
import csv
myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)
myData = {}
for myRow in csvFile:
myId = myRow['id']
if not myData.has_key(myId): myData[myId] = []
myData[myId].append(myRow['serial_no'])
for myId in sorted(myData):
print '%s %s' % (myId, myData[myId])
myFile.close()
Some observations:
0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...
1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.
2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.
3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.
4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.
5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.
6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).
Applying these ideas, we get:
filename = 'test.csv'
with open(filename) as in_file:
data = csv.reader(in_file)
data.next() # ignore the field labels
rows = list(data) # read the rest of the rows from the iterator
print [
# We want a list of all serial numbers from rows with a matching ID...
[serial_no for row_id, serial_no in rows if row_id == id]
# for each of the IDs that there is to match, which come from making
# a set from the first column of the data.
for id in set(zip(*rows)[0])
]
We can probably do even better than this by using the groupby function from the itertools module.
example using itertools.groupby. This only works if the rows are already grouped by id
from csv import DictReader
from itertools import groupby
from operator import itemgetter
filename = 'test.csv'
# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:
# group by id - this requires that the rows are already grouped by id
groups = groupby(DictReader(infile), key=itemgetter('id'))
# loop through the groups printing a list for each one
for i,j in groups:
print [i, map(itemgetter(' serial_no'), list(j))]
note the space in front of ' serial_no'. This is because of the space after the comma in the input file

Categories

Resources