Extract various information - python

Overview
would like to extract various information like name, date and address from a 2 column csv file before writing to another csv file
Conditions
Extract Name by first row as it will always be the first
row.
Extract Date by regex (is there regex in python?) ##/##/####
format
Extract Address by the constant keyword 'road'
Example CSV dummy Source data reference file format viewed from EXCEL
ID,DATA
88888,DADDY
88888,2/06/2016
88888,new issac road
99999,MUMMY
99999,samsung road
99999,12/02/2016
Desired CSV outcome
ID,Name,Address,DATE
8888,DADDY,new issac road,2/06/2016
9999,MUMMY,samsung road,12/02/2016
What i have so far:
import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('dummy_data.csv') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
uniqueidstatement = columns['receipt_id']
print uniqueidstatement
resultFile = open("wtf.csv",'wb')
wr = csv.writer(resultFile, dialect='excel')
wr.writerow(uniqueidstatement)

You can group the sections by ID and from each group you can determine which is the date and which is the address with some simple logic .
import csv
from itertools import groupby
from operator import itemgetter
with open("test.csv") as f, open("out.csv", "w") as out:
reader = csv.reader(f)
next(reader)
writer = csv.writer(out)
writer.writerow(["ID","NAME","ADDRESS", "DATE"])
groups = groupby(csv.reader(f), key=itemgetter(0))
for k, v in groups:
id_, name = next(v)
add_date_1, add_date_2 = next(v)[1], next(v)[1]
date, add = (add_date_1, add_date_2) if "road" in add_date_2 else (add_date_2, add_date_1)
writer.writerow([id_, name, add, date])

Related

How can I combining like-rows in CSV to fill missing data?

I have a CSV file that might look like:
id, value01, value02,
01, , 01b,
01, 01a, ,
02, , 02b,
02, 02a, 02b,
...
As you can see, I have duplicate rows where one of the duplicates (can be more than two) (determined as a duplicate by the id) has missing values, and the other duplicates contain other values missing.
I think someone who managed this CSV output wrote twice to the CSV rather than combine results and output once, so now I need to find a clean way to do this.
So far, my work is:
import csv
def combine_dups():
data = []
with open("file.csv", newline='', mode='r') as csvFile:
csvData = csv.DictReader(csvFile)
for row in csvData:
for lookahead in csvData:
# Check if lookahead and current row have matching ids
if row["id"] == lookahead["id"]:
# Loop through columns of row and lookahead
for col in row:
# If current row's column is blank, take value from lookahead
if row[col] == '' or row[col] is None:
row[col] = lookahead[col]
data.append(row) # Add new filled out, completed row
# Manage data to no longer contain excess duplicates
# Code here??
return data
This code isn't correct, as it as:
If loops through csvData for each row, rather than looping through all data after current row. This is easily solved using a for loop using an index, but I left that out for simplicity.
The row is filled in with the missing data, but this operation is done multiple times for the other values with identical id's. How can I avoid this?
Edit:
For clarity, the NEW csv should look like:
id, value01, value02,
01, 01a, 01b,
02, 02a, 02b,
...
Using the csv module.
Ex:
import csv
data = {}
with open(filename) as csvInFile:
csvData = csv.DictReader(csvInFile)
fieldnames = csvData.fieldnames
for row in csvData:
# Combine data and update.
data.setdefault(row['id'], dict()).update({k:v for k,v in row.items() if v.strip() and not data.get(row['id']).get(k)})
with open(filename, "w", newline='') as csvOutFile:
csvOutData = csv.DictWriter(csvOutFile, fieldnames=fieldnames)
csvOutData.writeheader()
csvOutData.writerows(data.values()) # Write data.
Output:
id, value01, value02
01, 01a, 01b
02, 02a, 02b
The could be solved by using a dictionary object to store the values. This csv would be traversed just once for each row.
You can refer to the code below:
import csv
c_dict = {}
with open("file.csv", newline='', mode='r') as csvFile:
csvData = csv.DictReader(csvFile)
for row in csvData:
if row['id'] not in c_dict.keys():
c_dict[row['id']] = {}
if (row['value01'] == '' or row['value01'] is None) or 'value01' in c_dict[row['id']].keys():
pass
else:
c_dict[row['id']]['value01'] = row['value01']
if (row['value02'] == '' or row['value02'] is None) or 'value02' in c_dict[row['id']].keys():
pass
else:
c_dict[row['id']]['value02'] = row['value02']
keys = ['id', 'value01', 'value02']
with open('output.csv', 'w', newline='') as output_file:
dict_writer = csv.writer(output_file)
dict_writer.writerow(keys)
for k, v in c_dict.items():
dict_writer.writerow([k, v['value01'], v['value02']])
Note: If you have more than these 2 columns, you may if-else clause in a loop

Python (3.7) CSV Sort/Sum by Field Value

I have a csv file (of indefinite size) that I would like to read and do some work with.
Here is the structure of the csv file:
User, Value
CN,500.00
CN,-250.00
CN,360.00
PT,200.00
PT,230.00
...
I would like to read the file and get the sum of each row where the first field is the same.
I have been trying the following just to try and identify a value for the first field:
with open("Data.csv", newline='') as data:
reader = csv.reader(data)
for row in reader:
if row.startswith('CN'):
print("heres one")
This fails because startswith does not work on a list object. I have also tried using readlines().
EDIT 1:
I can currently print the following dataframe object with the sorted sums:
Value
User
CN 3587881.89
D 1000.00
KC 1767783.99
REC 12000.00
SB 25000.00
SC 1443039.12
SS 0.00
T 9966998.93
TH 2640009.32
ls 500.00
I get this output using this code:
mydata=pd.read_csv('Data.csv')
out = mydata.groupby(['user']).sum()
print(out)
Id now like be able to write if statements for this object. Something like:
if out contains User 'CN'
varX = Value for 'CN'
because this is now a dataframe type I am having trouble setting the Value to a variable for a specific user.
You can do the followings:
import pandas as pd
my_data= pd.read_csv('Data.csv')
my_data.group_by('user').sum()
you can use first row element:
import csv
with open("Data.csv", newline='') as data:
reader = csv.reader(data)
for row in reader:
if row[0].startswith('CN'):
print("heres one")
Using collections.defaultdict
Ex:
import csv
from collections import defaultdict
result = defaultdict(int)
with open(filename, newline='') as data:
reader = csv.reader(data)
next(reader)
for row in reader:
result[row[0]] += float(row[1])
print(result)
Output
defaultdict(<class 'int'>, {'CN': 610.0, 'PT': 430.0})

Python - Extract data from csvfile1 and write to csvfile2 based on values in columns

I have data stored in a csv file :
ID;Event;Date
ABC;In;05/01/2015
XYZ;In;05/01/2016
ERT;In;05/01/2014
... ... ...
ABC;Out;05/01/2017
First, I am trying to extract all rows where Event is "In" and saves thoses rows in a new csv file. Here is the code i've tried so far:
[UPDATED : 05/18/2017]
with open('csv_in', 'r') as f, open('csv_out','w') as f2:
fieldnames=['ID','Event','Date']
reader = csv.DictReader(f, delimiter=';', lineterminator='\n',
fieldnames=fieldnames)
wr = csv.DictWriter(f2,dialect='excel',delimiter=';',
lineterminator='\n',fieldnames=fieldnames)
rows = [row for row in reader if row['Event'] == 'In']
for row in rows:
wr.writerows(row)
I am getting the following error : " ValueError: dict contains fields not in fieldnames: 'I', 'D'
[/UPDATED]
1/ Any thoughts on how to fix this ?
2/ Next step, how would you proceed to do a "lookup" on the ID (if exists several times as per ID "ABC") and extract the given "Date" value where Event is "Out"
output desired :
ID Date Exit date
ABC 05/01/2015 05/01/2017
XYZ 05/01/2016
ERT 05/01/2014
Thanks in advance for your input.
PS : can't use panda .. only standard lib.
you can interpret the raw csv with the standard library like so:
oldcsv=open('csv_in.csv','r').read().split('\n')
newcsv=[]
#this next part checks for events that are in
for line in oldcsv:
if 'In' in line.split(';'):
newcsv.append(line)
new_csv_file=open('new_csv.csv','w')
[new_csv_file.write(line+'\n') for line in newcsv]
new_csv_file.close()
you would use the same method to do your look-up, it's just that you'd change the keyword in that for loop, and if there's more than one item in the newly generated list you have more than one occurance of your ID, then just modify the condition to include two keywords
The error here is because you have not added a delimiter.
Syntax-
csv.DictReader(f, delimiter=';')
For Part 2.
import csv
import datetime
with open('csv_in', 'r') as f, open('csv_out','w') as f2:
reader = csv.DictReader(f, delimiter=';')
wr = csv.writer(f2,dialect='excel',lineterminator='\n')
result = {}
for row in reader:
if row['ID'] not in result:
# Assign Values if not in dictionary
if row['Event'] == 'In':
result[row['ID']] = {'IN' : datetime.datetime.strptime(row['Date'], '%d/%m/%Y') }
else:
result[row['ID']] = {'OUT' : datetime.datetime.strptime(row['Date'], '%d/%m/%Y') }
else:
# Compare dates with those present in csv.
if row['Event'] == 'In':
# if 'IN' is not present, use the max value of Datetime to compare
result[row['ID']]['IN'] = min(result[row['ID']].get('IN', datetime.datetime.max), datetime.datetime.strptime(row['Date'], '%d/%m/%Y'))
else:
# Similarly if 'OUT' is not present, use the min value of datetime to compare
result[row['ID']]['OUT'] = max(result[row['ID']].get('OUT', datetime.datetime.min), datetime.datetime.strptime(row['Date'], '%d/%m/%Y'))
# format the results back to desired representation
for v1 in result.values():
for k2,v2 in v1.items():
v1[k2] = datetime.datetime.strftime(v2, '%d/%m/%Y')
wr.writerow(['ID', 'Entry', 'Exit'])
for row in result:
wr.writerow([row, result[row].get('IN'), result[row].get('OUT')])
This code should work just fine. I have tested it on a small input

Referencing keys in a Python dictionary for CSV reader

new to python and trying to build a simple CSV reader to create new trades off an existing instrument. Ideally, I'd like to build a dictionary to simplify the parameters required to set up a new trade (instead of using row[1], [2], [3], etc, I'd like to replace with my headers that read Value Date, Trade Date, Price, Quantity, etc.)
I've created dictionary keys below, but am having trouble linking them to my script to create the new trade. What should I put to substitute the rows? Any advice appreciated! Thanks...
Code below:
import acm
import csv
# Opening CSV file
with open('C:\Users\Yina.Huang\Desktop\export\TradeBooking.csv', 'rb') as f:
reader = csv.DictReader(f, delimiter=',')
next(reader, None)
for row in reader:
# Match column header with column number
d = {
row["Trade Time"],
row["Value Day"],
row["Acquire Day"],
row["Instrument"],
row["Price"],
row["Quantity"],
row["Counterparty"],
row["Acquirer"],
row["Trader"],
row["Currency"],
row["Portfolio"],
row["Status"]
}
NewTrade = acm.FTrade()
NewTrade.TradeTime = "8/11/2016 12:00:00 AM"
NewTrade.ValueDay = "8/13/2016"
NewTrade.AcquireDay = "8/13/2016"
NewTrade.Instrument = acm.FInstrument[row["Instrument"]]
NewTrade.Price = row[4]
NewTrade.Quantity = row[5]
NewTrade.Counterparty = acm.FParty[row[6]]
NewTrade.Acquirer = acm.FParty[row[7]]
NewTrade.Trader = acm.FUser[row[8]]
NewTrade.Currency = acm.FCurrency[row[9]]
NewTrade.Portfolio = acm.FPhysicalPortfolio[row[10]]
NewTrade.Premium = (int(row[4])*int(row[5]))
NewTrade.Status = row[11]
print NewTrade
NewTrade.Commit()
The csv module already provides this functionality with the csv.DictReader object.
with open('C:\Users\Yina.Huang\Desktop\export\TradeBooking.csv', 'rb') as f:
reader = csv.DictReader(f)
for row in reader:
NewTrade = acm.FTrade()
NewTrade.TradeTime = row['Trade Time']
NewTrade.ValueDay = row['Value Day']
NewTrade.AcquireDay = row['Aquire Day']
NewTrade.Instrument = acm.Finstrument[row['Instrument']]
NewTrade.Price = row['Price']
NewTrade.Quantity = row['Quantity']
# etc
From the documentation:
Create an object which operates like a regular reader but maps the
information read into a dict whose keys are given by the optional
fieldnames parameter. The fieldnames parameter is a sequence whose
elements are associated with the fields of the input data in order.
These elements become the keys of the resulting dictionary. If the
fieldnames parameter is omitted, the values in the first row of the
csvfile will be used as the fieldnames. If the row read has more
fields than the fieldnames sequence, the remaining data is added as a
sequence keyed by the value of restkey. If the row read has fewer
fields than the fieldnames sequence, the remaining keys take the value
of the optional restval parameter.

How do I merge two CSV files based on field and keep same number of attributes on each record?

I am attempting to merge two CSV files based on a specific field in each file.
file1.csv
id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"
file2.csv
id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False
This is the code I am using:
import csv
from collections import OrderedDict
with open('file2.csv','r') as f2:
reader = csv.reader(f2)
fields2 = next(reader,None) # Skip headers
dict2 = {row[0]: row[1:] for row in reader}
with open('file1.csv','r') as f1:
reader = csv.reader(f1)
fields1 = next(reader,None) # Skip headers
dict1 = OrderedDict((row[0], row[1:]) for row in reader)
result = OrderedDict()
for d in (dict1, dict2):
for key, value in d.iteritems():
result.setdefault(key, []).extend(value)
with open('merged.csv', 'wb') as f:
w = csv.writer(f)
for key, value in result.iteritems():
w.writerow([key] + value)
I get output like this, which merges appropriately, but does not have the same number of attributes for all rows:
1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure
file2 will not have a record for every id in file1. I'd like the output to have empty fields from file2 in the merged file. For example, id 1 would look like this:
1,True,7,Purple,,,
How can I add the empty fields to records that don't have data in file2 so that all of my records in the merged CSV have the same number of attributes?
If we're not using pandas, I'd refactor to something like
import csv
from collections import OrderedDict
filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
with open(filename, "rb") as fp: # python 2
reader = csv.DictReader(fp)
fieldnames.extend(reader.fieldnames)
for row in reader:
data.setdefault(row["id"], {}).update(row)
fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
writer = csv.writer(fp)
writer.writerow(fieldnames)
for row in data.itervalues():
writer.writerow([row.get(field, '') for field in fieldnames])
which gives
id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope,,,
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure,,,
For comparison, the pandas equivalent would be something like
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)
which is much simpler to my eyes, and means you can spend more time dealing with your data and less time reinventing wheels.
You can use pandas to do this:
import pandas
csv1 = pandas.read_csv('filea1.csv')
csv2 = pandas.read_csv('file2.csv')
merged = csv1.merge(csv2, on='id')
merged.to_csv("output.csv", index=False)
I haven't tested this yet but it should put you on the right track until I can try it out. The code is quite self-explanatory; first you import the pandas library so that you can use it. Then using pandas.read_csv you read the 2 csv files and use the merge method to merge them. The on parameter specifies which column should be used as the "key". Finally, the merged csv is written to output.csv.
Use dict of dict then update it. Like this:
import csv
from collections import OrderedDict
with open('file2.csv','r') as f2:
reader = csv.reader(f2)
lines2 = list(reader)
with open('file1.csv','r') as f1:
reader = csv.reader(f1)
lines1 = list(reader)
dict1 = {row[0]: dict(zip(lines1[0][1:], row[1:])) for row in lines1[1:]}
dict2 = {row[0]: dict(zip(lines2[0][1:], row[1:])) for row in lines2[1:]}
#merge
updatedDict = OrderedDict()
mergedAttrs = OrderedDict.fromkeys(lines1[0][1:] + lines2[0][1:], "?")
for id, attrs in dict1.iteritems():
d = mergedAttrs.copy()
d.update(attrs)
updatedDict[id] = d
for id, attrs in dict2.iteritems():
updatedDict[id].update(attrs)
#out
with open('merged.csv', 'wb') as f:
w = csv.writer(f)
for id, rest in sorted(updatedDict.iteritems()):
w.writerow([id] + rest.values())

Categories

Resources