Python (3.7) CSV Sort/Sum by Field Value - python

I have a csv file (of indefinite size) that I would like to read and do some work with.
Here is the structure of the csv file:
User, Value
CN,500.00
CN,-250.00
CN,360.00
PT,200.00
PT,230.00
...
I would like to read the file and get the sum of each row where the first field is the same.
I have been trying the following just to try and identify a value for the first field:
with open("Data.csv", newline='') as data:
reader = csv.reader(data)
for row in reader:
if row.startswith('CN'):
print("heres one")
This fails because startswith does not work on a list object. I have also tried using readlines().
EDIT 1:
I can currently print the following dataframe object with the sorted sums:
Value
User
CN 3587881.89
D 1000.00
KC 1767783.99
REC 12000.00
SB 25000.00
SC 1443039.12
SS 0.00
T 9966998.93
TH 2640009.32
ls 500.00
I get this output using this code:
mydata=pd.read_csv('Data.csv')
out = mydata.groupby(['user']).sum()
print(out)
Id now like be able to write if statements for this object. Something like:
if out contains User 'CN'
varX = Value for 'CN'
because this is now a dataframe type I am having trouble setting the Value to a variable for a specific user.

You can do the followings:
import pandas as pd
my_data= pd.read_csv('Data.csv')
my_data.group_by('user').sum()

you can use first row element:
import csv
with open("Data.csv", newline='') as data:
reader = csv.reader(data)
for row in reader:
if row[0].startswith('CN'):
print("heres one")

Using collections.defaultdict
Ex:
import csv
from collections import defaultdict
result = defaultdict(int)
with open(filename, newline='') as data:
reader = csv.reader(data)
next(reader)
for row in reader:
result[row[0]] += float(row[1])
print(result)
Output
defaultdict(<class 'int'>, {'CN': 610.0, 'PT': 430.0})

Related

Split values in CSV that look like JSON

So I have a CSV file with a column called content. However, the contents in column look like it is based on JSON, and, therefore, house more columns. I would like to split these contents into multiple columns or extract the final part of it after "value". See picture below to see an example of the file. Any ideas how to get this? I would prefer using Python. I don't have any experience with JSON.
Using pandas you could do in a simpler way.
EDIT updated to handle the single quotes:
import pandas as pd
import json
data = pd.read_csv('test.csv', delimiter="\n")["content"]
res = [json.loads(row.replace("'", '"')) for row in data]
result = pd.DataFrame(res)
result.head()
# Export result to CSV
result.to_csv("result.csv")
my csv:
result:
This script will create a new csv file with the 'value' added to the csv as an additional column
(make sure that the input_csv and output_csv are different filenames)
import csv
import json
input_csv = "data.csv"
output_csv = "data_updated.csv"
values = []
with open(input_csv) as f_in:
dr = csv.DictReader(f_in)
for row in dr:
value = json.loads(row["content"].replace("'", '"'))["value"]
values.append(value)
with open(input_csv) as f_in:
with open(output_csv, "w+") as f_out:
w = csv.writer(f_out, lineterminator="\n")
r = csv.reader(f_in)
all = []
row = next(r)
row.append("value")
all.append(row)
i = 0
for row in r:
row.append(values[i])
all.append(row)
i += 1
w.writerows(all)

How to edit a CSV file row by row in Python without using Pandas

I have a CSV file and when I read it by importing the CSV library I get as the output:
['exam', 'id_student', 'grade']`
['maths', '573834', '7']`
['biology', '573834', '8']`
['biology', '578833', '4']
['english', '581775', '7']`
# goes on...
I need to edit it by creating a 4th column called 'Passed' with two possible values: True or False depending on whether the grade of the row is >= 7 (True) or not (False), and then count how many times each student passed an exam.
If it's not possible to edit the CSV file that way, I would need to just read the CSV file and then create a dictionary of lists with the following output:
dict = {'id_student':[573834, 578833, 581775], 'passed_count': [2,0,1]}
# goes on...
Thanks
Try using importing csv as pandas dataframe
import pandas as pd
data=pd.read_csv('data.csv')
And then use:
data['passed']=(data['grades']>=7).astype(bool)
And then save dataframe to csv as:
data.to_csv('final.csv',index=False)
It is totally possible to "edit" CSV.
Assuming you have a file students.csv with the following content:
exam,id_student,grade
maths,573834,7
biology,573834,8
biology,578833,4
english,581775,7
Iterate over input rows, augment the field list of each row with an additional item, and save it back to another CSV:
import csv
with open('students.csv', 'r', newline='') as source, open('result.csv', 'w', newline='') as result:
csvreader = csv.reader(source)
csvwriter = csv.writer(result)
# Deal with the header
header = next(csvreader)
header.append('Passed')
csvwriter.writerow(header)
# Process data rows
for row in csvreader:
row.append(str(int(row[2]) >= 7))
csvwriter.writerow(row)
Now result.csv has the content you need.
If you need to replace the original content, use os.remove() and os.rename() to do that:
import os
os.remove('students.csv')
os.rename('result.csv', 'students.csv')
As for counting, it might be an independent thing, you don't need to modify CSV for that:
import csv
from collections import defaultdict
with open('students.csv', 'r', newline='') as source:
csvreader = csv.reader(source)
next(csvreader) # Skip header
stats = defaultdict(int)
for row in csvreader:
if int(row[2]) >= 7:
stats[row[1]] += 1
print(stats)
You can include counting into the code above and have both pieces in one place. defaultdict (stats) has the same interface as dict if you need to access that.

Python - Extract data from csvfile1 and write to csvfile2 based on values in columns

I have data stored in a csv file :
ID;Event;Date
ABC;In;05/01/2015
XYZ;In;05/01/2016
ERT;In;05/01/2014
... ... ...
ABC;Out;05/01/2017
First, I am trying to extract all rows where Event is "In" and saves thoses rows in a new csv file. Here is the code i've tried so far:
[UPDATED : 05/18/2017]
with open('csv_in', 'r') as f, open('csv_out','w') as f2:
fieldnames=['ID','Event','Date']
reader = csv.DictReader(f, delimiter=';', lineterminator='\n',
fieldnames=fieldnames)
wr = csv.DictWriter(f2,dialect='excel',delimiter=';',
lineterminator='\n',fieldnames=fieldnames)
rows = [row for row in reader if row['Event'] == 'In']
for row in rows:
wr.writerows(row)
I am getting the following error : " ValueError: dict contains fields not in fieldnames: 'I', 'D'
[/UPDATED]
1/ Any thoughts on how to fix this ?
2/ Next step, how would you proceed to do a "lookup" on the ID (if exists several times as per ID "ABC") and extract the given "Date" value where Event is "Out"
output desired :
ID Date Exit date
ABC 05/01/2015 05/01/2017
XYZ 05/01/2016
ERT 05/01/2014
Thanks in advance for your input.
PS : can't use panda .. only standard lib.
you can interpret the raw csv with the standard library like so:
oldcsv=open('csv_in.csv','r').read().split('\n')
newcsv=[]
#this next part checks for events that are in
for line in oldcsv:
if 'In' in line.split(';'):
newcsv.append(line)
new_csv_file=open('new_csv.csv','w')
[new_csv_file.write(line+'\n') for line in newcsv]
new_csv_file.close()
you would use the same method to do your look-up, it's just that you'd change the keyword in that for loop, and if there's more than one item in the newly generated list you have more than one occurance of your ID, then just modify the condition to include two keywords
The error here is because you have not added a delimiter.
Syntax-
csv.DictReader(f, delimiter=';')
For Part 2.
import csv
import datetime
with open('csv_in', 'r') as f, open('csv_out','w') as f2:
reader = csv.DictReader(f, delimiter=';')
wr = csv.writer(f2,dialect='excel',lineterminator='\n')
result = {}
for row in reader:
if row['ID'] not in result:
# Assign Values if not in dictionary
if row['Event'] == 'In':
result[row['ID']] = {'IN' : datetime.datetime.strptime(row['Date'], '%d/%m/%Y') }
else:
result[row['ID']] = {'OUT' : datetime.datetime.strptime(row['Date'], '%d/%m/%Y') }
else:
# Compare dates with those present in csv.
if row['Event'] == 'In':
# if 'IN' is not present, use the max value of Datetime to compare
result[row['ID']]['IN'] = min(result[row['ID']].get('IN', datetime.datetime.max), datetime.datetime.strptime(row['Date'], '%d/%m/%Y'))
else:
# Similarly if 'OUT' is not present, use the min value of datetime to compare
result[row['ID']]['OUT'] = max(result[row['ID']].get('OUT', datetime.datetime.min), datetime.datetime.strptime(row['Date'], '%d/%m/%Y'))
# format the results back to desired representation
for v1 in result.values():
for k2,v2 in v1.items():
v1[k2] = datetime.datetime.strftime(v2, '%d/%m/%Y')
wr.writerow(['ID', 'Entry', 'Exit'])
for row in result:
wr.writerow([row, result[row].get('IN'), result[row].get('OUT')])
This code should work just fine. I have tested it on a small input

Python - splitting data as columns in csv file

I have data in a csv file that looks like that is imported as this.
import csv
with open('Half-life.csv', 'r') as f:
data = list(csv.reader(f))
the data will come out as this to where it prints out the rows like data[0] = ['10', '2', '2'] and so on.
What i'm wanting though is to retrieve the data as columns in instead of rows, to where in this case, there are 3 columns.
You can create three separate lists, and then append to each using csv.reader.
import csv
c1 = []
c2 = []
c3 = []
with open('Half-life.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
c1.append(row[0])
c2.append(row[1])
c3.append(row[2])
A little more automatic and flexible version of Alexander's answer:
import csv
from collections import defaultdict
columns = defaultdict(list)
with open('Half-life.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for i in range(len(row)):
columns[i].append(row[i])
# Following line is only necessary if you want a key error for invalid column numbers
columns = dict(columns)
You could also modify this to use column headers instead of column numbers.
import csv
from collections import defaultdict
columns = defaultdict(list)
with open('Half-life.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
headers = next(reader)
column_nums = range(len(headers)) # Do NOT change to xrange
for row in reader:
for i in column_nums:
columns[headers[i]].append(row[i])
# Following line is only necessary if you want a key error for invalid column names
columns = dict(columns)
Another option, if you have numpy installed, you can use loadtxt to read a csv file into a numpy array. You can then transpose the array if you want more columns than rows (I wasn't quite clear on how you wanted the data to look). For example:
import numpy as np
# Load data
data = np.loadtxt('csv_file.csv', delimiter=',')
# Transpose data if needs be
data = np.transpose(data)

How to read a csv file into something like a "record" data type?

For Python 3.4.0
Hey everyone,
I have a csv file that looks like this:
string1;value1
string2;value2
string3;value3
What I wanna do is getting this csv file into some kind of "record" data type, so that I can e.g. look for stringX in stringbig and, if stringX is found, then add +1 to valueX.
What is the easiest way to code that?
Thanks in advance
You can make rows into namedtuples. Here's a simple example:
import csv
from collections import namedtuple
Record = namedtuple('Record', ['product', 'part_number', 'category'])
mydict = defaultdict(dict)
with open('inventory.csv', 'rb') as inf:
for rec in map(Record._make, csv.reader(inf)):
print(rec.part_number)
You can just use the builtin csv module and a simple python dictionary:
import csv
records = {}
with open('/path/to/your/file.csv','rb') as fileobj:
reader = csv.reader(fileobj, delimiter=';')
for key, value in reader:
records[key] = int(value)
Then you can easily update the valueX for stringX by doing:
records[stringX] = records.get(stringX, 0) + 1
You can use DictReader for that:
CSV:
Name;Value
string1;0
string2;20
string3;12
Python:
import csv
with open("data.csv", 'r') as f:
r = csv.DictReader(f, delimiter=';')
for row in r:
if 'string2' in row['Name']:
row['Value'] += 1
print(row)
{'Value': 21, 'Name': 'string2'}

Categories

Resources