Trying to join two datasets via csv and python

Trying to join two datasets via csv and python - python

I currently have two csv files full of objects. The objects in one of the csv files contains an object id, and various other info fields. The other contains the object id's that reference to the first file along with other info about the objects.
I'm trying to output a third csv file that contains all of the information for each object. Looping through these traditionally is too slow as there are ~3 million objects in one of the files. Does there exist a python package or other solution that makes this process more efficient?

This only requires that data from the smaller csv file be kept in memory.
import csv
extra_data {}
with open('smaller.csv', newline='') as fin1:
reader = csv.reader(fin1)
for row in reader:
objid = row[0] # or whichever field has the object id
extra_data[objid] = row[1:]
with open('bigger.csv', newline='') as fin2, open('combined.csv', 'w', newline='') as fout:
reader = csv.reader(fin2)
writer = csv.writer(fout)
for row in reader:
objid = row[0] # or whichever field has the object id
new_row = row + extra_data.get(objid, [])
writer.writerow(new_row)

Related

large file delete rows python

Need some help with a use case. I have two file one is about 9GB (test_data) and the other 42MB(master_data). test_data contains data with several columns one of the column i.e. #7 contains email address . master_data is my master data file which has just one column which is email address only.
What I am trying to achieve is to compare the emails in master_data file with the emails in test_data if they match, the whole row is to be deleted. I need an efficient way to achieve the same.
The below piece of code is written to achieve but I am stuck at deleting the lines from master_data file but am not sure if this is an efficient way of achieving this requirement.
import csv
import time
# open the file in read mode
filename = open('master_data.csv', 'r')
# creating dictreader object
file = csv.DictReader(filename)
# creating empty lists
email = []
# iterating over each row and append
# values to empty list
for col in file:
email.append(col['EMAIL'])
# printing lists
print('Email:', email)
datafile = open('test_data.csv', 'r+')
for line in datafile:
#print(line)
# str1,id=line.split(',')
split_line=line.split(',')
str1=split_line[7] # Whatever columns
id1=split_line[0]
for w in email:
print(w)
print(str1)
#time.sleep(2.4)
if w in str1:
print(id1)
datafile.remove(id1)

Removing lines from a file is difficult. Its a lot easier to write a new file, filtering out rows as you go. Put your existing emails in a set for easy lookup, write to a temporary file and rename when done. This also has the advantage that you don't loose data if something goes wrong along the way.
You'll need to "normalize" the email. Most email systems are case-insensitive and ignore periods in addresses. Addresses can also contain extra name information as in John Doe <j.doe#Gmail.com>. Write a function that puts addresses into a single form and use it for both of your files.
import csv
import os
import email.utils
def email_normalize(val):
# discard optional full name. lower case, remove '.' in local name
_, addr = email.utils.parseaddr(val)
local, domain = addr.lower().split('#', 1)
local = local.replace('.', '')
return f'{local}#{domain)'
# create set of user emails to keep
with open('master_data.csv', newline='') as file:
emails = set(email_normalize(row[0]) for row in csv.reader(file))
with open('test_data.csv', newline='') as infile, \
open('test_data.csv.tmp', 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
writer.writerow(next(reader)) # write header
writer.writerows(row for row in reader
if email_normalize(row[7]) not in emails) # email is column #7
del reader, writer
os.rename('test_data.csv', 'test_data.csv.deleteme')
os.rename('test_data.csv.tmp', 'test_data.csv')
os.remove('test_data.csv.deleteme')

You could load the master file and store the emails in a dict, then as you iterate the rows of test, you can check if the email for a row is in that (master) dict.
Given these CSVs:
test.csv:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,,,,,,foo#mail.com
2,,,,,,bar#mail.com
3,,,,,,baz#mail.com
4,,,,,,dog#mail.com
5,,,,,,foo#mail.com
master.csv:
Col1
foo#mail.com
cat#mail.com
dog#mail.com
When I run:
import csv
emails: dict[str, None] = {}
with open("master.csv", newline="") as f:
reader = csv.reader(f)
next(reader) # skip header
for row in reader:
emails[row[0]] = None
out_line = "{:<20} {:>8}"
with open("test.csv", newline="") as f:
reader = csv.reader(f)
next(reader) # skip header
print(out_line.format("Email", "Test ID"))
for row in reader:
if row[6] in emails:
print(out_line.format(row[6], row[0]))
I get:
Email Test ID
foo#mail.com 1
dog#mail.com 4
foo#mail.com 5
That proves that you can read the emails in from master and compare with them while reading from test.
As others have pointed out, actually deleting anything from a file is difficult; it's far easier to create a new file and just exclude (filter out) the things you don't want:
f_in = open("test.csv", newline="")
reader = csv.reader(f_in)
f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)
for row in reader:
if row[6] in emails:
continue
writer.writerow(row)
f_in.close()
f_out.close()
Iterating your CSV reader and writing-out with your CSV writer is a very efficient means of transforming a CSV (test.csv → output.csv, in this case): you only need the memory to hold row in each step of the loop.
When I run that, after having populated the emails dict like before, my output.csv looks like:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
2,,,,,,bar#mail.com
3,,,,,,baz#mail.com
For the real-world performance of your situation, I mocked up a 42MB CSV file for master—1.35M rows of 32-character-long hex strings. Reading those 1.35M unique strings and saving them in the dict took less than 1 second real time and used 176 MB of RAM (on my M1 Macbook Air, with dual-channel SSD).
Also, I recommend using the csv module every time you need to read/write a CSV. No matter how simple the CSV looks, using the csv readers/writers will be 100% correct and there's almost 0 overhead compared to trying and manually split or join on a comma.

getting average of some digits from a csv file as input and Write the averages in an output csv file in python 3

I am learning python3 :), and I am trying to read a CSV file with different rows
and take the average of the scores for each person(in each row)
and write it in a CSV file as an output in python 3.
The input file is like below:
David,5,2,3,1,6
Adele,3,4,1,5,2,4,2,1
...
The output file should seem like below:
David,4.75
Adele,2.75
...
It seems that I am reading the file correctly, as I print
the average for each name in the terminal, but in CSV
output file it prints only the average of the last name
of the input file, while I want to print all names and
corresponding averages in CSV output file.
Anybody can help me with it?
import csv
from statistics import mean
these_grades = []
name_list = []
reader = csv.reader(open('input.csv', newline=''))
for row in reader:
name = row[0]
name_list.append(name)
with open('result.csv', 'w', newline='\n') as f:
writer = csv.writer(f,
delimiter=',',
quotechar='"',
quoting=csv.QUOTE_MINIMAL)
for grade in row[1:]:
these_grades.append(int(grade))
for item in name_list:
writer.writerow([''.join(item), mean(these_grades)])
print('%s,%f' % (name , mean(these_grades)))

There are several issues in your code:
You're not using a context manager (with) when you read the input file. There's no reason to use it when writing but not when reading - you consequently don't close the "input.csv" file
You're using a list to store data from rows. This doesn't easily distinguish between the person's name and the scores associated with the person. It would be better to use a dictionary in which the key is the person's name, and the values stored against that key are the individual scores
You repeatedly open the file within a for loop in 'w' mode. Every time you open a file in write mode, it just wipes all the previous contents. You actually do write each row to the file, but you just wipe it again when you open the file on the next iteration.
You can use:
import csv
import statistics
# use a context manager to read the data in too, not just for writing
with open('input.csv') as infile:
reader = csv.reader(infile)
data = list(reader)
# Create a dictionary to store the scores against the name
scores = {}
for row in data:
scores[row[0]] = row[1:] # First item in the row is the key (name) and the rest is values
with open('output.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
# Now we need to iterate the dictionary and average the score on each iteration
for name, scores in scores.items():
ave_score = statistics.mean([int(item) for item in scores])
writer.writerow([name, ave_score])
This can be further consolidated, but it's less easy to see what's happening:
with open('input.csv') as infile, open('output.csv', 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
name = row[0]
values = row[1:]
ave_score = statistics.mean(map(int, values))
writer.writerow([name, ave_score])

How to delete lines from csv file using python?

I have a CSV file:It contain the classes name and type of code smell and for each class Icalculated the number of a code smell .the final calcul is on the last line so there are many repeated classes name .
I need just the last line of the class name.
This is a part of my CSV file beacause it's too long :
NameOfClass,LazyClass,ComplexClass,LongParameterList,FeatureEnvy,LongMethod,BlobClass,MessageChain,RefusedBequest,SpaghettiCode,SpeculativeGenerality
com.nirhart.shortrain.MainActivity,NaN,NaN,NaN,NaN,NaN,NaN,1,NaN,NaN,NaN
com.nirhart.shortrain.path.PathParser,NaN,1,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.path.PathParser,NaN,1,NaN,1,NaN,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.path.PathParser,NaN,1,1,1,NaN,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.path.PathParser,NaN,1,2,1,NaN,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.path.PathParser,NaN,1,2,1,1,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.path.PathPoint,1,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.path.PathPoint,1,NaN,1,NaN,NaN,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.path.TrainPath,NaN,NaN,NaN,1,NaN,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.rail.RailActionActivity,NaN,NaN,NaN,1,NaN,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.rail.RailActionActivity,NaN,NaN,NaN,1,1,NaN,NaN,NaN,NaN,NaN

To filter out the last entry for groups of NameOfClass, you can make use of Python's groupby() function to return lists of rows with the same NameOfClass. The last entry from each can then be written to a file.
from itertools import groupby
import csv
with open('data_in.csv', newline='') as f_input, open('data_out.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for key, rows in groupby(csv_input, key=lambda x: x[0]):
csv_output.writerow(list(rows)[-1])
For the data you have given, this would give you the following output:
NameOfClass,LazyClass,ComplexClass,LongParameterList,FeatureEnvy,LongMethod,BlobClass,MessageChain,RefusedBequest,SpaghettiCode,SpeculativeGenerality
com.nirhart.shortrain.MainActivity,NaN,NaN,NaN,NaN,NaN,NaN,1,NaN,NaN,NaN
com.nirhart.shortrain.path.PathParser,NaN,1,2,1,1,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.path.PathPoint,1,NaN,1,NaN,NaN,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.path.TrainPath,NaN,NaN,NaN,1,NaN,NaN,NaN,NaN,NaN,NaN
com.nirhart.shortrain.rail.RailActionActivity,NaN,NaN,NaN,1,1,NaN,NaN,NaN,NaN,NaN

To get just the unique class names (ignoring repeated rows, not deleting them), you can do this:
import csv
with open('my_file.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
classNames = set(row[0] for row in reader)
print(classNames)
# {'com.nirhart.shortrain.MainActivity', 'com.nirhart.shortrain.path.PathParser', 'com.nirhart.shortrain.path.PathPoint', ...}
This is just using the csv module to open a file, getting the first value in each row, and then taking only the unique values of those. You can then manipulate the resulting set of strings (you might want to cast it back to a list via list(classNames)) however you need to.

If you intend to later process the data in pandas, filtering duplicates is trivial:
import pandas as pd
df = pd.read_csv('file.csv')
df = df.loc[~df.NameOfClass.duplicated(keep='last')]
If you just want to build a new csv file with only the expected lines, pandas is overkill and the csv module is enough:
import csv
with open('file.csv') as fdin, file('new_file.csv', 'w', newline='') as fdout:
rd = csv.reader(fdin)
wr = csv.writer(fdout)
wr.writerow(next(rd)) # copy the header line
old = None
for row in rd:
if old is not None and old[0] != row[0]:
wr.writerow(old)
old = row
wr.writerow(old)

parse a csv into python dictionaries by structured range

I've got a csv file that has ranges of information dumped into rows and comma separated. Each line is a separate row, and the recordset is 'bookended' by rows that say startrecord (with an unique id) and stoprecord (matching id). The data between the record bookmarks various, no there is some repeatability.
StartRecord, ID
name, value, other
name, value,value,value
EndRecord, ID
StartRecord, ID
name, value, other
something, another, another, something new, different
EndRecord, ID
StartRecord, ID
various, name, value, name, value, other
EndRecord, ID
I'm reading in the records using csv reader and iterating the rows. How can I capture the data between and startrecord and endrecord into separate dictionary objects (in this example, I'd have 3 records)?

The continue statement should be helpful here. Try this.
import csv
dict = {}
current_id = ''
with open('filename.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
print(row)
if row[0] == 'StartRecord':
current_id = row[1]
dict[row[1]] = []
continue
if row[0] == 'EndRecord':
continue
dict[current_id].append(row)
print(dict)
Edit: I realised that you may need to modify the delimiter argument if your csv file has spaces after the commas like in your example.

Parsing CSV files using Python 2.7

I'm trying to write a script that will open a CSV file and write rows from that file to a new CSV file based on the match criteria of a unique telephone number in column 4 of csv.csv. The phone numbers are always in column 4, and are often duplicated in the file, however the other columns are often unique, thus each row is inherently unique.
A row from the csv file I'm reading looks like this: (the TN is 9259991234)
2,PPS,2015-09-17T15:44,9259991234,9DF51758-A2BD-4F65-AAA2
I hit an error with the code below saying that '_csv.writer' is not iterable and I'm not sure how to modify my code to solve the problem.
import csv
import sys
import os
os.chdir(r'C:\pTest')
with open(r'csv.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
with open (r'new_csv.csv', 'ab') as new_f:
writer = csv.writer(new_f, delimiter=',')
for row in reader:
if row[3] not in writer:
writer.writerow(new_f)

Your error stems from this expression:
row[3] not in writer
You cannot test for membership against a csv.writer() object. If you wanted to track if you already have processed a phone number, use a separate set() object to track those:
with open(r'csv.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
with open (r'new_csv.csv', 'ab') as new_f:
writer = csv.writer(new_f, delimiter=',')
seen = set()
for row in reader:
if row[3] not in seen:
seen.add(row[3])
writer.writerow(row)
Note that I also changed your writer.writerow() call; you want to write the row, not the file object.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to join two datasets via csv and python - python

Related

large file delete rows python

getting average of some digits from a csv file as input and Write the averages in an output csv file in python 3

How to delete lines from csv file using python?

parse a csv into python dictionaries by structured range

Parsing CSV files using Python 2.7

Categories

Resources