Copying the Matching Columns from CSV File

Copying the Matching Columns from CSV File - python

Input:
I have two csv files (file1.csv and file2.csv).
file1 looks like:
ID,Name,Gender
1,Smith,M
2,John,M
file2 looks like:
name,gender,city,id
Problem:
I want to compare the header of file1 with file2 and copy the data of the matching columns. The header in file1 need to be in lowercase prior to finding the matching columns in file2.
Output:
the output should be like this:
name,gender,city,id # name,gender,and id are the only matching columns btw file1 and file2
Smith,M, ,1 # the data copied for name, gender, and id columns
John,M, ,2
I have tried the following code so far:
import csv
file1 = csv.DictReader(open("file1.csv")) #reading file1.csv
file1_Dict = {} # the dictionary of lists that will store the keys and values as list
for row in file1:
for column, value in row.iteritems():
file1_Dict.setdefault(column,[]).append(value)
for key in file1_Dict: # converting the keys of the dictionary to lowercase
file1_Dict[key.lower()] = file1_Dict.pop(key)
file2 = open("file2.csv") #reading file2.csv
file2_Dict ={} # store the keys into a dictionary with empty values
for row2 in file2:
row2 = row2.split(",")
for i in row2:
file2_Dict[i] = ""
Any idea how to solve this problem?

I had a crack on this problem using python without taking performance into consideration. Took me quite a while, phew!
This is my solution.
import csv
csv_data1_filepath = './file1.csv'
csv_data2_filepath = './file2.csv'
def main():
# import nem config and data into memory
data1 = list(csv.reader(file(csv_data1_filepath, 'r')))
data2 = list(csv.reader(file(csv_data2_filepath, 'r')))
file1_header = data1[0][:] # Get f1 header
file2_header = data2[0][:] # Get f1 header
lowered_file1_header = [item.lower() for item in file1_header] # lowercase it
lowered_file2_header = [item.lower() for item in file2_header] # do it for header 2 anyway
col_index_dict = {}
for column in lowered_file1_header:
if column in file2_header:
col_index_dict[column] = lowered_file1_header.index(column)
else:
col_index_dict[column] = -1 # mark as column that will not be worked on later
for column in lowered_file2_header:
if not column in lowered_file1_header:
col_index_dict[column] = -1 # mark as column that will not be worked on later
# build header
output = [col_index_dict.keys()]
is_header = True
for row in data1:
if is_header is False:
rowData = []
for column in col_index_dict:
column_index = col_index_dict[column]
if column_index != -1:
rowData.append(row[column_index])
else:
rowData.append('')
output.append(rowData)
else:
is_header = False
print(output)
if __name__ == '__main__':
main()
This will give you the output of:
[
['gender', 'city', 'id', 'name'],
['M', '', '1', 'Smith'],
['M', '', '2', 'John']
]
Note that the output kind of lost its ordering but this should be fixable by using the ordered dictionary instead.
Hope this helps.

You don't need Python for this. This is a task for SQL.
SQLite Browser supports CSV Import. Take the below steps to get the desired output:
Download and install SQLite Browser
Create a new Database
Import both CSV's as tables (let's say the table names are file1 and file2, respectively)
Now, you can decide how you want to match the data sets. If you only want to match the files on ID, then you can do something like:
select *
from file1 f1
inner join file2 f2
on f1.id = f2.id
If you want to match on every column, you can do something like:
select *
from file1 f1
inner join file2 f2
on f1.id = f2.id and f1.name = f2.name and f1.gender = f2.gender
Finally, simply export the query results back to a CSV.
I spent a lot of time myself trying to perform tasks like this with scripting languages. The benefit of using SQL is that you simply tell what you want to match on, and then let the database do the optimization for you. Generally, it ends up doing the matching faster than any code I could write.
In case you're interested, python also has a sqlite module that comes out-of-the-box. I've gravitated towards using this as my source for data in python scripts for the above reason, and I simply import the CSV's required in SQLite browser before running the python script.

Related

How to compare two CSV files in Python?

I've two CSV file named as file1.csv and file2.csv in file2.csv there is only one column which contain only five records and in file1.csv I've three column which contain more than thousand records I want to get those records which contain in file2.csv
for example this is my file1.csv
'A J1, Jhon1',jhon1#jhon.com, A/B-201 Test1
'A J2, Jhon2',jhon2#jhon.com, A/B-202 Test2
'A J3, Jhon3',jhon3#jhon.com, A/B-203 Test3
'A J4, Jhon4',jhon4#jhon.com, A/B-204 Test4
.......and more records
and inside my file2.csv I've only five records right now but in future it can be many
A/B-201 Test1
A/B-2012 Test12
A/B-203 Test3
A/B-2022 Test22
so I've to find records from my file1.csv at index[2] or index[-1]
this is what I did but it not giving me any output it just returning empty list
import csv
file1 = open('file1.csv','r')
file2 = open('file2.csv','r')
f1 = list(csv.reader(file1))
f2 = list(csv.reader(file2))
new_list = []
for i in f1:
if i[-1] in f2:
new_list.append(i)
print('New List : ',new_list)
it gives me output like this
New List : []
Please help if I did any thing wrong correct me.

Method 1: pandas
This task can be done with relative ease using pandas. DataFrame documentation here.
Example:
In the example below, the two CSV files are read into two DataFrames. The DataFrames are merged using an inner join on the matching columns.
The output shows the merged result.
import pandas as pd
df1 = pd.read_csv('file1.csv', names=['col1', 'col2', 'col3'], quotechar="'", skipinitialspace=True)
df2 = pd.read_csv('file2.csv', names=['match'])
df = pd.merge(df1, df2, left_on=df1['col3'], right_on=df2['match'], how='inner')
The quotechar and skipinitialspace parameters are used as the first column in file1 is quoted and contains a comma, and there is leading whitespace after the comma before the last field.
Output:
col1 col2 col3
0 A J1, Jhon1 jhon1#jhon.com A/B-201 Test1
1 A J3, Jhon3 jhon3#jhon.com A/B-203 Test3
If you choose, the output can easily be written back to a CSV file as:
df.to_csv('path/to/output.csv')
For other DataFrame operations, refer to the documentation linked above.
Method 2: Core Python
The method below does not use any libraries, only core Python.
Read the matches from file2 into a list.
Iterate over file1 and search each line to determine if the last value is a match for an item in file2.
Report the output.
Any subsequent data cleaning (if required) will be up to your personal requirements or use-case.
Example:
output = []
# Read the matching values into a list.
with open('file2.csv') as f:
matches = [i.strip() for i in f]
# Iterate over file1 and place any matches into the output.
with open('file1.csv') as f:
for i in f:
match = i.split(',')[-1].strip()
if any(match == j for j in matches):
output.append(i)
Output:
["'A J1, Jhon1',jhon1#jhon.com, A/B-201 Test1\n",
"'A J3, Jhon3',jhon3#jhon.com, A/B-203 Test3\n"]

Use sets or dicts for in checks (complexity is O(1) for them, instead of O(N) for lists and tuples).
Have a look at convtools library (github): it has Table helper for working with table data as with streams (table docs)
from convtools import conversion as c
from convtools.contrib.tables import Table
# creating a set of allowed values
allowed_values = {
item[0] for item in Table.from_csv("input2.csv").into_iter_rows(tuple)
}
result = list(
# reading a file with custom quotechar
Table.from_csv("input.csv", dialect=Table.csv_dialect(quotechar="'"))
# stripping last column values
.update(COLUMN_2=c.col("COLUMN_2").call_method("strip"))
# filtering based on allowed values
.filter(c.col("COLUMN_2").in_(c.naive(allowed_values)))
# returning iterable of tuples
.into_iter_rows(tuple)
# # OR outputting csv if needed
# .into_csv("result.csv")
)
"""
>>> In [36]: result
>>> Out[36]:
>>> [('A J1, Jhon1', 'jhon1#jhon.com', 'A/B-201 Test1'),
>>> ('A J3, Jhon3', 'jhon3#jhon.com', 'A/B-203 Test3')]
"""

Writing a function that returns the number of unique names in a column in a dataset - Python

I'm currently trying to write a function that takes an integer a dataset (one that I already have, named data). And looks for a column in this dataset called name. It then has to return the number of different types of names there are in the column (there are 4 values, but only 3 types of values--two of them are the same).
I'm having a hard time with this program, but this is what I have so far:
def name_count(data):
unique = []
for name in data:
if name.strip() not in unique:
unique[name] += 1
else:
unique[name] = 1
unique.append(name)
The only import I'm allowed to use for this challenge is math.
Does anyone have any help or advice they can offer with this problem?

You can use a set to keep duplicates from it, for example:
data = ['name1', 'name2', 'name3', 'name3 ']
cleaned_data = map(lambda x: x.strip(), data)
count = len(set(cleaned_data))
print(count)
>>> 3

You almost had it. Unique should be a dictionary, not a list.
def name_count(data):
unique = {}
for name in data:
if name.strip() in unique:
unique[name] += 1
else:
unique[name] = 1
return unique
#test
print(name_count(['Jack', 'Jill', 'Mary', 'Sam', 'Jack', 'Mary']))
#output
{'Jack': 2, 'Jill': 1, 'Mary': 2, 'Sam': 1}

def name_count(data):
df = pandas.DataFrame(data)
unique = []
for name in df["name"]: #if column name is "name"
if name:
if (name not in unique) :
unique.append(name)
return unique
You need to pass the complete dataset to the function and not just the integers.

It is not clear what kind of data variable you already have there.
So, I will suggest a solution, starting from reading the file.
Considering that you have a csv file and that there is a restriction on importing only math module (as you mentioned), then this should work.
def name_count(filename):
with open(filename, 'r') as fh:
headers = next(fh).split(',')
name_col_idx = headers.index('name')
names = [
line.split(',')[name_col_idx]
for line in fh
]
return len(set(names))
Here we read the first line, identify the location of name in the header, collect all items in the name column into a variable names and finally return the length of the set, which contains only unique elements.

Here is the solution if you are feeding a csv file to your function. It reads the csv file, gets rid of the header line, accumulates the names which are on index 1 of each line, casts the list as a set to get rid of the duplicates and returns the length of the set which is the same as the number of unique names.
import csv
def name_count(filename):
with open(filename, "r") as csvfile:
csvreader = csv.reader(csvfile)
names = [row[1] for row in csvreader if row][1:]
return len(set(names))
Alternatively, if you don't want to use a csv reader, you can use a tect file reader without any imports as follows. The code splits each line on commas.
def name_count(filename):
with open(filename, "r") as input:
names = [row.rstrip('\n').split(',')[1] for row in input if row][1:]
return len(set(names))

Read data from excel after a string matches

I want to read the entire row data and store it in variables, later use them in selenium to write it to webelements. Programming language is Python.
Example: I have an excel sheet of Incidents and their details regarding priority, date, assignee etc
If I give the string as INC00000 it should match the excel data, fetch all the above details and store it in separate variables like
INC #= INC0000 Priority= Moderate Date = 11/2/2020
Is this feasible? I tried and failed writing a code. Please suggest other possible ways to do this.

I would,
load the sheet into a pandas DataFrame
filter the corresponding column in the DataFrame by the INC # of interest
convert the row to dictionary (assuming the INC filter produces only 1 row)
get the corresponding value in the dictionary to assign to the corresponding webelement
Example:
import pandas as pd
df = pd.read_excel("full_file_path", sheet_name="name_of_sheet")
dict_data = df[df['INC #']].to_dict("record") # <-- assuming the "INC #" are in column named "INC #" in the spreadsheet
webelement1.send_keys(dict_data[columnname1])
webelement2.send_keys(dict_data[columnname2])
webelement3.send_keys(dict_data[columnname3])
.
.
.

Please find the below code and do the changes as per your variables after saving your excel file as csv:
Please find the dummy data image
import csv
# Set up input and output variables for the script
gTrack = open("file1.csv", "r")
# Set up CSV reader and process the header
csvReader = csv.reader(gTrack)
header = csvReader.next()
print header
id_index = header.index("id")
date_index = header.index("date ")
var1_index = header.index("var1")
var2_index = header.index("var2")
# # Make an empty list
cList = []
# Loop through the lines in the file and get required id
for row in csvReader:
id = row[id_index]
if(id == 'INC001') :
date = row[date_index]
var1 = row[var1_index]
var2 = row[var2_index]
cList.append([id,date,var1,var2])
# # Print the coordinate list
print cList

How can I write to an existing csv file from a dictionary to a specific column?

I have a dictionary I created from a csv file and would like to use this dict to update the values in a specific column of a different csv file called sheet2.csv.
Sheet2.csv has many columns with different headers and I need to only update the column PartNumber based on my key value pairs in my dict.
My question is how would I use the keys in dict to search through sheet2.csv and update/write to only the column PartNumber with the appropriate value?
I am new to python so I hope this is not too confusing and any help is appreciated!
This is the code I used to create the dict:
import csv
a = open('sheet1.csv', 'rU')
csvReader = csv.DictReader(a)
dict = {}
for line in csvReader:
dict[line["ReferenceID"]] = line["PartNumber"]
print(dict)
dict = {'R150': 'PN000123', 'R331': 'PN000873', 'C774': 'PN000064', 'L7896': 'PN000447', 'R0640': 'PN000878', 'R454': 'PN000333'}
To make things even more confusing, I also need to make sure that already existing rows in sheet2 remain unchanged. For example, if there is a row with ReferenceID as R1234 and PartNumber as PN000000, it should stay untouched. So I would need to skip rows which are not in my dict.
Link to sample CSVs:
http://dropbox.com/s/zkagunnm0xgroy5/Sheet1.csv
http://dropbox.com/s/amb7vr48mdc94v6/Sheet2.csv
EDIT: Let me rephrase my question and provide a better example csvfile.
Let's say I have a Dict = {'R150': 'PN000123', 'R331': 'PN000873', 'C774': 'PN000064', 'L7896': 'PN000447', 'R0640': 'PN000878', 'R454': 'PN000333'}.
I need to fill in this csv file: https://www.dropbox.com/s/c95mlitjrvyppef/sheet.csv
Specifically, I need to fill in the PartNumber column using the keys of the dict I created. So I need to iterate through column ReferenceID and compare that value to my keys in dict. If there is a match I need to fill in the corresponding PartNumber cell with that value.... I'm sorry if this is all confusing!

The code below should do the trick. It first builds a dictionary just like your code and then moves on to read Sheet2.csv row by row, possibly updating the part number. The output goes to temp.csv which you can compare with the inital Sheet2.csv. In case you want to overwrite Sheet2.csv with the contents of temp.csv, simply uncomment the line with shutil.move.
Note that the sample files you provided do not contain any updateable data, so Sheet2.csv and temp.csv will be identical. I tested this with a slightly modified Sheet1.csv where I made sure that it actually contains a reference ID used by Sheet2.csv.
import csv
import shutil
def createReferenceIdToPartNumberMap(csvToReadPath):
result = {}
print 'read part numbers to update from', csvToReadPath
with open(csvToReadPath, 'rb') as csvInFile:
csvReader = csv.DictReader(csvInFile)
for row in csvReader:
result[row['ReferenceID']] = row['PartNumber']
return result
def updatePartNumbers(csvToUpdatePath, referenceIdToPartNumberMap):
tempCsvPath = 'temp.csv'
print 'update part numbers in', csvToUpdatePath
with open(csvToUpdatePath, 'rb') as csvInFile:
csvReader = csv.reader(csvInFile)
# Figure out which columns contain the reference ID and part number.
titleRow = csvReader.next()
referenceIdColumn = titleRow.index('ReferenceID')
partNumberColumn = titleRow.index('PartNumber')
# Write tempoary CSV file with updated part numbers.
with open(tempCsvPath, 'wb') as tempCsvFile:
csvWriter = csv.writer(tempCsvFile)
csvWriter.writerow(titleRow)
for row in csvReader:
# Check if there is an updated part number.
referenceId = row[referenceIdColumn]
newPartNumber = referenceIdToPartNumberMap.get(referenceId)
# If so, update the row just read accordingly.
if newPartNumber is not None:
row[partNumberColumn] = newPartNumber
print ' update part number for %s to %s' % (referenceId, newPartNumber)
csvWriter.writerow(row)
# TODO: Move the temporary CSV file over the initial CSV file.
# shutil.move(tempCsvPath, csvToUpdatePath)
if __name__ == '__main__':
referenceIdToPartNumberMap = createReferenceIdToPartNumberMap('Sheet1.csv')
updatePartNumbers('Sheet2.csv', referenceIdToPartNumberMap)

Python: General CSV file parsing and manipulation

The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?

Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.

This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']

If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.