Joining the columns of one CSV to another CSV - python

So I'm trying to combine column values from one csv to another while saving it into a final csv file. But I want to iterate through all the rows adding the column values of each row to each row of the original csv.
In other words say csv1 has 3 rows.
Row 1: Frog,Rat,Duck
Row 2: Cat,Dog,Cow
Row 3: Moose,Fox,Zebra
And I want to combine 2 more column values from csv2 to each of those rows.
Row 1: Chicken,Pig
Row 2:
Row 3: Bear,Boar
So csv3 would end up looking like.
Row 1: Frog,Rat,Duck,Chicken,Pig
Row 2: Moose,Fox,Zebra,Bear,Boar
But at the same time if there's a row in csv2 that has no values at all I don't want it to copy the row from csv1. In other words that row will not exist at all in the final csv file. I prefer not to use pandas as I have just been using the csv module thus far throughout my code but any method is appreciated.
So far I have come across this method which works if there's only one single row. But when there's more than that it just adds random lines and appends the values all over the place. And it combines both of the columns into one string while adding an extra blank line at the end of the csv for some odd reason.
import csv
f1 = open ("2.csv","r", encoding='utf-8')
with open("3.csv","w", encoding='utf-8', newline='') as f:
writer = csv.writer(f)
with open("1.csv","r", encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=",")
for row in reader:
row[6] = f1.readline()
writer.writerow(row)
f1.close()
Using the same example csvs above the results given are.
Frog,Rat,Duck,Chicken,Pig
Cat,Dog,Cow
Moose,Fox,Zebra,Bear,Boar

You can zip together the two files and then iterate through each row. Then you can concatenate the two lists and write the result to a file.
To check if there is an empty row we can compare the set of the row to the set of an empty string.
import csv
new_csv_data = []
EMPTY_ROW = set([""])
with open("1.csv", "r", newline="") as first_file, open("2.csv", "r", newline="") as second_file, open("3.csv", "w", newline="") as out_file:
first_file_reader = csv.reader(first_file)
second_file_reader = csv.reader(second_file)
out_file_writer = csv.writer(out_file)
# The iterator will stop when the shortest file is finished
for row_1, row_2 in zip(first_file_reader, second_file_reader):
# Check if the second row is empty, skipping if it is
if not row_2 or set(row_2) == EMPTY_ROW:
continue
out_file_writer.writerow(row_1 + row_2)

Related

How to delete a row in a CSV file if a cell is empty using Python

I want to go through large CSV files and if there is missing data I want to remove that row completely, This is only row specific so if there is a cell that = 0 or has no value then I want to remove the entire row. I want this to happen for all the columns so if any column has a black cell it should delete the row, and return the corrected data in a corrected csv.
import csv
with open('data.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
print(row)
if not row[0]:
print("12")
This is what I found and tried but it doesnt not seem to be working and I dont have any ideas about how to aproach this problem, help please?
Thanks!
Due to the way in which CSV reader presents rows of data, you need to know how many columns there are in the original CSV file. For example, if the CSV file content looks like this:
1,2
3,
4
Then the lists return by iterating over the reader would look like this:
['1','2']
['3','']
['4']
As you can see, the third row only has one column whereas the first and second rows have 2 columns albeit that one is (effectively) empty.
This function allows you to either specify the number of columns (if you know them before hand) or allow the function to figure it out. If not specified then it is assumed that the number of columns is the greatest number of columns found in any row.
So...
import csv
DELIMITER = ','
def valid_column(col):
try:
return float(col) != 0
except ValueError:
pass
return len(col.strip()) > 0
def fix_csv(input_file, output_file, cols=0):
if cols == 0:
with open(input_file, newline='') as indata:
cols = max(len(row) for row in csv.reader(indata, delimiter=DELIMITER))
with open(input_file, newline='') as indata, open(output_file, 'w', newline='') as outdata:
writer = csv.writer(outdata, delimiter=DELIMITER)
for row in csv.reader(indata, delimiter=DELIMITER):
if len(row) == cols:
if all(valid_column(col) for col in row):
writer.writerow(row)
fix_csv('original.csv', 'fixed.csv')
maybe like this
import csv
with open('data.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
data=list(csvreader)
data=[x for x in data if '' not in x and '0' not in x]
you can then rewrite the the csv file if you like
Instead of using csv, you should use Pandas module, something like this.
import pandas as pd
df = pd.read_csv('file.csv')
print(df)
index = 1 #index of the row that you want to remove
df = df.drop(index)
print(df)
df.to_csv('file.csv')

Python: How to iterate every third row starting with the second row of a csv file

I'm trying to write a program that iterates through the length of a csv file row by row. It will create 3 new csv files and write data from the source csv file to each of them. The program does this for the entire row length of the csv file.
For the first if statement, I want it to copy every third row starting at the first row and save it to a new csv file(the next row it copies would be row 4, row 7, row 10, etc)
For the second if statement, I want it to copy every third row starting at the second row and save it to a new csv file(the next row it copies would be row 5, row 8, row 11, etc).
For the third if statement, I want it to copy every third row starting at the third row and save it to a new csv file(the next row it copies would be row 6, row 9, row 12, etc).
The second "if" statement I wrote that creates the first "agentList1.csv" works exactly the way I want it to but I can't figure out how to get the first "elif" statement to start from the second row and the second "elif" statement to start from the third row. Any help would be much appreciated!
Here's my code:
for index, row in Sourcedataframe.iterrows(): #going through each row line by line
#this for loop counts the amount of times it has gone through the csv file. If it has gone through it more than three times, it resets the counter back to 1.
for column in Sourcedataframe:
if count > 3:
count = 1
#if program is on it's first count, it opens the 'Sourcedataframe', reads/writes every third row to a new csv file named 'agentList1.csv'.
if count == 1:
with open('blankAgentList.csv') as infile:
with open('agentList1.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
count2 += 1
if not count2 % 3:
writer.writerow(row)
elif count == 2:
with open('blankAgentList.csv') as infile:
with open('agentList2.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
count2 += 1
if not count2 % 3:
writer.writerow(row)
elif count == 3:
with open('blankAgentList.csv') as infile:
with open('agentList3.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
count2 += 1
if not count2 % 3:
writer.writerow(row)
count = count + 1 #counts how many times it has ran through the main for loop.
convert csv to dataframe as (df.to_csv(header=True)) to start indexing from second row
then,pass row/record no in iloc function to fetch particular record using
( df.iloc[ 3 , : ])
you are open your csv file in each if claus from the beginning. I believe you already opened your file into Sourcedataframe. so just get rid of reader = csv.DictReader(infile) and read data like this:
Sourcedataframe.iloc[column]
Using plain python we can create a solution that works for any number of interleaved data rows, let's call it NUM_ROWS, not just three.
Nota Bene: the solution does not require to read and keep the whole input all the data in memory. It processes one line at a time, grouping the last needed few and works fine for a very large input file.
Assuming your input file contains a number of data rows which is a multiple of NUM_ROWS, i.e. the rows can be split evenly to the output files:
NUM_ROWS = 3
outfiles = [open(f'blankAgentList{i}.csv', 'w') for i in range(1,NUM_ROWS+1)]
with open('blankAgentList.csv') as infile:
header = infile.readline() # read/skip the header
for f in outfiles: # repeat header in all output files if needed
f.write(header)
row_groups = zip(*[iter(infile)]*NUM_ROWS)
for rg in row_groups:
for f, r in zip(outfiles, rg):
f.write(r)
for f in outfiles:
f.close()
Otherwise, for any number of data rows we can use
import itertools as it
NUM_ROWS = 3
outfiles = [open(f'blankAgentList{i}.csv', 'w') for i in range(1,NUM_ROWS+1)]
with open('blankAgentList.csv') as infile:
header = infile.readline() # read/skip the header
for f in outfiles: # repeat header in all output files if needed
f.write(header)
row_groups = it.zip_longest(*[iter(infile)]*NUM_ROWS)
for rg in row_groups:
for f, r in it.zip_longest(outfiles, rg):
if r is None:
break
f.write(r)
for f in outfiles:
f.close()
which, for example, with an input file of
A,B,C
r1a,r1b,r1c
r2a,r2a,r2c
r3a,r3b,r3c
r4a,r4b,r4c
r5a,r5b,r5c
r6a,r6b,r6c
r7a,r7b,r7c
produces (output copied straight from the terminal)
(base) SO $ cat blankAgentList.csv
A,B,C
r1a,r1b,r1c
r2a,r2a,r2c
r3a,r3b,r3c
r4a,r4b,r4c
r5a,r5b,r5c
r6a,r6b,r6c
r7a,r7b,r7c
(base) SO $ cat blankAgentList1.csv
A,B,C
r1a,r1b,r1c
r4a,r4b,r4c
r7a,r7b,r7c
(base) SO $ cat blankAgentList2.csv
A,B,C
r2a,r2a,r2c
r5a,r5b,r5c
(base) SO $ cat blankAgentList3.csv
A,B,C
r3a,r3b,r3c
r6a,r6b,r6c
Note: I understand the line
row_groups = zip(*[iter(infile)]*NUM_ROWS)
may be intimidating at first (it was for me when I started).
All it does is simply to group consecutive lines from the input file.
If your objective includes learning Python, I recommend studying it thoroughly via a book or a course or both and practising a lot.
One key subject is the iteration protocol, along with all the other protocols. And namespaces.

Using Python to combine csv files with headers on different rows

I'm trying to combine a bunch of csv files. Each csv file has a different number of columns. This is not a problem, I can easily loop through the files and pull in all the column headers, pasting them into an empty file to use as a base.
The problem I'm having is that the column headers are on different rows in each file.
For example:
Table1
Random Text
!,Header1,Header2,Header3
*,123,124,5235
*,124,15,23624
*,135,677,234
Table2
Random Text
Random Text
!,Header1,Header2,Header4
*,124,2156,7478
*,126,12357,547
*,237,12,267
Output:
Table,Header1,Header2,Header3,Header4
Table1,123,124,5235
Table1,124,15,23624
Table1,135,677,234
Table2,124,2156,7478
Table2,126,12357,547
Table2,237,12,267
My existing code looks something like this:
files = glob.glob(r'//Directory/*.csv')
#This block goes through each file and works out which variables exist
variablelist=[]
for f in files:
with open(f,'r') as csvfile:
read_rows = csv.reader(csvfile)
for row in read_rows:
if row[0]!="*": #The last row with no * in column 1 is the header row
rowlist = row
variablelist.extend(x for x in rowlist if x not in variablelist)
list.sort(variablelist)
I use the fact that the header row is the last row without a * in the first column. I work out which row the headers are on and then store the header names in a list - combining the same list from all files.
I then try and combine the files together using this code that I found by searching this website:
with open("out.csv", "w", newline="") as f_out: # Comment 2 below
writer = csv.DictWriter(f_out, fieldnames=variablelist)
for f in files:
with open(f, "r", newline="",) as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
for line in reader:
# Comment 3 below
writer.writerow(line)
The problem is, I don't know how to deal with the headers being on different lines. I tried using code to define the header row number, but don't know how to implement this into the code above - (Can dictreader skip a dynamic number of rows before finding headers?)
with open(f,'r') as csvfile:
read_rows = csv.reader(csvfile)
header_row_number = 0
for row in read_rows:
if row[0]!="*":
header_row_number=read_rows.line_num
Any help would be much appreciated

Create new CSV that excludes rows from old CSV

I need guidance on code to write a CSV file that drops rows with specific numbers in the first column [0]. My script writes a file, but it contains the rows that I am working to delete. I suspect that I may have an issue with the spreadsheet being read as one long string rather than ~150 rows.
import csv
Property_ID_To_Delete = {4472738, 4905985, 4905998, 4678278, 4919702, 4472936, 2874431, 4949190, 4949189, 4472759, 4905977, 4905995, 4472934, 4905982, 4906002, 4472933, 4905985, 4472779, 4472767, 4472927, 4472782, 4472768, 4472750, 4472769, 4472752, 4472748, 4472751, 4905989, 4472929, 4472930, 4472753, 4933246, 4472754, 4472772, 4472739, 4472761, 4472778}
with open('2015v1.csv', 'rt') as infile:
with open('2015v1_edit.csv', 'wt') as outfile:
writer = csv.writer(outfile)
for row in csv.reader(infile):
if row[0] != Property_ID_To_Delete:
writer.writerow(row)
Here is the data:
https://docs.google.com/spreadsheets/d/19zEMRcir_Impfw3CuexDhj8PBcKPDP46URZ9OA3uV9w/edit?usp=sharing
You need to check if an id, converted into an integer as you set as integers,
is contained in the ids to delete.
Write the line only if its not contained. You compare the id in the
first column with the whole set of ids to be deleted. A string is always
not equal to a set:
>>> '1' != {1}
True
Therefore, you get all rows in your output.
Change:
if row[0] != Property_ID_To_Delete:
into:
if int(row[0]) not in Property_ID_To_Delete:
EDIT
You need tow write the header of your infile first before trying to convert the first column entry into an integer:
with open('2015v1.csv', 'rt') as infile:
with open('2015v1_edit.csv', 'wt') as outfile:
writer = csv.writer(outfile)
reader = csv.reader(infile)
writer.writerow(next(reader))
for row in reader:
if int(row[0]) not in Property_ID_To_Delete:
writer.writerow(row)

Replace column in csv with modified column

I got a csv file with a couple of columns and a header containing 4 rows. The first column contains the timestamp. Unfortunately it also gives milliseconds, but whenever those are at 00, they are not given in the file. It looks like that:
"TOA5","CR1000","CR1000","E9048"
"TIMESTAMP","RECORD","BattV_Avg","PTemp_C_Avg"
"TS","RN","Volts","Deg C"
"","","Avg","Avg"
"2015-08-28 12:40:23.51",1,12.91,32.13
"2015-08-28 12:50:43.23",2,12.9,32.34
"2015-08-28 13:12:22",3,12.91,32.54
As I don't need the milliseconds, I want to get rid of those, as this makes further calculations containing time a bit complicated. My approach so far:
Extract first 20 digits in each row to get a format such as 2015-08-28 12:40:23
timestamp = []
with open(filepath) as f:
for _ in xrange(4): #skip 4 header rows
next(f)
for line in f:
time = line[1:20] #Get values for the current line
timestamp.append(time) #Add values to list
From here on I'm struggling on how to procede further. I want to exchange the first column in the csv file with the newly created timestamp list.
I tried creating a dictionary, but I don't know how to use the header caption in row 2 as the key:
d = {}
with open(filepath, 'rb') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for col in csv_reader:
#use header info from row 2 as key here
This would import the whole csv file into a dict and I'd then change the TIMESTAMP entry in the dict with the timestamp list above. Is this even possible?
Or is there an easier approach on how to just change the first column in the csv with my new list so that my csv file in the end contains the timestamp just without the millisecond information?
So the first column in my csv should look like this:
"TOA5"
"TIMESTAMP"
"TS"
""
2015-08-28 12:40:23
2015-08-28 12:50:43
2015-08-28 13:12:22
This should do it and preserve the quoting:
with open(filepath1, 'rb') as fin, open(filepath2, 'wb') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
for _ in xrange(4): # copy first 4 header rows
writer.writerow(next(reader))
for row in reader: # process data lines
row[0] = row[0][:19] # strip fractional seconds from first column
writer.writerow([row[0], int(row[1])] + map(float, row[2:]))
Since a csv.reader returns the columns of each row as a list of strings, it's necessary to convert any which contain numeric values into their actual int or float numeric value before they're written out to prevent them from being quoted.
I believe you can easily create a new csv from iterating over the original csv and replacing the timestamp as you want.
Example -
with open(filepath, 'rb') as csv_file, open('<new file>','wb') as outfile:
csv_reader = csv.reader(csv_file, delimiter=',')
csv_writer = csv.writer(outfile, delimiter=',')
for i, row in enumerate(csv_reader): #Enumerating as we only need to change rows after 3rd index.
if i <= 3:
csv_writer.writerow(row)
else:
csv_writer.writerow([row[0][1:20]] + row[1:])
I'm not entirely sure about how to parse your csv but I would do something of the sort:
time = time.split(".")[0]
so if it does have a millisecond it would get removed and if it doesn't nothing will happen.

Categories

Resources