Comparing 2 csv files to remove rows - python

I have 2 csv files that have information related to each other. Each row of one csv file corresponds to another row in the other file. In order to prepare the data, I needed to remove certain values from the first csv file which resulted in removing certain rows from that file. Now when I print those rows out they jump around. As an example a certain portion of the first csv file jumps from row number 20838 to 20842, 20843, etc. So what I want to do is compare the first csv file that had certain rows removed to the second csv file and remove the rows that are not currently in the first csv file from the second csv file and then reorder all the rows so that both csv files have rows listed from 0 to 20000. I am using Pandas and numpy.
This is the code I have used to remove the information from the first csv file:
data_csv1 = pd.read_csv("address1")
data_csv2 = pd.read_csv("address2")
data_csv1.drop(data.columns[[0]], axis = 1)
data_csv1 = data_csv1[(data_csv1 !=0).all(1)]
How would I go about doing this? I personally do not care if the data is removed or simply ignored, I just need both csv files to contain the same row numbers.

assuming that at start your two files had exact same index, you can pass index of first file to the second file after post processing:
data_csv2 = data_csv2.iloc[data_csv1.index]

Related

Ignore carriage returns (u1000D) with read_csv in python pandas

I regularly get sent on a regular basis a csv containing 100+ columns and millions or rows. These csv files always contain certain set of columns, Core_cols = [col_1, col_2, col_3], and a variable number of other columns, Var_col = [a, b, c, d, e]. The core columns are always there and there could be 0-200 of the variable columns. Sometimes one of the columns in the variable columns will contain a carriage return. I know which columns this can happen in, bad_cols = [a, b, c].
When import the csv with pd.read_csv these carriage returns make corrupt rows in the resultant dataframe. I can't re-make the csv without these columns.
How do I either:
Ignore these columns and the carriage return contained within? or
Replace the carriage returns with blanks in the csv?
My current code looks something like this:
df = pd.read_csv(data.csv, dtype=str)
I've tried things like removing the columns after the import, but the damage seems to already have been done by this point. I can't find the code now, but when testing one fix the error said something like "invalid character u000D in data". I don't control the source of the data so can't make the edits to that.
Pandas supports multiline CSV files if the file is properly escaped and quoted. If you cannot read a CSV file in Python using pandas or csv modules nor open it in MS Excel then it's probably a non-compliant "CSV" file.
Recommend to manually edit a sample of the CSV file and get it working so can open with Excel. Then recreate the steps to normalize it programmatically in Python to process the large file.
Use this code to create a sample CSV file copying first ~100 lines into a new file.
with open('bigfile.csv', "r") as csvin, open('test.csv', "w") as csvout:
line = csvin.readline()
count = 0
while line and count < 100:
csvout.write(line)
count += 1
line = csvin.readline()
Now you have a small test file to work with. If the original CSV file has millions of rows and "bad" rows are found much later in the file then you need to add some logic to find the "bad" lines.

How to read bunch of excel file in python/pandas with keyword starting columnnames

I have bunch of excel files where data starts at random rows beyond first row in a sheet
.
There is no pattern to starting row number but the column names are constant. For example in a file I have columns PolicyNumber, AccountValue in row 3 and in another its in row 5 and in another its row 10.
I need to extract the data for these column in the files in python. How could I accomplish that with xlsx file format.

Remove Duplicate rows from csv [headers + Content]

I have a data set which more more than 100mb in size and also many in number of files. These files have more than 20 columns and about more than 1 million rows.
The main problem with data is:
Headers are repeating -- Duplicate header rows
Duplicate rows in full i.e. data from all the columns in that particular row is duplicate.
Without bothering about the which column or how many columns .. only need to Keep the first occurrence and then remove the rest.
I did find too many examples but what I am looking for is the input and output both need to be same file. The only reason to seek help is, I want the same file to be edited.
sample Input: Here
https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=0
Appreciate the help thanks in advance..
If the number of duplicate headers is known and constant, skip those rows:
csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1', skiprows=4)
Alternatively, which comes w/ the bonus of removing all duplicates, based on all columns, do this:
csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1')
csv = csv.drop_duplicates()
Now you still have a header line in the data, just skip it:
csv = csv.iloc[1:]
You certainly can then overwrite the input file with pandas.DataFrame.to_csv

python:compare two large csv files by two reference columns and update another column

I have quite large csv file, about 400000 lines like:
54.10,14.20,34.11
52.10,22.20,22.11
49.20,17.30,29.11
48.40,22.50,58.11
51.30,19.40,13.11
and the second one about 250000 lines with updated data for the third column - the first and the second column are reference for update:
52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15
I would like to build third file like:
54.10,14.20,34.11
52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15
51.30,19.40,13.11
It has to contain all data from the first file except these lines where value of third column is taken from the second file.
Suggest you look at Pandas merge functions. You should be able to do what you want,, It will also handle the data reading from CSV (create a dataframe that you will merge)
A stdlib solution with just the csv module; the second file is read into memory (into a dictionary):
import csv
with open('file2.csv', 'rb') as updates_fh:
updates = {tuple(r[:2]): r for r in csv.reader(updates_fh)}
with open('file1.csv', 'rb') as infh, open('output.csv', 'wb') as outfh:
writer = csv.writer(outfh)
writer.writerows((updates.get(tuple(r[:2]), r) for r in csv.reader(infh)))
The first with statement opens the second file and builds a dictionary keyed on the first two columns. It is assumed that these are unique in the file.
The second block then opens the first file for reading, the output file for writing, and writes each row from the inputfile to the output file, replacing any row present in the updates dictionary with the updated version.

Write data from multiple input CSV file to a single CSV in column format

I'm trying to
loop over multiple csv files,
extract information from them
write an output into one new csv file with a row for each of the
original files.
I take the information:
Name, Date, Time, Test, Navg, Percent
for each row.
I have tried to do it, however I have the problems:
It writes each of Name, Date, Time, Test, Navg, Percent to a NEW ROW...I want each word in a new column
It writes each new file to the a new row underneath(I do want it underneath, but with each word in a column.
b = open('C:\Users\AClayton\Desktop\Data.csv', 'a')
a = csv.writer(b,delimiter='\t',lineterminator='\n')
a.writerows((Name, Date, Time, Test, Navg, Percent))
b.close()
Note the file has been read and the data extracted in earlier code.
writerows interpretes the input argument as a list of rows, hence each item in your tuple is written into a separate row.
Using writerow should write it into a single row.

Categories

Resources