I have two large csv files and I want to compare column1 in csv1 with column1 in csv2. I was able to do this using Python List where I read csv1 and throw column1 in list1, do the same thing with csv2 and then check to see if element in list1 is present in list2
olist = []
def oldList(self):
for row in self.csvreaderOld:
self.olist.append(row[1])
nlist = []
def newList(self):
for row in self.csvreaderNew:
self.nlist.append(row[1])
def new_list(self):
return [item for item in self.olist if item not in self.nlist]
the code works but can a long time to complete. I am trying to see if I can use dictionary instead, see if that would be faster, so I can compare keys in dictionary1 exist in dictionary2 but so far havent been successfully owing to my limited knowledge.
If it's a big CSV file or your'e planning to continue working with tables, I would suggest doing it with the Pandas module.
To be honest, even if it's a small file, or you're not going to continue working with tables, Pandas is an excellent module.
From what I understand (and I might be mistaken), for reading CSV files, Pandas is one of the quickest libraries to do so.
import pandas as pd
df = pd.read_csv("path to your csv file", use_cols = ["column1", "column2"])
def new_list(df):
return [item for item in df["column2"].values if item not in df["column1"].values]
It's important to use .values when checking for an item in a pandas series (when you're extracting a column in a DataFrame you're getting a pandas series)
You could also use list(df["column1"]) and the other methods suggested in How to determine whether a Pandas Column contains a particular value for determining whether a value is contains in a pandas column
for example :
df = pd.DataFrame({"column1":[1,2,3,4], "column2":[2,3,4,5]})
the data frame would be
column1 column2
1 2
2 3
3 4
4 5
and new_line would return [5]
You can read both files into objects and compare in a single loop.
Here is a short code snippet for the idea (not class implementation):
fsOld = open('oldFile.csv', 'r')
fsNew = open('newFile.csv', 'r')
fsLinesOld = fsOld.readlines()
fsLinesNew = fsNew.readlines()
outList = []
# assumes lines are same for both files data:
for i in range(0, fsLinesOld.__len__(), 1):
if ( fsLinesOld[i] == fsLinesNew[i]):
outList.append(fsLinesOld[i])
First of all, change the way of reading the CSV files, if you want just one column mention that in usecols, like this
df = pd.read_csv("sample_file.csv", usecols=col_list)
And second, you can use set difference if you are not comparing row to row, like this
set(df.col.to_list()).difference(set(df2.col.to_list()))
Related
Here's my problem, I need to compare two procmon scans which I converted into CSV files.
Both files have identical column names, but obviously the contents differ.
I need to check the "Path" (5th column) from the first file to the one to the second file and print out the ENTIRE row of the second file into a third CSV, if there are corresponding matches.
I've been googling for quite a while and can't seem to get this to work like I want it to, any help is appreciated!
I've tried numerous online tools and other python scripts, to no avail.
Have you tried using pandas and numpy together?
It would look something like this:
import pandas as pd
import numpy as np
#get your second file as a Dataframe, since you need the whole rows later
file2 = pd.read_csv("file2.csv")
#get your columns to compare
file1Column5 = pd.read_csv("file1.csv")["name of column 5"]
file2Column5 = file2["name of column 5"]
#add a column where if values match, row marked True, else False
file2["ColumnsMatch"] = np.where(file1Column5 == file2Column5, 'True', 'False')
#filter rows based on that column and remove the extra column
file2 = file2[file2['ColumnsMatch'] == 'True'].drop('ColumnsMatch', 1)
#write to new file
file2.to_csv(r'file3.csv')
Just write for such things your own code. It's probably easier than you are expecting.
#!/usr/bin/env python
import pandas as pd
# read the csv files
csv1 = pd.read_csv('<first_filename>')
csv2 = pd.read_csv('<sencond_filename>')
# create a comapare series of the files
iseq = csv1['Path'] == csv2['Path']
# push compared data with 'True' from csv2 to csv3
csv3 = pd.DataFrame(csv2[iseq])
# write to a new csv file
csv3.to_csv('<new_filename>')
I am using pandas to read a csv file into my python code. I understand I can grab a specific value from a specific column for all rows and append it to an array as follows:
import pandas as pd
df = pd.read_csv('File.txt')
for row in df[df.columns[0]]:
playerNames.append(row)
However, I want to, instead, grab the values from columns 0 and 2 at the same time to populate a dictionary. In my head it would be something like:
for row in df[df.columns[0,2]]:
playerNameDictionary[row.columns[0]] = row.columns[2]
Obviously this wrong (don't even think it compiles) but I am just at a loss as to how I would go about doing this.
dict_sample = dict(zip(df.column1, df.column2))
column1 & column 2 stands for the column names. It will create a key value pair with key being column1 data and value being column2 data. I hope I understood the question right.
Loops are anti-pattern for Pandas. More efficiently, you can use pd.Series.to_dict:
key_col, val_col = df.columns[[0, 2]]
playerNameDictionary = df.set_index(key_col)[val_col].to_dict()
Make sure your future keys are not duplicate.
For Python <3
my_dict ={}
for key, value in zip(df.column0, df.column2)):
my_dict [key] = value
For Python 3>
my_dict = dict(zip(df.column0, df.column2))
I have a CSV file, and i want to delete some rows of it based on the values of one of the columns. I do not know the related code to delete the specific rows of a CSV file which is in type pandas.core.frame.DataFrame.
I read related questions, and i found that people suggest writing every line that is acceptable in a new file. I do not want to do that. The thing that i want is:
1) to delete the rows that I know the index of them (number of the row)
or
2) to make a new CSV in the memory of the python (not to write and again read it )
Here's an example of what you can do with pandas. If you need more detail, you might find Indexing and Selecting Data a helpful resource.
import pandas as pd
from io import StringIO
mystr = StringIO("""speed,time,date
12,22:05,1
15,22:10,1
13,22:15,1""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr)
# convert time column to timedelta, assuming mm:ss
df['time'] = pd.to_timedelta('00:'+df['time'])
# filter for >= 22:10, i.e. second item
df = df[df['time'] >= df['time'].loc[1]]
print(df)
speed time date
1 15 00:22:10 1
2 13 00:22:15 1
I'm doing some manipulation on a CSV file using Python and the csv module. I take a CSV file, do some operations, and output an XML file. Very simplified, but the input data looks similar to this:
name,group
joe,staff
jane,student
bill,staff
barry,support
jack,student
I have a list as follows:
outputList = ['staff', 'support']
Essentially, what I want to do is remove the line of data if the group field isn't contained in the outputList. So what I would end up with is:
name,group
joe,staff
bill,staff
barry,support
The main reason I need to remove the rows is because I then need to sort by outputList (which is a lot longer than in this example, and in a specific non-alphabetical order).
Doing the sorting is relatively easy:
csvData = sorted(csvData, key=lambda k: (outputList.index(k['group'])))
However, obviously without removing the rows that aren't needed I get an error that the group value isn't in the outputList.
Is there an easy way of removing the data, or do I just need to iterate over each row and check whether the value is present? I've seen methods of doing it when you just have two lists. E.G.
data = ['staff', 'support', 'student']
csvData = [data for data in csvData if data not in outputList]
There's no other way to filter the data without scanning all the data of course, you can simply do something like this:
import csv
def parser(fp, groups):
with open(fp) as fin:
reader = csv.reader(fp)
for row in reader:
if row[1] in groups:
yield row
csvData = parser('~/some_loc/file.csv', outputList)
Load your csv into a pandas dataframe df.
The you can use:
df = df[df.group.isin(outputList)]
Isin creates a boolean series(mask) which you can use to select only the relevant rows.
I am reading data from a tsv file using DataFrame from Pandas module in Python.
df = pandas.DataFrame.from_csv(filename, sep='\t')
The file has around 5000 columns (4999 test parameters and 1 result / output value).
I iterate through the entire tsv file and check if the result value matches the value that is expected. I then write this row inside another csv file.
expected_value = 'some_value'
with open(file_to_write, 'w') as csvfile:
csvfwriter = csv.writer(csvfile, delimiter='\t')
for row in df.iterrows():
result = row['RESULT']
if expected_value.lower() in str(result).lower():
csvwriter.writerow(row)
But in the output csv file, the result is not proper, i.e. the individual column values are not going into their respective columns / cells. It is getting appended as rows. How do I write this data correctly in the csv file?
The answers suggested works well however, I need to check for multiple conditions. I have a list which has some values:
vals = ['hello', 'foo', 'bar']
One of the column for all the rows has values that looks like this 'hello,foo,bar'. I need to do two checks, one if any value in the vals list is present in the column with the values 'hello, foo, bar' or if the result value matches the expected value. I have written the following code
df = pd.DataFrame.from_csv(filename, sep='\t')
for index, row in df.iterrows():
csv_vals = row['COL']
values = str(csv_vals).split(",")
if(len(set(vals).intersection(set(values))) > 0 or expected_value.lower() in str(row['RESULT_COL'].lower()):
print row['RESULT_COL']
You should create a dataframe where you have a column 'RESULT' and one 'EXPECTED'.
Then you can filter the rows where both match and output only these to csv using:
df.ix[df['EXPECTED']==df['RESULT']].to_csv(filename)
You can filter the values like this:
df[df['RESULT'].str.lower().str.contains(expected_value.lower())].to_csv(filename)
This will work for filtering values that contain your expected_value as you did in your code.
If you want to get exact match you can use:
df.loc[df['Result'].str.lower() == expected_value.lower()].to_csv(filename)
As you suggested in comment, for multiple criteria you will need something like this:
expected_values = [expected_value1, expected_value2, expected_value3]
df[df['Result'].isin(expected_values)]
UPDATE:
And to filter on multiple criteria and to filter desired column:
df.ix[df.isin(vals).any(axis=1)].loc[df['Result'].str.lower() == expected_value.lower()].to_csv(filename)