Python with fuzzy,pandas - python

I have two csv's in which the rows can be matched by the value in one column (after some tweaking of this column). After the matching I want to take some values from both off them and make a new, combined row. I thought of a simple script using csv.DictReader for both of them and then a double
for row1 in csv1:
for row2 in csv2:
if row1['someID'] == row2['someID]:
newdict = ... etc
However, 1 file is 9 million rows and the other is 500k rows. So my code would take 4.5 * 10^12 iterations. Hence my question: what is a fast way to match them? Important:
This 'someID' on which they are matched is in neither csv unique.
I want additional rows for every match. So if a 'someID' appears
twice in csv1 and 3 times csv2, I expect 6 rows with this 'someID' in the final result.

Try this: instead of iterating, use pandas.read_csv() on both files, and merge them on someID. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
For example:
import pandas as pd
csv1 = pd.read_csv(path1)
csv2 = pd.read_csv(path2)
merged = csv1.merge(csv2, on='someID')
merged['new_column'] = ...
Pandas operations are over entire numpy arrays which is are much faster than iterating at the element level.

Related

Combine duplicate rows in Pandas

I got a dataframe where some rows contains almost duplicate values. I'd like to combine these rows as much as possible to reduce the row numbers. Let's say I got following dataframe:
One
Two
Three
A
B
C
B
B
B
C
A
B
In this example I'd like the output to be:
One
Two
Three
ABC
AB
CB
The real dataframe got thousands of rows with eight columns.
The csv from a dataframe-sample:
Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
A,A,A,A,A,A,A,A
A,A,A,A,A,A,A,B
A,A,A,A,A,A,A,C
A,A,A,A,A,A,B,A
A,A,A,A,A,A,B,B
A,A,A,A,A,A,B,C
A,A,A,A,A,A,C,A
A,A,A,A,A,A,C,B
A,A,A,A,A,A,C,C
C,C,C,C,C,C,A,A
C,C,C,C,C,C,A,B
C,C,C,C,C,C,A,C
C,C,C,C,C,C,B,A
C,C,C,C,C,C,B,B
C,C,C,C,C,C,B,C
C,C,C,C,C,C,C,A
C,C,C,C,C,C,C,B
C,C,C,C,C,C,C,C
To easier show how desired outcome woud look like:
Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
AC,AC,AC,AC,AC,AC,ABC,ABC
I've tried some code but I end up in real long code snippets which I doubt could be the best and most natural solution. Any suggestions?
If your data are all characters you can end up with this solution and collapse everything to one single row:
import pandas as pd
data = pd.read_csv("path/to/data")
collapsed = data.astype(str).sum().applymap(lambda x: ''.join(set(x)))
Check this answer on how to get unique characters in a string.
You can use something like this:
df = df.groupby(['Two'])['One','Three'].apply(''.join).reset_index()
If you can provide a small bit of code that creates the first df it'd be easier to try out solutions.
Also this other post may help: pandas - Merge nearly duplicate rows based on column value
EDIT:
Does this get you the output you're looking for?
joined_df = df.apply(''.join, axis=0)
variation of this: Concatenate all columns in a pandas dataframe

Iterate over a pandas DataFrame & Check Row Comparisons

I'm trying to iterate over a large DataFrame that has 32 fields, 1 million plus rows.
What i'm trying to do is iterate over each row, and check whether any of the rest of the rows have duplicate information in 30 of the fields, while the other two fields have different information.
I'd then like to store the the ID info. of the rows that meet these conditions.
So far i've been trying to figure out how to check two rows with the below code, it seems to work when comparing single columns but throws an error when I try more than one column, could anyone advise on how best to approach?
for index in range(len(df)):
for row in range(index, len(df)):
if df.iloc[index][1:30] == df.iloc[row][1:30]:
print(df.iloc[index])
As a general rule, you should always always try not to iterate over the rows of a DataFrame.
It seems that what you need is the pandas duplicated() method. If you have a list of the 30 columns you want to use to determine duplicates rows, the code looks something like this:
df.duplicated(subset=['col1', 'col2', 'col3']) # etc.
Full example:
# Set up test df
from io import StringIO
sub_df = pd.read_csv(
StringIO("""ID;col1;col2;col3
One;23;451;42;31
Two;24;451;42;54
Three;25;513;31;31"""
),
sep=";"
)
Find which rows are duplicates in col1 and col2. Note that the default is that the first instance is not marked as a duplicate, but later duplicates are. This behaviour can be changed as described in the documentation I linked to above.
mask = sub_df.duplicated(["col1", "col2"])
This looks like:
Now, filter using the mask.
sub_df["ID"][sub_df.duplicated(["col1", "col2"])]
Of course, you can do the last two steps in one line.

How to compare two columns in two CSV's using dictionary?

I have two large csv files and I want to compare column1 in csv1 with column1 in csv2. I was able to do this using Python List where I read csv1 and throw column1 in list1, do the same thing with csv2 and then check to see if element in list1 is present in list2
olist = []
def oldList(self):
for row in self.csvreaderOld:
self.olist.append(row[1])
nlist = []
def newList(self):
for row in self.csvreaderNew:
self.nlist.append(row[1])
def new_list(self):
return [item for item in self.olist if item not in self.nlist]
the code works but can a long time to complete. I am trying to see if I can use dictionary instead, see if that would be faster, so I can compare keys in dictionary1 exist in dictionary2 but so far havent been successfully owing to my limited knowledge.
If it's a big CSV file or your'e planning to continue working with tables, I would suggest doing it with the Pandas module.
To be honest, even if it's a small file, or you're not going to continue working with tables, Pandas is an excellent module.
From what I understand (and I might be mistaken), for reading CSV files, Pandas is one of the quickest libraries to do so.
import pandas as pd
df = pd.read_csv("path to your csv file", use_cols = ["column1", "column2"])
def new_list(df):
return [item for item in df["column2"].values if item not in df["column1"].values]
It's important to use .values when checking for an item in a pandas series (when you're extracting a column in a DataFrame you're getting a pandas series)
You could also use list(df["column1"]) and the other methods suggested in How to determine whether a Pandas Column contains a particular value for determining whether a value is contains in a pandas column
for example :
df = pd.DataFrame({"column1":[1,2,3,4], "column2":[2,3,4,5]})
the data frame would be
column1 column2
1 2
2 3
3 4
4 5
and new_line would return [5]
You can read both files into objects and compare in a single loop.
Here is a short code snippet for the idea (not class implementation):
fsOld = open('oldFile.csv', 'r')
fsNew = open('newFile.csv', 'r')
fsLinesOld = fsOld.readlines()
fsLinesNew = fsNew.readlines()
outList = []
# assumes lines are same for both files data:
for i in range(0, fsLinesOld.__len__(), 1):
if ( fsLinesOld[i] == fsLinesNew[i]):
outList.append(fsLinesOld[i])
First of all, change the way of reading the CSV files, if you want just one column mention that in usecols, like this
df = pd.read_csv("sample_file.csv", usecols=col_list)
And second, you can use set difference if you are not comparing row to row, like this
set(df.col.to_list()).difference(set(df2.col.to_list()))

split groups in a table into tables of its sub-groups

I have a table that is already grouped according to first column. I would like to split table into sub-tables with only the corresponding second column. I would like to use pandas or something else in python. I am not keen to use "awk" because that will require me to "subprocess" or "os". In the end I actually only need entries in second column separated according to first. The size of the table can be about 10000 rows X 6 columns.
These are similar posts that I found but I could not figure how to modify them for my purpose.
Split pandas dataframe based on groupby
Splitting groupby() in pandas into smaller groups and combining them
The table/dataframe that I have looks like this:
P0A910 sp|A0A2C5WRC3| 84.136 0.0 100
P0A910 sp|A0A068Z9R6| 73.816 0.0 99
Q9HVD1 sp|A0A2G2MK84| 37.288 4.03e-34 99
Q9HVD1 sp|A0A1H2GM32| 40.571 6.86e-32 98
P09169 sp|A0A379DR81| 52.848 2.92e-117 99
P09169 sp|A0A127L436| 49.524 2.15e-108 98
And I would like it to be split like the following
group1:
P0A910 A0A2C5WRC3
P0A910 A0A068Z9R6
group2:
Q9HVD1 A0A2G2MK84
Q9HVD1 A0A1H2GM32
group3:
P09169 A0A379DR81
P09169 A0A127L436
OR into lists
P0A910:
A0A2C5WRC3
A0A068Z9R6
Q9HVD1:
A0A2G2MK84
A0A1H2GM32
P09169:
A0A379DR81
A0A127L436
So your problem is rather to separate the strings. Is it what you want:
new_col = df[1].str[3:-1]
list(new_col.groupby(df[0]))
So I managed to get a solution of some sort. In this solution I managed to remove prefixes in the second and use groupby in pandas to group the entries by first column. Then, looped through it and wrote each group separately to csv files. I took help from #Quang 's answer and this link. It could probably be done in better ways but here is my code:
import pandas as pd
#read .csv as dataframe
data=pd.read_csv("BlastOut.csv")
#truncates sp| | from second column (['B']).
new_col=data['B'].str[3:-1]
#replaces second column with new_col
data['B']=new_col.to_frame(name=None)
#groups dataframe by first column (['A'])
grouped=data.groupby('A')
#loops through grouped items and writes each group to .csv file with title
#of group ([group_name].csv)
for group_name, group in grouped:
group.to_csv('Out_{}.csv'.format(group_name))
Update- removed all columns except column of interest. This is a continuation to the previous code
import glob
#reads all csv files starting with "Out_" in filename
files=glob.glob("Out_*.csv")
#loop through all csv files
for f in files:
df=pd.read_csv(f, index_col=0)
# Drop columns by column title (["A"])
df.drop(["A"], axis=1, inplace=True)
df.to_csv(f,index=False)

Empty values in pandas -- most memory-efficient way to filter out empty values for some columns but keep empty values for one column?

Using Python, I have a large file (millions of rows) that I am reading in with Pandas using pd.read_csv. My goal is to minimize the amount of memory I use as much as possible.
Out of about 15 columns in the file, I only want to keep 6 columns. Of those 6 columns, I have different needs for the empty rows.
Specifically, for 5 of the columns, I'd like to filter out / ignore all of the empty rows. But for 1 of the columns, I need to keep only the empty rows.
What is the most memory-efficient way to do this?
I guess I have two problems:
First, looking at the documentation for Pandas read_csv, it's not clear to me if there is a way to filter out empty rows. Are there a set of parameters and specifications for read_csv -- or with some other method --that I can use to filter out empty rows?
Second, is it possible to filter out empty rows only for some columns but then keep all of the empty rows for one of my columns?
I would advise you use dask.dataframe. Syntax is pandas-like, but it deals with chunking and optimal memory management. Only when you need the result in memory should you translate the dataframe back to pandas, where of course you will need sufficient memory to hold the result in a dataframe.
import dask.dataframe as dd
df = dd.read_csv('file.csv')
# filtering and manipulation logic
df = df.loc[....., ....]
# compute & return to pandas
df_pandas = df.compute()

Categories

Resources