How to Compare two columns of CSV simultaneously in Python?

How to Compare two columns of CSV simultaneously in Python? - python

I have a csv file with a huge dataset containing two columns. I want to compare the data of these two columns such that, if a duplicated pair is present then it gets deleted. For example, if my data file looks something like this:
Column A Column B
DIP-1N DIP-1N
DIP-2N DIP-3N
DIP-3N DIP-2N
DIP-4N DIP-5N
Then the first entry gets deleted because I don't want two "DIP-1Ns". Also, the order of occurrence of pair is not an issue as far as the entry is unique. For example, here, DIP-2N & DIP-3N and DIP-3N & DIP-3N are paired. But both the entries mean the same thing. So I want to keep one entry and delete the rest.
I have written the following code, but I don't know how to compare simultaneously the entry of both the columns.
import csv
import pandas as pd
file = pd.read_csv("/home/staph.csv")
for i in range(len(file['Column A'])):
for j in range(len(file['Column B'])):
list1 = []
list2 = []
list1.append(file[file['Column A'].str.contains('DIP-'+str(i)+'N')])
list2.append(file[file['Column B'].str.contains('DIP-'+str(i)+'N')])
for ele1,ele2 in list1,list2:
if(list1[ele1]==list2[ele2]):
print("Duplicate")
else:
print("The 1st element is :", ele1)
print("The 2nd element is :", ele2)
Seems like something is wrong, as there is no output. The program just ends without any output or error. Any help would be much appreciated in terms of whether my code is wrong or if I can optimize the process in a better way. Thanks :)

It might not be the best way to get what you need but, it works.
df['temp'] = df['Column A'] + " " + df['Column B']
df['temp'] = df['temp'].str.split(" ")
df['temp'] = df['temp'].apply(lambda list_: " ".join(sorted(list_)))
df.drop_duplicates(subset=['temp'], inplace=True)
df = df[df['Column A'] != df['Column B']]
df.drop('temp', axis=1, inplace=True)
Output:
index
Column A
Column B
1
DIP-2N
DIP-3N
3
DIP-4N
DIP-5N

With some tweaking you could use the pandas method:
# get indices of duplicate-free (except first occurence) combined sets of col A and B
keep_ind = pd.Series(df[["Column A", "Column B"]].values.tolist()).apply(set).drop_duplicates(keep="first").index
# use these indices to filter the DataFrame
df = df.loc[keep_ind]

Related

Not able to assign values to a column. Bag_of_words

I am trying to assign values to a column in my pandas df, however I am getting a blank column, here's the code:
df['Bag_of_words'] = ''
columns = ['genre', 'director', 'actors', 'key_words']
for index, row in df.iterrows():
words = ''
for col in columns:
words += ' '.join(row[col]) + ' '
row['Bag_of_words'] =words
The output is an empty column, can someone please help me understand what is happening here, as I am not getting any errors.

from the iterrows documentation:
You should never modify something you are iterating over.
This is not guaranteed to work in all cases. Depending on the
data types, the iterator returns a copy and not a view, and writing
to it will have no effect.
So you do row[...] = ... and it turns out row is a copy and that's not affecting the original rows.
iterrows is frowned upon anyway, so you can instead
join each words list per row to become strings
aggregate those strings with " ".join row-wise
add space to them
df["Bag_of_words"] = (df[columns].apply(lambda col: col.str.join(" "))
.agg(" ".join, axis="columns")
.add(" "))

Instead of:
row['Bag_of_words'] =words
Use:
df.at[index,'Bag_of_words'] = words

pandas - If partial string match exists, put value in new column

I've got a tricky problem in pandas to solve. I was previously referred to this thread as a solution but it is not what I am looking for.
Take this example dataframe with two columns:
df = pd.DataFrame([['Mexico', 'Chile'], ['Nicaragua', 'Nica'], ['Colombia', 'Mex']], columns = ["col1", "col2"])
I first want to check each row in column 2 to see if that value exists in column 1. This is checking full and partial strings.
df['compare'] = df['col2'].apply(lambda x: 'Yes' if df['col1'].str.contains(x).any() else 'No')
I can check to see that I have a match of a partial or full string, which is good but not quite what I need. Here is what the dataframe looks like now:
What I really want is the value from column 1 which the value in column 2 matched with. I have not been able to figure out how to associate them
My desired result looks like this:

Here's a "pandas-less" way to do it. Probably not very efficient but it gets the job done:
def compare_cols(match_col, partial_col):
series = []
for partial_str in partial_col:
for match_str in match_col:
if partial_str in match_str:
series.append(match_str)
break # matches to the first value found in match_col
else: # for loop did not break = no match found
series.append(None)
return series
df = pd.DataFrame([['Mexico', 'Chile'], ['Nicaragua', 'Nica'], ['Colombia', 'Mex']], columns = ["col1", "col2"])
df['compare'] = compare_cols(match_col=df.col1, partial_col=df.col2)
Note that if a string in col2 matches to more than one string in col1, the first occurrence is used.

Making a new column based on 2 other columns

I am trying to calculate a new column labeled in the code as "Sulphide-S(calc)-C_%S", this column can be calculated from one of two options (see below in the code). Both these columns wont be filled at the same time. So I want it to calculate from the column that has data present. Presently, I have this but the second equation overwrites the first.
df["Sulphide-S(calc)-C_%S"] = df["Total-S_%S"] - df["Sulphate-S(HCL Leachable)_%S"]
df.head()
df["Sulphide-S(calc)-C_%S"] = df["Total-S_%S"]- df["Sulphate-S_%S"]
df.head()

You can use the apply function in pandas to create a new column based on other columns, resulting in a Series that you can add to your original dataframe. Without knowing what your dataframe looks like, the following code might not work directly until you replace the if condition with a working condition to detect the empty dataframe spot.
def create_sulfide_col(row):
if row["Sulphate-S(HCL Leachable)_%S"] is None:
val = row["Total-S_%S"] - row["Sulphate-S(HCL Leachable)_%S"]
else:
val = ["Total-S_%S"]- df["Sulphate-S_%S"]
return val
df["Sulphide-S(calc)-C_%S"] = df.apply(lambda row: create_sulfide_col(row), axis='columns')

If I'm understanding what you're saying correctly, the second equation overwrites the first because they have the same column name. Try changing the column name in one or both of the "Sulphide-S(calc)-C_%S" to something else like "Sulphide-S(calc)-C_%S_A" and "Sulphide-S(calc)-C_%S_B":
df["Sulphide-S(calc)-C_%S_A"] = df["Total-S_%S"] - df["Sulphate-S(HCL Leachable)_%S"]
df.head()
df["Sulphide-S(calc)-C_%S_B"] = df["Total-S_%S"]- df["Sulphate-S_%S"]
df.head()

Pandas - contains from other DF

I have 2 dataframes:
DF A:
and DF B:
I need to check every row in the DFA['item'] if it contains some of the values in the DFB['original'] and if it does, then add new column in DFA['my'] that would correspond to the value in DFB['my'].
So here is the result I need:
I tought of converting the DFB['original'] into list and then use regex, but this way I wont get the matching result from column 'my'.

Ok, maybe not the best solution, but it seems to be working.
I did cartesian join and then check the records which contains the data needed
dfa['join'] = 1
dfb['join'] = 1
dfFull = dfa.merge(dfb, on='join').drop('join' , axis=1)
dfFull['match'] = dfFull.apply(lambda x: x.original in x.item, axis = 1)
dfFull[dfFull['match']]

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?

for column in df:
print(df[column])

You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))

This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)

You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.

A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name

Using list comprehension, you can get all the columns names (header):
[column for column in df]

Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])

I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)

I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.

Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...

assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Compare two columns of CSV simultaneously in Python? - python

Related

Not able to assign values to a column. Bag_of_words

pandas - If partial string match exists, put value in new column

Making a new column based on 2 other columns

Pandas - contains from other DF

How to iterate over columns of pandas dataframe to run regression

Categories

Resources