Excel diff with python - python

I am looking for an algorithm to comapre two excel sheets, based on their column names, in Python.
I do not know what the columns are, so one sheet may have an additional column or both sheets can have several columns with the same name.
The easiest case is when a column in the first sheet corresponds to only one column in the second excel sheet. Then I can perform the diff on rows of that column using xlrd.
If the column name is not unique, I can verify if the columns have the same position.
Does anyone know of an already existing algorithm or have any experience in this domain?

Fast an dirty:
# Since order of the names doesn't matter, we can use the set() option
matching_names = set(sheet_one_names) & set(sheet_one_names)
...
# Here, order does matter since we're comparing rowdata..
# not just if they match at some point.
matching_rowdata = [i for i, j in zip(columndata_one, columndata_two) if i != j]
Note: This assumes that you've done a few things ahead,
get the column names for sheet 1 via xlrd and same for the second sheet,
get the row data for both sheets in two different variables.
This is to give you an idea.
Also note that doing the [...] option (second one) it's important that the rows are of the same length, otherwise it will be skipped. This is a MISS-MATCH scenario, reverse to get the matches in the data flow.
This is a slower but functional solution:
column_a_name = ['Location', 'Building', 'Location']
column_a_data = [['Floor 1', 'Main', 'Sweden'],
['Floor 2', 'Main', 'Sweden'],
['Floor 3', 'Main', 'Sweden']]
column_b_name = ['Location', 'Building']
column_b_data = [['Sweden', 'Main', 'Floor 1'],
['Norway', 'Main', 'Floor 2'],
['Sweden', 'Main', 'Floor 3']]
matching_names = []
for pos in range(0, len(column_a_name)):
try:
if column_a_name[pos] == column_b_name[pos]:
matching_names.append((column_a_name[pos], pos))
except:
pass # Index out of range, column length are not the same
mismatching_data = []
for row in range(0, len(column_a_data)):
rowa = column_a_data[row]
rowb = column_b_data[row]
for name, _id in matching_names:
if rowa[_id] != rowb[_id] and (rowa[_id] not in rowb or rowb[_id] not in rowa):
mismatching_data.append((row, rowa[_id], rowb[_id]))
print mismatching_data

Related

Pandas method chaining with str.contains and str.split

I'm learning Pandas method chaining and having trouble using str.conains and str.split in a chain. The data is one week's worth of information scraped from a Wikipedia page, I will be scraping several years worth of weekly data.
This code without chaining works:
#list of data scraped from web:
list = ['Unnamed: 0','Preseason-Aug 11','Week 1-Aug 26','Week 2-Sep 2',
'Week 3-Sep 9','Week 4-Sep 23','Week 5-Sep 30','eek 6-Oct 7','Week 7-Oct 14',
'Week 8-Oct 21','Week 9-Oct 28','Week 10-Nov 4','Week 11-Nov 11','Week 12-Nov 18',
'Week 13-Nov 25','Week 14Dec 2','Week 15-Dec 9','Week 16 (Final)-Jan 4','Unnamed: 18']
#load to dataframe:
df = pd.DataFrame(list)
#rename column 0 to text:
df = df.rename(columns = {0:"text"})
#remove ros that contain "Unnamed":
df = df[~df['text'].str.contains("Unnamed")]
#split column 0 into 'week' and 'released' at the hyphen:
df[['week', 'released']] = df["text"].str.split(pat = '-', expand = True)
Here's my attempt to rewrite it as a chain:
#load to dataframe:
df = pd.DataFrame(list)
#function to remove rows that contain "Unnamed"
def filter_unnamed(df):
df = df[~df["text"].str.contains("Unnamed")]
return df
clean_df = (df
.rename(columns = {0:"text"})
.pipe(filter_unnamed)
#[['week','released']] = lambda df_:df_["text"].str.split('-', expand = True)
)
The first line of the clean_df chain to rename column 0 works.
The second line removes rows that contain "Unnamed"; it works, but is there a better way than using pipe and a function?
I'm having the most trouble with str.split in the 3rd row (doesn't work, commented out). I tried assign for this and think it should work, but I don't know how to pass in the new column names ("week" and "released") with the str.split function.
Thanks for the help.
I also couldn't figure out how to create two columns in one go from the split... but I was able to do it by splitting twice and accessing parts 1 and 2 in succession (not ideal), df.assign(week = ...[0], released = ...[1]).
Note also I reset the index.
df.assign(week = df[0].str.split(pat = '-', expand=True)[0], released = df[0].str.split(pat = '-', expand=True)[1])[~df[0].str.contains("Unnamed")].reset_index(drop=True).rename(columns = {0: "text"})
I'm sure there's a sleeker way, but this may help.

Python dataframe loop row by row would not change value no matter what

I am trying to change value of my panda dataframe but it just so stubborn and would not change the value desired. I have used df.at as suggested in some other post and it is not working as a way to change/modify data in dataframe.
HOUSING_PATH = "datasets/housing"
csv_path = os.path.join(HOUSING_PATH, "property6.csv")
housing = pd.read_csv(csv_path)
headers = ['Sold Price', 'Longitude', 'Latitude', 'Land Size', 'Total Bedrooms', 'Total Bathrooms', 'Parking Spaces']
# housing.at[114, headers[6]] = 405 and I want to change this to empty or 0 or None as 405 parking spaces does not make sense.
for index in housing.index:
# Total parking spaces in this cell
row = housing.at[index, headers[6]]
# if total parking spaces is greater than 20
if row > 20:
# change to nothing
row = ''
print(housing.at[114, headers[6]])
# however, this is still 405
Like why is this happening? Why can't I replace the value of the dataframe? They are<class 'numpy.float64'>, I have checked so the if statement should work and it is working. But just changing the value
You cannot do it like this. Once you assign the value of housing.at[index, headers[6]], you create a new variable which contains this value (row). Then you change the new variable, not the original data.
for index in housing.index:
# if total parking spaces is greater than 20
if housing.at[index, headers[6]] > 20:
# Set the value of original data to empty string
housing.at[index, headers[6]] = ''
This can be easily done without the use of for loop. Use pd.loc to filter the data frame based on condition and change the values
CODE
import pandas as pd
import os
HOUSING_PATH = "datasets/housing"
csv_path = os.path.join(HOUSING_PATH, "property6.csv")
housing = pd.read_csv(csv_path)
housing.loc[housing["Parking Spaces"] > 20, "Parking Spaces"] = ""
There are several built-in functions to finish such tasks. (where, mask, replace etc.)
# Series.where(cond, other=nan, inplace=False, axis=None, level=None, errors='rais',...)
data2=data.iloc[:,6]
data2.where(data2<=20, '', inplace=True)

Searching through strings in a dataframe and increasing the numbers found by 1

I have a dataframe that I have created by hand. I am working on a code that copies the dataframe and concatenates the new dataframe to the end of the first one. For now, I need the code to look through each value of a column of the 'Name' dataframe that contains strings and if there is a number in the string, increase this number by 1. I need the number to be turned into an int so that I can create a function that will look through the dataframe and automatically add 1 to the largest number in the dataframe. An example:
import pandas as pd
data = {'ID': [1,2,3,4],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp']}
df = pd.DataFrame(data)
df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
Afterwards the new df looks like
data2 = {'ID': [1,2,3,4,5,6,7,8],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp','BN #2', 'HHC', 'A comp', 'B Comp']}
When I run this, I receive a 'NoneType' object is not subscriptable error. This makes sense because only the BN # row has a number and re.search returns None when the string parameters are not met, but I cannot figure out how to tell python to ignore the other rows.
EDIT
Only the first row each dataframe will increase by 1, so if there is an easier way where I do not use re.search, that is fine. I know there are a couple ways of doing this but I want to be able to always look through the string value of BN and increase it by 1 every time I run the code.
REGEX EDIT
df2['BaseName'] = [re.sub('\d', '', x) for x in df2['Name'].values]
df['BaseName'] = [re.sub('\d', '', x) for x in df['Name'].values]
df2['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df2['Name'].values]
# df2['SysNum'] = df2['Name'].get(r'(?<=#)\d').astype(int)
# df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
df['SysNum'] = df['Name'].str.contains('(?<=#)\d').astype(int)
m = re.search(r'(?<=#)\d', df2['Name'].iloc[0])
if m:
df2['SysNum'] = int(m.group(0)) + 1
n = re.search(r'(?<=#)\d', df['Name'].iloc[0])
if n:
df['SysNum'] = int(n.group(1)) + 1
new_names = df2['BaseName'].unique()
maxes2 = np.zeros((len(new_names), ))
for j in range(len(new_names)):
un2 = new_names[j]
maxes2[j] = df['SysNum'].loc[df['BaseName'] == un2].max()
df2['SysNum'].loc[df2['BaseName'] == un2] = np.linspace(1, len(df2['SysNum'].loc[df2['BaseName'] == un2]), len(df2['SysNum'].loc[df2['BaseName'] == un2]))
df2['SysNum'].loc[df2['BaseName'] == un2] += maxes2[j]
newnames2 = [s + '%d' % num for s,num in zip(df2['BaseName'].loc[df2['BaseName'] == un2].values, df2['SysNum'].loc[df2['BaseName'] == un2].values)]
df2['Name'].loc[df2['BaseName'] == un2] = newnames2
I have this code working for two dataframes and the numbering works out how I would like it to. The first two have a "Name-###" naming convention for all the rows in the dataframe. This allows the commented out re.search line at the top to run just fine. The next two dataframes I am working on are like the examples I put up earlier with the BN #1 and the rest of the names do not have a number. When I run the commented out re.search lines, the code tries to convert the NoneTypes to int and it cannot do that. When I run the code as is now, a new number is put on each and every row immediately following the name, but I need it to add a new number to the row with the #. So what I need and I am struggling with is a piece of code that looks through the dataframe, looks for a # sign, turns the number after the # sign into an int, a loop that looks for the max int and then adds 1 to that number, adds that new number onto the new dataframe, adds new dataframe onto the old one for a larger master list.
You can access the value on the first row of the Name column using df['Name'].iloc[0].
Thus, you can search for a sequence of digits after a # sign in that value using
m = re.search(r'#(\d+)', df['Name'].iloc[0])
if m:
df['SysNum'] = int(m.group(1)) + 1
Output:
>>> df
ID Name SysNum
0 1 BN #1 2
1 2 HHC 2
2 3 A comp 2
3 4 B Comp 2

How to compare two str values dataframe python pandas

I am trying to compare two different values in a dataframe. The questions/answers I've found I wasn't able to utilize.
import pandas as pd
# from datetime import timedelta
"""
read csv file
clean date column
convert date str to datetime
sort for equity options
replace date str column with datetime column
"""
trade_reader = pd.read_csv('TastyTrades.csv')
trade_reader['Date'] = trade_reader['Date'].replace({'T': ' ', '-0500': ''}, regex=True)
date_converter = pd.to_datetime(trade_reader['Date'], format="%Y-%m-%d %H:%M:%S")
options_frame = trade_reader.loc[(trade_reader['Instrument Type'] == 'Equity Option')]
clean_frame = options_frame.replace(to_replace=['Date'], value='date_converter')
# Separate opening transaction from closing transactions, combine frames
opens = clean_frame[clean_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_OPEN'])]
closes = clean_frame[clean_frame['Action'].isin(['BUY_TO_CLOSE', 'SELL_TO_CLOSE'])]
open_close_set = set(opens['Symbol']) & set(closes['Symbol'])
open_close_frame = clean_frame[clean_frame['Symbol'].isin(open_close_set)]
'''
convert Value to float
sort for trade readability
write
'''
ocf_float = open_close_frame['Value'].astype(float)
ocf_sorted = open_close_frame.sort_values(by=['Date', 'Call or Put'], ascending=True)
# for readability, revert back to ocf_sorted below
ocf_list = ocf_sorted.drop(
['Type', 'Instrument Type', 'Description', 'Quantity', 'Average Price', 'Commissions', 'Fees', 'Multiplier'], axis=1
)
ocf_list.reset_index(drop=True, inplace=True)
ocf_list['Strategy'] = ''
# ocf_list.to_csv('Sorted.csv')
# create strategy list
debit_single = []
debit_vertical = []
debit_calendar = []
credit_vertical = []
iron_condor = []
# shift columns
ocf_list['Symbol Shift'] = ocf_list['Underlying Symbol'].shift(1)
ocf_list['Symbol Check'] = ocf_list['Underlying Symbol'] == ocf_list['Symbol Shift']
# compare symbols, append depending on criteria met
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
print(type(ocf_list['Underlying Symbol']))
ocf_list.to_csv('Sorted.csv')
print(debit_vertical)
# delta = timedelta(seconds=10)
The error I get is:
line 51, in <module>
if row['Symbol Check'][-1] is row['Underlying Symbol'][-1]:
TypeError: string indices must be integers
I am trying to compare the newly created shifted column to the original, and if they are the same, append to a list. Is there a way to compare two string values at all in python? I've tried checking if Symbol Check is true and it still returns an error about str indices must be int. .iterrows() didn't work
Here, you will actually iterate through the columns of your DataFrame, not the rows:
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
You can use one of the methods iterrows or itertuples to iterate through the rows, but they return rows as lists and tuples respectively, which means you can't index them using the column names, as you did here.
Second, you should use == instead of is since you are probably comparing values, not identities.
Lastly, I would skip iterating over the rows entirely, as pandas is made for selecting rows based on a condition. You should be able to replace the aforementioned code with this:
debit_vertical = ocf_list[ocf_list['Symbol Shift'] == ocf_list['Underlying Symbol']].values.tolist()

Cleaning up excel document - formatting cells based on its content

Quite new in Python, and doing my first project - excel data cleaning.
The idea is to check data before uploading it to the system. Cells which do not meet requirements have to be highlighted and comment should be added into the comment column.
Requirements to check:
Mark First or Last Names which contain numbers/symbols - action: highlight the cell and add a comment to the comment column
Check empty cells - action: highlight the cell and add a comment
I tried different ways (especially using IF statement) on how to highlight cells which do no meet requirements and comment at the same time, but nothing works
import pandas as pd
import numpy as np
df_i = pd.DataFrame({'Email' : ['john#yahoo.com','john#outlook.com','john#gmail.com'], 'First Name': ['JOHN',' roman2 ',''], 'Last Name': ['Smith','','132'], 'Comments':['','','']})
emails_to_exclude = ('#gmail', '#yahoo')
print(df_i)
#Proper names
def proper_name(name):
return name.str.title()
df_i['First Name'] = proper_name(df_i['First Name'] )
df_i['Last Name'] = proper_name(df_i['Last Name'] )
#Trim spaces
def trim(cell):
return cell.apply(lambda x: x.str.strip())
df_i = trim(df_i)
#Check public email domains
df_i.loc[df_i['Email'].str.contains('|'.join(emails_to_exclude), case=False),'Comments'] = df_i['Comments'].astype(str) + 'public email domain'
#Check first and last name
list_excl = ["1","2","3","4","5","6","7","8","9","0"]
df_i.loc[df_i['First Name'].str.contains('|'.join(list_excl), case=False), 'Comments'] = df_i['Comments'].astype(str) + " Check 'First Name'"
df_i.loc[df_i['Last Name'].str.contains('|'.join(list_excl), case=False), 'Comments'] = df_i['Comments'].astype(str) + " Check 'Last Name'"
print(df_i)
I would write a function that uses re to see if a string matches a defined pattern. I understood that the desired pattern is a sequence of upper- or lower-case letters (not sure if the names can contain whitespace characters).
For the formatting part, use df.style. Basically you write a function that defines how each cell should be formatted using CSS. You will need to export to excel (csv does not contain any information about the formatting). You can also render it as an html table. Read more. Note that after using df.style, the object that you are using is no longer pd.DataFrame. Rather, it is pandas.io.formats.style.Styler. You should do whatever you want to do with your DataFrame before styling it.
import pandas as pd
import numpy as np
import re
def highlight_invalid(string, invalid_colour='yellow', empty_colour='red'):
if string:
# The string contains only one or more letters
pattern = re.compile(r'^([A-z])+$')
if pattern.match(string):
# do not highlight valid strings
return ''
else:
# highlight non-matching strings in invalid_colour
return f'background-color: {invalid_colour}'
else:
# highlight empty strings in empty_colour
return f'background-color: {empty_colour}'
cols = ['First Name', 'Last Name']
for col in cols:
# It didn't work when I tried it with missing values, so make sure to replace
df_i[col] = df_i[col].replace(np.nan, '')
# Apply the highlighting function to every cell of 'First Name' and 'Last Name'
df_i = df_i.style.applymap(highlight_invalid, subset=cols)
df_i.to_excel(fname)
Maybe you want to write a separate function that does the data verification and use it both in highlighting and adding a comment. I will leave that to you as that is not related to formatting per se and should be asked as a separate question.

Categories

Resources