Quite new in Python, and doing my first project - excel data cleaning.
The idea is to check data before uploading it to the system. Cells which do not meet requirements have to be highlighted and comment should be added into the comment column.
Requirements to check:
Mark First or Last Names which contain numbers/symbols - action: highlight the cell and add a comment to the comment column
Check empty cells - action: highlight the cell and add a comment
I tried different ways (especially using IF statement) on how to highlight cells which do no meet requirements and comment at the same time, but nothing works
import pandas as pd
import numpy as np
df_i = pd.DataFrame({'Email' : ['john#yahoo.com','john#outlook.com','john#gmail.com'], 'First Name': ['JOHN',' roman2 ',''], 'Last Name': ['Smith','','132'], 'Comments':['','','']})
emails_to_exclude = ('#gmail', '#yahoo')
print(df_i)
#Proper names
def proper_name(name):
return name.str.title()
df_i['First Name'] = proper_name(df_i['First Name'] )
df_i['Last Name'] = proper_name(df_i['Last Name'] )
#Trim spaces
def trim(cell):
return cell.apply(lambda x: x.str.strip())
df_i = trim(df_i)
#Check public email domains
df_i.loc[df_i['Email'].str.contains('|'.join(emails_to_exclude), case=False),'Comments'] = df_i['Comments'].astype(str) + 'public email domain'
#Check first and last name
list_excl = ["1","2","3","4","5","6","7","8","9","0"]
df_i.loc[df_i['First Name'].str.contains('|'.join(list_excl), case=False), 'Comments'] = df_i['Comments'].astype(str) + " Check 'First Name'"
df_i.loc[df_i['Last Name'].str.contains('|'.join(list_excl), case=False), 'Comments'] = df_i['Comments'].astype(str) + " Check 'Last Name'"
print(df_i)
I would write a function that uses re to see if a string matches a defined pattern. I understood that the desired pattern is a sequence of upper- or lower-case letters (not sure if the names can contain whitespace characters).
For the formatting part, use df.style. Basically you write a function that defines how each cell should be formatted using CSS. You will need to export to excel (csv does not contain any information about the formatting). You can also render it as an html table. Read more. Note that after using df.style, the object that you are using is no longer pd.DataFrame. Rather, it is pandas.io.formats.style.Styler. You should do whatever you want to do with your DataFrame before styling it.
import pandas as pd
import numpy as np
import re
def highlight_invalid(string, invalid_colour='yellow', empty_colour='red'):
if string:
# The string contains only one or more letters
pattern = re.compile(r'^([A-z])+$')
if pattern.match(string):
# do not highlight valid strings
return ''
else:
# highlight non-matching strings in invalid_colour
return f'background-color: {invalid_colour}'
else:
# highlight empty strings in empty_colour
return f'background-color: {empty_colour}'
cols = ['First Name', 'Last Name']
for col in cols:
# It didn't work when I tried it with missing values, so make sure to replace
df_i[col] = df_i[col].replace(np.nan, '')
# Apply the highlighting function to every cell of 'First Name' and 'Last Name'
df_i = df_i.style.applymap(highlight_invalid, subset=cols)
df_i.to_excel(fname)
Maybe you want to write a separate function that does the data verification and use it both in highlighting and adding a comment. I will leave that to you as that is not related to formatting per se and should be asked as a separate question.
Related
'I am reading a csv file using panda read_csv which contains data,
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d720983f0000c0bf0000000014ae47bf0fe7c23ad1de3039;
Id;LibId;1;mod;modId;4;f9e9003e;
.
.
.
.
In the last column, I want to remove the Index, Step, data= and want to retain the hex value part.
I have created a list with the unwanted values and used regex but nothing seem to work.
to_remove = ['Index','Step','data=']
rex = '[' + re.escape (''. join (to_remove )) + ']'
output_csv['Column_name'].str.replace(rex , '', regex=True)
I suggest that you fix your code using
to_remove = ['Index','Step','data=']
output_csv['Column_name'] = output_csv['Column_name'].str.replace('|'.join([re.escape(x) for x in to_remove]), '', regex=True)
The '|'.join([re.escape(x) for x in to_remove]) part will create a regex like Index|Step|data\= and will match any of the to_remove substrings.
Input (added columns name for reference, can be avoided):
col1;col2;col3;col4;col5;col6;col7
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d720983f0000c0bf0000000014ae47bf0fe7c23ad1de3039
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d7203ad1de3039
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d720e47bf0fe7c23ad1de3039
Code:
import pandas as pd
df = pd.read_csv(r"check.csv", sep=";")
df["col7"].replace(regex=True, to_replace="(Index=)(.*)(data=)", value="", inplace=True)
This will extract only the hex value from "data" part and remove everything else. Do not forget about inplace=True.
I am using python3 and pandas to create a script to:
Read unstructured xsls data of varing column lengths
Total the "this", "last" and "diff" columns
Add Total under the brands columns
Dynamically bold the entire row that contains "total"
On the last point, the challenge I have been struggling with is that the row index changes depending on the data being fed in to the script. The code provided does not have a solution to this issue. I have tried every variation I can think of using style.applymap(bold) with and without variables.
Example of input
input
Example of desired outcome
outcome
Script:
import pandas as pd
import io
import sys
import warnings
def bold(val):
return 'font-weight: bold'
excel_file = 'testfile1.xlsx'
df = pd.read_excel(excel_file)
product = (df.loc[df['Brand'] == "widgit"])
product = product.append({'Brand':'Total',
'This':product['This'].sum(),
'Last':product['Last'].sum(),
'Diff':product['Diff'].sum(),
'% Chg':product['This'].sum()/product['Last'].sum()
},
ignore_index=True)
product = product.append({'Brand':' '}, ignore_index=True)
product.fillna(' ', inplace=True)
try something like this:
def highlight_max(x):
return ['font-weight: bold' if v == x.loc[4] else ''
for v in x]
df = pd.DataFrame(np.random.randn(5, 2))
df.style.apply(highlight_max)
output:
I have a data frame created by pandas. One of the columns in the data frame has URL's which, I would like to match and count the particular number of occurrences.
My logic is that if it does not return 'None' then at this stage print('Match'), however, that does not appear to work. Here is a sample of my current code, and would appreciate any tips on how to match a value using pandas as I really have just come back from using a lot of R and don't have a lot of experience with Pandas and data frames in python.
Title,URL,Date,Unique Pageviews
Preparing and Starting DS
career,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:242750,20-Jan-15,163
The Rogue Data Scientist,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:273425,4-May-15,1108
Is it safe to code after one bottle of
wine?,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:349416,9-Nov-15,1736
Short-Term Forecasting of Electricity
Demand,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:350421,12-Nov-15,1117
Visual directory of 339 tools.
Wow!,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:373786,14-Jan-16,4228
8 Types of Data,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:377008,23-Jan-16,2829
Very funny video for people who write
code,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:379578,30-Jan-16,2444
Code Block (Pep8 Requires two line spaces between functions)
def count_set_words(as_pandas):
reg_exp = re.match('\b/forum', as_pandas['URL']).any()
if as_pandas['URL'].str.match(reg_exp, case=False, flags=0, na=np.NAN).any():
print("Match")
def set_new_columns(as_pandas):
titles_list = ['Year > 2014', 'Forum', 'Blog', 'Python', 'R',
'Machine_Learning', 'Data_Science', 'Data', 'Analytics']
for number, word in enumerate(titles_list):
as_pandas.insert(len(as_pandas.columns), titles_list[number], 0)
def open_as_dataframe(file_name_in):
reader = pd.read_csv(file_name_in, encoding='windows-1251')
return reader
def main():
multi_sets = open_as_dataframe('HDT_data5.txt')
set_new_columns
count_set_words(multi_sets)
main()
reg_exp in the first line of count_words is not a regexp but check if the elements in the URL column match '\b/forum', I think someting like:
df = pd.read_csv(file_name_in, encoding='windows-1251')
for ix, row in df.iterrows():
re.match('\b/forum', row['url']) is not None:
print('this is a match')
Would solve your problem
or even simpler
df['is_a_match'] = df.url.apply(lambda row: re.match('\b/forum', row['url']) is not None)
I have the following code,
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]] = p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]].astype(int)
It has always been working until the csv file doesn't have enough coverage (of all week days). For e.g., with the following .csv file,
DOW,Hour,Changes
4Wed,01,237
3Tue,07,2533
1Sun,01,240
3Tue,12,4407
1Sun,09,2204
1Sun,01,240
1Sun,01,241
1Sun,01,241
3Tue,11,662
4Wed,01,4
2Mon,18,4737
1Sun,15,240
2Mon,02,4
6Fri,01,1
1Sun,01,240
2Mon,19,2300
2Mon,19,2532
I'll get the following error:
KeyError: "['5Thu' '7Sat'] not in index"
It seems to have a very easy fix, but I'm just too new to Python to know how to fix it.
Use reindex to get all columns you need. It'll preserve the ones that are already there and put in empty columns otherwise.
p = p.reindex(columns=['1Sun', '2Mon', '3Tue', '4Wed', '5Thu', '6Fri', '7Sat'])
So, your entire code example should look like this:
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
columns = ["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]
p = p.reindex(columns=columns)
p[columns] = p[columns].astype(int)
I had a very similar issue. I got the same error because the csv contained spaces in the header. My csv contained a header "Gender " and I had it listed as:
[['Gender']]
If it's easy enough for you to access your csv, you can use the excel formula trim() to clip any spaces of the cells.
or remove it like this
df.columns = df.columns.to_series().apply(lambda x: x.strip())
please try this to clean and format your column names:
df.columns = (df.columns.str.strip().str.upper()
.str.replace(' ', '_')
.str.replace('(', '')
.str.replace(')', ''))
I had the same issue.
During the 1st development I used a .csv file (comma as separator) that I've modified a bit before saving it.
After saving the commas became semicolon.
On Windows it is dependent on the "Regional and Language Options" customize screen where you find a List separator. This is the char Windows applications expect to be the CSV separator.
When testing from a brand new file I encountered that issue.
I've removed the 'sep' argument in read_csv method
before:
df1 = pd.read_csv('myfile.csv', sep=',');
after:
df1 = pd.read_csv('myfile.csv');
That way, the issue disappeared.
I am looking for an algorithm to comapre two excel sheets, based on their column names, in Python.
I do not know what the columns are, so one sheet may have an additional column or both sheets can have several columns with the same name.
The easiest case is when a column in the first sheet corresponds to only one column in the second excel sheet. Then I can perform the diff on rows of that column using xlrd.
If the column name is not unique, I can verify if the columns have the same position.
Does anyone know of an already existing algorithm or have any experience in this domain?
Fast an dirty:
# Since order of the names doesn't matter, we can use the set() option
matching_names = set(sheet_one_names) & set(sheet_one_names)
...
# Here, order does matter since we're comparing rowdata..
# not just if they match at some point.
matching_rowdata = [i for i, j in zip(columndata_one, columndata_two) if i != j]
Note: This assumes that you've done a few things ahead,
get the column names for sheet 1 via xlrd and same for the second sheet,
get the row data for both sheets in two different variables.
This is to give you an idea.
Also note that doing the [...] option (second one) it's important that the rows are of the same length, otherwise it will be skipped. This is a MISS-MATCH scenario, reverse to get the matches in the data flow.
This is a slower but functional solution:
column_a_name = ['Location', 'Building', 'Location']
column_a_data = [['Floor 1', 'Main', 'Sweden'],
['Floor 2', 'Main', 'Sweden'],
['Floor 3', 'Main', 'Sweden']]
column_b_name = ['Location', 'Building']
column_b_data = [['Sweden', 'Main', 'Floor 1'],
['Norway', 'Main', 'Floor 2'],
['Sweden', 'Main', 'Floor 3']]
matching_names = []
for pos in range(0, len(column_a_name)):
try:
if column_a_name[pos] == column_b_name[pos]:
matching_names.append((column_a_name[pos], pos))
except:
pass # Index out of range, column length are not the same
mismatching_data = []
for row in range(0, len(column_a_data)):
rowa = column_a_data[row]
rowb = column_b_data[row]
for name, _id in matching_names:
if rowa[_id] != rowb[_id] and (rowa[_id] not in rowb or rowb[_id] not in rowa):
mismatching_data.append((row, rowa[_id], rowb[_id]))
print mismatching_data