Find and replace strings in Excel (.xlsx) using Python - python

I am trying to replace a bunch of strings in an .xlsx sheet (~70k rows, 38 columns). I have a list of the strings to be searched and replaced in a file, formatted as below:-
bird produk - bird product
pig - pork
ayam - chicken
...
kuda - horse
The word to be searched is on the left, and the replacement is on the right (find 'bird produk', replace with 'bird product'. My .xlsx sheet looks something like this:-
name type of animal ID
ali pig 3483
abu kuda 3940
ahmad bird produk 0399
...
ahchong pig 2311
I am looking for the fastest solution for this, since I have around 200 words in the list to be searched, and the .xlsx file is quite large. I need to use Python for this, but I am open to any other faster solutions.
Edit:- added sheet example
Edit2:- tried some python codes to read the cells, took quite a long time to read. Any pointers?
from xlrd import open_workbook
wb = open_workbook('test.xlsx')
for s in wb.sheets():
print ('Sheet:',s.name)
for row in range(s.nrows):
values = []
for col in range(s.ncols):
print(s.cell(row,col).value)
Thank you!
Edit3:- I finally figured it out. Both VBA module and Python codes work. I resorted to .csv instead to make things easier. Thank you! Here is my version of the Python code:-
import csv
###### our dictionary with our key:values. ######
reps = {
'JUALAN (PRODUK SHJ)' : 'SALE( PRODUCT)',
'PAMERAN' : 'EXHIBITION',
'PEMBIAKAN' : 'BREEDING',
'UNGGAS' : 'POULTRY'}
def replace_all(text, dic):
for i, j in reps.items():
text = text.replace(i, j)
return text
with open('test.csv','r') as f:
text=f.read()
text=replace_all(text,reps)
with open('file2.csv','w') as w:
w.write(text)

I would copy the contents of your text file into a new worksheet in the excel file and name that sheet "Lookup." Then use text to columns to get the data in the first two columns of this new sheet starting in the first row.
Paste the following code into a module in Excel and run it:
Sub Replacer()
Dim w1 As Worksheet
Dim w2 As Worksheet
'The sheet with the words from the text file:
Set w1 = ThisWorkbook.Sheets("Lookup")
'The sheet with all of the data:
Set w2 = ThisWorkbook.Sheets("Data")
For i = 1 To w1.Range("A1").CurrentRegion.Rows.Count
w2.Cells.Replace What:=w1.Cells(i, 1), Replacement:=w1.Cells(i, 2), LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
Next i
End Sub

Make 2 arrays
A[bird produk, pig, ayam, kuda] //words to be changed
B[bird product, pork, chicken, horse] //result after changing the word
Now check each row of your excel and compare it with every element of A. If i matches then replace it with corresponding element of B.
for example
// not actual code something like pseudocode
for (i=1 to no. of rows.)
{
for(j=1 to 200)
{
if(contents of row[i] == A[j])
then contents of row[i]=B[j] ;
break;
}
}
To make it fast you have to stop the current iteration as soon as the word is replaced and check the next row.

Similar idea to #coder_A 's, but use a dictionary to do the "translation" for you, where the keys are the original words and the value for each key is what it gets translated to.

For reading and writing xls with Python, use xlrd and xlwt, see http://www.python-excel.org/
A simple xlrd example:
from xlrd import open_workbook
wb = open_workbook('simple.xls')
for s in wb.sheets():
print 'Sheet:',s.name
for row in range(s.nrows):
values = []
for col in range(s.ncols):
print(s.cell(row,col).value)
and for replacing target text, use a dict
replace = {
'bird produk': 'bird product',
'pig': 'pork',
'ayam': 'chicken'
...
'kuda': 'horse'
}
Dict will give you O(1)(most of the time, if keys don't collide) time complexity when checking membership using 'text' in replace. there's no way to get better performance than that.
Since I don't know what your bunch of strings look like, this answer may be inaccurate or incomplete.

Related

Is there a way to strip out white space from excel sheet name

I am looking to import an excel file and a particular sheet within the file called 'Cover sheet'. How can I insure that if the sheet name is misspelt, e.g. 'Cover sheet ' (there is an extra space there), then the correct sheet is still selected?
This is what I have at the moment:
df.pd.read_excel('../blabla/bla.xlsx', sheetname='Cover sheet')
A simple space removal is:
text = "english language"
text_without_spaces = text.replace(" ", "")
print(text_without_spaces)
Then you can try importing the one with space and the one without space and handle errors accordingly.
If you want a broader approach for this kind of use case I would advise using (wisely) difflib's SequenceMatcher.
SequenceMatcher will compare two strings and return you a similarity coefficient from 0 (totally different) to 1 (identical).
Here's an example:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
original_text = "english language"
test1_text = "english language"
test2_text = "Englishlanguage"
print(similar(original_text, test1_text))
print(similar(original_text, test2_text))
Output
1.0
0.9032258064516129
Then you could import the Excel file as a whole and compare the sheets names using the above function and act if the ratio is, for example, > than 0.8:
for sheet_name in xls.sheet_names()
if similar(sheet_name, name_to_compare) > 0.8:
# do something
Be sure that you take into account false positives.
Might not be exactly what you are looking for, but you can get a list of sheets of the excel file and work from there using xlrd
import xlrd
xls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True)
xls.sheet_names()

Filter rows in Excel file by specific words

I have been struggling to devise a python code that will search for 'N' words in an excel file. And where any of the 'N' words exist, the python code should output the entire row in which these words exist. I am searching for multiple word occurrences in an excel file.
Assume an excel file of this type(say it is called File.xlsx):
ID Date Time Comment
123 12/23/2017 11:10:02 Trouble with pin
98y 01/17/2016 12:45:01 Great web experience. But I had some issues.
76H 05/39/2017 09:55:59 Could not log into the portal.
The question, in light of the above data is:
If I were to search for words, 'pin' and 'log' and find it in the above excel file, I want my python code to output line1 and below it, output line3.
Conceptually, I can think of ways to solve this, but the Python implementation befuddles me. Furthermore, I have extensively searched in Stack Overflow but could not find a post that addressed this question.
Any and all help is deeply appreciated.
There are many ways you could accomplish this, as there are many python packages to read Excel files (http://www.python-excel.org/), but xlrd may be the most straightfoward way:
import xlrd # package to read Excel file
book = xlrd.open_workbook("File.xls") # open the file
sh = book.sheet_by_index(0) # get first Excel sheet
words = ['pin', 'log'] # list of words to search
for rx in xrange(sh.nrows): # for each row in file
for word in words: # for each word in list
if word in str(sh.row(rx)): # check of word in row
print 'line',rx # if so, print row number
outputs:
line 1
line 3
Here is a solution using openpyxl module which I've been using successfully for many projects.
row index starts from one including headers , hence if you dont want to count headers , we will need to reduce index count by 1 row - 1
from openpyxl import load_workbook
wb = load_workbook(filename = 'afile.xlsx')
ws = wb.active
search_words = ['pin' , 'log']
for row in xrange(1,ws.max_row + 1):
for col in xrange(1,ws.max_column + 1):
_cell = ws.cell(row=row, column=col)
if any(word in str(_cell.value) for word in search_words):
print "line {}".format(row - 1)
break
>>>
line 1
line 3
If you want to output actual lines then
Just add following print_row function
def print_row(row):
line = ''
for col in xrange(1,ws.max_column + 1):
_cell = ws.cell(row=row, column=col).value
if _cell:
line += ' ' + str(_cell)
return line
And replace print "line {}".format(row - 1) with print print_row(row)
>>>
123 2017-12-23 00:00:00 11:10:02 Trouble with pin
76H 05/39/2017 09:55:59 Could not log into the portal.
>>>

Comparing csv files in python to see what is in both

I have 2 csv files that I want to compare one of which is a master file of all the countries and then another one that has only a few countries. This is an attempt I made for some rudimentary testing:
char = {}
with open('all.csv', 'rb') as lookupfile:
for number, line in enumerate(lookupfile):
chars[line.strip()] = number
with open('locations.csv') as textfile:
text = textfile.read()
print text
for char in text:
if char in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))
I am trying to get a final output of the master file of countries with a secondary column indicating if it came up in the other list
Thanks !
Try this:
Write a function to turn the CSV into a Python dictionary containing as keys each of the country you found in the CSV. It can just look like this:
{'US':True, 'UK':True}
Do this for both CSV files.
Now, iterate over the dictionary.keys() for the csv you're comparing against, and just check to see if the other dictionary has the same key.
This will be an extremely fast algorithm because dictionaries give us constant time lookup, and you have a data structure which you can easily use to see which countries you found.
As Eric mentioned in comments, you can also use set membership to handle this. This may actually be the simpler, better way to do this:
set1 = set() # A new empty set
set1.add("country")
if country in set:
#do something
You could use exactly the same logic as the original loop:
with open('locations.csv') as textfile:
for line in textfile:
if char.strip() in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))

A solution to remove the duplicates?

My code is below. Basically, I've got a CSV file and a text file "input.txt". I'm trying to create a Python application which will take the input from "input.txt" and search through the CSV file for a match and if a match is found, then it should return the first column of the CSV file.
import csv
csv_file = csv.reader(open('some_csv_file.csv', 'r'), delimiter = ",")
header = csv_file.next()
data = list(csv_file)
input_file = open("input.txt", "r")
lines = input_file.readlines()
for row in lines:
inputs = row.strip().split(" ")
for input in inputs:
input = input.lower()
for row in data:
if any(input in terms.lower() for terms in row):
print row[0]
Say my CSV file looks like this:
book title, author
The Rock, Herry Putter
Business Economics, Herry Putter
Yogurt, Daniel Putter
Short Story, Rick Pan
And say my input.txt looks like this:
Herry
Putter
Therefore when I run my program, it prints:
The Rock
Business Economics
The Rock
Business Economics
Yogurt
This is because it searches for all titles with "Herry" first, and then searches all over again for "Putter". So in the end, I have duplicates of the book titles. I'm trying to figure out a way to remove them...so if anyone can help, that would be greatly appreciated.
If original order does not matter, then stick the results into a set first, and then print them out at the end. But, your example is small enough where speed does not matter that much.
Stick the results in a set (which is like a list but only contains unique elements), and print at the end.
Something like;
if any(input in terms.lower() for terms in row):
if not row[0] in my_set:
my_set.add(row[0])
During the search stick results into a list, and only add new results to the list after first searching the list to see if the result is already there. Then after the search is done print the list.
First, get the set of search terms you want to look for in a single list. We use set(...) here to eliminate duplicate search terms:
search_terms = set(open("input.txt", "r").read().lower().split())
Next, iterate over the rows in the data table, selecting each one that matches the search terms. Here, I'm preserving the behavior of the original code, in that we search for the case-normalized search term in any column for each row. If you just wanted to search e.g. the author column, then this would need to be tweaked:
results = [row for row in data
if any(search_term in item.lower()
for item in row
for search_term in search_terms)]
Finally, print the results.
for row in results:
print row[0]
If you wanted, you could also list the authors or any other info in the table. E.g.:
for row in results:
print '%30s (by %s)' % (row[0], row[1])

Searching CSV Files (Python)

I've made this CSV file up to play with.. From what I've been told before, I'm pretty sure this CSV file is valid and can be used in this example.
Basically I have this CSV file 'book_list.csv':
name,author,year
Lord of the Rings: The Fellowship of the Ring,J. R. R. Tolkien,1954
Nineteen Eighty-Four,George Orwell,1984
Lord of the Rings: The Return of the King,J. R. R. Tolkien,1954
Animal Farm,George Orwell,1945
Lord of the Rings: The Two Towers, J. R. R. Tolkien, 1954
And I also have this text file 'search_query.txt', whereby I put in keywords or search terms I want to search for in the CSV file:
Lord
Rings
Animal
I've currently come up with some code (with the help of stuff I've read) that allows me to count the number of matching entries. I then have the program write a separate CSV file 'results.csv' which just returns either 'Matching' or ' '.
The program then takes this 'results.csv' file and counts how many 'Matching' results I have and it prints the count.
import csv
import collections
f1 = file('book_list.csv', 'r')
f2 = file('search_query.txt', 'r')
f3 = file('results.csv', 'w')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)
input = [row for row in c2]
for booklist_row in c1:
row = 1
found = False
for input_row in input:
results_row = []
if input_row[0] in booklist_row[0]:
results_row.append('Matching')
found = True
break
row = row + 1
if not found:
results_row.append('')
c3.writerow(results_row)
f1.close()
f2.close()
f3.close()
d = collections.defaultdict(int)
with open("results.csv", "rb") as info:
reader = csv.reader(info)
for row in reader:
for matches in row:
matches = matches.strip()
if matches:
d[matches] += 1
results = [(matches, count) for matches, count in d.iteritems() if count >= 1]
results.sort(key=lambda x: x[1], reverse=True)
for matches, count in results:
print 'There are', count, 'matching results'+'.'
In this case, my output returns:
There are 4 matching results.
I'm sure there is a better way of doing this and avoiding writing a completely separate CSV file.. but this was easier for me to get my head around.
My question is, this code that I've put together only returns how many matching results there are.. how do I modify it in order to return the ACTUAL results as well?
i.e. I want my output to return:
There are 4 matching results.
Lord of the Rings: The Fellowship of the Ring
Lord of the Rings: The Return of the King
Animal Farm
Lord of the Rings: The Two Towers
As I said, I'm sure there's a much easier way to do what I already have.. so some insight would be helpful. :)
Cheers!
EDIT: I just realized that if my keywords were in lower case, it won't work.. is there a way to avoid case-sensitivity?
Throw away the query file and get your search terms from sys.argv[1:] instead.
Throw away your output file and use sys.stdout instead.
Append matched booklist titles to a result_list. The result_row that you currently have has a rather misleading name. The count that you want is len(result_list). Print that. Then print the contents of result_list.
Convert your query words to lowercase once (before you start reading the input file). As you read each book_list row, convert its title to lowercase. Do your your matching with the lowercase query words and the lowercase title.
Overall plan:
Read in the entire book list csv into a dictionary of {title: info}.
Read in the questions csv. For each keyword, filter the dictionary:
[key for key, value in books.items() if "Lord" in key]
say. Do what you will with the results.
If you want, put the results in another csv.
If you want to deal with casing issues, try turning all the titles to lowercase ("FOO".lower()) when you store them in the dictionary.

Categories

Resources