Locating a row based on a cell value - python

//EDIT: This question is kind of a sub-question. For a shorter and better example, which has better replies, check This Post
I'm very new into python and even newer into pandas.
I'm working with it for at least a month and I think I got most of the basics together.
My current task is to write values into a certrain cell, in a certain space inside of an xslx-file.
Situation
I have a very big excel-file including various data, from names to
email-adresses and everything. As well I have two lists (.txt-files)
with the same email-adresses of the excel-file in it, yet those
emails got verified either if they match certain security-cheks or
not. Depending on the outcome, they got stored inside of the
"Secured.txt" or the "Unsecured.txt" file.
To write and read in the excel-file, I use pandas.
Task
Next to the 'Emails'-column in the excel file, there is a column in which you mark with an entry either if the email is secured, or unsecured. My actual task is to insert those entrys, depending in which text-file the email lies.
Possible Solution
My approach to solve this problem is to read out each .txt-file and store each email-adress in a variable using a list and a for-loop. Iterating through those emails, I know want to look for the location of the email-adress inside of the excel-file and access the cell right next to it. Same row, different column. Since the emails got sorted matching for their security-validation before, I just can put in the according value into the validation-cell right next to the email.
Question
My question is the following: How do I approach a specific row based on a value in it?
I want to find the place of the cell which includes the actual content of the variable "mails", so I can move over to the cell right next to it. Since I know all the names of the columns, I actually just need the index of the row in which the email lies. I got the x-coordinate and need the y-coordinate.
Example
What I have up until now is the readout of the .txt-file:
import pandas as pd
import os
import re
#fetching the mail adress through indexnumber out of the list
with open('Protected/Protected G.txt', 'r') as file:
#creating the regex pattern to sort out the mail adresses
rgx = '\S+#\S+'
#read the file and convert the list into a string
content = file.readlines()
content_str = ''.join(content)
#get the mails out of the "list" with regex
mails = re.findall(rgx, content_str)
#put each mailadress in a variable
for item in mails:
print(item)
This dummy-dataframe represents the excel sheet I'm working with:
Dummy-Dataframe:
Forename Last Name Email Protection
1 John Kennedy John#gmx.net
2 Donald Trump Donald#gmx.net
3 Bill Clinton Bill#gmx.net
4 Richard Nixton Richard#gmx.net
I know want to pass the actual adress, stored in the variable 'item', to some kind of "locate"-function of pandas in order to find out in which row the actual email lies. As soon as I know in which row the adress lies, I can now tell pandas to write either an "x", saying the mail is protected, or an "o", meaning the mail is unprotected, in the very next column.
My finished dataframe could look like this:
Finished Dataframe:
Forename Last Name Email Protection
1 John Kennedy John#gmx.net x
2 Donald Trump Donald#gmx.net o
3 Bill Clinton Bill#gmx.net x
4 Richard Nixton Richard#gmx.net x
I really appreciate the help.

To make sure I understand you have a text file for protected and one for unprotected. I am making a large assumption you never have an email in both.
import pandas as pd
df = pd.read_csv('Protected/Protected G.txt', header = None, sep = " ")
df.columns = ['Protected Emails']
df2 = pd.read_excel('dummy-excel')
if df2['Email'].isin(df) :
df2['Protection'] = 'x'
else :
df2['Protection'] = 'o'
writer = pd.ExcelWriter('ProtectedEmails.xlsx')
df2.to_excel(writer,'Sheet1') #or whatever you want to name your sheet
writer.save()
maybe something like that, though I don't know what the text file of emails looks like.

Your question is different from the content. This is a simple answer might, somehow, be useful.
Assume that this is a dataframe:
Z = pd.DataFrame([1,2,4,6])
Now, let us access to number 4. There is a single column. Usually, the first column is assigned the name 0 as a heading. The required number, 4, exists in the third place of the dataframe. As python starts the indexes of lists, dfs, arrays.. etc from 0, then the number of index of number 4 is 2.
print(Z[0][2])
This would output [4]
Try applying the same thing on your data. Just male sure to know the names of the headings. Sometimes they are not numbers, but strings.

Related

Can I amend one data sheet to match another data frame's ID that are almost similar?

I have multiple data frames to compare. My problem is the product IDs. one is set up like:
000-000-000-000
Vs
000-000-000
(gross)
I have looked on here, reddit, YouTube, and even went deep down the rabbit hole trying .join, .append, some other method I've never seen before, or even understand yet. Is there a way(or even better some documentation I can read on to learn this) to pull the Product ID from the Main excel sheet, compare it to the one(s) that should match. Then i will more than like make the in place ID across all sheets. That way I can use those IDs as the index and do a side by side compare of the ID to row data? Each ID has about 113 values to compare. That's 113 columns, but for each row if that make sense
Example: (colorful columns is main sheet that the non colored column will be compared to)
additional notes:
The highlighted yellow IDs are "unique", and I wont be changing those but instead write them to a list or something and use an if statement to ignore them when found.
Edit:
so I wrote this code which is almost perfect what I need to do with this.
It takes out the "-" which I apply to all my IDs. Just need to make a list of ID that are unique to skip over on taking away the zeros
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "")
Then this will only list the digits up to 9 digits, except the unique IDs
dfSS["Product ID"] = dfSS["Product ID"]str[:9]
Will add the full code below here once i get it to work 100%
I am now trying to figure out how to say somethin like
lst =[1,2,3,4,5]
if dfSS["Product ID"] not in lst:
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "").str[:9]
This code does not work but everyday I get closer and closer to being able to compare these similar yet different data frames. the lst is just an example of a 000-000-000 Product IDs in a list that I do not want to filter at all. but keep in the data frame
If the ID transformation is predictable, then one option is to use regex for homogenizing IDs. For example if the situation is just removing the first three digits, then something like the following can be used:
df['short_id'] = df['long_id'].str.extract(r'\d\d\d-([\d-]*)')
If the ID transformation is not so predictable (e.g. due to transcription errors or some other noise in the data) then the best option is to first disambiguate the ID transformation using something like recordlinkage, see the example here.
Ok solved this for every Product ID with or without dashes, #, ltters, etc..
(\d\d\d-)?[_#\d-]?[a-zA-Z]?
(\d\d\d-)? -This is for the first & second three integer sets, w/ zero or more matches and a dashes (non-greedy)
[_#\d-]? - This is for any special chars and additional numbers (non-greedy)
[a-zA-Z]? - This, not sure why, but I had to separate from the last part due to it wouldn't pick up every letter. (non-greedy)
With the above I solved everything I needed for RE.
Where I learned how to improve my RE skills:
RE Documentation
Automate the Boring Stuff- Ch 7
You can test you RE's here
Additional way to show this. Put this here to show there is no one way of doing it. RE is super awesome:
(\d{3}-)?[_#\d{3}-]?[a-zA-Z]?

Removing excess tabs from .txt file after user-input-error

We receive multiple .txt files every night from our ERP, sometimes we have product names ending in TAB after the person who inserts the product name has copy pasted it from somewhere else, long story short this breaks the process as there is python script automated that will perform very modest cleaning, and then insert the data to our MySQL database.
Now, the script that imports them to our database errors and breaks when this happens as it will push one row in the file to be 1 column longer, and i'll need to find a way to fix this from happening as when it happens it breaks our BI reporting.
I've thought of some rules on how to pin point where the user-input-error is in the file, i reckon the right way would be to write a python script to import the .txt file as a pandas dataframe, find all rows where column [amount] is blank and then fix the said row. Unfortunately to my understanding fixing can't happen in pandas, as when I import the file to pandas dataframe the problem already happened and needs to be fixed prior to importing to pandas, unless it is somehow possible to remove the blank cell from column X, and move all the other columns one step back filling the void left. This is what happens with the error rows:
So i need to find a way to either move all the cells one step back(left) when column X is blank, or some other way, all help is welcome.
EDIT:
I suppose there is a way afterall to do this in pandas with shift, if anyone can assist on how to make it shift when columnX is blank, would be greatly appreciated!
EDIT2:
Here are headers in the .txt file, and 2nd row which is fine, and 3rd row which errors out:
tilausnro tasiakasnro ttkoodi lasiakasnro ltkoodi tilpvm myyja kasittelija myypri toiala tila tyonro toimpvm tryhma tuote nimi maara hinta valuutta mtili kpka s.posti kirjpvm aspvm ensvahaspvm vahvpvm tulpvm 100000-1 121007 121007 20-10-15 oer oer 8 100000-1 27-10-15 2100 ESP_734249 Wisby Hopfwis. Wei 5,6% 50EG Buk 150000 2032,26 SEK 3350 2 20-10-15 30-10-15 ? ? ? 500072-2 121110 121110 20-10-20 jra NTA 1 500072-2 21-10-20 2000 EVILN_007 Kwas Ostrabramski 0,5l back 60000 82,8 3350 600 20-10-20 23-10-20 ? ? ?
Managed to fix this with a little help from kind people at Discord.
lst = open('directory//filename.txt').readlines()
fixed = []
for line in lst:
inner = line.split('\t') #string to list
if inner[16] == '':
inner.pop(16)
inner = "\t".join(inner) #list to string
fixed.append(inner)
with open("directory//filename.txt", "w") as output:
for item in fixed:
output.write("%s" % item)

Skipping Empty Values in a Python Directed Mail Merge

I am running a mail merge from Excel to Word utilizing Python (Openpyxl). I'm running into a problem of blank values being merged in as a single space ' ' rather than just showing a true blank as they normally would. I have a numbered list that will pull 8 different merge fields (each to a new line) and should skip the number/line if the cell is blank. Is it possible to make openpyxl treat an empty cell as a true blank value rather than showing it as a blank space, which Word then merges in? A snippet of the mail merge code is below:
from __future__ import print_function
import os
import openpyxl
from mailmerge import MailMerge
from datetime import date
os. chdir(r'CURRENT WORKING FOLDER')
wb = openpyxl.load_workbook('FullMerge.xlsm', data_only=True)
sheet = wb["Database"]
max_col = 104
sheet.delete_rows(sheet.min_row, 1)
template = "FullMerge.docx"
document1 = MailMerge("FullMerge.docx")
First = str(sheet.cell(row = 1, column = 1).value or '')
Second = str(sheet.cell(row = 1, column = 2).value or '')
Third = str(sheet.cell(row = 1, column = 3).value or '')
document1.merge(
First = First,
Second = Second,
Third = Third
)
document1.write("FinishedMerge.docx")
EXAMPLE:
If the value in Second is blank and I manually mail merge, I get:
First Text
Third Text
If the value in Second is blank and I Python mail merge, I get:
First Text
'single blank space'
Third Text
Edited to take account of the revised question.
Here, a value of '' (empty string) merges as an empty string, as I would expect, not as a ' ' (i.e. a space). So I assume that the problem is really that docx-mailmerge does not do the same suppression of empty lines that Word does. I don't think it actually has anything to do with openpyxl.
The code for docx-mailmerge is quite small - it's a few hundred lines of python - and it only really does substitution of { MERGEFIELD } fields by the values you provide. It doesn't really deal with any other merge field constructs such as IF fields or field switches. If it processed IF fields then you would could deal with line suppression using that mechanism. But I think it would need quite a substantial change to the docx-mailmerge code to do it. (the code would probably have to "remember" where it had done every substitution, then concatenate all the <w:t/> elements within a <w:p/> (Paragraph) element and remove the <w:p/> if (a) there was no text or other object in there) and (b) removing the <w:p/> element did not result in an invalid .docx.
Otherwise, it's a question of whether or not there is another library that does what you need. (Off-topic question here, unfortunately!)
Original text of the Answer:
"None" is the value of the Python "NoneType" variable, and tells us that openpyxl is interpreting an empty cell as "NoneType". Actually, I am sure there are a lot of things wrong with what I just said from a Python point of view, but
a. it's actually a good thing that openpyxl returns "None" in this scenario because it allows you to decide what you really want to insert in your merge. For some field types, for example, you might want to insert "0"
b. There is a discussion about how to deal with None here . In the specific example you give, you could use this
myval = str(sheet.cell(row = 1, column = 1).value or '')
but IMO some people would be happier to use a function that made it more obvious what was going on.

python Win32 Excel is cell a range

I am writing a bit of Python code to automate the manipulation of Excel spreadsheets. The idea is to use spreadsheet templates to create daily reports. Saw this idea working several years ago using Perl. Anyway.
Here are the simple rules:
Sheets with the Workbook are process in the order they appear.
Within the sheets cells are process left to right, then top to bottom.
There are names defined which are single cell ranges, can contain static values or the results of queries. Cells can contain comments which contain SQL queries to run. ...
Here is the problem, as I process the cells I need to check if the cell has an attached comment and if the cell has a name. I am able to handle processing the attached cell comments. But I can not figure out how to determine if a cell is within a named range. In my case the single cell within the range.
I saw a posting the suggested this would work:
cellName = ws.ActiveCell.Name.Name
No luck.
Does anybody have any idea how to do this?
I am so close but no cigar.
Thanks for your attention to this matter.
KD
What you may consider doing is first building a list of all addresses of names in the worksheet, and checking the address of each cell against the list to see if it's named.
In VBA, you obtain the names collection (all the names in a workbook) this way:
Set ns = ActiveWorkbook.Names
You can determine if the names are pointed toward part of the current sheet, and a single cell, this way:
shname = ActiveSheet.Name
Dim SheetNamedCellAddresses(1 To wb.Names.Count) as String
i = 1
For Each n in ns:
If Split(n.Value, "!")(0) = "=" & shname And InStr(n.Value, ":") = 0 Then
' The name Value is something like "=Sheet1!A1"
' If there is no colon, it is a single cell, not a range of cells
SheetNamedCellAddresses(i) = Split(n,"=")(1) 'Add the address to your array, remove the "="
i = i + 1
End If
Next
So now you have a string array containing the addresses of all the named cells in your current sheet. Move that array into a python list and you are good to go.
OK so it errors out if the cell does NOT have a range name. If the cell has a range name the following bit of code returns the name: Great success!!
ws.Cells(r,c).Activate()
c = xlApp.ActiveCell
cellName = c.Name.Name
If there is no name associated with the cell, an exception is tossed.
So even in VBA you would have to wrap this bit of code in exception code. Sounds expensive to me to use exception processing for this call.

Sorting rows into groups based on shared values

I have a CSV with a large number of rows, from a user-submitted form. Each row includes a user email, and a field for them to list other user emails in their group. I've written a short script so far using Python and pandas that loads the CSV into a dataframe and cleans up entries.
I want to sort the rows by group, but am running into a few conceptual problems. Since it's user-entered, the list is not necessarily complete or correctly spelled. What's the best way to deal with this? I'm entirely new to parsing data like this and rather inexperienced overall.
Here's some example data to show what I mean:
email,group
user1#a.com, "['user4#b.com','user3#c.com']"
user2#a.com,
user3#c.com, "['user1#a.com']"
user4#b.com, "['user1#a.com','user3#b.com']"
So here user1, user3, and user4 are in a group. The problem is that user3 only listed user1.
My first thought was to append the submitting user's email to the group list and then sort the list and then column alphabetically. However, that only works if everyone's group entries are complete.
I'd like not to pick out 200 groups by hand, but I'm lost as how to proceed.
This is my current plan in pseudocode:
data # dataframe containing imported CSV
sorted_groups # result dataframe with equivalent rows, but sorted into groups
sort(data) by len(data[group])
for each row in data:
append row to sorted_groups
search for rows where email == entry in groups
append matching rows to sorted_groups
remove matching rows from data
remove initial row from data
This will definitely fail on misspellings, and only works if at least one person in the group got everything right. It's the best I have at the moment, though.
Thanks for taking the time to read this. Please let me know if I can clarify anything, and point me in the right direction!
I'm not sure how of your data is stored, so I'm writing this assuming you have a list of rows of data, and each row contains all of the email addresses entered in the form. e.g.,
rows = [['user1#a.com','user4#b.com','user3#c.com'],
['user2#a.com'],
['user3#c.com', 'user1#a.com'],
['user4#b.com','user1#a.com','user3#b.com']]
I'm also assuming that each user belongs to one and only one group, each user has submitted the form, and each user did not misspell their email.
We can obtain a set of valid email addresses using
valid = {row[0] for row in rows}
We can build a dictionary mapping users to groups, merge groups as we go, and remove invalid emails.
ugDict = {}
for row in rows:
mergedGroup = set(row) & valid
for user in row:
if user in ugDict:
mergedGroup |= ugDict[user]
for user in mergedGroup:
ugDict[user] = mergedGroup
This will result in a mapping from users to groups, and will include any mistyped email addresses. You'll have to decide how to validate emails -- you might just want to ignore them.
Now, to get a sorted list of groups, create a set of all groups, and use the sorted function.
sortedGroups = sorted({frozenset(g) for g in ugDict.values()})
frozenset(g) makes python's set object hashable (i.e. sortable).
The result?
sortedGroups = [frozenset({'user2#a.com'}),
frozenset({'user1#a.com', 'user3#b.com', 'user4#b.com'})]

Categories

Resources