//EDIT: This question is kind of a sub-question. For a shorter and better example, which has better replies, check This Post
I'm very new into python and even newer into pandas.
I'm working with it for at least a month and I think I got most of the basics together.
My current task is to write values into a certrain cell, in a certain space inside of an xslx-file.
Situation
I have a very big excel-file including various data, from names to
email-adresses and everything. As well I have two lists (.txt-files)
with the same email-adresses of the excel-file in it, yet those
emails got verified either if they match certain security-cheks or
not. Depending on the outcome, they got stored inside of the
"Secured.txt" or the "Unsecured.txt" file.
To write and read in the excel-file, I use pandas.
Task
Next to the 'Emails'-column in the excel file, there is a column in which you mark with an entry either if the email is secured, or unsecured. My actual task is to insert those entrys, depending in which text-file the email lies.
Possible Solution
My approach to solve this problem is to read out each .txt-file and store each email-adress in a variable using a list and a for-loop. Iterating through those emails, I know want to look for the location of the email-adress inside of the excel-file and access the cell right next to it. Same row, different column. Since the emails got sorted matching for their security-validation before, I just can put in the according value into the validation-cell right next to the email.
Question
My question is the following: How do I approach a specific row based on a value in it?
I want to find the place of the cell which includes the actual content of the variable "mails", so I can move over to the cell right next to it. Since I know all the names of the columns, I actually just need the index of the row in which the email lies. I got the x-coordinate and need the y-coordinate.
Example
What I have up until now is the readout of the .txt-file:
import pandas as pd
import os
import re
#fetching the mail adress through indexnumber out of the list
with open('Protected/Protected G.txt', 'r') as file:
#creating the regex pattern to sort out the mail adresses
rgx = '\S+#\S+'
#read the file and convert the list into a string
content = file.readlines()
content_str = ''.join(content)
#get the mails out of the "list" with regex
mails = re.findall(rgx, content_str)
#put each mailadress in a variable
for item in mails:
print(item)
This dummy-dataframe represents the excel sheet I'm working with:
Dummy-Dataframe:
Forename Last Name Email Protection
1 John Kennedy John#gmx.net
2 Donald Trump Donald#gmx.net
3 Bill Clinton Bill#gmx.net
4 Richard Nixton Richard#gmx.net
I know want to pass the actual adress, stored in the variable 'item', to some kind of "locate"-function of pandas in order to find out in which row the actual email lies. As soon as I know in which row the adress lies, I can now tell pandas to write either an "x", saying the mail is protected, or an "o", meaning the mail is unprotected, in the very next column.
My finished dataframe could look like this:
Finished Dataframe:
Forename Last Name Email Protection
1 John Kennedy John#gmx.net x
2 Donald Trump Donald#gmx.net o
3 Bill Clinton Bill#gmx.net x
4 Richard Nixton Richard#gmx.net x
I really appreciate the help.
To make sure I understand you have a text file for protected and one for unprotected. I am making a large assumption you never have an email in both.
import pandas as pd
df = pd.read_csv('Protected/Protected G.txt', header = None, sep = " ")
df.columns = ['Protected Emails']
df2 = pd.read_excel('dummy-excel')
if df2['Email'].isin(df) :
df2['Protection'] = 'x'
else :
df2['Protection'] = 'o'
writer = pd.ExcelWriter('ProtectedEmails.xlsx')
df2.to_excel(writer,'Sheet1') #or whatever you want to name your sheet
writer.save()
maybe something like that, though I don't know what the text file of emails looks like.
Your question is different from the content. This is a simple answer might, somehow, be useful.
Assume that this is a dataframe:
Z = pd.DataFrame([1,2,4,6])
Now, let us access to number 4. There is a single column. Usually, the first column is assigned the name 0 as a heading. The required number, 4, exists in the third place of the dataframe. As python starts the indexes of lists, dfs, arrays.. etc from 0, then the number of index of number 4 is 2.
print(Z[0][2])
This would output [4]
Try applying the same thing on your data. Just male sure to know the names of the headings. Sometimes they are not numbers, but strings.
My code is
import pymysql
conn=pymysql.connect(host=.................)
curs=conn.cursor()
import csv
f=open('./kospilist.csv','r')
data=f.readlines()
data_kp=[]
for i in data:
data_kp.append(i[:-1])
c = csv.writer(open("./test_b.csv","wb"))
def exportFunc():
result=[]
for i in range(0,len(data_kp)):
xp="select date from " + data_kp[i] + " where price is null"
curs.execute(xp)
result= curs.fetchall()
for row in result:
c.writerow(data_kp[i])
c.writerow(row)
c.writerow('\n')
exportFunc()
data_kp is reading the tables name
the tables' names are like this (string, ex: a000010)
I collect table names from here.
Then, execute and get the result.
The actual output of my code is ..
My expectation is
(not 3 columns.. there are 2000 tables)
I thought my code is near the answer... but it's not working..
My work is almost done, but I couldn't finish this part.
I had googled for almost 10 hours..
I don't know how.. please help
I think something is wrong with these part
for row in result:
c.writerow(data_kp[i])
c.writerow(row)
The csvwriter.writerow method allows you to write a row in your output csv file. This means that once you have called the writerow method, the line is wrote and you can't come back to it. When you write the code:
for row in result:
c.writerow(data_kp[i])
c.writerow(row)
You are saying:
"For each result, write a line containing data_kp[i] then write a
line containing row."
This way, everything will be wrote verticaly with alternation between data_kp[i] and row.
What is surprising is that it is not what we get in your actual output. I think that you've changed something. Something like that:
c.writerow(data_kp[i])
for row in result:
c.writerow(row)
But this has not entirely solved your issue, obviously: The names of the tables are not correctly displayed (one character on each column) and they are not side-by-side. So you have 2 problems here:
1. Get the table name in one cell and not splitted
First, let's take a look at the documentation about the csvwriter:
A row must be an iterable of strings or numbers for Writer objects
But your data_kp[i] is a String, not an "iterable of String". This can't work! But you don't get any error either, why? This is because a String, in python, may be itself considered as an iterable of String. Try by yourself:
for char in "abcde":
print(char)
And now, you have probably understood what to do in order to make the things work:
# Give an Iterable containing only data_kp[i]
c.writerow([data_kp[i]])
You have now your table name displayed in only 1 cell! But we still have an other problem...
2. Get the table names displayed side by side
Here, it is a problem in the logic of your code. You are browsing your table names, writing lines containing them and expect them to be written side-by-side and get columns of dates!
Your code need a little bit of rethinking because csvwriter is not made for writing columns but lines. We'll then use the zip_longest function of the itertools module. One can ask why don't I use the zip built-in function of Python: this is because the columns are not said to be of equal size and the zip function will stop once it reached the end of the shortest list!
import itertools
c = csv.writer(open("./test_b.csv","wb"))
# each entry of this list will contain a column for your csv file
data_columns = []
def exportFunc():
result=[]
for i in range(0,len(data_kp)):
xp="select date from " + data_kp[i] + " where price is null"
curs.execute(xp)
result= curs.fetchall()
# each column starts with the name of the table
data_columns.append([data_kp[i]] + list(result))
# the * operator explode the list into arguments for the zip function
ziped_columns = itertools.zip_longest(*data_columns, fillvalue=" ")
csvwriter.writerows(ziped_columns)
Note:
The code provided here has not been tested and may contain bugs. Nevertheless, you should be able (by using the documentation I provided) to fix it in order to make it works! Good luck :)
I am writing a bit of Python code to automate the manipulation of Excel spreadsheets. The idea is to use spreadsheet templates to create daily reports. Saw this idea working several years ago using Perl. Anyway.
Here are the simple rules:
Sheets with the Workbook are process in the order they appear.
Within the sheets cells are process left to right, then top to bottom.
There are names defined which are single cell ranges, can contain static values or the results of queries. Cells can contain comments which contain SQL queries to run. ...
Here is the problem, as I process the cells I need to check if the cell has an attached comment and if the cell has a name. I am able to handle processing the attached cell comments. But I can not figure out how to determine if a cell is within a named range. In my case the single cell within the range.
I saw a posting the suggested this would work:
cellName = ws.ActiveCell.Name.Name
No luck.
Does anybody have any idea how to do this?
I am so close but no cigar.
Thanks for your attention to this matter.
KD
What you may consider doing is first building a list of all addresses of names in the worksheet, and checking the address of each cell against the list to see if it's named.
In VBA, you obtain the names collection (all the names in a workbook) this way:
Set ns = ActiveWorkbook.Names
You can determine if the names are pointed toward part of the current sheet, and a single cell, this way:
shname = ActiveSheet.Name
Dim SheetNamedCellAddresses(1 To wb.Names.Count) as String
i = 1
For Each n in ns:
If Split(n.Value, "!")(0) = "=" & shname And InStr(n.Value, ":") = 0 Then
' The name Value is something like "=Sheet1!A1"
' If there is no colon, it is a single cell, not a range of cells
SheetNamedCellAddresses(i) = Split(n,"=")(1) 'Add the address to your array, remove the "="
i = i + 1
End If
Next
So now you have a string array containing the addresses of all the named cells in your current sheet. Move that array into a python list and you are good to go.
OK so it errors out if the cell does NOT have a range name. If the cell has a range name the following bit of code returns the name: Great success!!
ws.Cells(r,c).Activate()
c = xlApp.ActiveCell
cellName = c.Name.Name
If there is no name associated with the cell, an exception is tossed.
So even in VBA you would have to wrap this bit of code in exception code. Sounds expensive to me to use exception processing for this call.
To start I am a complete new comer to Python and programming anything other than web languages.
So, I have developed a script using Python as an interface between a piece of Software called Spendmap and an online app called Freeagent. This script works perfectly. It imports and parses the text file and pushes it through the API to the web app.
What I am struggling with is Spendmap exports multiple lines per order where as Freeagent wants One line per order. So I need to add the cost values from any orders spread across multiple lines and then 'flatten' the lines into One so it can be sent through the API. The 'key' field is the 'PO' field. So if the script sees any matching PO numbers, I want it to flatten them as per above.
This is a 'dummy' example of the text file produced by Spendmap:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
The above has been formatted for easier reading and normally is just one line after the next with no text formatting.
The 'key' or PO field is the first bold item and the second bold/italic item is the cost to be totalled. So if this example was to be passed through the script id expect the first row to be left alone, the Second and Third row costs to be added as they're both from the same PO number and the Fourth line to left alone.
Expected result:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,401.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
Any help with this would be greatly appreciated and if you need any further details just say.
Thanks in advance for looking!
I won't give you the solution. But you should:
Write and test a regular expression that breaks the line down into its parts, or use the CSV library.
Parse the numbers out so they're decimal numbers rather than strings
Collect the lines up by ID. Perhaps you could use a dict that maps IDs to lists of orders?
When all the input is finished, iterate over that dict and add up all orders stored in that list.
Make a string format function that outputs the line in the expected format.
Maybe feed the output back into the input to test that you get the same result. Second time round there should be no changes, if I understood the problem.
Good luck!
I would use a dictionary to compile the lines, using get(key,0.0) to sum values if they exist already, or start with zero if not:
InputData = """5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP COMMENT,002143"""
OutD = {}
ValueD = {}
for Line in InputData.split('\n'):
# commas in comments won't matter because we are joining after anyway
Fields = Line.split(',')
PO = Fields[3]
Value = float(Fields[5])
# set up the output string with a placeholder for .format()
OutD[PO] = ",".join(Fields[:5] + ["{0:.3f}"] + Fields[6:])
# add the value to the old value or to zero if it is not found
ValueD[PO] = ValueD.get(PO,0.0) + Value
# the output is unsorted by default, but you could sort or preserve original order
for POKey in ValueD:
print OutD[POKey].format(ValueD[POKey])
P.S. Yes, I know Capitals are for Classes, but this makes it easier to tell what variables I have defined...
I am using XLRD to attempt to read from and manipulate string text encapsulated within the cells of my excel document. I am posting my code, as well as the text that is returned when I choose to print a certain column.
import xlrd
data = xlrd.open_workbook('data.xls')
sheetname = data.sheet_names()
employees = data.sheet_by_index(0)
print employees.col(2)
>>>[text:u'employee_first', text:u'\u201cRichard\u201d', text:u'\u201cCatesby\u201d', text:u'\u201cBrian\u201d']
My intention is to create a dict or either reference the excel documents using strings in python. I would like to have a number of my functions in my program manipulate the data locally and then output at a later point (not within the scope of this question) to a second excel file.
How do I get rid of this extra information?
If you are only interested in the values of the cells, then you should do:
values = sheet.col_values(colx=2)
instead of:
cells = sheet.col(colx=2)
values = [c.value for c in cells]
because it's more concise and more efficient (Cell objects are constructed on the fly as/when requested).
employees.col(2) is a list of xlrd.sheet.Cell instances. To get all the values from the column (instead of the Cell objects), you can use the col_values method:
values = employees.col_values(2)
You could also do this (my original suggestion):
values = [c.value for c in employees.col(2)]
but that is much less efficient than using col_values.
\u201c and \u201d are unicode left and right double quotes, respectively. If you want to get rid of those, you can use, say, the lstrip and rstrip string methods. E.g. something like this:
values = [c.value.lstrip(u'\u201c').rstrip(u'\u201d') for c in employees.col(2)]