I am cataloging attribute fields for each feature class in the input list, below, and then I am writing the output to a spreadsheet for the occurance of the attribute in one or more of the feature classes.
import arcpy,collections,re
arcpy.env.overwriteOutput = True
input = [list of feature classes]
outfile= # path to csv file
f=open(outfile,'w')
f.write('ATTRIBUTE,FEATURE CLASS\n\n')
mydict = collections.defaultdict(list)
for fc in input:
cmp=[]
lstflds=arcpy.ListFields(fc)
for fld in lstflds:
cmp.append(fld.name)
for item in cmp:
mydict[item].append(fc)
for keys, vals in mydict.items():
#remove these characters
char_removal = ["[","'",",","]"]
new_char = '[' + re.escape(''.join(char_removal)) + ']'
v=re.sub(new_char,'', str(vals))
line=','.join([keys,v])+'\n'
print line
f.write(line)
f.close()
This code gets me 90% of the way to the intended solution. I still cannot get the feature classes(values) to separate by a comma within the same cell(being comma delimited it shifts each value over to the next column as I mentioned). In this particular code the "v" on line 20(feature class names) are output to the spreadsheet, separated by a space(" ") in the same cell. Not a huge deal because the replace " " with "," can be done very quickly in the spreadsheet itself but it would be nice to work this into the code to improve reusability.
For a CSV file, use double-quotes around the cell content to preserve interior commas within, like this:
content1,content2,"content3,contains,commas",content4
Generally speaking, many libraries that output CSV just put all contents in quotes, like this:
"content1","content2","content3,contains,commas","content4"
As a side note, I'd strongly recommend using an existing library to create CSV files instead of reinventing the wheel. One such library is built into Python 2.6+.
As they say, "Good coders write. Great coders reuse."
import arcpy,collections,re,csv
arcpy.env.overwriteOutput = True
input = [# list of feature classes]
outfile= # path to output csv file
f=open(outfile,'wb')
csv_write=csv.writer(f)
csv_write.writerow(['Field','Feature Class'])
csv_write.writerow('')
mydict = collections.defaultdict(list)
for fc in input:
cmp=[]
lstflds=arcpy.ListFields(fc)
for fld in lstflds:
cmp.append(fld.name)
for item in cmp:
mydict[item].append(fc)
for keys, vals in mydict.items():
# remove these characters
char_removal = ["[","'","]"]
new_char = '[' + re.escape(''.join(char_removal)) + ']'
v=re.sub(new_char,'', str(vals))
csv_write.writerow([keys,""+v+""])
f.close()
Related
I have a text file ("input.param"), which serves as an input file for a package. I need to modify the value of one argument. The lines need to be changed are the following:
param1 0.01
model_name run_param1
I need to search the argument param1 and modify the value of 0.01 for a range of different values, meanwhile the model_name will also be changed accordingly for different value of param1. For example, if the para1 is changed to be 0.03, then the model_name is changed to be 'run_param1_p03'. Below is some of my attempting code:
import numpy as np
import os
param1_range = np.arange(0.01,0.5,0.01)
with open('input.param', 'r') as file :
filedata = file.read()
for p_value in param1_range:
filedata.replace('param1 0.01', 'param1 ' + str(p_value))
filedata.replace('model_name run_param1', 'model_name run_param1' + '_p0' + str(int(round(p_value*100))))
with open('input.param', 'w') as file:
file.write(filedata)
os.system('./bin/run_app param/input.param')
However, this is not working. I guess the main problem is that the replace command can not recognize the space. But I do not know how to search the argument param1 or model_name and change their values.
I'm editing this answer to more accurately answer the original question, which it did not adequately do.
The problem is "The replace command can not recognize the space". In order to do this, the re, or regex module, can be of great help. Your document is composed of an entry and its value, separated by spaces:
param1 0.01
model_name run_param1
In regex, a general capture would look like so:
import re
someline = 'param1 0.01'
pattern = re.match(r'^(\S+)\s+(\S+)$', someline)
pattern.groups()
# ('param1', '0.01')
The regex functions as follows:
^ captures a start-of-line
\S is any non-space char, or, anything not in ('\t', ' ', '\r', '\n')
+ indicates one or more as a greedy search (will go forward until the pattern stops matching)
\s+ is any whitespace char (opposite of \S, note the case here)
() indicate groups, or how you want to group your search
The groups make it fairly easy for you to unpack your arguments into variables if you so choose. To apply this to the code you have already:
import numpy as np
import re
param1_range = np.arange(0.01,0.5,0.01)
filedata = []
with open('input.param', 'r') as file:
# This will put the lines in a list
# so you can use ^ and $ in the regex
for line in file:
filedata.append(line.strip()) # get rid of trailing newlines
# filedata now looks like:
# ['param1 0.01', 'model_name run_param1']
# It might be easier to use a dictionary to keep all of your param vals
# since you aren't changing the names, just the values
groups = [re.match('^(\S+)\s+(\S+)$', x).groups() for x in filedata]
# Now you have a list of tuples which can be fed to dict()
my_params = dict(groups)
# {'param1': '0.01', 'model_name': 'run_param1'}
# Now just use that dict for setting your params
for p_value in param1_range:
my_params['param1'] = str(p_value)
my_params['model_name'] = 'run_param1_p0' + str(int(round(p_value*100)))
# And for the formatting back into the file, you can do some quick padding to get the format you want
with open('somefile.param', 'w') as fh:
content = '\n'.join([k.ljust(20) + v.rjust(20) for k,v in my_params.items()])
fh.write(content)
The padding is done using str.ljust and str.rjust methods so you get a format that looks like so:
for k, v in dict(groups).items():
intstr = k.ljust(20) + v.rjust(20)
print(intstr)
param1 0.01
model_name run_param1
Though you could arguably leave out the rjust if you felt so inclined.
Well, I'm learning Python, so I'm working on a project that consists in passing some numbers of PDF files to xlsx and placing them in their corresponding columns, rows determined according to row heading.
The idea that came to me to carry it out is to convert the PDF files to txt and make a dictionary with the txt files, whose key is a part of the file name (because it contains a part of the row header) and the values be the numbers I need.
I have already managed to convert the txt files, now i'm dealing with the script to carry the dictionary. at the moment look like this:
import os
import re
p = re.compile(r'\w+\f+')
'''
I'm not entirely sure at the moment how the .compile of regular expressions works, but I know I'm missing something to indicate that what I want is immediately to the right, I'm also not sure if the keywords will be ignored, I just want take out the numbers
'''
m = p.match('Theese are the keywords' or 'That are immediately to the left' or 'The numbers I want')
def IsinDict(txtDir):
ToData = ()
if txtDir == "": txtDir = os.getcwd() + "\\"
for txt in os.listdir(txtDir):
ToKey = txt[9:21]
if ToKey == (r"\w+"):
Data = open(txt, "r")
for string in Data:
ToData += m.group()
Diccionary = dict.fromkeys(ToKey, ToData)
return Diccionary
txtDir = "Absolute/Path/OfTheText/Files"
IsinDict(txtDir)
Any contribution is welcome, thanks for your attention.
I have a file in a special format .cns,which is a segmented file used to analyze copy number. It is a text file, that looks like this (first line plus header):
head -1 copynumber.cns
chromosome,start,end,gene,log2 chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067
We transformed it to a .csv so we could separate it by tab (but it didn't work well). The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is useful. The output I need is something like this:
gene log2
LOC102725121 -0.28067
DDX11L1 -0.28067
OR4F5 -0.28067
PIK3CA 0.35475
NRAS 3.35475
The fist step, would be, to separate everything by commas and then, transpose columns? and finally print de log2 value for each gene that was contained in that string delimited by quotes. If you could help me with an R, or python script it would help a lot. Perhaps awk would work too.
I am using LInux UBuntu V16.04
I'm not sure if I am being clear, let me know if this is useful.
Thank you!
Hope following code in Python helps
import csv
list1 = []
with open('copynumber.cns','r') as file:
exampleReader = csv.reader(file)
for row in exampleReader:
list1.append(row)
for row in list1:
strings = row[3].split(',') # Get fourth column in CSV, i.e. gene column, and split on occurrance of comma
for string in strings: # Loop through each string
print(string + ' ' + str(row[4]))
I'm trying to implement Vigenere's Cipher. I want to be able to obfuscate every single character in a file, not just alphabetic characters.
I think I'm missing something with the different types of encoding. I have made some test cases and some characters are getting replaced badly in the final result.
This is one test case:
,.-´`1234678abcde^*{}"¿?!"·$%&/\º
end
And this is the result I'm getting:
).-4`1234678abcde^*{}"??!"7$%&/:
end
As you can see, ',' is being replaced badly with ')' as well as some other characters.
My guess is that the others (for example, '¿' being replaced with '?') come from the original character not being in the range of [0, 127], so its normal those are changed. But I don't understand why ',' is failing.
My intent is to obfuscate CSV files, so the ',' problem is the one I'm mainly concerned about.
In the code below, I'm using modulus 128, but I'm not sure if that's correct. To execute it, put a file named "OriginalFile.txt" in the same folder with the content to cipher and run the script. Two files will be generated, Ciphered.txt and Deciphered.txt.
"""
Attempt to implement Vigenere cipher in Python.
"""
import os
key = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
fileOriginal = "OriginalFile.txt"
fileCiphered = "Ciphered.txt"
fileDeciphered = "Deciphered.txt"
# CIPHER PHASE
if os.path.isfile(fileCiphered):
os.remove(fileCiphered)
keyToUse = 0
with open(fileOriginal, "r") as original:
with open(fileCiphered, "a") as ciphered:
while True:
c = original.read(1) # read char
if not c:
break
k = key[keyToUse]
protected = chr((ord(c) + ord(k))%128)
ciphered.write(protected)
keyToUse = (keyToUse + 1)%len(key)
print("Cipher successful")
# DECIPHER PHASE
if os.path.isfile(fileDeciphered):
os.remove(fileDeciphered)
keyToUse = 0
with open(fileCiphered, "r") as ciphered:
with open(fileDeciphered, "a") as deciphered:
while True:
c = ciphered.read(1) # read char
if not c:
break
k = key[keyToUse]
unprotected = chr((128 + ord(c) - ord(k))%128) # +128 so that we don't get into negative numbers
deciphered.write(unprotected)
keyToUse = (keyToUse + 1)%len(key)
print("Decipher successful")
Assumption: you're trying to produce a new, valid CSV with the contents of cells enciphered via Vigenere, not to encipher the whole file.
In that case, you should check out the csv module, which will handle properly reading and writing CSV files for you (including cells that contain commas in the value, which might happen after you encipher a cell's contents, as you see). Very briefly, you can do something like:
with open("...", "r") as fpin, open("...", "w") as fpout:
reader = csv.reader(fpin)
writer = csv.writer(fpout)
for row in reader:
# row will be a list of strings, one per column in the row
ciphered = [encipher(cell) for cell in row]
writer.writerow(ciphered)
When using the csv module you should be aware of the notion of "dialects" -- ways that different programs (usually spreadsheet-like things, think Excel) handle CSV data. csv.reader() usually does a fine job of inferring the dialect you have in the input file, but you might need to tell csv.writer() what dialect you want for the output file. You can get the list of built-in dialects with csv.list_dialects() or you can make your own by creating a custom Dialect object.
I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.