I am working with python 2.7 and arcpy under Jupyter notebook environment.
I would like to adapt iteratively my code to a reference table.
This is my reference table which contains the 3 variables that I use for the tool I am running in arpcy:
RegY HunCal CRY
1 1718 BL1
1 1112 JU1
1 1112 JU1
1 1213 JU1
This is a simple xls table which I imported to my jupyter notebook. I have it as a visual reference for when I have to change these variables in my code.
In the beginning, I was doing it by hand because they were a few changes to make. But now there are more than 150 changes to adapt, and, this amount increases with time. Therefore, I would like to modify the code in such a way that it uses the reference table to iterate through every feature each time the reference table changes.
This is the code I am using:
# 2011
# Set geoprocessor object property to overwrite existing output
arcpy.gp.overwriteOutput = True
arcpy.env.workspace = r'C:\Users\GeoData\simSear\SBA_D.gdb'
# Process: Group Similar Features
SS.SimilaritySearch("redD_RegY_1_1112","blackD_CRY_JU1_1112","SS_JU1_1112","NO_COLLAPSE",
"MOST_SIMILAR","ATTRIBUTE_PROFILES",0,
"Temperatur;Precipitat", 'DateFin')
How can I adapt the code in such a way that the variables from the reference table are inserted into my code in the following way:
From the reference table, the values from RegY would be replaced in redD_RegY_**1**_1112.
The values from CRY would be replaced in blackD_CRY_**JU1**_1112 and SS_**JU1**_1112
The values from HunCal would be replaced in redD_RegY_1_**1112**, blackD_CRY_JU1_**1112**, SS_JU1_**1112**
Any hint or suggestion would be highly appreciated.
You should iterate through each row of the table to get your reference table values, then use them to build the unique strings for your input, candidate, and output features.
for row in table:
regY = row[0]
hunCal = row[1]
cry = row[2]
input_features_to_match = 'redD_RegY_{}_{}'.format(regY, hunCal)
candidate_features = 'blackD_CRY_{}_{}'.format(cry, hunCal)
output_features = 'SS_{}_{}'.format(cry, hunCal)
SS.SimilaritySearch(
input_features_to_match,
candidate_features,
output_features,
'NO_COLLAPSE',
'MOST_SIMILAR',
'ATTRIBUTE_PROFILES',
0,
'Temperatur;Precipitat',
'DateFin')
Or much more compactly:
for row in table:
SS.SimilaritySearch(
'redD_RegY_{}_{}'.format(row[0], row[1]),
'blackD_CRY_{}_{}'.format(row[2], row[1]),
'SS_{}_{}'.format(row[2], row[1]),
'NO_COLLAPSE',
'MOST_SIMILAR',
'ATTRIBUTE_PROFILES',
0,
'Temperatur;Precipitat',
'DateFin')
Related
i have to create numerous pdfs of tables every year, so i was trying to write a script with Docx to create the table in Word with each column having its own set width and left or right alignment. Right now i am working on two tables - one with 262 rows, the other with 1036 rows.
The code works great for the table with 262 rows, but will not set column widths correctly for the table with 1036 rows. Since the code is identical for both tables, i am thinking it is a problem with the data itself, or possibly the size of the table? I tried creating the second larger table without any data, and the widths are correct. I then tried creating the table below with a subset of the 7 rows from the 1036 rows, including the rows with the largest numbers of characters, in case a column was not wrapping the text but instead was forcing the column widths to change. It runs fine - widths are correct. I use the exact same code on the full data set of 1036 rows, and the widths change. Any ideas?
Below is the code for only 7 rows of data. it works correctly - the first column is 3.5", the second and third columns are 1.25 inches.
from docx import Document
from docx.shared import Cm, Pt, Inches
import docx
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_ALIGN_VERTICAL
year = '2019'
set_width_dic = {
'PUR'+year +'_subtotals_indexed_by_site.txt':(Inches(3.50), Inches(1.25), Inches(1.25)),
'PUR'+year +'_subtotals_indexed_by_chemical.txt':(Inches(3.50), Inches(1.25), Inches(1.25))}
set_alignment_dic = {
'PUR'+year +'_subtotals_indexed_by_site.txt':[[0,WD_ALIGN_PARAGRAPH.LEFT, 'Commodity or site'],[1,WD_ALIGN_PARAGRAPH.RIGHT, 'Pounds Applied'],[2,WD_ALIGN_PARAGRAPH.RIGHT, 'Agricultural Applications']],
'PUR'+year +'_subtotals_indexed_by_chemical.txt':[[0,WD_ALIGN_PARAGRAPH.LEFT, 'Chemical'],[1,WD_ALIGN_PARAGRAPH.RIGHT, 'Pounds Applied'],[2,WD_ALIGN_PARAGRAPH.RIGHT, 'Agricultural Applications']]}
#the data
list_of_lists = ['Chemical', 'Pounds Applied', 'Agricultural Applications'],['ABAMECTIN', '51,276.54', '69,659'],['ABAMECTIN, OTHER RELATED', '0.03', 'N/A'], ['S-ABSCISIC ACID', '1,856.38', '230'],['ACEPHATE', '158,054.76', '11,082'],['SULFUR','49,038,554.00','170,396'],['BACILLUS SPHAERICUS 2362, SEROTYPE H5A5B, STRAIN ABTS 1743 FERMENTATION SOLIDS, SPORES AND INSECTICIDAL TOXINS','11,726.29','N/A']
doc = docx.Document() # Create an instance of a word document
col_ttl = 3 # add enough columns as headings in the first list in list_of_lists
row_ttl =7# add rows to equal total number lists in list_of_lists
# Creating a table object
table = doc.add_table(rows= row_ttl, cols= col_ttl)
table.style='Light Grid Accent 1'
for r in range(len(list_of_lists)):
row=table.rows[r]
widths = set_width_dic[file] #Ex of widths = (Inches(3.50), Inches(1.25), Inches(1.25))
for c, cell in enumerate(table.rows[r].cells):#c is an index, cell is the empty cell of the table,
table.cell(r,c).vertical_alignment = WD_ALIGN_VERTICAL.BOTTOM
table.cell(r,c).width = widths[c]
par = cell.add_paragraph(str(list_of_lists[r][c]))
for l in set_alignment_dic[file]:
if l[0] == c:
par.alignment = l[1]
doc.save(path+'try.docx')
When i try to do the exact same code for the entire list_of_lists (a list of 1036 lists), the widths are incorrect: column 1 = 4.23", column 2 = 1.04", and column 3 = 0.89"
I printed the full 1036 row list_of_lists on my cmd box, then pasted it in a text file thinking i might be able to include it here. However when i attempted to run the full list, it would not paste back into the cmd box - it gave an EOL error, and only showed the first 65 lists in the list_of_lists. DocX is able to make the full table, just wont set the correct widths. I am baffled. I have looked through every StackExchange python Docx table width post i can find, and many other googled sites. Any thoughts much appreciated.
Just figured out the issue. I needed to add autofit = False. Now the code works for the longer table as well
table.autofit = False
table.allow_autofit = False
I have a data frame made of tweets and their author, there is a total of 45 authors. I want to divide the data frame into groups of 2 authors at a time such that I can export them later into csv files.
I tried using the following: (given that the authors are in column named 'B' and the tweets are in columns named 'A')
I took the following from this question
df.set_index(keys=['B'],drop=False,inplace=True)
authors = df['B'].unique().tolist()
in order to separate the lists :
dgroups =[]
for i in range(0,len(authors)-1,2):
dgroups.append(df.loc[df.B==authors[i]])
dgroups.extend(df.loc[df.B ==authors[i+1]])
but instead it gives me sub-lists like this:
dgroups = [['A'],['B'],
[tweet,author],
['A'],['B'],
[tweet,author2]]
prior to this I was able to divide them correctly into 45 sub-lists derived from the previous link 1 as follows:
for i in authors:
groups.append(df.loc[df.B==i])
so how would i do that for 2 authors or 3 authors or like that?
EDIT: from #Jonathan Leon answer, i thought i would do the following, which worked but isn't a dynamic solution and is inefficient i guess, especially if n>3 :
dgroups= []
for i in range(2,len(authors)+1,2):
tempset1=[]
tempset2=[]
tempset1 = df.loc[df.B==authors[i-2]]
if(i-1 != len(authors)):
tempset2=df.loc[df.B ==authors[i-1]]
dgroups.append(tempset1.append(tempset2))
else:
dgroups.append(tempset1)
This imports the foreign language incorrectly, but the logic works to create a new csv for every two authors.
pd.read_csv('TrainDataAuthorAttribution.csv')
# df.groupby('B').count()
authors=df.B.unique().tolist()
auths_in_subset = 2
for i in range(auths_in_subset, len(authors)+auths_in_subset, auths_in_subset):
# print(authors[i-auths_in_subset:i])
dft = df[df.B.isin(authors[i-auths_in_subset:i])]
# print(dft)
dft.to_csv('df' + str(i) + '.csv')
I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers
I have 2 tables in my database (visits, events).
visits has a primary key visit_id,
events_and_pages has a column visit_id which is sort of a foreign key of visits. (An events row can belong to 0 to 1 visit)
What I want to do: Filter-out from events table all the visit_id that don't belong to visits table. Simple task.
I have the data for each of those tables stored in pandas.DataFrame, respectively df_visits and df_events
I do the following operation :
len(set(df_visits.visit_id) - set(df_events.visit_id)) I get a result of 1670, which is compliant with what I should expect.
But when I do
filter_real_v = df_events.visit_id.isin(set(visits.visit_id))
filter_real_v.value_counts() # I get only True values
filter_real_v = df_events.visit_id.isin(visits.visit_id)
filter_real_v.value_counts() # I get only True values
Even weirder, when I use
pd.DataFrame(df_events.visit_id).isin(real_visits)).visit_id.value_counts() #I get all False values except 8 that are True
pd.DataFrame(df_events.visit_id).isin(set(real_visits)).visit_id.value_counts() #I get all True values
What is going on here? And how can I define a filter for which visit_id exists in events but not in visits?
Please find in this link, the df_events and df_visits csv files to reproduce this error (comma separated index,visit_id)
EDIT : Add snippet for minimal reproducible code:
Download the files in the link and put them in a file_path_events & file_path_visits of your chosing
Execute the code bellow:
import pandas as pd
events = pd.read_csv("df_events.csv")
events.set_index('index',inplace=True)
visits = pd.read_csv("df_visits.csv")
visits.set_index('index',inplace=True)
correct_delta = len(set(visits.visit_id) - set(events.visit_id))
print(correct_delta) #1670
filter_real_v = events.visit_id.isin(set(visits.visit_id))
bad_delta = filter_real_v.value_counts()
print(bad_delta[True]) #702680
Best regards
Everything is behaving correctly, your just misinterpreting the set operation "-"
len(set(df_visits.visit_id) - set(df_events.visit_id))
Will return the values of df_visits.visit_id not in df_events.visit_id. Note: If values of df_events.visit_id are not in df_visits.visit_id they will not be represented here. This is how sets work.
For example:
set([1,2,3,9]) - set([9,10,11])
Output:
{1, 2, 3}
Notice how 10 or 11 do not show up in the answer. None of the second set will as a matter of fact. Only the values in the second set will be taken away from the first set.
With isin() you are effectively doing:
visits['visit_id'].isin(df_events['visit_id'].values).value_counts()
True 56071
False 1670
# Note 1670 is the exact same you got in your set operation
and not:
df_events['visit_id'].isin(visits['visit_id'].values).value_counts()
True 702680
I'm using PyTables 2.2.1 w/ Python 2.6, and I would like to create a table which contains nested arrays of variable length.
I have searched the PyTables documentation, and the tutorial example (PyTables Tutorial 3.8) shows how to create a nested array of length = 1. But for this example, how would I add a variable number of rows to data 'info2/info3/x' and 'info2/info3/y'?
For perhaps an easier to understand table structure, here's my homegrown example:
"""Desired Pytable output:
DIEM TEMPUS Temperature Data
5 0 100 Category1 <--||--> Category2
x <--| |--> y z <--|
0 0 0
2 1 1
4 1.33 2.67
6 1.5 4.5
8 1.6 6.4
5 1 99
2 2 0
4 2 2
6 2 4
8 2 6
5 2 96
4 4 0
6 3 3
8 2.67 5.33
Note that nested arrays have variable length.
"""
import tables as ts
tableDef = {'DIEM': ts.Int32Col(pos=0),
'TEMPUS': ts.Int32Col(pos=1),
'Temperature' : ts.Float32Col(pos=2),
'Data':
{'Category1':
{
'x': ts.Float32Col(),
'y': ts.Float32Col()
},
'Category2':
{
'z': ts.Float32Col(),
}
}
}
# create output file
fpath = 'TestDb.h5'
fh = ts.openFile(fpath, 'w')
# define my table
tableName = 'MyData'
fh.createTable('/', tableName, tableDef)
tablePath = '/'+tableName
table = fh.getNode(tablePath)
# get row iterator
row = table.row
for i in xrange(3):
print '\ni=', i
# calc some fake data
row['DIEM'] = 5
row['TEMPUS'] = i
row['Temperature'] = 100-i**2
for j in xrange(5-i):
# Note that nested array has variable number of rows
print 'j=', j,
# calc some fake nested data
val1 = 2.0*(i+j)
val2 = val1/(j+1.0)
val3 = val1 - val2
''' Magic happens here...
How do I write 'j' rows of data to the elements of
Category1 and/or Category2?
In bastardized pseudo-code, I want to do:
row['Data/Category1/x'][j] = val1
row['Data/Category1/y'][j] = val2
row['Data/Category2/z'][j] = val3
'''
row.append()
table.flush()
fh.close()
I have not found any indication in the PyTables docs that such a structure is not possible... but in case such a structure is in fact not possible, what are my alternatives to variable length nested columns?
EArray? VLArray? If so, how to integrate these data types into the above described structure?
some other idea?
Any assistance is greatly appreciated!
EDIT w/ additional info:
It appears that the PyTables gurus have already addressed the "is such a structure possible" question:
PyTables Mail Forum - Hierachical Datasets
So has anyone figured out a way to create an analogous PyTable data structure?
Thanks again!
I have a similar task: to dump fixed size data with arrays of a variable length.
I first tried using fixed size StringCol(64*1024) fields to store my variable length data (they are always < 64K). But it was rather slow and wasted a lot of disk space, despite blosc compression.
After days of investigation I ended with the following solution:
(spoiler: we store array fields in separate EArray instances, one EArray per one array field)
I store fixed size data in a regular pytables table.
I added 2 additional fields to these tables: arrFieldName_Offset and arrFieldName_Length:
class Particle(IsDescription):
idnumber = Int64Col()
ADCcount = UInt16Col()
TDCcount = UInt8Col()
grid_i = Int32Col()
grid_j = Int32Col()
pressure = Float32Col()
energy = FloatCol()
buffer_Offset = UInt32() # note this field!
buffer_Length = UInt32() # and this one too!
I also create one EArray instance per each array field:
datatype = StringAtom(1)
buffer = h5file.createEArray('/detector', 'arr', datatype, (0,), "")
Then I add rows corresponding to a fixed size data:
row['idnumber'] = ...
...
row['energy'] = ...
row['buffer_Offset'] = buffer.nrows
# my_buf is a string (I get it from a stream)
row['buffer_Length'] = len(my_buf)
table.append(row)
Ta-dah! Add the buffer into the array.
buffer.append(np.ndarray((len(my_buf),), buffer=my_buf, dtype=datatype))
That's the trick. In my experiments this approach is 2-10x times faster than storing ragged fixed sized arrays (like StringAtom(HUGE_NUMBER)) and the resulting DB is few times smaller (2-5x)
Getting the buffer data is easy. Suppose that row is a single row you read from your DB:
# Open array for reading
buffer = h5file.createEArray('/detector', 'Particle.buffer', datatype, (0,), "")
...
row = ...
...
bufferDataYouNeed = buffer[ row['buffer_Offset'] : row['buffer_Offset'] + row['buffer_Length']]
This is a common thing that folks starting out with PyTables want to do. Certainly, it was the first thing I tried to do. As of 2009, I don't think this functionality was supported. You can look here for one solution "I always recommend":
http://www.mail-archive.com/pytables-users#lists.sourceforge.net/msg01207.html
In short, just put each VLArray in a separate place. If you do that, maybe you don't end up needing VLArrays. If you store separate VLArrays for each trial (or whatever), you can keep metadata on those VLArrays (guaranteed to stay in sync with the array across renames, moves, etc.) or put it in a table (easier to search).
But you may also do well to pick whatever a single time-point would be for your column atom, then simply add another column for a time stamp. This would allow for a "ragged" array that still has a regular, repeated (tabular) structure in memory. For example:
Trial Data
1 0.4, 0.5, 0.45
2 0.3, 0.4, 0.45, 0.56
becomes
Trial Timepoint Data
1 1 0.4
1 2 0.5
...
2 4 0.56
Data above is a single number, but it could be, e.g. a 4x5x3 atom.
If nested VLArrays are supported in PyTables now, I'd certainly love to know!
Alternatively, I think h5py does support the full HDF5 feature-set, so if you're really committed to the nested data layout, you may have more luck there. You'll be losing out on a lot of nice features though! And in my experience, naive neuroscientists end up with quite poor performance since they don't get pytables intelligent choices for data layout, chunking, etc. Please report back if you go that route!
I also ran into this and I ended using a fixed array size. The arrays I was trying to store were of variable len so I created new ones from the with the correct fixed length
I did something along the lines of
def filled_list(src_list, targ_len):
"""takes a varible len() list and creates a new one with a fixed len()"""
for i in range(targ_len):
try:
yield src_list[i]
except IndexError:
yield 0
src_list = [1,2,3,4,5,6,7,8,9,10,11]
new_list = [x for x in filled_list(src_list, 100)]
That did the trick for me.