Project #2: Creating Custom Geoprocessing Tools with Python Scripts (or, How to Bring Your Wildest Dreams to Life)

While I think many GIS analysts cherish Python for its ability to automate large-scale geoprocessing tasks and do a lot of the heavy lifting when it comes to data management, I find that its true creative power lies in creating custom geoprocessing scripts. So just to back up for a moment, we know that all GIS applications are based on the idea of passing spatial files (i.e. shapefiles, rasters, feature classes) through script tools that do something to these files. These scripts, nowadays, are often written in Python, especially after Esri began writing all of their tools in this language. For most GIS users, however, script tools operate almost like a black box – files are passed into them and new files pop out. While we all have an understanding of what the input needs to look like and what the output should look like, what happens in the middle actually remains fairly opaque. This obscurity was something that initially frustrated me when I first began using GIS. I wanted to know more about what was actually going on "under the hood" and to be able to manipulate that process to make different things happen.

So for this second project, I posed a new challenge to myself – to create a custom geoprocessing tool and, in the process, learn more about how geoprocessing tools actually function. My professor suggested that I try making a tool that would tackle a somewhat tricky (though regularly used) technique in GIS, something known as areal-weighted reaggregation (AWR). The goal of AWR is to get data that is aggregated by one areal unit and reaggregate this data by a different areal unit, particularly for the case when the areal units are not nested. The script I wrote for Project #1 uses areal-weighted reaggregation, taking Census data that gives the number of people of each race in each block group and reaggregating it by the Chicago Police Department's beats (patrol areas). This is ultimately what allows us to relate the density of arrests for possession of certain drugs in each beat to the racial makeup of that same beat.

So why is creating a tool to do this process actually useful? Well, areal-weighted reaggregation is a surprisingly common technique that needs to be performed regularly when working with spatial data. All too often, we have data aggregated by one areal unit, but we need it aggregated by a different areal unit. By creating a custom tool to do this process, we can use it on any problem that has the required inputs. For example, take the Chicago map I made in Project #1. With a custom AWR tool, we could make the same type of map for every city in the United States, or for every major city in the world (assuming reliable and extant data – which you really can never assume). Or let's say you're interested in completely different statistics (age, gender, socioeconomic class, education, etc.) or completely different scales of analysis (i.e. counties, states, countries, continents). Again, this tool can do that type of analysis as well. The power of creating custom tools is in their diversity of application, their ability to be applied to many different contexts and problems. The script from Project #1 is limited in the sense that it only works for one particular case (singularity of application).

Alright, now for the goods. Click here to access my script for the Areal-Weighted Reaggregation tool. So, a little explanation of the tool itself. The tool has a fair amount of parameters that need to be entered (because, well, it does a fair amount of things). These include:

Workspace – where are you storing your files that you're working with? The script needs to know this in order to know where to look on your computer to get at the files of interest.

Data Feature Class – the feature class holding your data of interest. This is the file that contains the aggregated data that you will be reaggregating. For example, if you have data in block groups and you want to get it by police beats, your block groups feature class would be the input for this parameter.

Areal Unit Feature Class – the feature class holding your areal unit of interest. This is the file that contains the areal units by which you are reaggregating. For example, if you have data in block groups and you want to get it by police beats, your police beats feature class would be the input for this parameter.

Dissolve Field – the field in your areal unit feature class that uniquely identifies each areal unit in the feature class. Your data will be reaggregated (dissolved) based on this field. For example, in your police beats feature class, there is likely a field (i.e. BEAT_NUM) that uniquely identifies each police beat. This would be the input for this parameter.

Intersect Feature Class – the intersect feature class created as part of the areal-weighted reaggregation. This is one of two outputs created by the tool and includes the result of an intersect analysis between your data feature class and areal unit feature class. This feature class isn't necessarily imperative for analysis, but it does have all sorts of applications within and beyond running an AWR. For this parameter, you provide a name and the script will write the intersect feature class with that name.

Statistic Field(s) – the field(s) that you actually want to aggregate (SUM) in the output feature class. These fields are the data of interest contained in your data feature class. For example, if I wanted to aggregate the number of Black, White, Hispanic, and Asian people by police beat, I would select fields "BLACK", "WHITE", "HISPANIC", and "ASIAN" for this parameter. Note that this parameter can be any numerical input (it would likely not work for strings, although I can't say I've tested that).

Output Feature Class – the output of our areal-weighted reaggregation. The output feature class of our AWR will provide the statistics specified in the parameter above reaggregated by the areal units specified in the Dissolve Field parameter. These statistics will appear with a prefix "SUM_" and a suffix "_awr". For example, performing an AWR on the Chicago data outputs the following fields in the output feature class: "SUM_WHITE_awr", "SUM_BLACK_awr", "SUM_HISPANIC_awr", "SUM_ASIAN_awr", "SUM_MULT_RACE_awr."

To ensure that the tool actually worked on more than just the Chicago data, I ran it on a similar dataset for North Carolina and Indiana. In this problem, I used my custom AWR script to take data for the number of black and white people in each Census block group and reaggregate by house district. The results are pictured below.

Notice the fields in the attribute table "SUM_WHITE_awr", "SUM_BLACK_awr", etc. These are the outputs of the tool and are visualized on the map on screen. Notice as well that I also included the number of males and females as statistics in my AWR – you can imagine, however, that distribution of men and women might be more homogenous than race.

Ah, the sweet feeling of success when a custom tool runs correctly. Notice too that the tool is quite a behemoth – it took over 5 minutes to run on this data set. Be wary that as the number of features in your feature classes increases, so does the amount of time it will take for the script to properly run.

Here is the good ol' Chicago map once again. This time, however, I used my custom AWR tool to perform the analysis rather than the specific script from Project #1. Comparing the two attribute tables, we see that both the script from Project #1 and the AWR tool produce the same results. The GUI on the left is the AWR tool come to life, and the one on the right shows the Properties of the AWR tool.

Ultimately, it was a remarkably satisfying experience to write this tool and see it come to life. It was also a terrific lesson in learning how to abstract a specific problem into something that can be more universally applied. If you have further questions about how this tool works or how to get it to work in your own ArcMap project, feel free to get in touch on the Contact page. Check back in a few weeks for an update on my upcoming work with relational databases and my presentation at AAG! Below is the full code written out.

# Name: ArealWeightedReaggregation.py
# Description: Perform an areal-weighted reaggregation given two input polygon vectors. Statistics and dissolve fields must be specified in parameters.
# Author: Parker Ziegler

# Import system modules.

import arcpy
import os

# Set environments and output workspace.

from arcpy import env
env.overwriteOutput = True
env.workspace = arcpy.GetParameterAsText(0)

# Set local variables to the polygon feature class with the data of interest (feature data) and the polygon feature class with the new areal units (areal units). Specify the dissolve field in the areal units feature class that will be used to aggregate the data.

feature_data = arcpy.GetParameterAsText(1)
areal_unit = arcpy.GetParameterAsText(2)
dissolve_field = arcpy.GetParameterAsText(3)

# Add a field called "AREA_M" that specifies the area of each polygon in the polygon feature class with your data of interest (feature data). This will be used in a later step to calculate an area ratio.

arcpy.AddField_management(feature_data, "AREA_M", "DOUBLE", 10)
arcpy.CalculateField_management(feature_data, "AREA_M", '!shape.geodesicArea@squaremeters!', "PYTHON_9.3")

# Intersect the feature data polygons with the areal units polygons.

intersect_fcs = [feature_data, areal_unit]
out_fcs = arcpy.GetParameterAsText(4)
arcpy.Intersect_analysis(intersect_fcs, out_fcs)

# Now, create a field called "AREA_RATIO" that will divide the areas of the intersected polygons by the areas of the feature data polygons.

arcpy.AddField_management(out_fcs, "AREA_RATIO", "DOUBLE", 10)
arcpy.CalculateField_management(out_fcs, "AREA_RATIO", '!shape.geodesicArea@meters!/!AREA_M!', "PYTHON_9.3")

# Next, multiply this ratio by the statistic(s) of interest. This will calculate the value of the statistic(s) for each intersect polygon.

# First, create a field(s) that will hold each new statistic.

# Set local variables.
calc_fields = arcpy.GetParameterAsText(5)
fieldList = calc_fields.split(";")
fieldList.sort()
field_type = "DOUBLE"
field_length = 10

for field in fieldList:
arcpy.AddField_management(out_fcs, field + '_awr', field_type, field_length)

# Now, calculate the values for all five fields using the ratio created above and populate the fields created from the previous step.

awr_fields = []
awrList = arcpy.ListFields(out_fcs, "*_awr")
for field in awrList:
awr_fields.append(field.name)
awr_fields.sort()
i = 0
for field in awr_fields:
arcpy.CalculateField_management(out_fcs, awr_fields[i], '' + '!' + fieldList[i] + '!' + '*!AREA_RATIO!', "PYTHON_9.3")
i += 1

# The last step in the AWR is to dissolve our intersected features along the boundaries of the new areal units. This will reaggregate our data by these new areal units.

# Define variables.

in_features = arcpy.GetParameterAsText(4)
out_features = arcpy.GetParameterAsText(6)
field_names = []
stat_fields = arcpy.ListFields(in_features, "*_awr")
stat_fields.sort()
for field in stat_fields:
field_names.append(field.name)
statList = [str(field) + " SUM" for field in field_names]
dissolveStats = str("; ".join(statList))

# Run the Dissolve tool.

arcpy.Dissolve_management(in_features, out_features, dissolve_field, dissolveStats)

Project #1: Python Scripting for Automated Geoprocessing (or, How to Make Magic Happen)

When I first developed the syllabus for my senior research, I spent a lot of time asking fellow GIS students, professors, and analysts where they thought I should begin my study. And nearly everyone gave me the same answer – Python scripting.

Python scripting in GIS is primarily desirable because it allows you to automate almost all of the geoprocessing tasks you do in GIS without so much clicking. Rather than moving tediously from one GUI (Graphical User Interface, those pop up boxes where you tell a tool what files and parameters to use) to the next or attempting to build a flawless model from scratch, a Python script allows you to write code that is flexible, dynamic, and most importantly, adaptable to a variety of contexts. Let's say, for example, that you needed to project 1000 feature classes or run a particular technique on a group of shapefiles 500 times. Doing this manually would likely take days, if not weeks, of work. Conversely, a Python script allows you to direct the computer to a group of files and say, "Hey, could you bring these files into a GIS, run x, y, or z operations on them, and tell me what happens?" For me, Python scripts evoke a sensation of performing magic (particularly when layers you only conceived of creating actually start appearing on screen – ok, that's it for the meta stuff for now).

Another beautiful aspect of Python scripting is the ability to add a lot of creativity to your own geospatial analysis. Before I started scripting with Python, I didn't realize how much unseen functionality the folks at Esri and the QGIS Project had programmed into their geoprocessing tools, how many little details are taken care of for you. On the one hand, this is part of what makes GIS a lot more user friendly and approachable nowadays – you don't have to worry too much about the nitty gritty and you don't need to have a programming background to get started. However, I also think working exclusively with the built-in functionality of ArcGIS or QGIS confines you somewhat to very specific techniques or routines (which, ultimately, produces an inundation of very similar types of maps).

So, for my first project, I decided to tackle what felt like an ambitious project for a self-taught programmer like myself. Could I write a Python script that would make this map?

I made this map towards the end of last semester using data and a workflow courtesy of Professor Jeff Howarth in the Middlebury Geography Department. I did deviate in considerable ways from the original method for making this map, and dressed it up with my own cartography, but it was in many ways a standard GIS procedure for me. I used GUIs, Esri geodatabases, all of the built-in functionality I could get my hands on. The challenge now would be to see if I could do the same without so much support.

Fortunately, I did stumble upon a solid book to help me develop my scripting capabilities quickly, Python Scripting for ArcGIS by Paul A. Zandbergen. The book really focuses on using the ArcPy site package with ArcGIS to write effective, efficient code. It acts more as a conceptual skeleton than a How To guide, giving you the initial tools that you need to start exploring ArcPy and developing your own scripts. For this first project, I really wanted to stick with ArcPy because 1) it has a whole slew of solid methods and functions custom built for geoprocessing, 2) it works seamlessly with standard Python functions (i.e. you can nest 'regular' Python functions into larger ArcPy functions), and 3) there's well written, easy to use documentation available from Esri. At the same time, there is some dissonance between the stated goals of my research and the choice to use ArcPy. It isn't open source (it is built directly into ArcGIS, a very expensive piece of software), it has a good deal of custom functionality that still obscures what's going on "under the hood", and code produced using ArcPy can't be run by folks not using the Esri suite of products. And so I want to take a moment to acknowledge these benefits and drawbacks in my decision, and to recognize that this tension between convenience and accessibility is palpable in the GIS world today and demands further investigation. I will continue to do that investigation throughout this semester. Now, on to the results!

Click here to access the script I used to make this map. Note that this script is specific to the files and databases written to my personal drive at Middlebury. If you want to replicate this map, you will have to grab the data yourself (don't worry, I give you some sources where you can find it in the comments at the beginning of the script).

This script took a fair amount of troubleshooting to actually work through. ArcPy requires very specific syntax and does not like to perform operations on conflicting data types (trust me, I found out the hard way). This is true of most programming languages, but what is perhaps more difficult with ArcPy is that you have to do a fair amount of data type conversion and concatenation to actually get seemingly simple calculations to go through. It's also difficult to iterate through individual feature classes to do specific operations on each of them. You'll notice that I used multiple for loops with lists and indexes to take care of this problem; you could also use cursors, although I find these to be a bit more cumbersome.

Getting this script to work was a remarkably satisfying moment for me. The only items the script starts with are a .mxd file (ArcMap Document file), a shapefile of police beats for the city of Chicago, a shapefile of all Census block groups for the united States, and a .csv file (Comma Separated Value file) with XY coordinate data for arrests for possession of crack cocaine. This means that the script takes care of:

  • Setting Environments
  • Creating a file geodatabase
  • Performing a select by location to get block groups only for the city of Chicago
  • Making an XY Event Layer to display the crime data and writing this data to a new feature class
  • Converting all shapefiles to feature classes in a geodatabase
  • Projecting all feature classes
  • Performing a spatial join
  • Running an areal-weighted reaggregation to get the number of people of each race by police beat across the city (this is the entire second half of the script and encompasses about 8 separate operations...but more on this next post)

My hope is that in making this script available I can help others to write similar scripts that also take on the challenge of visualizing social injustice. While I am proud of the accomplishment, I am more excited about the story that this map tells - one of racially biased policing in a city that has seen too much violence. And so I am reminded that this thing I have made, this script, is only really as valuable as the story it helps to tell and the minds it helps to change. Below, you will find the map that this script produces (Note: you will have to do the symbology manually to get your map to look like mine).

And here is the code, fully written out, that made this map.

# NAME: AWR_Script
# Description: Run an areal-weighted reaggregation (AWR) on crime data and racial Census data for the city of Chicago given a .csv file of crimes, a shapefile for Census block groups, and a shapefile for CPD police beats.
# Author: Parker Ziegler

# Preparatory Steps
# Create a .mxd file to hold your map (i.e. "Chicago_Arrest_Map.mxd")
# Create a .csv file of the point data you're interested in. In this case, I used data showing the locations of arrests for crack cocaine possession in Chicago, courtesy of the Chicago Data Portal.
# You'll need to acquire shapefiles of the City of Chicago police beats as well as Census data. These can be accessed via the Chicago Data Portal and census.gov, respectively.
# Place all files and their metadata into your workspace folder.

# Import system modules.

import arcpy
import os

# Set environment workspace, enable file overwrite.

from arcpy import env
env.workspace = r"W:/GG700/Chicago_Arrests"
env.overwriteOutput = True
print "Env is set to Chicago_Arrests."

# Create a file geodatabase to store your data

out_folder_path = r"W:/GG700/Chicago_Arrests"
out_name = "Chicago_Arrest_Data.gdb"
arcpy.CreateFileGDB_management(out_folder_path, out_name)
print "Geodatabase created."

# Before starting analysis, we need to clean up our Census layer. In order to run the SelectLayerByLocation tool we first have to create a layer file from our Census shapefile.

arcpy.MakeFeatureLayer_management("block_groups.shp", "block_groups_lyr")
print "Census layer created."

# Perform a select by location on the Census data to get only the blocks groups that intersect the CPD police beats.

arcpy.SelectLayerByLocation_management("block_groups_lyr", "INTERSECT", "police_beats.shp")
print "Intersecting features selected."

# Write these block groups to a new shapefile.
arcpy.CopyFeatures_management("block_groups_lyr", "chicago_blk_grps.shp")
print "Features copied to shapefile."

# Display the Crack_XYData.csv as points and write a point feature class.
 
# Set variables.

in_table = "Crack_XYData.csv"
x_coords = "Longitude"
y_coords = "Latitude"
out_layer = "crack_as_points"
saved_layer = "crack_as_points.lyr"
sr = arcpy.SpatialReference("WGS 1984")

# Make the XY Event Layer

arcpy.MakeXYEventLayer_management(in_table, x_coords, y_coords, out_layer, sr)
print "XY Event Layer created."

# Create an output layer file (.lyr) file.

arcpy.SaveToLayerFile_management(out_layer, saved_layer)
print "Layer file created."

# Copy the layer file to a shapefile (.shp).

arcpy.CopyFeatures_management(saved_layer, "crack_as_points.shp")
print "Shapefile successfully created."

# Move your "crack_as_points.shp", your "police_beats.shp", and your "chicago_blk_grps.shp" into your file geodatabase.

in_features = ["crack_as_points.shp", "police_beats.shp", "chicago_blk_grps.shp"]
out_gdb = r"W:/GG700/Chicago_Arrests/Chicago_Arrest_Data.gdb"
arcpy.FeatureClassToGeodatabase_conversion(in_features, out_gdb)
print "Shapefiles have been converted to feature classes in the geodatabase."

# Now, get all your feature classes into the same spatial reference.

env.workspace = out_gdb
infcList = arcpy.ListFeatureClasses()
infcList.sort()
outfcList = ["chicago_blk_grps_project","crack_as_points_project", "police_beats_project"]
outCS = arcpy.SpatialReference("NAD 1983 UTM Zone 16N")
i = 0
for infc in infcList:
arcpy.Project_management(infcList[i], outfcList[i], outCS)
i += 1

print "All layers are in the same coordinate system."

# By now, your three features classes should be in the same coordinate system. Now, perform a spatial join to get the number of crimes in each police beat.

# Join "crack_as_points_project" (points) to "police_beats_project" (polygons).
# Define variables.

target_features = "police_beats_project"
join_features = "crack_as_points_project"
outfc = "police_beats_crack_joined"

# Run the Spatial Join tool.

arcpy.SpatialJoin_analysis(target_features, join_features, outfc)
print "Crack points have been joined to police beats."

# Performing a spatial join will automatically create a field labeled "Join_Count"; use this field, as well as area information from your police beats feature class, to create a new field that gives arrests / sq. km.

# Add fields for police beat geometry and crime density.

arcpy.AddField_management("police_beats_crack_joined", "area_beats", "FLOAT", 9)
arcpy.AddField_management("police_beats_crack_joined", "arrest_dns", "FLOAT", 9)
print "Two fields have been added."

# Calculate geometry, use this value to calculate a crime density for each police beat.

arcpy.CalculateField_management("police_beats_crack_joined", "area_beats", '!shape.geodesicArea@squarekilometers!', "PYTHON_9.3")
arcpy.CalculateField_management("police_beats_crack_joined", "arrest_dns", 'float(!Join_Count!)/float(!area_beats!)', "PYTHON_9.3")
print "Crime density field has been calculated."

# Write this joined value to a new feature class, as joins are stored only in memory.

arcpy.CopyFeatures_management("police_beats_crack_joined", "arrest_density")
print "Arrest Density feature class written to geodatabase."

# You have now successfully created the choropleth layer representing arrest density for possession of crack cocaine by CPD police beat. The next part of the workflow involves performing an areal weighted reaggregation on our Census data to obtain the number of people living in each police beat by race.

# Calculate the area of each Census block group.

arcpy.AddField_management("chicago_blk_grps_project", "a_blk_grps", "DOUBLE", 10)
arcpy.CalculateField_management("chicago_blk_grps_project", "a_blk_grps", '!shape.geodesicArea@squarekilometers!', "PYTHON_9.3")
print "Area of each block group calculated."

# Intersect the Census block groups with the police beats.

in_features = ["chicago_blk_grps_project", "police_beats_project"]
out_features = "blk_grps_beats_intersect"
arcpy.Intersect_analysis(in_features, out_features)
print "Intersect completed."

# Now, create a field that will divide the areas of the intersected features by the areas of the Census block groups.

arcpy.AddField_management(out_features, "area_ratio", "DOUBLE", 10)
arcpy.CalculateField_management(out_features, "area_ratio", '!shape.geodesicArea@squarekilometers!/!a_blk_grps!', "PYTHON_9.3")
print "Area ratio calculated."

# Next, multiply this ratio by the number of Black, White, Hispanic, Asian, and Multiple Race people in each Census block group. This will give us the number of people of each race in each intersect feature.

# Add five fields, one for each race.

fieldList = ["black_pop", "white_pop", "hisp_pop", "asian_pop", "mult_pop"]
field_type = "DOUBLE"
field_length = 10
for field in fieldList:
arcpy.AddField_management(out_features, field, field_type, field_length)
print "All fields for racial data added."

# Calculate the values for all five fields using the ratio created above and the Census data included from the intersect.

calcList = ["BLACK", "WHITE", "HISPANIC", "ASIAN", "MULT_RACE"]
i = 0
for field in fieldList:
arcpy.CalculateField_management(out_features, fieldList[i], '' + '!' + calcList[i] + '!' + '*!area_ratio!', "PYTHON_9.3")
i += 1
print "All fields for racial data calculated."

# The last step in the AWR is to dissolve our intersected features along the boundaries of the police beats. This will reaggregate our racial data by police beats, which we will represent using a dot density.

# Define variables.
 
in_features = "blk_grps_beats_intersect"
out_features = "race_by_beats"
dissolve_field = "BEAT_NUM"
stats_fields = "black_pop SUM; white_pop SUM; hisp_pop SUM; asian_pop SUM; mult_pop SUM"

# Run the Dissolve tool.

arcpy.Dissolve_management(in_features, out_features, dissolve_field, stats_fields)
print "Dissolve completed."

# The script is now complete. The final map should be composed of two layers – "arrest_density" (a choropleth density of arrests for possession of crack cocaine) and "race_by_beats" (a dot density representation of the racial make up of Chicago). You may also want to add a base map depending on your purposes.
# Use the symbology GUI to change the symbology and set boundaries as you wish. I recommend doing most of your design work in Adobe Illustrator.