Things that we tried but decided were not good ideas.
Adding the ability to pre-allocate builtin list storage. At best we can speed up appends by about 7-8% for lists of 50-100 elements. For the large part the benefit is 0-2%. For lists under 20 elements, peformance is actually reduced when pre-allocating.
CPython spends considerable time moving exception info around among thread states, frame objects, and the sys module. This code is complicated and under-documented. Patch 1145039 took a stab at reverse-engineering a weak invariant, and exploited it for a bit of speed. That worked fine so far as it went (and a variant was checked in), but it's likely more remains to be gotten. Alas, the tim-exc_sanity branch set up to try that consumed a lot of time fighting mysteries, and looks like it's more trouble than it's worth.
Singleton of StopIteration
As part of the new exceptions implementation, we tried making a singleton StopIteration instance. No speedup was detected. This is primarily due to most uses of StopIteration using the type object directly (ie "raise StopIteration" vs. "raise StopIteration()"). Even for a crafted test case where the instance use was forced there was no detectable change in speed.
Making a PyDict_GET_SIZE like PyTuple_GET_SIZE doesn't give a measurable improvement in pybench or pystone. This is likely because the compiler notices that those functions that use it have alreaday done NULL checks and frequently PyDict_Check so we aren't telling it anything it didn't already know.
Conversley changing all Py(Tuple|List)_GET_SIZE to point to plain Size has no measurable slowdown! Well, in the range of 0.5%, which may just be noise. A few calls to the Tuple/List GET_SIZE macros are also in error; they work because ->ob_size is the right thing to get but they aren't actually accessing Py(Tuple|List)Object variables (may just need a cast).