Things that we tried but decided were not good ideas.
Using more aggressive calling conventions/inlining in ceval
The Py_LOCAL/Py_LOCAL_INLINE macros can be used instead of static to force the use of a more efficient C calling convention, on platforms that support that. We tried applying that to ceval, and saw small speedups on some platforms, and somewhat larger slowdowns on others.
Adding the ability to pre-allocate builtin list storage. At best we can speed up appends by about 7-8% for lists of 50-100 elements. For the large part the benefit is 0-2%. For lists under 20 elements, peformance is actually reduced when pre-allocating.
CPython spends considerable time moving exception info around among thread states, frame objects, and the sys module. This code is complicated and under-documented. Patch 1145039 took a stab at reverse-engineering a weak invariant, and exploited it for a bit of speed. That worked fine so far as it went (and a variant was checked in), but it's likely more remains to be gotten. Alas, the tim-exc_sanity branch set up to try that consumed a lot of time fighting mysteries, and looks like it's more trouble than it's worth.
Singleton of StopIteration
As part of the new exceptions implementation, we tried making a singleton StopIteration instance. No speedup was detected. This is primarily due to most uses of StopIteration using the type object directly (ie "raise StopIteration" vs. "raise StopIteration()"). Even for a crafted test case where the instance use was forced there was no detectable change in speed.
Making a PyDict_GET_SIZE like PyTuple_GET_SIZE doesn't give a measurable improvement in pybench or pystone. This is likely because the compiler notices that those functions that use it have alreaday done NULL checks and frequently PyDict_Check so we aren't telling it anything it didn't already know.
Conversely changing all Py(Tuple|List)_GET_SIZE to point to plain Size has no measurable slowdown! Well, in the range of 0.5%, which may just be noise. Switching the #define to the real functions generates some spurious warnings because the regular methods expect PyObjects and not the more specific types.
One man-day was spent trying to seperate the dicts used in namespaces (module.dict, instance/type/class dicts) the result was changing over a quarter of the PyDict* macros in the trunk to PySymdict (over half if you exclude Modules/). This was such a massive change it was abandoned after sprint Day1.
Copyless network I/O
A "hot buffer" class was implemented in order to perform I/O and parsing (using struct) without creating or copying strings around. This is compared for performance with a loop that creates strings via slices and concatenates strings. It turns out that the hot buffer version is not much faster (about the same speed) because it replaces string allocations and concatenations with dict lookup in order to access its attributes (position and limit). The version that uses strings is pretty fast since we write it using the string protocol, i.e. no dict lookup is performed. However, by implementing the common use patterns in C we should be able to make parsing of common input (e.g. netstrings) much faster. The hot buffer provides a more intuitive interface to parsing. The fact is that currently the performance gains with the initial version are not significant.