Differences between revisions 3 and 4
Revision 3 as of 2005-01-09 04:04:10
Size: 12817
Editor: user-12l2d8t
Comment:
Revision 4 as of 2005-02-15 23:10:24
Size: 12914
Editor: user-12l2d8t
Comment:
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:

My company can be found at http://www.imperialdata.net all the development I do is in python.

Chad Crabtree

I reside in Livonia, Michigan and I work at a coffee shop. I have been using python for about 5 years after reading How to Become a Hacker article. I have thus far written about 10k loc and feel that I'm finally starting to get this programing thing. I recently took part in a Code Clinic from the Tutor mailing list check it out on BrianvandenBroek's page.

My company can be found at http://www.imperialdata.net all the development I do is in python.

Random Writer Code Clinic


My Analysis

Thanks for your implementations. I looked it over interesting how differently we all ended up doing this and how similar they where also. I think Christian’s object approach is the most elegant I've seen thus far for this problem. Mine is object based with procedural driver code. David’s is purely procedural. Brian’s was objective. I did a little bench marking for the fun of it and I wanted to know why my implementation was so much slower than Christians and about the same as David’s, I figured when I wrote this it would be very competitive on the timing because I only needed to loop through the file once. However the initial benchmark gave the below results with the parameters (5 500 tom.txt out.txt) Mine: 2.41900018019 David's: 2.3409288052 Christian's:1.92717454313 Brian’s: 8.40194857168

I was thinking why are mine about the same even though I pre-cached all seed combinations? After looking through their code I saw that they looped through the file each time they needed a new character which is how I did it at first, however I did it like this approximately

s[n:n+len(seed)] for n,v in enumerate(s) if s[n:n+len(seed)]==seed]

I did a pure python loop instead of the implied C loops that the other two did. This is why their implementation of looping through the file for each character is pretty fast.

So I thought 'Ah-ha I know how to hit their algorithms.’ With the parameters of (10 5000 tom.txt out.txt) I got theses results. Mine:3.67060087246 David's:18.8382646398 Christian's:15.3224465679 Brian’s: 143.980883604

Now here's some profiles with the settings (10 5000 tom.txt out.txt)

Mine:

        9988 function calls in 3.856 CPU seconds 

  Ordered by: standard name 

  ncalls  tottime  percall  cumtime  percall filename:lineno(function) 

       1    0.000    0.000    3.829    3.829 <string>:1(?) 
       1    0.026    0.026    3.856    3.856 profile:0(main()) 
       0    0.000             0.000          profile:0(profiler) 
    4991    0.057    0.000    0.057    0.000 random.py:229(choice) 
       1    2.637    2.637    2.637    2.637 randomwriter.py:11(index_text) 
       1    0.040    0.040    0.040    0.040 randomwriter.py:24(get_first_seed) 
    4990    0.180    0.000    0.237    0.000 randomwriter.py:36(get_next_letter) 
       1    0.000    0.000    2.637    2.637 randomwriter.py:4(__init__) 
       1    0.074    0.074    2.989    2.989 randomwriter.py:51(buildoutput) 
       1    0.840    0.840    3.829    3.829 randomwriter.py:59(main) 

David's:

        15010 function calls in 18.463 CPU seconds 

Ordered by: standard name 

  ncalls  tottime  percall  cumtime  percall filename:lineno(function) 
       1    0.000    0.000   18.436   18.436 <string>:1(?) 
       1    0.027    0.027   18.463   18.463 profile:0(main()) 
       0    0.000             0.000          profile:0(profiler) 
    5001    0.115    0.000    0.115    0.000 random.py:229(choice) 
       1    0.000    0.000    0.000    0.000 random.py:90(seed) 
       1    0.096    0.096   18.436   18.436 rwspiffy.py:102(main) 
       1    0.001    0.001    0.001    0.001 rwspiffy.py:38(validateArgs) 
       1    0.000    0.000    0.175    0.175 rwspiffy.py:65(newSeed) 
    5000   17.773    0.004   17.773    0.004 rwspiffy.py:76(nextAfterSeed) 
    5000    0.277    0.000   18.164    0.004 rwspiffy.py:85(doSeed) 
       1    0.000    0.000    0.000    0.000 sets.py:119(__iter__) 
       1    0.175    0.175    0.175    0.175 sets.py:356(_update) 
       1    0.000    0.000    0.175    0.175 sets.py:425(__init__) 

Christian's:

        29958 function calls in 16.217 CPU seconds 

  Ordered by: standard name 

  ncalls  tottime  percall  cumtime  percall filename:lineno(function) 
       1    0.000    0.000   16.050   16.050 <string>:1(?) 
       1    0.166    0.166   16.217   16.217 profile:0(main()) 
       0    0.000             0.000          profile:0(profiler) 
       1    0.000    0.000    0.000    0.000 random.py:135(randrange) 
       1    0.000    0.000    0.000    0.000 random.py:198(randint) 
    4990    0.096    0.000    0.096    0.000 random.py:229(choice) 
       1    0.001    0.001   16.050   16.050 randomwrite.py:8(main) 
       1    0.000    0.000    0.000    0.000 randomwriter.py:28(_selectSeed) 
    4990   15.404    0.003   15.404    0.003 randomwriter.py:33(_getMatches) 
       1    0.004    0.004    0.005    0.005 randomwriter.py:4(__init__) 
    4990    0.094    0.000    0.094    0.000 randomwriter.py:43(_getSubChars) 
    5000    0.075    0.000    0.075    0.000 randomwriter.py:55(_writeChar) 
    4990    0.046    0.000    0.046    0.000 randomwriter.py:60(_updateSeed) 
    4990    0.281    0.000   15.995    0.003 randomwriter.py:64(Step) 
       1    0.049    0.049   16.045   16.045 randomwriter.py:74(Run) 

Brian’s:

         81353 function calls in 141.329 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.089    0.089  141.303  141.303 <string>:1(?)
        1    0.000    0.000    0.000    0.000 brian_random_writer.py:177(__init__)
        1   50.521   50.521  141.208  141.208 brian_random_writer.py:192(get_randomwriter_text)
        1    0.001    0.001    0.001    0.001 brian_random_writer.py:228(get_input)
        1    0.116    0.116    0.130    0.130 brian_random_writer.py:255(get_orig_text)
        1    0.000    0.000    0.000    0.000 brian_random_writer.py:269(get_novel_seed)
    27871    0.908    0.000   90.555    0.003 brian_random_writer.py:287(get_new_character)
    25600   89.043    0.003   89.043    0.003 brian_random_writer.py:316(get_choices)
        1    0.006    0.006  141.214  141.214 brian_random_writer.py:365(main)
        1    0.014    0.014    0.014    0.014 brian_random_writer.py:79(reader)
        1    0.026    0.026  141.329  141.329 profile:0(main())
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    0.000    0.000 random.py:135(randrange)
        1    0.000    0.000    0.000    0.000 random.py:198(randint)
    27871    0.604    0.000    0.604    0.000 random.py:229(choice)

I'm going to paste the code from the hot spots identified here.

This is the one that came with mine.

   def index_text(self): 
       index={} 
       textsize=len(self.text) 
       for n,v in enumerate(self.text): 
           if n+self.size<=textsize: 
               seed=self.text[n:n+self.size] 
           else: 
               seed=self.text[n:textsize] 
           if index.has_key(seed): 
               index[seed].append(n) 
           else: 
               index[seed]=[n] 
       return index 

The reason I ended up doing this is because my first time solving the problem I got correct results but terrible execution time so I profiled it to find out what the problem was. I was using a loop very similar to the one I'm using above. Before I was doing this for each character, instead I cached my results here and used them on every seed look up. Every possible seed is already in the dictionary. Apparently dictionary lookups are quite fast.

This is David's hot spot.

def nextAfterSeed(seed): 
   """ takes seed to Sort and find probability for next character based on source """ 
   # abstracted for onceandonlyonce principle from doSeed 
   tempvariable = sourcefile.split(seed) 
   del tempvariable[0] # first entry of split sourcefile is not a probable character 
   # OneLiner, return list of first characters in tempvariable, no blanks 
   return [item[:1] for item in tempvariable if len(item) > 0] 

I thought this was interesting and it never crossed my mind, and gives good performance on small values because it uses implied 'C' based loops. The one I used was python only so it was very slow.

Where he loops through the file each time for each character needed in the output. This makes good sense split the file on the seed then the first character of each subsequent string is the one that is wanted. I have a feeling the string slices are slowing this one down a little I'm not really sure.

This is Christian's hot spot.

   def _getMatches(self): 
       """build a list of indexes of the current seed in the text""" 
       matches = [] 
       match = self.text.find(self.seed) 
       matches.append(match) 
       while match != -1: 
           match = self.text.find(self.seed, match+1) 
           matches.append(match) 
       return matches 

I needed to look this up so I'm putting it here for the sake of others too.

find(sub[,start[,end]]) Return the lowest index in the string where substring sub is found, such that sub is contained in the range [start, end). Optional arguments start and end are interpreted as in slice notation. Return |-1| if sub is not found.

This is not so different in concept from the previous hot spot. Find the seed matches then keep their index then he passes this to the character grabber function that takes the index adds the seed length then take that character. It is fast enough because of the implied C loops also.

This is Brian’s hot spot.

    def get_choices(self):
        search_point = self.orig_text.find(self.seed, 0, -1)
        choices = []
        while True:
            if search_point == -1:
                if choices == []:
                    self.get_novel_seed
                else:
                    break 
            else:
                choices.append(
                             self.orig_text[search_point + len(self.seed)])
                search_point = self.orig_text.find(self.seed, search_point + 1, -1)
        return choices

This does not look to different from how Christian solved the problem. From the profile and comparing it to our profiles Brian’s has an unreasonable number of calls to the biggest hotspot. 27k calls to this function. To figure this out we need to look at function that is the second heavy hotspot, to look at why this is so.

def get_randomwriter_text(self):
        self.get_input()
        self.get_orig_text()
        word_count = old_word_count = 0
        self.get_novel_seed()
        while word_count < self.output_text_length:
            new_character = self.get_new_character()  # stored as used x2
            self.output_text = self.output_text + new_character
            old_word_count = word_count
            word_count = len(self.output_text.split())
            self.seed = self.seed[1:] + new_character\

I was baffled for a little while what was going on in the above code that caused all the function calls, then on the line

Word_count=len(self.output_text.split())

I noticed this would cause the output to be much longer. I changed the line in his program to not split() the string and got the correct output and a much speedier 1.77 seconds on 5 500 tom.txt. The rest of his algorithm was the same as ours just those 8 characters changed things and caused a great slow down. On the parameters 10 5000 tom.txt he got 16.5 seconds, he uses basically the same algorithm as Christian but he caches his previous results, so every time he gets a new seed he looks in the cache and if it already exists then he uses that if not he makes a new one and caches it. This causes his implementation sans .split() to be a bit faster than Christian’s. Yet another metric to look at when inspecting code is lines of code.

Brian’s code was 364 loc with a whole bunch of comments and doc strings. Christians Code was 128 loc with comments and spaces between functions. David’s was 124 with copious comments and good spacing. Mine was 108 after I took out all the extra white

you can find the originals here [http://www.imperialdata.net/randomwriter_clinic.zip Randomwriter Clinic Source]

Note: I did make changes to the sources in order to do the profiling and to use the timeit module.

  • In the case of Brians code I hard coded the paramaters in order to get accurate results for profile and timeit

Email: base64.decodestring('ZmxheGVhdGVyQHlhaG9vLmNvbQ==\n')

...


CategoryHomepage

ChadCrabtree (last edited 2008-11-15 14:00:12 by localhost)

Unable to edit the page? See the FrontPage for instructions.