Darwinweb http://www.darwinweb.net/ gabe@websaviour.com (Gabe da Silveira) en-us Convert Syck to Psych YAML format <p>If you deal with a lot of <span class="caps">YAML</span> in your Ruby code, especially migrating from Ruby 1.8.x to 1.9.2 and 1.9.3, then you may have run into some of the numerous issues with <span class="caps">YAML</span> incompatibility. I won&#8217;t bore you with a lot of specifics, because most of these bugs are transient, but here are the basic facts:</p> <ol> <li>Syck is the antiquated builtin legacy <span class="caps">YAML</span> driver for Ruby.</li> <li>Psych is the new-fangled modern <span class="caps">YAML</span> driver for Ruby.</li> <li>Ruby 1.9.3 <em>always</em> defaults to Psych.</li> <li>Ruby 1.9.2 <em>may</em> default to Psych depending on the system state Ruby is compiled.</li> </ol> <p>You can check which engine is active in Ruby 1.9.x like so:</p> <pre class="pastels_on_dark">1.9.3p327 :001 &gt; require &#39;yaml&#39; =&gt; true 1.9.3p327 :002 &gt; YAML =&gt; Psych </pre><p>Hopefully you are using Psych for everything, but if you have legacy Syck <span class="caps">YAML</span> files lying around you could be in for some pain because they are not necessarily compatible. In my case, I had a bunch of i18n translation files emitted with Syck which uses an incompatible escape code structure instead of plain utf-8, thereby rendering the files unreadable by both Psych and humans alike. My solution was a little script utilizing the fact that the engine can be swapped dynamically:</p> <pre class="pastels_on_dark"><span class="Keywords">require</span> <span class="Strings"><span class="Strings">&#39;</span>yaml<span class="Strings">&#39;</span></span> <span class="Keywords">require</span> <span class="Strings"><span class="Strings">&#39;</span>fileutils<span class="Strings">&#39;</span></span> <span class="ControlStructures">if</span> <span class="Variables">ARGV</span>.empty? <span class="Variables"><span class="Variables">$</span>stderr</span>.puts <span class="Strings"><span class="Strings">&quot;</span>! Must pass one or more filenames to convert from Syck output to Psych output.<span class="Strings">&quot;</span></span> exit <span class="Numbers">1</span> <span class="ControlStructures">end</span> bad_files <span class="Operators">=</span> <span class="Variables">ARGV</span>.select{ |<span class="Variables">f</span>| <span class="Operators">!</span> File.exists?(f) } <span class="ControlStructures">if</span> bad_files.any? <span class="Variables"><span class="Variables">$</span>stderr</span>.puts <span class="Strings"><span class="Strings">&quot;</span>! Aborting because the following files do not exist:<span class="Strings">&quot;</span></span> <span class="Variables"><span class="Variables">$</span>stderr</span>.puts bad_files exit <span class="Numbers">1</span> <span class="ControlStructures">end</span> <span class="ControlStructures">def</span> use_syck <span class="Variables">YAML</span>::ENGINE.yamler <span class="Operators">=</span> <span class="Strings"><span class="Strings">&#39;</span>syck<span class="Strings">&#39;</span></span> <span class="Keywords">raise</span> <span class="Strings"><span class="Strings">&quot;</span>Oops! Something went horribly wrong.<span class="Strings">&quot;</span></span> <span class="ControlStructures">unless</span> <span class="Variables">YAML</span> <span class="Operators">==</span> <span class="Variables">Syck</span> <span class="ControlStructures">end</span> <span class="ControlStructures">def</span> use_psych <span class="Variables">YAML</span>::ENGINE.yamler <span class="Operators">=</span> <span class="Strings"><span class="Strings">&#39;</span>psych<span class="Strings">&#39;</span></span> <span class="Keywords">raise</span> <span class="Strings"><span class="Strings">&quot;</span>Oops! Something went horribly wrong.<span class="Strings">&quot;</span></span> <span class="ControlStructures">unless</span> <span class="Variables">YAML</span> <span class="Operators">==</span> <span class="Variables">Psych</span> <span class="ControlStructures">end</span> <span class="Variables">ARGV</span>.each <span class="ControlStructures">do </span>|<span class="Variables">filename</span>| <span class="Variables"><span class="Variables">$</span>stdout</span>.print <span class="Strings"><span class="Strings">&quot;</span>Converting <span class="Strings"><span class="Strings">#{</span>filename<span class="Strings">}</span></span> from Syck to Psych...<span class="Strings">&quot;</span></span> use_syck hash <span class="Operators">=</span> <span class="Variables">YAML</span>.load(File.read filename) FileUtils.cp filename, <span class="Strings"><span class="Strings">&quot;</span><span class="Strings"><span class="Strings">#{</span>filename<span class="Strings">}</span></span>.bak<span class="Strings">&quot;</span></span> use_psych File.open(filename, <span class="Strings"><span class="Strings">&#39;</span>w<span class="Strings">&#39;</span></span>){ |<span class="Variables">file</span>| file.write(<span class="Variables">YAML</span>.dump(hash)) } <span class="Variables"><span class="Variables">$</span>stdout</span>.puts <span class="Strings"><span class="Strings">&quot;</span> done.<span class="Strings">&quot;</span></span> <span class="ControlStructures">end</span> </pre> gabe@websaviour.com (Gabe da Silveira) Mon, 12 Nov 2012 23:35:00 +0000 http://localhost:3000/articles/convert-syck-to-psych-yaml-format http://localhost:3000/articles/convert-syck-to-psych-yaml-format Converting Hash to OrderedHash in Serialized ActiveRecord Column <p>Suppose you have an ActiveRecord class like this:</p> <pre class="pastels_on_dark"><span class="ControlStructures">class</span> Report &lt; ActiveRecord::Base serialize <span class="Constants"><span class="Constants">:</span>options</span>, <span class="Variables">Hash</span> <span class="ControlStructures">end</span> </pre><p>And, because you are foolishly still using Ruby 1.8 and doing direct comparisons of the serialized format you need to change it to this:</p> <pre class="pastels_on_dark"><span class="ControlStructures">class</span> Report &lt; ActiveRecord::Base serialize <span class="Constants"><span class="Constants">:</span>options</span>, ActiveSupport::OrderedHash <span class="ControlStructures">end</span> </pre><p>The problem that you face is that the change will cause any existing rows to start throwing `ActiveRecord::SerializationTypeMismatch`. The solution is to run something like this in a migration or console simultaneous with the deploy:</p> <pre class="pastels_on_dark"><span class="Variables"><span class="Variables">@</span>tables</span> <span class="Operators">=</span> <span class="Strings"><span class="Strings">%w(</span>reports<span class="Strings">)</span></span> <span class="Variables"><span class="Variables">@</span>tables</span>.each <span class="ControlStructures">do </span>|<span class="Variables">table</span>| puts <span class="Strings"><span class="Strings">&quot;</span>============== <span class="Strings"><span class="Strings">#{</span>table<span class="Strings">}</span></span> ==============<span class="Strings">&quot;</span></span> query <span class="Operators">=</span> <span class="Strings"><span class="Strings">&quot;</span>SELECT id,options FROM `<span class="Strings"><span class="Strings">#{</span>table<span class="Strings">}</span></span>` ORDER BY id<span class="Strings">&quot;</span></span> result <span class="Operators">=</span> ActiveRecord::Base.connection.execute(query) result.each <span class="ControlStructures">do </span>|<span class="Variables">row</span>| id <span class="Operators">=</span> row[<span class="Numbers">0</span>] hash <span class="Operators">=</span> <span class="Variables">YAML</span>.load(row[<span class="Numbers">1</span>]) <span class="ControlStructures">if</span> hash.is_a?(ActiveSupport::OrderedHash) puts <span class="Strings"><span class="Strings">&quot;</span>Skipping row #<span class="Strings"><span class="Strings">#{</span>id<span class="Strings">}</span></span> which is already an OrderedHash<span class="Strings">&quot;</span></span> <span class="ControlStructures">else</span> ordered_hash <span class="Operators">=</span> ActiveSupport::OrderedHash.new hash.keys.sort{ |<span class="Variables">a</span>,<span class="Variables">b</span>| a.to_s <span class="Operators">&lt;=&gt;</span> b.to_s }.each <span class="ControlStructures">do </span>|<span class="Variables">k</span>| ordered_hash[k] <span class="Operators">=</span> hash[k] <span class="ControlStructures">end</span> escaped_options <span class="Operators">=</span> ActiveRecord::Base.sanitize(ordered_hash.to_yaml) query <span class="Operators">=</span> <span class="Strings"><span class="Strings">&quot;</span>UPDATE `<span class="Strings"><span class="Strings">#{</span>table<span class="Strings">}</span></span>` SET options=<span class="Strings"><span class="Strings">#{</span>escaped_options<span class="Strings">}</span></span> WHERE id=<span class="Strings"><span class="Strings">#{</span>id<span class="Strings">}</span></span><span class="Strings">&quot;</span></span> puts <span class="Strings"><span class="Strings">&quot;</span>Running <span class="Strings">&quot;</span></span> <span class="Operators">+</span> query ActiveRecord::Base.connection.execute query <span class="ControlStructures">end</span> <span class="ControlStructures">end</span> puts <span class="Strings"><span class="Strings">&quot;</span><span class="Strings">&quot;</span></span> <span class="ControlStructures">end</span> </pre> gabe@websaviour.com (Gabe da Silveira) Thu, 01 Nov 2012 04:11:00 +0000 http://localhost:3000/articles/converting-hash-to-orderedhash-in-serialized-activerecord-column http://localhost:3000/articles/converting-hash-to-orderedhash-in-serialized-activerecord-column Drupal vs Rails <p>During <a href="http://news.ycombinator.com/item?id=4604555">this discussion on HN</a>, I was surprised to learn that Drupal—the high-level framework I left behind years ago in favor of Rails—has a high-octane elite consulting industry behind it with the same prestige and perhaps even more clout than the Rails community.</p> <p>This gave me pause, because for me Drupal was always the option for demanding clients with razor thin budgets—The type who waxes on casually about a laundry list of features that sounds like Yahoo&#8217;s homepage and then comes with a four-figure budget as if that should amply fund a Facebook clone.</p> <p>But companies like Acquia are an obvious indicator of just how shallow and murky my end of the pond was when I first formed my opinion on Drupal. <strong>What are the true advantages of Drupal vs Rails for elite web consultants?</strong></p> <h2>Contrasts</h2> <p>Drupal has much more common functionality and it&#8217;s better integrated. Rails&#8217; functionality is more modular and lower-level.</p> <p>Drupal theming and layout is easier, getting you to a working site in minutes and styling in general is much less work. Rails is more flexible, being <span class="caps">CSS</span> framework agnostic and allowing minimalism or complexity to taste.</p> <p>Drupal has a heavy data model and rather unique <span class="caps">API</span> which allows modules to follow common patterns and work together more naturally. Rails starts with no data model, so you build from scratch but you can tailor to your exact use case without any unused overhead.</p> <h2>Primary Use Cases</h2> <p><strong>Drupal excels</strong> in cases where you have huge amounts of content and a wide variety of essentially boilerplate functionality to integrate all over the place. Large content sites like <a href="http://bbc.co.uk">bbc.co.uk</a> or <a href="http://whitehouse.gov">whitehouse.gov</a> would be the poster children for this type of site.</p> <p><strong>Rails excels</strong> where you have a very specific and unique data model and you need to tailor every aspect of the site and UX around that data model. Almost any startup fits the bill here, since it is generally implicit that the backend powering the service is unique. It&#8217;s hard to imagine building Github in Drupal, for example.</p> <h2>Summary</h2> <p>In essence Drupal hands you a mountain of functionality that is surprisingly flexible for its sheer volume, but for which the cost of adding new functionality is relatively high. Rails gives you only primitives (maybe with some higher level engine gems that still require a fair amount of glue code to integrate fully), but the cost of building functionality is relatively minimal.</p> <p>In order for Drupal to make economic sense you need to be able to use a significant portion of that mountain of functionality. In fact you need to use enough of that mountain to cover the additional cost of custom development when there is no applicable module, or UX assumptions diverge from Drupal core&#8217;s.</p> <p>I would summarize the distinction as <strong>Portals vs Startups</strong>. The archetypical Portal is exactly the kind of site for which Drupal is optimal and Rails would be a money-sink. Likewise, the archetypical Startup would not want to get mired down with Drupal&#8217;s prescribed data model when they are trying to iterate rapidly on a concept.</p> <p>Of course, the vast majority of sites out there are neither as massive as a Portal, or as unique as a Startup. Suddenly the water is murky again, but I think this is still a useful lens with which to make early website technology choices leaving personal biases aside.</p> gabe@websaviour.com (Gabe da Silveira) Wed, 03 Oct 2012 19:30:00 +0000 http://localhost:3000/articles/drupal-vs-rails http://localhost:3000/articles/drupal-vs-rails The Problem with Rails' Catch-all Route <p>Today I am upgrading several hundred lines of Rails routes from the old pre 3.0 syntax to the new Rails 3.0 routing <span class="caps">DSL</span>. I do like the new syntax, though it is far from a trivial job as there are a number of subtle edge cases I&#8217;ve been dealing with. As I meticulously considered and tested each route I discovered some route matches that I didn&#8217;t think should have been covered.</p> <p>For example:</p> <pre class="pastels_on_dark">namespace <span class="Constants"><span class="Constants">:</span>services</span> <span class="ControlStructures">do</span> resources <span class="Constants"><span class="Constants">:</span>fanships</span> <span class="ControlStructures">do</span> collection { put <span class="Constants"><span class="Constants">:</span>create</span> } <span class="ControlStructures">end</span> <span class="ControlStructures">end</span> </pre><p>This was matching:</p> <pre class="pastels_on_dark"><span class="Variables">GET</span> <span class="Operators">/</span>services<span class="Operators">/</span>fanships<span class="Operators">/</span>create </pre><p>Which was clearly not what was intended.</p> <p>Okay, I admit this route is an unusual construction in a Rails app. At first glance I thought maybe something funky was going on due to the conflict of the default resourceful <code>create</code> route and the additional one I was creating. So I stripped it down:</p> <pre class="pastels_on_dark">namespace <span class="Constants"><span class="Constants">:</span>services</span> <span class="ControlStructures">do</span> put <span class="Strings"><span class="Strings">&#39;</span>fanships/create<span class="Strings">&#39;</span></span> =&gt; <span class="Strings"><span class="Strings">&#39;</span>fanships#create<span class="Strings">&#39;</span></span> <span class="ControlStructures">end</span> </pre><p>However, even after verifying that there were no other fanships routes getting in the way (the file is over 500 lines) still a match on <code>GET</code>.</p> <p>At this point it hit me, the catch-all route:</p> <pre class="pastels_on_dark">map.connect <span class="Strings"><span class="Strings">&#39;</span>:controller/:action/:id<span class="Strings">&#39;</span></span> </pre><p>This app begin life in Rails 1.1, so the default route has always been there even though the vast majority of the app is served by standard Railsy resources. I never thought much of it, and in fact I&#8217;ve relied on it a handful of times over the years. But suddenly doing this exercise it hit me like a ton of bricks that the default route is clobbering all my carefully considered RESTful constraints on URLs. Then I realized that the <code>:controller</code> symbol is traversing slashes as well, thus even my namespaced controllers aren&#8217;t safe. As far as I can tell there is no straightforward way to prevent this effect as long as the catch-all is active.</p> <p>I&#8217;m sure this is no revelation to a lot of people—in fact there&#8217;s a comment about it in the default routes.rb comments—but I had never thought through the implications. I guess the reason it&#8217;s not a bigger problem is because the default resourceful route for <code>show</code> shields the <code>update</code>, <code>create</code>, and <code>destroy</code> methods from a casual <code>GET</code>. But even so, this violates the principle of least surprise in a dangerous way. From now on I&#8217;ll be sure to never use the default catch-all route.</p> gabe@websaviour.com (Gabe da Silveira) Fri, 28 Oct 2011 00:52:00 +0000 http://localhost:3000/articles/the-problem-with-rails-catch-all-route http://localhost:3000/articles/the-problem-with-rails-catch-all-route In Defense of ORMs <p>Laurie Voss just posted an article that <a href="http://seldo.com/weblog/2011/06/15/orm_is_an_antipattern"><span class="caps">ORM</span> is an anti-pattern</a> in which he calls out ActiveRecord (although nothing about ActiveRecord is specifically cited) as doing more harm than good. Some of the flaws he mentions are real issues, but he hand-wavingly dismisses the benefits without any real analysis. Let&#8217;s examine the claims:</p> <blockquote> <p>Some <span class="caps">ORM</span> layers will tell you that they &#8220;eliminate the need for <span class="caps">SQL</span>&#8221;. This is a promise I have yet to see delivered.</p> </blockquote> <p>Which <span class="caps">ORM</span> says this? This is just a strawman set up by painting <span class="caps">ORM</span> projects as making ridiculous claims. There are quotes around the words, but if you plug them into Google, nothing on the first page is a claim from an <span class="caps">ORM</span>, they&#8217;re all about eliminating [MS] <span class="caps">SQL</span> Server or <span class="caps">SQL</span> experts.</p> <blockquote> <p>Others will more realistically claim that they reduce the need to write <span class="caps">SQL</span> but allow you to use it when you need it. For simple models, and early in a project, this is definitely a benefit: you will get up and running faster with <span class="caps">ORM</span>, no doubt about it. However, you will be running in the wrong direction.</p> </blockquote> <p>Now he brings it back to reality and admits you can get up and running faster with <span class="caps">ORM</span>, but then claims you&#8217;re running in the wrong direction, full stop. There&#8217;s no reasoning <em>why</em> you are running just in the wrong direction, just a blind assertion.</p> <p>&#8220;<strong>Code Generation</strong>&#8221; is also mentioned as a benefit, but without going into much detail. &#8220;<strong>Efficiency is &#8216;good enough&#8217;</strong>&#8221;, is mentioned here as well but seems out of place; it&#8217;s a known tradeoff, not something to be touted as a benefit.</p> <p>Now getting to the meat of the article.</p> <h2>Inadequate abstraction</h2> <blockquote> <p>The most obvious problem with <span class="caps">ORM</span> as an abstraction is that it does not adequately abstract away the implementation details.</p> </blockquote> <p>This is because there is a fundamental parity mismatch between <span class="caps">SQL</span> and imperative code. It&#8217;s certainly valid that ORMs are one of the leakiest abstractions imaginable, and therefore they are poor abstractions on the continuum of abstraction quality. That&#8217;s a fundamental challenge of using <span class="caps">SQL</span> regardless of whether you are using an <span class="caps">ORM</span> or not. Writing your <span class="caps">SQL</span> by hand does nothing to address that, you still have to convert back and forth between imperative code and data structures to <span class="caps">SQL</span> and flat rows of data. The challenge of doing this well and creating a nice interface is the whole reason there are so many different ORMs and why they vary so widely in form and function. If you reject ORMs you&#8217;re really just saying that you have a better way of doing it, at least for the project at hand, which may very well be true, or you may wind up digging your very own anti-pattern.</p> <blockquote> <p>The whole point of an abstraction is that it is supposed to simplify. An abstraction of <span class="caps">SQL</span> that requires you to understand <span class="caps">SQL</span> anyway is doubling the amount you need to learn: first you need to learn what the <span class="caps">SQL</span> you&#8217;re trying to run is, then you have to learn the <span class="caps">API</span> to get your <span class="caps">ORM</span> to write it for you.</p> </blockquote> <p>The premise of <span class="caps">ORM</span> is not that it precludes the need to understand <span class="caps">SQL</span>, anyone telling you otherwise is selling snake oil. You&#8217;re not learning &#8220;twice as much&#8221;, the <span class="caps">ORM</span> is there to provide a standard way to get your data in and out of an <span class="caps">SQL</span> database. It certainly is quicker to learn the primitive <span class="caps">SQL</span> driver functions, but all the time and more you saved by not learning the <span class="caps">ORM</span> will be spent figuring out how to construct <span class="caps">SQL</span> queries and map row data to objects. Again, you might create a more optimized solution for your project, but just as likely you will run into dead-ends and require refactorings because you don&#8217;t have the experience of thousands of developer hours with widely varying use-cases behind your homegrown solution.</p> <blockquote> <p>A defender of <span class="caps">ORM</span> will say that this is not true of every project, that not everyone needs to do complicated joins, that <span class="caps">ORM</span> is an &#8220;80/20&#8221; solution, where 80% of users need only 20% of the features of <span class="caps">SQL</span>, and that <span class="caps">ORM</span> can handle those. All I can say is that in my fifteen years of developing database-backed web applications that has not been true for me. Only at the very beginning of a project can you get away with no joins or naive joins. After that, you need to tune and consolidate queries. Even if 80% of users need only 30% of the features of <span class="caps">SQL</span>, then 100% of users have to break your abstraction to get the job done.</p> </blockquote> <p>Well, in my <em>27 years</em> of programming (that&#8217;s right, I first sat down in front of an AppleSoft <span class="caps">BASIC</span> prompt when I was 6), I&#8217;ve found that on 34% of projects, we need 58% of the features of 66% of the <span class="caps">RDBMS</span> engines available.</p> <p>But seriously, if you work with ActiveRecord for any amount of time you realize that assertion about the percentage of <span class="caps">SQL</span> feature coverage is meaningless. There is no such percentage. I&#8217;m hard pressed to think of a feature of <span class="caps">SQL</span> that ActiveRecord does not support. It&#8217;s true that there are some higher level methods (associations for instance) that do more with less code, but even those can take additional options that tweak the <span class="caps">SQL</span> in a variety of ways. Even when you drop down to the more basic building block methods (<code>#select</code>, <code>#where</code>, <code>#group</code>, etc) you still gain the benefits of the <span class="caps">ORM</span> in that you can chain them with each other and higher level methods and not have to worry about munging your own <span class="caps">SQL</span> strings.</p> <h2>Incorrect abstraction</h2> <blockquote> <p>If your project really does not need any relational data features, then <span class="caps">ORM</span> will work perfectly for you, but then you have a different problem: you&#8217;re using the wrong datastore.</p> </blockquote> <p>This is continuing with the straw man that ORMs don&#8217;t do joins, but they do. The subsequent line of thought follows directly from this fallacy. Then he comes to:</p> <blockquote> <p>On the the other hand, if your data is relational, then your object mapping will eventually break down.</p> </blockquote> <p>This is true in one sense. No <span class="caps">ORM</span> can make all possible relations be first class citizens in the library. At some point you have to drop to a lower level. Depending what it is you may need to drop down to the raw data, or maybe an intermediate level. In any case, this is not invalidating the other thousands of queries that for which your app happily makes use of the high-level <span class="caps">ORM</span> features. The two can very happily co-exist.</p> <h2>Death by a thousand queries</h2> <blockquote> <p>This leads naturally to another problem of <span class="caps">ORM</span>: inefficiency. When you fetch an object, which of its properties (columns in the table) do you need? <span class="caps">ORM</span> can&#8217;t know, so it gets all of them (or it requires you to say, breaking the abstraction).</p> </blockquote> <p>So the solution is to force yourself to specify the columns every time without exception? Again, baby right out with the bathwater; it&#8217;s no harder to specify in an <span class="caps">ORM</span> than it is in raw <span class="caps">SQL</span>. Also you may end up shooting yourself in the foot because sometimes <span class="caps">RDBMS</span> software has optimizations around <code>SELECT *</code>, and therefore you hurt yourself by removing columns unless you really are shaving a lot of data.</p> <blockquote> <p>Initially this is not a problem, but when you are fetching a thousand records at a time, fetching 30 columns when you only need 3 becomes a pernicious source of inefficiency.</p> </blockquote> <p>It&#8217;s not pernicious. You just add <code>.select('col1,col2,col3')</code> after you profiled the action and realized this was the bottleneck. Or you could trace the usage of every query, optimize the <code>SELECT</code> clause for each and every case, and endlessly refactor as your application and interface grows and changes. You could also write the whole thing in binary and I bet it would be pretty easy to squeeze out twice the performance in 100x the development time.</p> <blockquote> <p>Many <span class="caps">ORM</span> layers are also notably bad at deducing joins, and will fall back to dozens of individual queries for related objects.</p> </blockquote> <p>On the other hand if you are joining several relations to a base table, this is exactly what you want to avoid cartesian product explosion. Contrary to the implication, <span class="caps">ORM</span> developers do profile and optimize their code. You can often do a better for a specific case, but it&#8217;s far from being necessary every time.</p> <blockquote> <p>The problem, I have discovered with experience, is that there is seldom a single &#8220;magic bullet&#8221; query that needs to be optimized: the death of database-backed applications is not the efficiency of any one query, but the number of queries. ORM&#8217;s lack of context-sensitivity means that it cannot consolidate queries, and must fall back on caching and other mechanisms to attempt to compensate.</p> </blockquote> <p>The use of &#8220;magic bullet&#8221; here frames the idea of selective optimization in a negative light, because all programmers <em>know</em> there are no magic bullets. Unfortunately it&#8217;s a poor use of metaphor because a bullet refers to a solution not a problem. A more apt platitude would be that premature optimization is always bad. There is more likely more than one problem query in a large application, but there are also likely hundreds of inconsequential boilerplate queries needing no more fine-tuning than an index or two.</p> <p>In any case, it would be disengenuous to say that ORMs don&#8217;t come with performance overhead, but it&#8217;s a measured tradeoff. There&#8217;s no cliff-dropping performance hit as indicated by the last vaguely damning sentence about using caching to &#8220;compensate&#8221;, whatever that means.</p> <h2>Solutions: Use Objects</h2> <blockquote> <p>If your data is objects, stop using a relational database. The programming world is currently awash with key-value stores that will allow you to hold elegant, self-contained data structures in huge quantities and access them at lightning speed.</p> </blockquote> <p>Okay this is a bit offtopic for a rant like this, but I can&#8217;t let it slide. Your data <em>&#8220;is&#8221;</em> not either objects or relational. The choice of data structures and data stores is an abstract representation of concepts that can be interpretted many different ways. It really bothers me whenever someone offhandedly says that there are many types of data that aren&#8217;t relational. Quite frankly, I think that&#8217;s horseshit. Let me go on record that I think that there is no useful data that can not be made more useful through solid relational modeling as opposed to some object DB dump.</p> <p>With a relational model you have a far more powerful querying paradigm than you have on any object data store. By its nature, a relational model is meant to slice and dice data in unforeseen ways. Of course, this flexibility comes with a price. The reason for all these new high performance data stores is to commit to a less-flexible data model that allows you to break out of the performance limitations of a strict <span class="caps">ACID</span> <span class="caps">RDBMS</span>.</p> <p>The reason people start with a relational database is because it&#8217;s the most flexible way for your data to sit. You can pivot your startup to completely different purposes using the same data and all you need to do is change a few indexes around to support your new query structure. When you use some kind of document or object-oriented store, you are making a much firmer assertion about exactly how this data will be used, and what kind of queries and reports you intend to run from it.</p> <p>The fact is that most applications will never outgrow the capability of a single moderately powered relational DB server. If and when you do, you will have scaling issues that don&#8217;t magically go away just because you chose the high-performance datastore dujour. You&#8217;ll have a lot of architecting and optimization to do regardless of the path you take eliminating bottlenecks, so I think it&#8217;s entirely justified to slap in an <span class="caps">SQL</span> database on day one with the realization that you may have to change it later.</p> <h2>Solutions: Use <span class="caps">SQL</span> in the Model</h2> <blockquote> <p>However, remember that the job of your model layer is not to <em>represent objects</em> but to <strong>answer questions</strong>. Provide an <span class="caps">API</span> that answers the questions your application has, as simply and efficiently as possible.</p> </blockquote> <p>I would add changing state to this, but fundamentally I agree with the philosophy. Given the fact that interacting with a relational database can be messy, I think it&#8217;s very smart if you can keep the logic entirely contained in the model, and therefore I could accept that keeping ActiveRecord primitives out of your controllers and views might be a good idea. In practice I only work this way for more complex types of &#8220;questions&#8221; that I&#8217;m asking the model, and Rails sort of blurs the lines by defining a bunch of methods, which could theoretically be replaced if you ripped out ActiveRecord and replaced your backend with something completely different.</p> <blockquote> <p>OO is itself an abstraction, a beautiful and hugely flexible one, but relational data is one of its boundaries, and pretending objects can do something they can&#8217;t is the fundamental, root problem in all <span class="caps">ORM</span>.</p> </blockquote> <p>This closing statement implies a false belief about the design criteria and intention of <span class="caps">ORM</span> systems. It seems like a case of misattributing the beliefs of newbies who wish <span class="caps">SQL</span> didn&#8217;t exist to <span class="caps">ORM</span> developers who clearly have a deep understanding of <span class="caps">SQL</span>. In reality ORMs are mostly designed for experts. I haven&#8217;t used a lot of other ORMs extensively, so I can&#8217;t comment on the design criteria or optimal usage of something like SQLAlchemy, but for ActiveRecord and probably a lot of ORMs there are some clear wins:</p> <ul> <li><strong>Automatically map database types to the equivalent programming language types and vice-versa.</strong> If you&#8217;re managing your own <span class="caps">SQL</span> then you&#8217;re going to waste a lot of time figuring this out. It&#8217;s one place where sensible defaults go a long way. Why debug the edge cases yourself?</li> <li><strong>Greatly reduce error-prone <span class="caps">SQL</span> string munging.</strong> Assuming you have your pants on straight and don&#8217;t open up any injection holes, you still have to have complex string manipulation logic to compile conditions and joins, or else a bunch of very similar but different queries that are difficult to read and hard to refactor.</li> <li><strong>Composability of queries.</strong> With a proper relational algebra engine, you can programmatically build up a query by chaining clause methods together. Doing so allows you to compose many complex queries and keep the code in the logical place without the design being driven by how you are organizing the <span class="caps">SQL</span> itself. An example would be if you have the concept of Authors who might have a &#8220;new&#8221; flag, and Posts that might have a &#8220;rant&#8221; flag. It&#8217;s clear that the definition of those flags belongs in the respective models, but what if you want to query both at once? In the case where you manually craft your <span class="caps">SQL</span> you have to put it somewhere, maybe in one model or the other, or somewhere else. You also have to decide if you are going to munge a bunch of conditions together in one query or split them out. Throw something like an advanced search form in the app and this becomes a minefield of niggling details.</li> <li><strong>Simplify the simple stuff.</strong> There are tons of boilerplate queries in any sizable app. An <span class="caps">ORM</span> gives you a concise way to call them and get a standard default mapping of fields to attributes.</li> <li><strong>Facilitate higher level patterns.</strong> The concepts of <code>belongs_to</code> and <code>has_many</code> in ActiveRecord are a simple encapsulation of very common patterns that come up again and again in relational databases. Through a series of simple options it allows you specify and execute a wide variety of queries for counting, instantiating, or loading multiple collections. They can be chained with the lower level functions, and always output correct and straightforward <span class="caps">SQL</span>. Sure they can&#8217;t do everything, but they certainly are capable of handling the majority of queries in a typical app quite elegantly.</li> </ul> <p>The bottom line is that the concept of an <span class="caps">ORM</span> is a catchall for a range of projects that implement common patterns for using <span class="caps">SQL</span> in an object-oriented system. An <span class="caps">ORM</span> is nothing more than a codified set of patterns. Because of the nature of the relational model, there are many many different approaches one can take. To dismiss ORMs entirely is to dismiss <span class="caps">SQL</span> as well. You might claim that you don&#8217;t use an <span class="caps">ORM</span> in your code, but in effect you are just implementing your own idea of an <span class="caps">ORM</span>. If you are just peppering queries around your model then it might be a very lightweight <span class="caps">ORM</span>, but the fact is you are still Mapping Objects to Relations and vice-versa.</p> gabe@websaviour.com (Gabe da Silveira) Thu, 16 Jun 2011 06:00:00 +0000 http://localhost:3000/articles/in-defense-of-orms http://localhost:3000/articles/in-defense-of-orms Aws::S3 Gem Bucket Does Not Exist Bug <p>I just spent two hours debugging the most banal bug. It eventually reduced to the following wtf moment:</p> <pre class="pastels_on_dark"><span class="Variables">S3Object</span>.store(<span class="Strings"><span class="Strings">&#39;</span>a_key<span class="Strings">&#39;</span></span>, <span class="Strings"><span class="Strings">&#39;</span>some_content<span class="Strings">&#39;</span></span>, <span class="Strings"><span class="Strings">&#39;</span>a_bucket<span class="Strings">&#39;</span></span>) =&gt; success <span class="Variables">S3Object</span>.find(<span class="Strings"><span class="Strings">&#39;</span>a_key<span class="Strings">&#39;</span></span>, <span class="Strings"><span class="Strings">&#39;</span>a_bucket<span class="Strings">&#39;</span></span>) =&gt; <span class="Variables">AWS</span>::S3::NoSuchBucket: <span class="Variables">The</span> specified bucket does <span class="Operators">not</span> exist from ...<span class="Operators">/</span>vendor<span class="Operators">/</span>gems<span class="Operators">/</span>aws<span class="Operators">-</span>s3<span class="Operators">-</span><span class="Numbers">0.6</span>.<span class="Numbers">2</span><span class="Operators">/</span>lib<span class="Operators">/</span>aws<span class="Operators">/</span>s3<span class="Operators">/</span>error.rb:<span class="Numbers">38</span><span class="Constants"><span class="Constants">:</span>in</span> <span class="Strings"><span class="Strings">`</span>raise&#39;</span> <span class="Strings"> from .../vendor/gems/aws-s3-0.6.2/lib/aws/s3/base.rb:76:in <span class="Strings">`</span></span>request<span class="Strings"><span class="Strings">&#39;</span></span> <span class="Strings"> from .../vendor/gems/aws-s3-0.6.2/lib/aws/s3/base.rb:92:in `get<span class="Strings">&#39;</span></span> from ...<span class="Operators">/</span>vendor<span class="Operators">/</span>gems<span class="Operators">/</span>aws<span class="Operators">-</span>s3<span class="Operators">-</span><span class="Numbers">0.6</span>.<span class="Numbers">2</span><span class="Operators">/</span>lib<span class="Operators">/</span>aws<span class="Operators">/</span>s3<span class="Operators">/</span>bucket.rb:<span class="Numbers">104</span><span class="Constants"><span class="Constants">:</span>in</span> <span class="Strings"><span class="Strings">`</span>find&#39;</span> <span class="Strings"> from .../vendor/gems/aws-s3-0.6.2/lib/aws/s3/object.rb:172:in <span class="Strings">`</span></span>find<span class="Strings"><span class="Strings">&#39;</span></span> <span class="Strings"> from (irb):7</span> </pre><p>Okay, that makes no sense. It works in a vanilla irb. I smell a monkeypatch.</p> <p>Sure enough, it turns out the <a href="http://github.com/cardmagic/contacts">Contacts gem</a> and the <a href="http://github.com/marcel/aws-s3"><span class="caps">AWS</span>-S3 gem</a> both define a <code>Hash#to_query_string</code> method. Here&#8217;s the kicker: <strong>both of those gems only call this method one time</strong>!</p> <p>This is why monkeypatching should be considered harmful. I just wasted two hours of my life so someone could call <code>options.to_query_string</code> instead of <code>to_query_string(options)</code>.</p> gabe@websaviour.com (Gabe da Silveira) Thu, 10 Jun 2010 06:27:00 +0000 http://localhost:3000/articles/awss3-gem-bucket-does-not-exist-bug http://localhost:3000/articles/awss3-gem-bucket-does-not-exist-bug Configuring MySQL for utf8 under homebrew <p>Man, every time I set up mysql it takes me half an hour to google up all the documentation of how to make it just work™ with utf-8 instead of the god forsaken latin1 default. So here it is once and for all.</p> <p>Server <code>my.cnf</code> should contain:</p> <pre class="pastels_on_dark">[mysqld] collation_server=utf8_general_ci character_set_server=utf8 </pre><p>The location of my.cnf varies depending on how you install it. dev.mysql.com has <a href="http://dev.mysql.com/tech-resources/articles/mysql_intro.html#SECTION0001500000">a nice summary of how my.cnf</a> is looked up.</p> <p>Since I use homebrew my location is <code>/usr/local/var/mysql/my.cnf</code>. Binary installations differ.</p> <p>Server config variables are described under <a href="http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_character_set_client">Server System Variables</a></p> <p>Client configuration goes in <code>~/.my.cnf</code>:</p> <pre class="pastels_on_dark">[client] default-character-set=utf8 </pre><p>Client config variables are described under <a href="http://dev.mysql.com/doc/refman/5.0/en/mysql-command-options.html">mysql Options</a></p> <p>You can check the status of these with <a href="http://dev.mysql.com/doc/refman/5.1/en/show-variables.html"><span class="caps">SHOW</span> <span class="caps">VARIABLES</span></a></p> <p>Finally, as a reminder for homebrew users, if you use the com.mysql.mysqld launchd plist that comes with the homebrew formula then you will want to start and stop the server via launchctl:</p> <pre class="pastels_on_dark">launchctl stop com.mysql.mysqld ... launchctl start com.mysql.mysqld </pre> gabe@websaviour.com (Gabe da Silveira) Sat, 08 May 2010 04:13:00 +0000 http://localhost:3000/articles/configuring-mysql-for-utf8-under-homebrew http://localhost:3000/articles/configuring-mysql-for-utf8-under-homebrew The Case for Git Rebase <p>When you first sit down with git they tell you to watch out for rebase. &#8220;Git is fast. Git is great. Git gets merging right. If you screw up git reflog has your back. <strong><em>But watch out for rebase</em>, as long as you avoid rebase you&#8217;ll never get in too far over your head.</strong>&#8221;</p> <p>Okay, I can see the wisdom in that.</p> <p>In fact for the first year I avoided rebase almost entirely. I read <a href="http://changelog.complete.org/archives/586-rebase-considered-harmful">Rebase Considered Harmful</a> early on and it reaffirmed my choice. But as I came to understand git&#8217;s internals, and as <a href="http://progit.org/book/ch3-6.html">new descriptions of rebase</a> came online, the distinct feeling that I was missing something started creeping in.</p> <p>Ironically, what pushed me to really grok rebase was the need to perform a surgical 3-way rebase for a series of commits that were drastically misapplied. Once I understood how to do that rebase was layed clear to me. Rebase is at its essence simply takes a series of commits as if they were individual patches, and applies them at a different point in history. The confusion and opacity of rebase comes in large part from the fact that the range of commits and &#8220;base&#8221; commits are determined somewhat magically by the branch names that are passed to <code>git-rebase</code>.</p> <p>Once I understood rebase, I started using it more, but I still held back from suggesting the use of rebase to the rest of my team; the clean history was nice but I didn&#8217;t see a compelling enough advantage to become an advocate for a rebase-based workflow.</p> <p>Things have changed.</p> <h2>A private team workflow</h2> <p>Git was built to manage Linux kernel development, so it&#8217;s no surprise that discussion of rebase tends to be focused on open source workflows. In open source once you push a commit to a public repo, you don&#8217;t know who else has it, and rebasing public commits will lead to dangerous cascading effects. In a private repo it&#8217;s still a good rule of thumb not to rebase pushed commits, but with small teams you can bend the rules just a hair to optimize the history linearity.</p> <p>The advice I&#8217;m proposing to my team is to:</p> <ol> <li><strong>Always use <code>git pull --rebase</code></strong></li> <li><strong>Rebase topic branches against master</strong> (or the current base branch) <strong>before pushing for the first time</strong></li> <li><strong>Rebase topic branches just before merging and deleting them</strong> (and let other people know the branch is officially dead so they don&#8217;t keep committing to their local copy)</li> </ol> <p>Why go through the trouble of all this rebasing? Won&#8217;t we be losing history? Well yes, rebase vs merge is always a tradeoff. For a long time I thought it was basically a wash: readability in exchange for precise history. However as I came to understand the tradeoffs things kept shifting towards rebase.</p> <h2>Readability</h2> <p>If you are following good agile practices and keeping your stories really short, ideally your topic branches are short and sweet and they all get created and merged within a day or two, right? In that case a few merge commits are not really a burden, and you can easily parse out the exact history of who did what, when. Right, but in the real world some branches end up sticking around longer for various reasons. It doesn&#8217;t take long to reach a threshold where the full history becomes unparseable by the human brain. Here is a recent example&#8230; this is with just <strong>4 developers</strong>!</p> <p><img src="http://img.skitch.com/20100424-fujrjnxfh23akyb41kd45cgwpm.jpg" alt="" /></p> <p>Once you reach that point you&#8217;ve lost the benefit of having full history, and all those merge commits are just useless noise. And it gets worse.</p> <h2>Bug Locality</h2> <p>All else being equal, the readability is not enough to tip me in favor of rebase. However when it comes to debugging, a linear rebased history is your friend as well.</p> <p>Often times the combination of topic branches result in a conflict. Merging seems simpler because you resolve all the conflicts at once. With rebase you have to fix each conflict as it occurs commit-by-commit. However with each individual commit it&#8217;s easier to resolve because the commit is (hopefully) more focused than the entire branch, so the resolution is done in the same context as the original commit was. You think to yourself, &#8220;what was the purpose of this commit, and how should it be different given the wider changes that occurred on the base branch?&#8221;</p> <p>With either workflow you have the possibility of bugs. Either because you flubbed the merge or, worse, because of some subtle interaction that you may not discover until much later. This is where bug locality comes in.</p> <p>If you&#8217;re using <code>git-bisect</code> months after the fact, what commit will appear to have caused the bug? In the case of merge it&#8217;s going to be a huge commit combining two branches with many changes. This is fairly likely to be completely useless. However, if you have always rebased and have a perfectly linear history, you will <strong>always</strong> be able to trace it back to a single logical commit. This is the kind of thing that is hard to appreciate until you&#8217;ve actually seen <code>git-bisect</code> turn up useless a few times.</p> <p>Okay, so I admit it&#8217;s not always feasible (or even worth it) to maintain a perfectly linear history. A few merge commits here and there aren&#8217;t going to hurt anybody. This is one reason I kept my rebasing to myself for a long time. However if you rebase at all then there is a third downside with merging.</p> <h2>Merging is Viral</h2> <p>If you merge all the time, you can find yourself in situations where you&#8217;d like to rebase but can&#8217;t for practical reasons. This isn&#8217;t really a weakness in git so much as the fact that rebasing is easier the closer it&#8217;s done to the actual commits. Rebasing your own commits a-la <code>git pull --rebase</code> is more or less the same difficulty as merging (most of the time).</p> <p>However if you go back to rebase a sequence that has a bunch of merge commits in it, <code>git-rebase</code> will not be able to make use of any conflict resolution done in those merges. This is because the individual commits are replayed one by one in temporal order, which means conflicts that were resolved in later merges have to be re-resolved piece by piece.</p> <p>Consider rebasing a long-lived topic branch back to master:</p> <p><img src="http://img.skitch.com/20100424-tptk1p2ujpgutibf1f99jpdqg7.jpg" alt="" /></p> <p>On the left you have a topic branch worked on by two people who were regularly merging. On the right you see the master branch which had it&#8217;s own line of development going on simultaneously. Now when it comes time to merge this branch down to master, you want to rebase and then delete it. The only problem is that as each commit is replayed, <strong>you hit every conflict that originally occurred and was resolved in those merge commits</strong>, except now these changes are potentially ancient history, and even if you were one to original do the merge you may not clearly remember the context of each individual commit.</p> <p>If the two developers had done <code>git pull --rebase</code> every time, they would have resolved conflicts locally so that the later rebase to master would not have any old conflicts to resolve. In this case the conflicts were gnarly enough that rebasing was not practical. Once that happens then rebasing becomes impossible for any branch containing this sequence.</p> <p>Of course eventually you expect to merge everything back to master and you get a clean slate, but the point is that little merges require bigger and bigger merges as a topic branch grows. Since you don&#8217;t necessarily know the life cycle of a topic branch when you start, keeping history clean is a smart hedge.</p> <h2>Conclusion</h2> <p>It took a couple years of daily git use on private projects but I&#8217;ve now come to believe that the benefits of a clean linear history outweigh the benefits of a perfect historical record. <code>git-rebase</code> maintains the commit dates also, so you can infer a good deal about the original history. An original history <strong>may</strong> give me a clue about what a developer was thinking at the time, but this is not necessarily of greater benefit than knowing in a clear order what changes were applied to the codebase. In the end a more powerful <code>git-bisect</code> is the trump card that puts me firmly in the camp of rebase, at least for private projects.</p> gabe@websaviour.com (Gabe da Silveira) Sun, 25 Apr 2010 06:18:00 +0000 http://localhost:3000/articles/the-case-for-git-rebase http://localhost:3000/articles/the-case-for-git-rebase Testing ThinkingSphinx with Test::Unit and Transactional Fixtures <p>I first started using Sphinx about a year ago. At The Auteurs we use it to solve some thorny localization and filtering issues, and we have been pushing its limits almost since the beginning. Our film index in particular is quite complex, hitting ten tables and defining two dozen attributes. I&#8217;ve been impressed with how flexible it is with a pretty basic set of primitives. But one thing that never had an obvious solution was testing. I just finished the third revision on our test harnesses, and I thought I&#8217;d share my approach because it runs a bit contrary to the current zeitgest in Ruby testing circles.</p> <h2>Integration vs Unit Testing</h2> <p>I&#8217;m a big believer in integration testing, and when we first setup Sphinx we really wanted to test it end-to-end. Our first two harnesses reflected this approach. We ended up with a fairly elegant method to start and stop sphinx on classes that needed it. This gave us good test coverage, but eventually cracks started to show. We already needed to speed up our tests as I <a href="http://darwinweb.net/articles/83">outlined a couple weeks ago</a>, and integrated Sphinx support was one of the worst offenders:</p> <ul> <li>Startup time for sphinx is on the order of a couple seconds, this adds up over multiple tests</li> <li>Indexing time is also significant, and it adds up even faster if you reindex for every individual test</li> <li>We&#8217;re regularly adding new models to be indexed by Sphinx which is magnifying these sources of overhead</li> <li>Our indexes are getting more complex, which means exercising them well at the integration level is too slow</li> <li>Rails foxy fixtures create sparse ids which Sphinx is not in love with</li> <li>Even worse, if the ID overflows Sphinx internal key size (which I think depends on whether it is compiled as 32 or 64-bit) then those records are silently dropped.</li> <li>We rely on transactional fixtures for performance reasons, which are not efficient to turn off on a case by case basis to solve the previous problems</li> </ul> <h2>Designing Sphinx Tests for Performance</h2> <p>Clearly Sphinx can be quite complex and good test coverage is critical. But what constitutes &#8216;good&#8217;? The meat of Sphinx is simply querying some efficiently indexed data. Verifying the results that <code>ThinkingSphinx#search</code> returns allows you to test that everything between the index definition and the query is working as expected. There are a lot of moving parts under there, so it&#8217;s reassuring to have a solid sanity check that your indexes are doing what you think they&#8217;re doing.</p> <p>Given that Sphinx is a high performance fulltext search, there&#8217;s no reason you can&#8217;t run a lot of tests quickly. I decided that for the health of our test suite and testing discipline to stub out Sphinx in all our tests, and move the Sphinx tests into its own suite of highly efficient tests covering the actual indexes. In the end we lose a bit of regression coverage over the actual Sphinx calls in the application, but that is a comparatively small surface area for potential problems in exchange to encouraging better testing of the complex part, plus it&#8217;s not likely to be subtly broken by some distant change and if breakage does make it to production exception notification has our back.</p> <h2>Step One: New Test Suite</h2> <p>In order to isolate these tests from fixtures and minimize startup overhead, I moved all Sphinx tests to <code>test/sphinx</code> and wrote this Rake task:</p> <pre class="pastels_on_dark">namespace <span class="Constants"><span class="Constants">:</span>test</span> <span class="ControlStructures">do</span> task <span class="Constants"><span class="Constants">:</span>sphinx</span> =&gt; <span class="Strings"><span class="Strings">&#39;</span>ts:config_test<span class="Strings">&#39;</span></span> <span class="ControlStructures">do</span> puts <span class="Strings"><span class="Strings">&quot;</span>! Starting Sphinx by rake<span class="Strings">&quot;</span></span> silence_stream(<span class="Variables">STDOUT</span>){ Rake::Task[<span class="Strings"><span class="Strings">&quot;</span>thinking_sphinx:start<span class="Strings">&quot;</span></span>].invoke } <span class="ControlStructures">begin</span> Rake::Task[<span class="Strings"><span class="Strings">&quot;</span>test:sphinx_without_daemon<span class="Strings">&quot;</span></span>].invoke <span class="ControlStructures">ensure</span> puts <span class="Strings"><span class="Strings">&quot;</span>! Stopping Sphinx by rake<span class="Strings">&quot;</span></span> silence_stream(<span class="Variables">STDOUT</span>){ Rake::Task[<span class="Strings"><span class="Strings">&quot;</span>thinking_sphinx:stop<span class="Strings">&quot;</span></span>].invoke } <span class="ControlStructures">end</span> <span class="ControlStructures">end</span> Rake::TestTask.new(<span class="Constants"><span class="Constants">:</span>sphinx_without_daemon</span>) <span class="ControlStructures">do </span>|<span class="Variables">t</span>| t.libs <span class="Operators">&lt;&lt;</span> <span class="Strings"><span class="Strings">&quot;</span>test<span class="Strings">&quot;</span></span> t.pattern <span class="Operators">=</span> <span class="Strings"><span class="Strings">&#39;</span>test/sphinx/*<span class="Strings">&#39;</span></span> t.verbose <span class="Operators">=</span> <span class="LanguageConstants">true</span> <span class="ControlStructures">end</span> <span class="ControlStructures">end</span> namespace <span class="Constants"><span class="Constants">:</span>ts</span> <span class="ControlStructures">do</span> task <span class="Constants"><span class="Constants">:</span>config_test</span> <span class="ControlStructures">do</span> <span class="Variables">ENV</span>[<span class="Strings"><span class="Strings">&quot;</span>RAILS_ENV<span class="Strings">&quot;</span></span>] <span class="Operators">=</span> <span class="Strings"><span class="Strings">&#39;</span>test<span class="Strings">&#39;</span></span> <span class="Comments"><span class="Comments">#</span> This is a horrible sin against humanity, but it ensures that A) the test config is</span> <span class="Comments"> <span class="Comments">#</span> always up to date, B) doesn&#39;t require devs remembering to add RAILS_ENV=test when</span> <span class="Comments"> <span class="Comments">#</span> they call rake and C) doesn&#39;t incur a penalty of shelling out to 2nd env</span> Rake::Task[<span class="Strings"><span class="Strings">&quot;</span>environment<span class="Strings">&quot;</span></span>].invoke <span class="Comments"><span class="Comments">#</span> This must be run after the test environment (as opposed to as a prereq) is</span> <span class="Comments"> <span class="Comments">#</span> forced or sphinx gets the wrong env.</span> puts <span class="Strings"><span class="Strings">&quot;</span>! Configuring Sphinx by rake<span class="Strings">&quot;</span></span> silence_stream(<span class="Variables">STDOUT</span>){ Rake::Task[<span class="Strings"><span class="Strings">&quot;</span>thinking_sphinx:configure<span class="Strings">&quot;</span></span>].invoke } <span class="ControlStructures">end</span> <span class="ControlStructures">end</span> task <span class="Strings"><span class="Strings">&quot;</span>test<span class="Strings">&quot;</span></span> =&gt; [<span class="Strings"><span class="Strings">&quot;</span>test:sphinx<span class="Strings">&quot;</span></span>] </pre><p>The main task here is <code>test:sphinx</code> which first ensures that the test config is up to date (by means of some fairly brutal methods as noted in the comments) and starts and stops Sphinx before running the actual tests.</p> <h2>Step Two: SphinxTestCase</h2> <p>To finish the optimization and make the test-writing experience smooth, a custom test case was required. The following has 3 purposes:</p> <ul> <li>Start and stop Sphinx, but only if it&#8217;s not already started globally by Rake</li> <li><code>TRUNCATE</code> tables this test uses so that left over fixture remains get cleared and ids start at 1 again</li> <li>Instantiate an set of test data, but <strong>only once per test class</strong></li> </ul> <p>The result:</p> <pre class="pastels_on_dark"><span class="ControlStructures">class</span> SphinxTestCase &lt; ActiveSupport::TestCase cattr_accessor <span class="Constants"><span class="Constants">:</span>sphinx_started_within_test_case</span> class_inheritable_accessor <span class="Constants"><span class="Constants">:</span>sphinx_tables_to_index</span> <span class="ControlStructures">def</span> self.indexes_tables(<span class="Variables"><span class="Operators">*</span>args</span>) <span class="Variables">self</span>.sphinx_tables_to_index <span class="Operators">=</span> args.map(<span class="Operators">&amp;</span><span class="Constants"><span class="Constants">:</span>to_s</span>).map(<span class="Operators">&amp;</span><span class="Constants"><span class="Constants">:</span>tableize</span>) <span class="ControlStructures">end</span> <span class="Comments"> <span class="Comments">#</span> We use inherited hook to define this only on the concrete test class otherwise the method gets called twice.</span> <span class="ControlStructures">def</span> self.inherited(<span class="Variables">subclass</span>) <span class="ControlStructures">def</span> subclass.suite(<span class="Variables"><span class="Operators">*</span>args</span>) mysuite <span class="Operators">=</span> <span class="ControlStructures">super</span> <span class="ControlStructures">def</span> mysuite.run(<span class="Variables"><span class="Operators">*</span>args</span>) <span class="ControlStructures">unless</span> ThinkingSphinx.sphinx_running? <span class="Comments"><span class="Comments">#</span> When running via rake we only start sphinxd once</span> puts <span class="Strings"><span class="Strings">&quot;</span>! Starting Sphinx on per-test basis<span class="Strings">&quot;</span></span> <span class="Comments"><span class="Comments">#</span> If running by rake and somehow sphinx doesn&#39;t start we should know about it</span> silence_stream(<span class="Variables">STDOUT</span>){ ThinkingSphinxTestHelper.start! } SphinxTestCase.sphinx_started_within_test_case <span class="Operators">=</span> <span class="LanguageConstants">true</span> <span class="ControlStructures">end</span> <span class="ControlStructures">super</span> <span class="ControlStructures">if</span> SphinxTestCase.sphinx_started_within_test_case print <span class="Strings"><span class="Strings">&quot;</span><span class="CharacterConstants">\n</span>! Stopping Sphinx on per-test basis<span class="Strings">&quot;</span></span> silence_stream(<span class="Variables">STDOUT</span>){ ThinkingSphinxTestHelper.stop! } <span class="ControlStructures">end</span> <span class="ControlStructures">end</span> mysuite <span class="ControlStructures">end</span> <span class="ControlStructures">end</span> <span class="Comments"> <span class="Comments">#</span> Instance variables set here are automatically set on each individual test instance</span> <span class="ControlStructures">def</span> self.setup_database(<span class="Variables"><span class="Operators">&amp;</span>block</span>) <span class="Variables"><span class="Variables">@</span>database_setup_block</span> <span class="Operators">=</span> block <span class="ControlStructures">end</span> <span class="ControlStructures">def</span> setup_fixtures <span class="ControlStructures">unless</span> sphinx_tables_to_index.present? <span class="Keywords">raise</span> <span class="Strings"><span class="Strings">&quot;</span>Tables to be cleared must be defined using indexes_tables on the test class<span class="Strings">&quot;</span></span> <span class="ControlStructures">end</span> <span class="Variables"><span class="Variables">@@</span>sphinx_test_case_already_loaded</span> <span class="Operators">||=</span> {} <span class="ControlStructures">unless</span> <span class="Variables"><span class="Variables">@@</span>sphinx_test_case_already_loaded</span>[<span class="Variables">self</span>.class] sphinx_tables_to_index.each <span class="ControlStructures">do </span>|<span class="Variables">table</span>| ActiveRecord::Base.connection.execute(<span class="Strings"><span class="Strings">&quot;</span>TRUNCATE <span class="Strings"><span class="Strings">#{</span>table<span class="Strings">}</span></span><span class="Strings">&quot;</span></span>) <span class="ControlStructures">end</span> db_setup_block <span class="Operators">=</span> <span class="Variables">self</span>.class.instance_variable_get(<span class="Constants"><span class="Constants">:</span>@database_setup_block</span>) <span class="Keywords">raise</span> <span class="Strings"><span class="Strings">&quot;</span>setup_database was not called for <span class="Strings"><span class="Strings">#{</span><span class="Variables">self</span><span class="Strings"><span class="Strings">.</span><span class="Strings">class</span></span><span class="Strings">}</span></span><span class="Strings">&quot;</span></span> <span class="ControlStructures">unless</span> db_setup_block db_setup_block.call ThinkingSphinxTestHelper.index! <span class="Variables"><span class="Variables">@@</span>sphinx_test_case_already_loaded</span>[<span class="Variables">self</span>.class] <span class="Operators">=</span> <span class="LanguageConstants">true</span> <span class="ControlStructures">end</span> <span class="Variables">self</span>.class.instance_variables.each <span class="ControlStructures">do </span>|<span class="Variables">ivar</span>| <span class="Variables">self</span>.instance_variable_set(ivar, <span class="Variables">self</span>.class.instance_variable_get(ivar)) <span class="ControlStructures">end</span> <span class="ControlStructures">end</span> <span class="ControlStructures">def</span> teardown_fixtures <span class="Comments"> <span class="Comments">#</span> Do nothing.</span> <span class="ControlStructures">end</span> <span class="ControlStructures">end</span> </pre><h3>Test class methods</h3> <p>There are two class methods defined <code>indexes_tables</code> and <code>setup_database</code> that a SphinxTestCase uses to define its environment in lieu of fixtures. These methods respectively take a list of symbols ala standard Rails class methods and a block inserting a set of database rows.</p> <h3>Once-per-class hook</h3> <p>The <code>self.inherited</code> block is a bit hard to parse, but it&#8217;s essentially just a way to run something once per test class, similar to RSpec&#8217;s <code>before(:all)</code>. It has to be called in the inherited callback because otherwise the super call ends up recursing and the code executes twice.</p> <p>This idea was adapted from work by <a href="http://lazyatom.com/">James Adam</a>, thanks James!</p> <h3>Override fixtures setup</h3> <p>The <code>setup_fixtures</code> and <code>teardown_fixtures</code> are methods defined by <code>ActiveSupport::TestCase</code>. They are completely overriden here since there&#8217;s nothing efficient that can be done with transactional fixtures enabled anyway. Instead we just verify that some tables are specified, truncate them, setup the data, and index it. But this only happens once per test class, therefore we have to copy over the instance variables to provide the usual semantics.</p> <h3>Isn&#8217;t this data sharing A Bad Thing?</h3> <p>This design means that adding or deleting data within tests will be error prone since it can affect other tests. However for Sphinx you would need to reindex anyway, and because Sphinx is all about querying full datasets, it made sense to me to set up one big set of test data and then right multiple tests to query against it. The worst case is that you simply need to create a new test class to setup a new context, which personally is no less distasteful than contexts nested 3 or 4 levels deep.</p> <h3>Why not just truncate all tables?</h3> <p>That&#8217;s not a bad idea actually. Forgetting to specify tables tends to lead to bizarre duplicate key errors. However in our database we have 150 tables, so I decided to require selectivity. <span class="caps">YMMV</span>.</p> <h3>Wheres the code?</h3> <p>This seemed like too specific a use-case to release in any formal fashion, but let me know if you disagree.</p> <h2>Step 3: Write Your Tests</h2> <p>Here&#8217;s an example using FactoryGirl and Shoulda:</p> <pre class="pastels_on_dark"><span class="ControlStructures">class</span> UserSphinxTest &lt; SphinxTestCase indexes_tables <span class="Constants"><span class="Constants">:</span>users</span>, <span class="Constants"><span class="Constants">:</span>emails</span> setup_database <span class="ControlStructures">do</span> <span class="Variables"><span class="Variables">@</span>user</span> <span class="Operators">=</span> <span class="Variables">Factory</span>(<span class="Constants"><span class="Constants">:</span>user</span>, <span class="Constants"><span class="Constants">:</span>first_name</span> =&gt; <span class="Strings"><span class="Strings">&quot;</span>Will<span class="Strings">&quot;</span></span>, <span class="Constants"><span class="Constants">:</span>last_name</span> =&gt; <span class="Strings"><span class="Strings">&quot;</span>Fulignoranz<span class="Strings">&quot;</span></span>) <span class="Variables"><span class="Variables">@</span>user2</span> <span class="Operators">=</span> <span class="Variables">Factory</span>(<span class="Constants"><span class="Constants">:</span>user</span>, <span class="Constants"><span class="Constants">:</span>first_name</span> =&gt; <span class="Strings"><span class="Strings">&quot;</span>Farmer<span class="Strings">&quot;</span></span>, <span class="Constants"><span class="Constants">:</span>last_name</span> =&gt; <span class="Strings"><span class="Strings">&quot;</span>William<span class="Strings">&quot;</span></span>) <span class="Variables"><span class="Variables">@</span>user3</span> <span class="Operators">=</span> <span class="Variables">Factory</span>(<span class="Constants"><span class="Constants">:</span>user</span>, <span class="Constants"><span class="Constants">:</span>first_name</span> =&gt; <span class="Strings"><span class="Strings">&quot;</span>Bubby<span class="Strings">&quot;</span></span>, <span class="Constants"><span class="Constants">:</span>last_name</span> =&gt; <span class="Strings"><span class="Strings">&quot;</span><span class="Strings">&quot;</span></span>) <span class="ControlStructures">end</span> should <span class="Strings"><span class="Strings">&quot;</span>find will<span class="Strings">&quot;</span></span> <span class="ControlStructures">do</span> results <span class="Operators">=</span> User.search(<span class="Strings"><span class="Strings">&#39;</span>will<span class="Strings">&#39;</span></span>) assert results.include?(<span class="Variables"><span class="Variables">@</span>user</span>) assert results.include?(<span class="Variables"><span class="Variables">@</span>user2</span>) assert <span class="Operators">!</span> results.include?(<span class="Variables"><span class="Variables">@</span>user3</span>) <span class="ControlStructures">end</span> should <span class="Strings"><span class="Strings">&quot;</span>find ignorance<span class="Strings">&quot;</span></span> <span class="ControlStructures">do</span> assert_equal <span class="Variables"><span class="Variables">@</span>user</span>, User.search(<span class="Strings"><span class="Strings">&#39;</span>fulignoranz<span class="Strings">&#39;</span></span>).first <span class="ControlStructures">end</span> should <span class="Strings"><span class="Strings">&quot;</span>not find the drummer<span class="Strings">&quot;</span></span> <span class="ControlStructures">do</span> assert_equal <span class="Numbers">0</span>, User.search(<span class="Strings"><span class="Strings">&#39;</span>friendly fred<span class="Strings">&#39;</span></span>).size <span class="ControlStructures">end</span> <span class="ControlStructures">end</span> </pre><h2>Results</h2> <p>19 tests, 36 assertions, finishes in 7 seconds, full test suite 20 seconds faster with +25 assertions and marginal cost for additional Sphinx tests is now negligible.</p> <p>Questions?</p> gabe@websaviour.com (Gabe da Silveira) Sat, 13 Feb 2010 09:25:00 +0000 http://localhost:3000/articles/testing-thinkingsphinx-with-testunit-and-transacti http://localhost:3000/articles/testing-thinkingsphinx-with-testunit-and-transacti Benchmarking Rake Tasks and Trivial Rails Testing Optimizations <p>I&#8217;ve been thinking a lot about testing recently. Perhaps the most inconvenient truth about automated testing is that even though it&#8217;s many orders of magnitude faster than manual testing, it&#8217;s not infinitely fast, and the marginal cost does add up. This is especially true in slower, interpreted languages like Ruby.</p> <p>Recently our test suite at The Auteurs has been approaching 15 minutes on a fast MacBook Pro. The obvious solution to this is continuous integration, but I believe there is lower-hanging fruit. Being able to run the full suite (or at least the majority of it) fast means a lot to the agility of the team. There is always room to optimize the tests themselves, but I&#8217;ll leave that topic for another post. What was of interest to me is that the default Rails testing tasks seemed to be getting quite slow on a code base of our size (30k lines).</p> <p>How to attack the problem?</p> <h2>Benchmarking Rake</h2> <p>Rake 0.8.7 doesn&#8217;t come with any built-in benchmarking options, but hooking in proved extremely straightforward (thank you Ruby!):</p> <pre class="pastels_on_dark"><span class="Keywords">require</span> <span class="Strings"><span class="Strings">&#39;</span>benchmark<span class="Strings">&#39;</span></span> <span class="ControlStructures">class</span> Rake::Task <span class="ControlStructures">def</span> execute_with_benchmark(<span class="Variables"><span class="Operators">*</span>args</span>) bench <span class="Operators">=</span> Benchmark.measure <span class="ControlStructures">do</span> execute_without_benchmark(<span class="Operators">*</span>args) <span class="ControlStructures">end</span> puts <span class="Strings"><span class="Strings">&quot;</span> <span class="Strings"><span class="Strings">#{</span>name<span class="Strings">}</span></span> --&gt; <span class="Strings"><span class="Strings">#{</span>bench<span class="Strings">}</span></span><span class="Strings">&quot;</span></span> <span class="ControlStructures">end</span> alias_method_chain <span class="Constants"><span class="Constants">:</span>execute</span>, <span class="Constants"><span class="Constants">:</span>benchmark</span> <span class="ControlStructures">end</span> </pre><p>That produces output like:</p> <pre class="pastels_on_dark">environment --&gt; 8.580000 1.390000 9.970000 ( 10.408723) db:abort_if_pending_migrations --&gt; 2.600000 0.040000 2.640000 ( 2.662062) db:test:purge --&gt; 0.010000 0.000000 0.010000 ( 0.146904) db:schema:load --&gt; 0.810000 0.070000 0.880000 ( 49.263301) db:test:load --&gt; 0.820000 0.070000 0.890000 ( 49.274854) db:test:prepare --&gt; 0.830000 0.070000 0.900000 ( 49.422403) test:units --&gt; 0.050000 0.080000 103.930000 (118.107608) </pre><p>It&#8217;s important to note here that this output doesn&#8217;t explicitly communicate nested tasks, and thus you can&#8217;t naively sum the totals. Rake mostly operates using prerequisites so that things run one at a time without being nested, but in this case <code>db:test:prepare</code> invokes <code>db:test:load</code> which invokes <code>db:schema:load</code>, thus the former includes the latter execution times.</p> <p>Also I benchmarked a bunch of things in <code>test_helper.rb</code>, but most significant ended up being the environment:</p> <pre class="pastels_on_dark"><span class="Keywords">require</span> <span class="Strings"><span class="Strings">&#39;</span>benchmark<span class="Strings">&#39;</span></span> bench <span class="Operators">=</span> Benchmark.measure <span class="ControlStructures">do</span> <span class="Keywords">require</span> File.expand_path(File.dirname(<span class="Variables">__FILE__</span>) <span class="Operators">+</span> <span class="Strings"><span class="Strings">&quot;</span>/../config/environment<span class="Strings">&quot;</span></span>) <span class="ControlStructures">end</span> puts <span class="Strings"><span class="Strings">&quot;</span> Require Rails Env --&gt; <span class="Strings"><span class="Strings">#{</span>bench<span class="Strings">}</span></span><span class="Strings">&quot;</span></span> </pre><p>which ended up being around 10 seconds just like in the rake task:</p> <pre class="pastels_on_dark">Require Rails Env --&gt; 8.270000 1.820000 10.090000 ( 10.205786) </pre><h2>Rolling Up Testing Tasks Into One</h2> <p>If you aren&#8217;t thinking carefully you might assume that rake loading the environment is basically free since the tests need it anyway, but remember, the rake process is just running commands in a shell, it&#8217;s not invoking the tests in the same Ruby process. Therefore each block of tests (ie. unit, functional, integration) needs to startup the environment from scratch. That works fine for small applications, but as they grow the overhead becomes significant. In our case we even have a couple extra blocks of tests beyond the Rails defaults, all of which adds up to significant overhead.</p> <p>However it&#8217;s easy to write a task that runs all the tests in one block:</p> <pre class="pastels_on_dark">namespace <span class="Constants"><span class="Constants">:</span>test</span> <span class="ControlStructures">do</span> Rake::TestTask.new(<span class="Constants"><span class="Constants">:</span>fast</span>) <span class="ControlStructures">do </span>|<span class="Variables">t</span>| files <span class="Operators">=</span> FileList[<span class="Strings"><span class="Strings">&quot;</span>test/unit/**/*_test.rb<span class="Strings">&quot;</span></span>, <span class="Strings"><span class="Strings">&quot;</span>test/functional/**/*_test.rb<span class="Strings">&quot;</span></span>, <span class="Strings"><span class="Strings">&quot;</span>test/integration/**/*_test.rb<span class="Strings">&quot;</span></span>] t.libs <span class="Operators">&lt;&lt;</span> <span class="Strings"><span class="Strings">&#39;</span>test<span class="Strings">&#39;</span></span> t.verbose <span class="Operators">=</span> <span class="LanguageConstants">true</span> t.test_files <span class="Operators">=</span> files <span class="ControlStructures">end</span> Rake::Task[<span class="Strings"><span class="Strings">&#39;</span>test:fast<span class="Strings">&#39;</span></span>].comment <span class="Operators">=</span> <span class="Strings"><span class="Strings">&quot;</span>Runs unit/functional/integration tests in a single block<span class="Strings">&quot;</span></span> </pre><p>The potential issue with this is that you lose the namespacing of the tests, which could lead to test name collisions, or there may be other types of environment collisions depending on the specifics of your app. However in our case this worked right out of the box.</p> <h2>Drop Test Preparations</h2> <p>The other thing you&#8217;ll notice about that rake task if you compare to the ones defined in <code>railties/lib/tasks/testing.rake</code> is that I&#8217;ve dropped the prerequisite <code>db:test:prepare</code> from the definition. That means that the task won&#8217;t automatically reload the database schema, and it won&#8217;t warn you if migrations haven&#8217;t been run. It also means the environment isn&#8217;t required either, yielding a savings of over a minute. Obviously this is highly dependent on the number of tables in the schema. In our case we have 130 which doesn&#8217;t seem like a tremendous amount&#8212;each table is adding over a third of a second of overhead. If our unit tests were optimized a bit this overhead could end up dominating the execution time.</p> <p>The downside here is that you are basically on the hook for calling <code>db:migrate</code> and <code>db:test:prepare</code>, but the number of times we need to call those is dwarfed by frequency of test runs. It does require a bit of mental overhead to recognize test failures caused by a bad schema, but the performance gains are well worth it for a large app.</p> <h2>Isolate Slow Tests</h2> <p>Now that the rake overhead is removed, the next logical step is to optimize individual tests. So grab the <a href="http://github.com/timocratic/test_benchmark">test_benchmark</a> gem and profile your tests. In our case there were a few super slow tests dominating the runtime profile. These tests are important, but not necessarily in proportion to their runtime. My solution was to cordon off the slow tests in their own directory:</p> <pre class="pastels_on_dark">Rake::TestTask.new(<span class="Constants"><span class="Constants">:</span>slow</span>) <span class="ControlStructures">do </span>|<span class="Variables">t</span>| t.libs <span class="Operators">&lt;&lt;</span> <span class="Strings"><span class="Strings">&#39;</span>test<span class="Strings">&#39;</span></span> t.verbose <span class="Operators">=</span> <span class="LanguageConstants">true</span> t.pattern <span class="Operators">=</span> <span class="Strings"><span class="Strings">&#39;</span>test/slow/**/*_test.rb<span class="Strings">&#39;</span></span> <span class="ControlStructures">end</span> Rake::Task[<span class="Strings"><span class="Strings">&#39;</span>test:slow<span class="Strings">&#39;</span></span>].comment <span class="Operators">=</span> <span class="Strings"><span class="Strings">&quot;</span>Runs the slow tests<span class="Strings">&quot;</span></span> </pre><p>Just to be clear, I recognize that having a set of isolated tests like this is a code smell. In order to be fully comfortable with this I need to have continuous integration setup running these tests so that any developer pushing code will be notified within 20-30 minutes if one of these tests is failing. The other mitigator is that these tests should not be regression tests of core functionality. That&#8217;s a judgement call, but with experience it&#8217;s not hard to classify some tests as more important than others.</p> <h2>Roll It In</h2> <p>I don&#8217;t want to mess around with the built-in Rails tasks because that is confusing to new developers. However I do want <code>test:fast</code> to be the default task because that&#8217;s going to be our bread and butter. So in the end I came up with these changes:</p> <pre class="pastels_on_dark">Rake::Task[<span class="Constants"><span class="Constants">:</span>default</span>].clear task <span class="Constants"><span class="Constants">:</span>default</span> =&gt; <span class="Strings"><span class="Strings">&quot;</span>test:fast<span class="Strings">&quot;</span></span> task <span class="Strings"><span class="Strings">&quot;</span>test<span class="Strings">&quot;</span></span> =&gt; [<span class="Strings"><span class="Strings">&quot;</span>test:slow<span class="Strings">&quot;</span></span>] </pre><p>The last bit ensures that <code>rake test</code> will first run <code>test:slow</code> which means that we now have the best of both worlds at our finger tips. <code>rake</code> will run the fast tests, and <code>rake test</code> will run the standard Rails test blocks, preceded up the slow tests.</p> <p>I chose to move 4 out of 250 test classes to the <code>slow/</code> set. Testing the total runtime from the shell using the <code>time</code> utility yields:</p> <pre class="pastels_on_dark">sh&gt; time(rake test) real 11m54.247s user 8m56.002s sys 0m30.125s sh&gt; time(rake) real 8m1.043s user 6m36.532s sys 0m16.860s </pre><p>So we went from 12 minutes down to 8 minutes by dropping less than 2% of the test classes. Not a bad first effort. Next up: optimization techniques for Rails test suites.</p> gabe@websaviour.com (Gabe da Silveira) Fri, 29 Jan 2010 19:00:00 +0000 http://localhost:3000/articles/benchmarking-rake-tasks-and-trivial-rails-testing- http://localhost:3000/articles/benchmarking-rake-tasks-and-trivial-rails-testing-