Choosing Your Programming Language – The Inside Scoop January 28, 2012
Posted by jeffvroom in Java/JEE, Software.11 comments
Many programmers prefer typeless, interpreted languages like PHP and Ruby for several reasons. They are more concise and easier to read and write for a novice. They tend to be interpreted languages, not compiled, which are simpler to use and typically offer faster round-trip time between making a change and seeing the result. They support a “google, cut and paste” type workflow more easily which, frankly is how many programmers operate these days.
And yet still strongly typed languages are more wide used, particularly as the complexity of the project and the number of the developers grows. I have discussed this issue with a number of colleagues and wanted to write down my thoughts. It’s important to choose the right language for the right job and today unfortunately, there’s no one size fits all answer so knowing the details may help. My opinions were formed by poking around into the guts of the JVM, Python, PHP, Ruby, and Flash interpreters, and from coding in Java, C, and C++ extensively.
Typeless versus Typed
One reason I believe typed languages are used is the robustness of the code itself. Typeless languages offer a single-point of failure with each code construct. If you misspell a variable name, you do not find out until runtime and only by debugging the problem or through code inspection. With a typed language, each misspelling is caught at compile time because every name must occur in the program at least twice, once for the type definition, once for the usage. This fact alone will often make up for the extra key strokes you need to use in a typed language.
With typed languags, more is known about the system during the code editing process. This makes the tooling opportunities richer and reduces keystrokes which can make it faster to write code in a typed language than an untyped one, even though the typed language is more verbose. For example, handling imports, completion of member or method names. The “find all usages” feature is extremely valuable at tracing code paths and doing refactoring. Typeless languages may offer such features but they are much less specific as they must do only name matching, not type+name matching. The ability to change a field or method name and reliably update all references is a big time saver when modifying a large existing project.
Another reason people prefer typed languages of course is runtime performance. But why exactly do typed languages run so much faster? The biggest reason is that they offer a much faster way to evaluate “a.b” expressions and do method lookups (a.b()) at runtime. With a dynamic language, every single indirection requires a hashtable or binary search which turns into dozens or 100s of instructions. With a typed language, a compiler can frequently generate an “a.b” with just a few instructions using a “load from fixed offset” pattern. That’s why a typeless language will run usually at least 10X slower than a typed language no matter how many engineers Facebook puts on the problem.
Some folks today are trying to infer types in typeless languages to improve runtime performance. In limited cases they could compile typeless code to use fixed offsets. That may well be an area of research which could improve the performance of some typeless code. I suspect though that the code which will speed up will need to be well organized around common types and so written a lot like a typed language.
It also perhaps poorly understood that even typed languages do not always realize the a.b speedup for using fixed offsets. For example, when you use a feature like interfaces in Java, you do end up with some searching to find the right method in the general case. You may not see this all the time because Java employs a trick to cache the offset for the last type seen which sometimes eliminates that search in many cases. I have a project in which changing one interface to an abstract class improved performance by over 50%.
One other poorly understood performance factor in comparing typeless and typed language is when interpreted code calls native code. For example when PHP or Java calls some C function. Native transitions are usually substantially slower than normal method calls because of the extra work they need to do in translating data types, pinning down memory used by the native code, copying memory from an unmanaged to managed environment etc.
Though both typed and typeless languages suffer the same problem, in general typeless languages use more, higher level C libraries. That’s probably because writing them in the language itself is too slow or just the effort involved in writing the code itself is too high given the limited commercial support for typeless languages. With more native transitions, the performance hit for this design increases so just moving more code into the native layer may not make things faster when you need to make lots of native method calls.
Of course more use of native code turns into an advantage when you have a small amount of typeless code which just strings together a few efficient but long running native methods, like copying a file. In these systems the typeless language is almost as fast as C.
In general typeless languages have faster round trip times between changing code and seeing the change. Because they are typeless, when you update a module, you do not have to update the entire application. Changed code constructs can co-exist with unchanged constructs. In a typed language however, you have to update the type in a way that preserves the stricter typing contracts. Since the code itself relies on fixed offsets, when those offsets change, you have to update all of the code atomically which is hard to do and get right. Most typed languages cannot do that seamlessly and worse still, there’s no way to know when it will or won’t work making “Class patching” useful only in special cases where you can isolate all dependencies on that class that is being changed.
Interpreted versus Compiled
To get good performance as a project grows, even interpreted languages these days must cache compiled descriptions of the code. They do however retain the ease of use benefits in most cases because this is all done transparently, by the browser or the runtime engine. When the code changes, these caches are updated automatically. Without such a feature, interpreted languages bog down as code sizes grow. Each time a process restarts, too much code must be interpreted before you can use the system.
Thread Architecture
Java, C, and C++ are all multithreaded using operating system thread scheduling. In general, this means that all code must be “thread aware” though in practice, frameworks try to reduce the likeklihood of thread conflicts. When a framework is well designed, the burden of synchronization is not imposed on application code.
You need a threaded architecture when you need to share a large pool of memory or efficiently perform I/O with a bunch of sockets or files. You can more easily leverage a multi CPU environment with OS threading.
In contrast, even multi-threaded VMs like Python may have a global interpreter lock or will do VM based thread scheduling. Either of these architectures eliminates opportunities to do parallel I/O unless you switch to a multi-process model. For example, PHP will run each HTTP request in a separate process and so achieves some form of parallelism that way. But in doing so, it eliminates the use of shared memory which reduces the efficiency of memory caching. It also means that any data structure used by all HTTP requests must be replicated across all PHP processes further increasing both computation and RAM usage.
So for PHP, you’ll need even more memory and more CPU to populate that memory. You do still benefit from OS level file caching of course.
What about the Future?
I tried to be neutral in my analysis but you can probably tell from the above that I like the benefits of typed languages. When you consider long term costs, and include modifications, enhancements, transfer of code between developers, runtime efficiency for either large scale or mobile deployments, strongly typed languages win out.
I agree however with Ruby and PHP developers that we are not there yet when any strongly typed language today will beat out PHP and Ruby for any given project. As long as the code is easier to read and edit for most people, the typed language advantages may easily be outweighed by availability of people, cost, and the poor workflows that exist between complex typed languages like Java, C, C++ for designers, analysts, and admins.
To bridge the gap, we need a strongly typed language which has:
- simplified tools – the Java IDE is too complex for entry level programmers and others who work with PHP and Ruby code today
- syntax improvements to eliminate imports, use inferred typing, and in general simplify the syntax will bring typed languages much closer to untyped languages in readability/brevity.
- mixed interpreted/compiled modes and a way to migrate code between them as it solidifies
- updating of types for the common cases for immediate code updates. When that’s not possible the ability to know as soon as the code is changed that a restart is required.
- built in compilation, dependency management for automated builds, updates, deployments. Maven, ant, and IDE configuration are too complex today.
What do you think? Did I miss any important issues that affect your choice of a language? Let me know in the comments!
Evolution of Forms (More about Why I left Adobe) July 4, 2010
Posted by jeffvroom in All, Flex, BlazeDS, LCDS, Java/JEE, Software.add a comment
An article of mine about evolution of forms technology was published on The Register. The need for this technology is why I went to work at Adobe and why I left when I realized they would not market LCDS this way.
BTW, Froyo – aka Android 2.2 update arrived on my Nexus One July 1. My phone runs Flash! Congrats to my friends at Adobe for creating the first/best universal portable runtime for rich UIs. As a stock holder, I just wish you had a better monetization vehicle for it (hint, hint). Thanks Google for not being afraid of Flash, plus all of the great things you did with android: tethering, navigation, my tracks, maps, gmail, etc.
MS and Oracle’s big dev tools – who needs ‘em? March 2, 2010
Posted by jeffvroom in All, Flex, BlazeDS, LCDS, Java/JEE, Software.3 comments
My article on the register.
It’s Gotta Be Git February 27, 2010
Posted by jeffvroom in All, Flex, BlazeDS, LCDS, Java/JEE, Software.add a comment
Source control plays an essential role in software engineering. I’ve been using it ever since my first job and it transformed how I code. But like every tool it seems, it can be your best friend or at times your worst enemy. Most painfully, CVS, SVN and P4 for example all are terrible at merging a branch the second time. They lose track of what was already merged and start registering false conflicts.
At Adobe, on some complex projects during lockdown you’d have to coordinate with someone before each checkin. He’d bracket batches of commits with tags, then carefully merge a set of deltas one batch at a time. Not a fun job – everyone’s waiting on you, while you are trying to juggle lots of code you did not write at a critical juncture of the project.
The other time source control let me down in a big way was on my trip to India a couple of years ago. Access to the source control system back in San Jose was so poor, it made me change how I worked – in a bad way. I did not verify the diffs and checkin comments for affected code before making changes. I batched up all sync/checkins during breaks (and yes took more breaks).
The reasons Git is superior:
Local history, local branches
I started a new project by creating a git repository on my local machine (git init, git add). A few months later, I wanted to share the code with a friend. I cloned my local repository into a bare repository on a hosted linux vps, then gave out that URL (git clone ssh://myserver.com/var/git/myapp.git). Now I can “git push” and “git pull” changes to/from that remote server as needed to share or backup my work. Each repository maintains the entire history of shared branches so even if there is a central repository, you use it less often. When you have conflicts trying to push or pull, there’s one straightforward process to merge and resolve them.
Stashing changes
Occasionally you need to put work on hold to fix some other more important bug. Git lets you stash away your changes in a temporary branch (git “stash”), do the fix, then bring your changes back with “git stash apply”, all without touching a server.
Smaller checkins
Because you can check in changes to your own repository without affecting others and without having to run the complete test suite, your checkins tend to be smaller which improves the quality of your version history. At Adobe I was known for massive checkins sometimes with as many as 10 bug fixes. That’s because the test suites would take an hour or more to run. I could run them at most two times a day without interrupting my work. Later this cost me time when trying to identify or merge a particular fix. With Git you make checkins to your local project at natural intervals for history. You push/pull at natural intervals for synchronization.
Staging/Live Repositories
On all but the smallest projects, you need a way to test environments that are isolated from active development prior to release. Usually you tell coders to stop checking in changes during lockdown or you might create a branch and start merging. Either way slows you down at the most critical phase of the project.
With Git you define a separate server repository for each level of isolation that is required. You might have a development repository which developers sync to, a staging repository for testing primarily used by QA, followed by a live one that is used to mirror what is actually released or to be released. During normal development, you might have staging automatically pull from development so QA stays on the latest. But after lockdown, you turn this off. QA can move changes as needed from the development repository into staging and sync that to live as needed. Any developer can change their default repository and sync to either staging or live as needed when problems arise.
No Waiting
So far, I like the performance characteristics of Git. Given the architecture, some things are faster, some things are slower but I suspect that since Linus wrote the core, most things you do day-to-day are faster even on large projects. Version information is maintained per-repository, not per-file so getting the changes which affect an individual file can be slower – i.e. the “git blame” command (similar to cvs annotate). But commits, push and pull commands have so far been very fast for me. Despite the fact that Git does not store changes as “diffs”, but instead stores everything as a compressed blob file-chunks, space has not been an issue.
Smarter Than You’d Expect
Renaming a file? Git figures that out automatically by comparing SHA1 hashes. Git can even figure out when you refactor a big chunk of one file into another one. It does fancy ascii-art during each push/pull to show you added/removed chunks.
Verifies All Files
Kernel programmers tend to be paranoid (a good thing). Git verifies the integrity of all files using SHA1 hashes. If any bit is out of place, it will barf with some cryptic error that may require a google search to fix. But this has already paid off for me. One problem I had with Git on windows was running it in cygwin without newlines getting destroyed (it only works in one of cygwin’s binary mode). Git complained which prevented me from checking in any corrupted files.
Flexible
My favorite app server, Resin, is now using Git behind the scenes to sync files across a cluster of servers. I like that use since a) it is pretty fast, b) it makes it easy to make an isolated change on a live server while tracking that change robustly, c) you can check the history even on production, d) The verification comes in super handy here – any local changes can be detected and traced.
As with all new technology there are caveats. Git is still fairly low-level, has numerous options and did not fully follow industry standard conventions (i.e to revert: “git checkout file”). It takes more thought to set up repositories and workflows, and the two-phase commit/push process requires some mental re-wiring. Because it is so flexible, people are still figuring out how to use it best for different purposes. Since no one is making money off of git (except maybe github?), it is evolving fairly slowly in the “polish” area. But from now on, for me it’s gotta be git.