Speed Read 2 0 1 – Reading Techniques
Reading is one of the main ways we use to absorb information, and being able to read fast is something that most of us would like. If you could read 100% faster, for instance, you would read twice as many books every year, or the newspaper twice as fast every morning.
USB 1.0 should not be the fastest port you have, otherwise it is time to upgrade as USB 1.0 only offers a max transfer speed of 1.5mb/s. USB 2.0 offers speeds of up to 35mb/s in a perfect enviroment but more like 20mb/s in real life. USB 3.0 offers speeds of up to 500mb/s in a perfect world and 350-400mb/s in our real world.
The problem is that most speed reading tricks and techniques never work as promised.
- Word-chunking closely parallels with the idea of eliminating the inner monologue. This is the act of reading multiple words at once, and is the key to reading faster. All of these reading tips tie together, yet word-chunking is probably the most active tool to use when you work to increase your reading speed.
- Read here speed Reading tips. 4 Simple Speed Reading Eye Exercises Having your eyes trained to effectively expand your vision vertically and horizontally will allow you to read and understand several words at a time on two or three different lines with just a single glance.
This week I came across a video explaining a trick that surprised me though. It takes some time to get used, but once you do I am sure you will feel a tangible increase in the speed that you read. Here is the video:
The basic idea is that you need to stop reading with your larynx (i.e., trying to pronounce each word as you go, even if just mentally) and start reading with your eyes (i.e., the information is processed by the brain immediately as you see it).
The trick to shift from one to the other is try pronounce something as you read, like “aeiou” or “123.” Sounds weird, but for me it worked.
Browse all articles on the General category
44 Responses to “One Speed Reading Trick That Does Work”
- Sean Morrissy
Thanks for that. I read a ton of books and it would be great to get through them twice as fast. I’ll definitely try and work at this, my ever expanding library requires it!
Cheers,
Sean - Farnoosh Brock
Fantastic, fantastic video. I love it. I think it worked already. I will need to practice because I am an extremely avid reader, both information online and heaps of books. I am now ready to take my husband on his challenge for a reading race (he is a fast reader but wait til I show him). Seriously, good stuff. Thanks!
- Tom Bradshaw
I think this may help some people. But for me to be able to take in what I’m reading I have to ‘say’ or think every word to myself. I read every day but probably not enough to be able to see any improvement in my speed.
- mmSeason
Interesting; i haven’t seen this method before and will give it a go. Of course it will take some practice, but it’s bothering me that i read more slowly than i used to – and i was never a fast reader! I’ve always put that down to having a brain that prefers the visual, which in this method will help instead of hinder.
(I don’t imagine it’s useful when you need to remember the specific wording, say when learning quotations for an essay.)
Thanx for posting. :0)
- Karen, author of “My Funny Dad, Harry”
I’ll definitely give this a try. It sounds like kind of the same principle for typing faster, going from thinking each letter as you type to thinking words as you type to thinking a phrase as you type and just registering through your eyes and not worrying about comprehending. I’m going to share about this on my Friday “Things I Learned This Week” post with a link back here.
- hugerewards
I think because the throat that it is made and used me to always use that gain is got rid of, this kind of technology mentioned will be suitable for me. It is what I do not know well that this is analogous to, but Il spends time testing this scheme read fast.
- Brandy
Very cool! I like 1234 better than the vowels. I used the technique to speed read the rest of these comments and it worked! In the beginning I had to read the sentence 2x but after awhile I was looking for the main words and skipping some but still understood the sentence or paragraph.
- Chester
Wow. I should have read this trick way back. It’s working!
- Jorge Delgado
Well I’ll give it a try…I do need to read faster.
THanks
Jorge - Props Blog Ideas
I’ve heard of a ton of different methods for speed reading. One of the big things all of them preach is word recognition; not pronouncing the word, but just seeing the word and recognizing it. I actually think AEIOU flows better than 123. For stuff like reading blogs or reviews, this technique seems really useful, but I’d be hessitant to use it for very technical things (I’m a chemist; I couldn’t read an SOP like this and expect to get it right).
On the other hand, it’s kind of like touch typing. At first it’s hard use the “home keys.” After you get used to using the home keys you slowly stop looking at the keyboard to type. Finally, you get where you can type without looking at the keyboard and your typing speed is really a function of how quickly you want to go and how complex the words you are using are.
Blake Waddill
- The Laughington Post
I also read quickly without saying a word but i always react to what i read in a noticeable manner something that suprise people around me
- Web Marketing Tips
Wow I never knew that I am using three senses or organs to read a word. This is really informative. Would love to try this on weekends and will share with my nephews as well so that there reading speed can increase.
- Brian D. Hawkins
I’m struggling with this. I can’t seem to understand a single word while saying aeiou. I guess I need to keep practicing.
- GoBusiness101
Thanks for the Tip. It works well. Using this technique already but not defined by the book yet until today.
- Josh Stauffer
I’ll give it a shot. Hope this works for me.
- Rocque
I am definitely going to do this. I read really slow. I was speaking to our school librarian and told her I read 3 books this summer. Hey that is really good for me. She read 50! I can not believe she read 50 books in 10 weeks.
So you can see I can use this. I am also going to add this to my blog and give Daily Blog Tips as the place where I found this.
Thanks for another great post.
- Brandon
yeah this just hits the spot. i always try to pronounce each word, but the more i think about it, the ridiculous it seems.
but i still don’t think that this will be any help, because reading headlines doesn’t take any time at all and if the content is good, there’s no feeling of time-loss there.
- Fatin Pauzi
Bloggers need to learn speed reading so that it can help them to save their time whenever do blog commenting.Well,I think it is a vital subject to be discussed then if this thing got a lot of cooperate from bloggers who are writing the post,well it can help a lot for them to save time and get a same result for the effort.
- Igor Kheifets
Thanks Daniel,
where can I get more videos like that?
Igor
- ROW
Will go with Stefan above. To me it’s too much distracting.
I ‘m not sure how this can increase the speed. I am trying to comprehend something else and speaking something else. So my brain is processing two piece of different information at the same time, looks like instead of increasing it may result in decreasing my speed!
- ffoucaud
Thank you for this article.
But with a text instead of a video, we could learn this method even faster, couldn’t we ? 🙂 - Stefan
I tried reading a few sentences but didn’t really get the text even though I could see the words. I’m guessing this is not a great method when you are studying or trying to learn something interesting.
- Boerne Search
I remember learning this my junior year. But I haven’t thought of it since. But it is a good idea.
Kane
- Matej
I thought everyone knew that, that is a main problem with reading – most of people need to pronounce words in their head, every letter…
I didn’t knew about the exercise, 1234 sounds worth of trying .. aeiou sounds hard to pronounce over and over lol
- Alex Lim
Thanks for bringing this up. I’ve been practicing reading out loud for years; in fact it makes my reading slow. It’s a great timing because I wanted to change this habit gradually. I think the technique mentioned will be applicable to me since it won’t get rid the use of larynx which I always use. This is pretty new to me but I’ll take time to try this speed reading trick. Just out of curiosity, how much time did it take you to master this trick?ss
- srikanth
Nice video, thanks for sharing..
- Colby
Thanks for posting this. I do a lot of reading everyday, but I’m very slow probably closer to the 125 WPM. If I can accelerate my reading I could save a ton of time. I think I’ll work on this everyday for the next month and see if I can improve my reading skills.
- MLDina
I started doing this a long time ago and while it’s great for scanning news or a book, you have to be careful with emails. Sometimes your brain will process what it thinks is one word and it can make a HUGE difference if you’re a letter or two off. Definitely helpful for catching up on the news and social media sites, though.
- Nicholas Cardot
Thanks for posting this. I’m going to devote some time to trying to become better at using this method of accelerated reading. I may also share this video with some of my readers because it really is amazing.
- John (Human3rror)
this is awesome. i was taught this a long time ago… and it works.
Comments are closed.
In the following sections, we briefly go through a few techniques that can help make your Julia code run as fast as possible.
A global variable might have its value, and therefore its type, change at any point. This makes it difficult for the compiler to optimize code using global variables. Variables should be local, or passed as arguments to functions, whenever possible.
Any code that is performance critical or being benchmarked should be inside a function.
We find that global names are frequently constants, and declaring them as such greatly improves performance:
Uses of non-constant globals can be optimized by annotating their types at the point of use:
Passing arguments to functions is better style. It leads to more reusable code and clarifies what the inputs and outputs are.
All code in the REPL is evaluated in global scope, so a variable defined and assigned at top level will be a global variable. Variables defined at top level scope inside modules are also global.
In the following REPL session:
is equivalent to:
so all the performance issues discussed previously apply.
Measure performance with @time
and pay attention to memory allocation
Speed Reading Techniques
A useful tool for measuring performance is the @time
macro. We here repeat the example with the global variable above, but this time with the type annotation removed:
On the first call (@time sum_global()
) the function gets compiled. (If you've not yet used @time
in this session, it will also compile functions needed for timing.) You should not take the results of this run seriously. For the second run, note that in addition to reporting the time, it also indicated that a significant amount of memory was allocated. We are here just computing a sum over all elements in a vector of 64-bit floats so there should be no need to allocate memory (at least not on the heap which is what @time
reports).
Unexpected memory allocation is almost always a sign of some problem with your code, usually a problem with type-stability or creating many small temporary arrays. Consequently, in addition to the allocation itself, it's very likely that the code generated for your function is far from optimal. Take such indications seriously and follow the advice below.
If we instead pass x
as an argument to the function it no longer allocates memory (the allocation reported below is due to running the @time
macro in global scope) and is significantly faster after the first call:
The 5 allocations seen are from running the @time
macro itself in global scope. If we instead run the timing in a function, we can see that indeed no allocations are performed:
In some situations, your function may need to allocate memory as part of its operation, and this can complicate the simple picture above. In such cases, consider using one of the tools below to diagnose problems, or write a version of your function that separates allocation from its algorithmic aspects (see Pre-allocating outputs).
For more serious benchmarking, consider the BenchmarkTools.jl package which among other things evaluates the function multiple times in order to reduce noise.
Julia and its package ecosystem includes tools that may help you diagnose problems and improve the performance of your code:
- Profiling allows you to measure the performance of your running code and identify lines that serve as bottlenecks. For complex projects, the ProfileView package can help you visualize your profiling results.
- The Traceur package can help you find common performance problems in your code.
- Unexpectedly-large memory allocations–as reported by
@time
,@allocated
, or the profiler (through calls to the garbage-collection routines)–hint that there might be issues with your code. If you don't see another reason for the allocations, suspect a type problem. You can also start Julia with the--track-allocation=user
option and examine the resulting*.mem
files to see information about where those allocations occur. See Memory allocation analysis. @code_warntype
generates a representation of your code that can be helpful in finding expressions that result in type uncertainty. See@code_warntype
below.
When working with parameterized types, including arrays, it is best to avoid parameterizing with abstract types where possible.
Consider the following:
Because a
is a an array of abstract type Real
, it must be able to hold any Real
value. Since Real
objects can be of arbitrary size and structure, a
must be represented as an array of pointers to individually allocated Real
objects. However, if we instead only allow numbers of the same type, e.g. Float64
, to be stored in a
these can be stored more efficiently:
Assigning numbers into a
will now convert them to Float64
and a
will be stored as a contiguous block of 64-bit floating-point values that can be manipulated efficiently.
See also the discussion under Parametric Types.
In many languages with optional type declarations, adding declarations is the principal way to make code run faster. This is not the case in Julia. In Julia, the compiler generally knows the types of all function arguments, local variables, and expressions. However, there are a few specific instances where declarations are helpful.
Speed Read 2 0 1 – Reading Techniques For Beginners
Types can be declared without specifying the types of their fields:
This allows a
to be of any type. This can often be useful, but it does have a downside: for objects of type MyAmbiguousType
, the compiler will not be able to generate high-performance code. The reason is that the compiler uses the types of objects, not their values, to determine how to build code. Unfortunately, very little can be inferred about an object of type MyAmbiguousType
:
The values of b
and c
have the same type, yet their underlying representation of data in memory is very different. Even if you stored just numeric values in field a
, the fact that the memory representation of a UInt8
differs from a Float64
also means that the CPU needs to handle them using two different kinds of instructions. Since the required information is not available in the type, such decisions have to be made at run-time. This slows performance.
You can do better by declaring the type of a
. Here, we are focused on the case where a
might be any one of several types, in which case the natural solution is to use parameters. For example:
This is a better choice than
because the first version specifies the type of a
from the type of the wrapper object. For example:
The type of field a
can be readily determined from the type of m
, but not from the type of t
. Indeed, in t
it's possible to change the type of the field a
:
In contrast, once m
is constructed, the type of m.a
cannot change:
The fact that the type of m.a
is known from m
's type—coupled with the fact that its type cannot change mid-function—allows the compiler to generate highly-optimized code for objects like m
but not for objects like t
.
Of course, all of this is true only if we construct m
with a concrete type. We can break this by explicitly constructing it with an abstract type:
For all practical purposes, such objects behave identically to those of MyStillAmbiguousType
.
It's quite instructive to compare the sheer amount code generated for a simple function
using
For reasons of length the results are not shown here, but you may wish to try this yourself. Because the type is fully-specified in the first case, the compiler doesn't need to generate any code to resolve the type at run-time. This results in shorter and faster code.
The same best practices also work for container types:
For example:
For MySimpleContainer
, the object is fully-specified by its type and parameters, so the compiler can generate optimized functions. In most instances, this will probably suffice.
While the compiler can now do its job perfectly well, there are cases where you might wish that your code could do different things depending on the element type of a
. Usually the best way to achieve this is to wrap your specific operation (here, foo
) in a separate function:
This keeps things simple, while allowing the compiler to generate optimized code in all cases.
However, there are cases where you may need to declare different versions of the outer function for different element types or types of the AbstractVector
of the field a
in MySimpleContainer
. You could do it like this:
It is often convenient to work with data structures that may contain values of any type (arrays of type Array{Any}
). But, if you're using one of these structures and happen to know the type of an element, it helps to share this knowledge with the compiler:
Here, we happened to know that the first element of a
would be an Int32
. Making an annotation like this has the added benefit that it will raise a run-time error if the value is not of the expected type, potentially catching certain bugs earlier.
In the case that the type of a[1]
is not known precisely, x
can be declared via x = convert(Int32, a[1])::Int32
. The use of the convert
function allows a[1]
to be any object convertible to an Int32
(such as UInt8
), thus increasing the genericity of the code by loosening the type requirement. Notice that convert
itself needs a type annotation in this context in order to achieve type stability. This is because the compiler cannot deduce the type of the return value of a function, even convert
, unless the types of all the function's arguments are known.
Type annotation will not enhance (and can actually hinder) performance if the type is constructed at run-time. This is because the compiler cannot use the annotation to specialize the subsequent code, and the type-check itself takes time. For example, in the code:
the annotation of c
harms performance. To write performant code involving types constructed at run-time, use the function-barrier technique discussed below, and ensure that the constructed type appears among the argument types of the kernel function so that the kernel operations are properly specialized by the compiler. For example, in the above snippet, as soon as b
is constructed, it can be passed to another function k
, the kernel. If, for example, function k
declares b
as an argument of type Complex{T}
, where T
is a type parameter, then a type annotation appearing in an assignment statement within k
of the form:
does not hinder performance (but does not help either) since the compiler can determine the type of c
at the time k
is compiled.
As a heuristic, Julia avoids automatically specializing on argument type parameters in three specific cases: Type
, Function
, and Vararg
. Julia will always specialize when the argument is used within the method, but not if the argument is just passed through to another function. This usually has no performance impact at runtime and improves compiler performance. If you find it does have a performance impact at runtime in your case, you can trigger specialization by adding a type parameter to the method declaration. Here are some examples:
This will not specialize:
but this will:
These will not specialize:
but this will:
This will not specialize:
but this will:
One only needs to introduce a single type parameter to force specialization, even if the other types are unconstrained. For example, this will also specialize, and is useful when the arguments are not all of the same type:
Note that @code_typed
and friends will always show you specialized code, even if Julia would not normally specialize that method call. You need to check the method internals if you want to see whether specializations are generated when argument types are changed, i.e., if (@which f(...)).specializations
contains specializations for the argument in question.
Writing a function as many small definitions allows the compiler to directly call the most applicable code, or even inline it.
Here is an example of a 'compound function' that should really be written as multiple definitions:
This can be written more concisely and efficiently as:
It should however be noted that the compiler is quite efficient at optimizing away the dead branches in code written as the mynorm
example.
When possible, it helps to ensure that a function always returns a value of the same type. Consider the following definition:
Although this seems innocent enough, the problem is that 0
is an integer (of type Int
) and x
might be of any type. Thus, depending on the value of x
, this function might return a value of either of two types. This behavior is allowed, and may be desirable in some cases. But it can easily be fixed as follows:
There is also a oneunit
function, and a more general oftype(x, y)
function, which returns y
converted to the type of x
.
An analogous 'type-stability' problem exists for variables used repeatedly within a function:
Local variable x
starts as an integer, and after one loop iteration becomes a floating-point number (the result of /
operator). This makes it more difficult for the compiler to optimize the body of the loop. There are several possible fixes:
- Initialize
x
withx = 1.0
- Declare the type of
x
explicitly asx::Float64 = 1
- Use an explicit conversion by
x = oneunit(Float64)
- Initialize with the first loop iteration, to
x = 1 / rand()
, then loopfor i = 2:10
Many functions follow a pattern of performing some set-up work, and then running many iterations to perform a core computation. Where possible, it is a good idea to put these core computations in separate functions. For example, the following contrived function returns an array of a randomly-chosen type:
This should be written as:
Julia's compiler specializes code for argument types at function boundaries, so in the original implementation it does not know the type of a
during the loop (since it is chosen randomly). Therefore the second version is generally faster since the inner loop can be recompiled as part of fill_twos!
for different types of a
.
The second form is also often better style and can lead to more code reuse.
This pattern is used in several places in Julia Base. For example, see vcat
and hcat
in abstractarray.jl
, or the fill!
function, which we could have used instead of writing our own fill_twos!
.
Functions like strange_twos
occur when dealing with data of uncertain type, for example data loaded from an input file that might contain either integers, floats, strings, or something else.
Let's say you want to create an N
-dimensional array that has size 3 along each axis. Such arrays can be created like this:
This approach works very well: the compiler can figure out that A
is an Array{Float64,2}
because it knows the type of the fill value (5.0::Float64
) and the dimensionality ((3, 3)::NTuple{2,Int}
). This implies that the compiler can generate very efficient code for any future usage of A
in the same function.
But now let's say you want to write a function that creates a 3×3×... array in arbitrary dimensions; you might be tempted to write a function
This works, but (as you can verify for yourself using @code_warntype array3(5.0, 2)
) the problem is that the output type cannot be inferred: the argument N
is a value of type Int
, and type-inference does not (and cannot) predict its value in advance. This means that code using the output of this function has to be conservative, checking the type on each access of A
; such code will be very slow.
Now, one very good way to solve such problems is by using the function-barrier technique. However, in some cases you might want to eliminate the type-instability altogether. In such cases, one approach is to pass the dimensionality as a parameter, for example through Val{T}()
(see 'Value types'):
Julia has a specialized version of ntuple
that accepts a Val{::Int}
instance as the second parameter; by passing N
as a type-parameter, you make its 'value' known to the compiler. Consequently, this version of array3
allows the compiler to predict the return type.
However, making use of such techniques can be surprisingly subtle. For example, it would be of no help if you called array3
from a function like this:
Here, you've created the same problem all over again: the compiler can't guess what n
is, so it doesn't know the type of Val(n)
. Attempting to use Val
, but doing so incorrectly, can easily make performance worse in many situations. (Only in situations where you're effectively combining Val
with the function-barrier trick, to make the kernel function more efficient, should code like the above be used.)
An example of correct usage of Val
would be:
In this example, N
is passed as a parameter, so its 'value' is known to the compiler. Essentially, Val(T)
works only when T
is either hard-coded/literal (Val(3)
) or already specified in the type-domain.
Once one learns to appreciate multiple dispatch, there's an understandable tendency to go overboard and try to use it for everything. For example, you might imagine using it to store information, e.g.
and then dispatch on objects like Car{:Honda,:Accord}(year, args...)
.
This might be worthwhile when either of the following are true:
- You require CPU-intensive processing on each
Car
, and it becomes vastly more efficient if you know theMake
andModel
at compile time and the total number of differentMake
orModel
that will be used is not too large. - You have homogenous lists of the same type of
Car
to process, so that you can store them all in anArray{Car{:Honda,:Accord},N}
.
When the latter holds, a function processing such a homogenous array can be productively specialized: Julia knows the type of each element in advance (all objects in the container have the same concrete type), so Julia can 'look up' the correct method calls when the function is being compiled (obviating the need to check at run-time) and thereby emit efficient code for processing the whole list.
When these do not hold, then it's likely that you'll get no benefit; worse, the resulting 'combinatorial explosion of types' will be counterproductive. If items[i+1]
has a different type than item[i]
, Julia has to look up the type at run-time, search for the appropriate method in method tables, decide (via type intersection) which one matches, determine whether it has been JIT-compiled yet (and do so if not), and then make the call. In essence, you're asking the full type- system and JIT-compilation machinery to basically execute the equivalent of a switch statement or dictionary lookup in your own code.
Some run-time benchmarks comparing (1) type dispatch, (2) dictionary lookup, and (3) a 'switch' statement can be found on the mailing list.
Perhaps even worse than the run-time impact is the compile-time impact: Julia will compile specialized functions for each different Car{Make, Model}
; if you have hundreds or thousands of such types, then every function that accepts such an object as a parameter (from a custom get_year
function you might write yourself, to the generic push!
function in Julia Base) will have hundreds or thousands of variants compiled for it. Each of these increases the size of the cache of compiled code, the length of internal lists of methods, etc. Excess enthusiasm for values-as-parameters can easily waste enormous resources.
Multidimensional arrays in Julia are stored in column-major order. This means that arrays are stacked one column at a time. This can be verified using the vec
function or the syntax [:]
as shown below (notice that the array is ordered [1 3 2 4]
, not [1 2 3 4]
):
This convention for ordering arrays is common in many languages like Fortran, Matlab, and R (to name a few). The alternative to column-major ordering is row-major ordering, which is the convention adopted by C and Python (numpy
) among other languages. Remembering the ordering of arrays can have significant performance effects when looping over arrays. A rule of thumb to keep in mind is that with column-major arrays, the first index changes most rapidly. Essentially this means that looping will be faster if the inner-most loop index is the first to appear in a slice expression. Keep in mind that indexing an array with :
is an implicit loop that iteratively accesses all elements within a particular dimension; it can be faster to extract columns than rows, for example.
Consider the following contrived example. Imagine we wanted to write a function that accepts a Vector
and returns a square Matrix
with either the rows or the columns filled with copies of the input vector. Assume that it is not important whether rows or columns are filled with these copies (perhaps the rest of the code can be easily adapted accordingly). We could conceivably do this in at least four ways (in addition to the recommended call to the built-in repeat
):
Now we will time each of these functions using the same random 10000
by 1
input vector:
Notice that copy_cols
is much faster than copy_rows
. This is expected because copy_cols
respects the column-based memory layout of the Matrix
and fills it one column at a time. Additionally, copy_col_row
is much faster than copy_row_col
because it follows our rule of thumb that the first element to appear in a slice expression should be coupled with the inner-most loop.
If your function returns an Array
or some other complex type, it may have to allocate memory. Unfortunately, oftentimes allocation and its converse, garbage collection, are substantial bottlenecks.
Sometimes you can circumvent the need to allocate memory on each function call by preallocating the output. As a trivial example, compare
with
Timing results:
Preallocation has other advantages, for example by allowing the caller to control the 'output' type from an algorithm. In the example above, we could have passed a SubArray
rather than an Array
, had we so desired.
Taken to its extreme, pre-allocation can make your code uglier, so performance measurements and some judgment may be required. However, for 'vectorized' (element-wise) functions, the convenient syntax x .= f.(y)
can be used for in-place operations with fused loops and no temporary arrays (see the dot syntax for vectorizing functions).
Julia has a special dot syntax that converts any scalar function into a 'vectorized' function call, and any operator into a 'vectorized' operator, with the special property that nested 'dot calls' are fusing: they are combined at the syntax level into a single loop, without allocating temporary arrays. If you use .=
and similar assignment operators, the result can also be stored in-place in a pre-allocated array (see above).
In a linear-algebra context, this means that even though operations like vector + vector
and vector * scalar
are defined, it can be advantageous to instead use vector .+ vector
and vector .* scalar
because the resulting loops can be fused with surrounding computations. For example, consider the two functions:
Both f
and fdot
compute the same thing. However, fdot
(defined with the help of the @.
macro) is significantly faster when applied to an array:
That is, fdot(x)
is ten times faster and allocates 1/6 the memory of f(x)
, because each *
and +
operation in f(x)
allocates a new temporary array and executes in a separate loop. (Of course, if you just do f.(x)
then it is as fast as fdot(x)
in this example, but in many contexts it is more convenient to just sprinkle some dots in your expressions rather than defining a separate function for each vectorized operation.)
In Julia, an array 'slice' expression like array[1:5, :]
creates a copy of that data (except on the left-hand side of an assignment, where array[1:5, :] = ...
assigns in-place to that portion of array
). If you are doing many operations on the slice, this can be good for performance because it is more efficient to work with a smaller contiguous copy than it would be to index into the original array. On the other hand, if you are just doing a few simple operations on the slice, the cost of the allocation and copy operations can be substantial.
An alternative is to create a 'view' of the array, which is an array object (a SubArray
) that actually references the data of the original array in-place, without making a copy. (If you write to a view, it modifies the original array's data as well.) This can be done for individual slices by calling view
, or more simply for a whole expression or block of code by putting @views
in front of that expression. For example:
Notice both the 3× speedup and the decreased memory allocation of the fview
version of the function.
Arrays are stored contiguously in memory, lending themselves to CPU vectorization and fewer memory accesses due to caching. These are the same reasons that it is recommended to access arrays in column-major order (see above). Irregular access patterns and non-contiguous views can drastically slow down computations on arrays because of non-sequential memory access.
Copying irregularly-accessed data into a contiguous array before operating on it can result in a large speedup, such as in the example below. Here, a matrix and a vector are being accessed at 800,000 of their randomly-shuffled indices before being multiplied. Copying the views into plain arrays speeds up the multiplication even with the cost of the copying operation.
Provided there is enough memory for the copies, the cost of copying the view to an array is far outweighed by the speed boost from doing the matrix multiplication on a contiguous array.
When writing data to a file (or other I/O device), forming extra intermediate strings is a source of overhead. Instead of:
use:
The first version of the code forms a string, then writes it to the file, while the second version writes values directly to the file. Also notice that in some cases string interpolation can be harder to read. Consider:
versus:
When executing a remote function in parallel:
is faster than:
The former results in a single network round-trip to every worker, while the latter results in two network calls - first by the @spawnat
and the second due to the fetch
(or even a wait
). The fetch
/wait
is also being executed serially resulting in an overall poorer performance.
A deprecated function internally performs a lookup in order to print a relevant warning only once. This extra lookup can cause a significant slowdown, so all uses of deprecated functions should be modified as suggested by the warnings.
These are some minor points that might help in tight inner loops.
- Avoid unnecessary arrays. For example, instead of
sum([x,y,z])
usex+y+z
. - Use
abs2(z)
instead ofabs(z)^2
for complexz
. In general, try to rewrite code to useabs2
instead ofabs
for complex arguments. - Use
div(x,y)
for truncating division of integers instead oftrunc(x/y)
,fld(x,y)
instead offloor(x/y)
, andcld(x,y)
instead ofceil(x/y)
.
Sometimes you can enable better optimization by promising certain program properties.
- Use
@inbounds
to eliminate array bounds checking within expressions. Be certain before doing this. If the subscripts are ever out of bounds, you may suffer crashes or silent corruption. - Use
@fastmath
to allow floating point optimizations that are correct for real numbers, but lead to differences for IEEE numbers. Be careful when doing this, as this may change numerical results. This corresponds to the-ffast-math
option of clang. - Write
@simd
in front offor
loops to promise that the iterations are independent and may be reordered. Note that in many cases, Julia can automatically vectorize code without the@simd
macro; it is only beneficial in cases where such a transformation would otherwise be illegal, including cases like allowing floating-point re-associativity and ignoring dependent memory accesses (@simd ivdep
). Again, be very careful when asserting@simd
as erroneously annotating a loop with dependent iterations may result in unexpected results. In particular, note thatsetindex!
on someAbstractArray
subtypes is inherently dependent upon iteration order. This feature is experimental and could change or disappear in future versions of Julia.
The common idiom of using 1:n to index into an AbstractArray is not safe if the Array uses unconventional indexing, and may cause a segmentation fault if bounds checking is turned off. Use LinearIndices(x)
or eachindex(x)
instead (see also Arrays with custom indices).
While @simd
needs to be placed directly in front of an innermost for
loop, both @inbounds
and @fastmath
can be applied to either single expressions or all the expressions that appear within nested blocks of code, e.g., using @inbounds begin
or @inbounds for ...
.
Here is an example with both @inbounds
and @simd
markup (we here use @noinline
to prevent the optimizer from trying to be too clever and defeat our benchmark):
On a computer with a 2.4GHz Intel Core i5 processor, this produces:
(GFlop/sec
measures the performance, and larger numbers are better.)
Here is an example with all three kinds of markup. This program first calculates the finite difference of a one-dimensional array, and then evaluates the L2-norm of the result:
On a computer with a 2.7 GHz Intel Core i7 processor, this produces:
Here, the option --math-mode=ieee
disables the @fastmath
macro, so that we can compare results.
In this case, the speedup due to @fastmath
is a factor of about 3.7. This is unusually large – in general, the speedup will be smaller. (In this particular example, the working set of the benchmark is small enough to fit into the L1 cache of the processor, so that memory access latency does not play a role, and computing time is dominated by CPU usage. In many real world programs this is not the case.) Also, in this case this optimization does not change the result – in general, the result will be slightly different. In some cases, especially for numerically unstable algorithms, the result can be very different.
The annotation @fastmath
re-arranges floating point expressions, e.g. changing the order of evaluation, or assuming that certain special cases (inf, nan) cannot occur. In this case (and on this particular computer), the main difference is that the expression 1 / (2*dx)
in the function deriv
is hoisted out of the loop (i.e. calculated outside the loop), as if one had written idx = 1 / (2*dx)
. In the loop, the expression ... / (2*dx)
then becomes ... * idx
, which is much faster to evaluate. Of course, both the actual optimization that is applied by the compiler as well as the resulting speedup depend very much on the hardware. You can examine the change in generated code by using Julia's code_native
function.
Note that @fastmath
also assumes that NaN
s will not occur during the computation, which can lead to surprising behavior:
Subnormal numbers, formerly called denormal numbers, are useful in many contexts, but incur a performance penalty on some hardware. A call set_zero_subnormals(true)
grants permission for floating-point operations to treat subnormal inputs or outputs as zeros, which may improve performance on some hardware. A call set_zero_subnormals(false)
enforces strict IEEE behavior for subnormal numbers.
Below is an example where subnormals noticeably impact performance on some hardware:
This gives an output similar to
Note how each even iteration is significantly faster.
This example generates many subnormal numbers because the values in a
become an exponentially decreasing curve, which slowly flattens out over time.
Treating subnormals as zeros should be used with caution, because doing so breaks some identities, such as x-y 0
implies x y
:
In some applications, an alternative to zeroing subnormal numbers is to inject a tiny bit of noise. For example, instead of initializing a
with zeros, initialize it with:
The macro @code_warntype
(or its function variant code_warntype
) can sometimes be helpful in diagnosing type-related problems. Here's an example:
Interpreting the output of @code_warntype
, like that of its cousins @code_lowered
, @code_typed
, @code_llvm
, and @code_native
, takes a little practice. Your code is being presented in form that has been heavily digested on its way to generating compiled machine code. Most of the expressions are annotated by a type, indicated by the ::T
(where T
might be Float64
, for example). The most important characteristic of @code_warntype
is that non-concrete types are displayed in red; since this document is written in Markdown, which has no color, in this document, red text is denoted by uppercase.
At the top, the inferred return type of the function is shown as Body::Float64
. The next lines represent the body of f
in Julia's SSA IR form. The numbered boxes are labels and represent targets for jumps (via goto
) in your code. Looking at the body, you can see that the first thing that happens is that pos
is called and the return value has been inferred as the Union
type UNION{FLOAT64, INT64}
shown in uppercase since it is a non-concrete type. This means that we cannot know the exact return type of pos
based on the input types. However, the result of y*x
is a Float64
no matter if y
is a Float64
or Int64
The net result is that f(x::Float64)
will not be type-unstable in its output, even if some of the intermediate computations are type-unstable.
How you use this information is up to you. Obviously, it would be far and away best to fix pos
to be type-stable: if you did so, all of the variables in f
would be concrete, and its performance would be optimal. However, there are circumstances where this kind of ephemeral type instability might not matter too much: for example, if pos
is never used in isolation, the fact that f
's output is type-stable (for Float64
inputs) will shield later code from the propagating effects of type instability. This is particularly relevant in cases where fixing the type instability is difficult or impossible. In such cases, the tips above (e.g., adding type annotations and/or breaking up functions) are your best tools to contain the 'damage' from type instability. Also, note that even Julia Base has functions that are type unstable. For example, the function findfirst
returns the index into an array where a key is found, or nothing
if it is not found, a clear type instability. In order to make it easier to find the type instabilities that are likely to be important, Union
s containing either missing
or nothing
are color highlighted in yellow, instead of red.
The following examples may help you interpret expressions marked as containing non-leaf types:
Function body starting with
Body::UNION{T1,T2})
- Interpretation: function with unstable return type
- Suggestion: make the return value type-stable, even if you have to annotate it
invoke Main.g(%%x::Int64)::UNION{FLOAT64, INT64}
- Interpretation: call to a type-unstable function
g
. - Suggestion: fix the function, or if necessary annotate the return value
- Interpretation: call to a type-unstable function
invoke Base.getindex(%%x::Array{Any,1}, 1::Int64)::ANY
- Interpretation: accessing elements of poorly-typed arrays
- Suggestion: use arrays with better-defined types, or if necessary annotate the type of individual element accesses
Base.getfield(%%x, :(:data))::ARRAY{FLOAT64,N} WHERE N
- Interpretation: getting a field that is of non-leaf type. In this case,
ArrayContainer
had a fielddata::Array{T}
. ButArray
needs the dimensionN
, too, to be a concrete type. - Suggestion: use concrete types like
Array{T,3}
orArray{T,N}
, whereN
is now a parameter ofArrayContainer
- Interpretation: getting a field that is of non-leaf type. In this case,
Consider the following example that defines an inner function:
Speed Reading Techniques Pdf
Function abmult
returns a function f
that multiplies its argument by the absolute value of r
. The inner function assigned to f
is called a 'closure'. Inner functions are also used by the language for do
-blocks and for generator expressions.
This style of code presents performance challenges for the language. The parser, when translating it into lower-level instructions, substantially reorganizes the above code by extracting the inner function to a separate code block. 'Captured' variables such as r
that are shared by inner functions and their enclosing scope are also extracted into a heap-allocated 'box' accessible to both inner and outer functions because the language specifies that r
in the inner scope must be identical to r
in the outer scope even after the outer scope (or another inner function) modifies r
.
The discussion in the preceding paragraph referred to the 'parser', that is, the phase of compilation that takes place when the module containing abmult
is first loaded, as opposed to the later phase when it is first invoked. The parser does not 'know' that Int
is a fixed type, or that the statement r = -r
transforms an Int
to another Int
. The magic of type inference takes place in the later phase of compilation.
Thus, the parser does not know that r
has a fixed type (Int
). nor that r
does not change value once the inner function is created (so that the box is unneeded). Therefore, the parser emits code for box that holds an object with an abstract type such as Any
, which requires run-time type dispatch for each occurrence of r
. This can be verified by applying @code_warntype
to the above function. Both the boxing and the run-time type dispatch can cause loss of performance.
Speed Read 2 0 1 – Reading Techniques Reading
If captured variables are used in a performance-critical section of the code, then the following tips help ensure that their use is performant. First, if it is known that a captured variable does not change its type, then this can be declared explicitly with a type annotation (on the variable, not the right-hand side):
The type annotation partially recovers lost performance due to capturing because the parser can associate a concrete type to the object in the box. Going further, if the captured variable does not need to be boxed at all (because it will not be reassigned after the closure is created), this can be indicated with let
blocks as follows.
Speed Reading Methods
The let
block creates a new variable r
whose scope is only the inner function. The second technique recovers full language performance in the presence of captured variables. Note that this is a rapidly evolving aspect of the compiler, and it is likely that future releases will not require this degree of programmer annotation to attain performance. In the mean time, some user-contributed packages like FastClosures automate the insertion of let
statements as in abmult3
.
When checking if a value is equal to some singleton it can be better for performance to check for identicality () instead of equality (). The same advice applies to using !
over !=
. These type of checks frequently occur e.g. when implementing the iteration protocol and checking if nothing
is returned from iterate
.