Difference between revisions of "ArrayStructures"

From ECLR
Jump to: navigation, search
(Expanding to Higher Dimensions)
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
 
=1.0
 
 
 
= Intro =
 
= Intro =
  
Sometimes you would like to store data which have more than two dimensions. For example, interest rates for different maturities for different countries over time. In this case there are three natural indices: country index, maturity index, and time. Another example could be regression results of Monte-Carlo simulations for different sample sizes and different model specifications. In MATLAB you can address this issue using three different approaches:
+
In many cases you would like to either (i) store data which have more than two dimensions or (ii) collect data of different type/dimension under the same variable name. Examples of the former could be interest rates for different maturities for different countries over time or Monte-Carlo simulations for different sample sizes and different model specifications. Examples of the latter could be real time prices grouped on a daily basis (number of observations varies from day to day) or results of OLS estimation stored in one variable. In MATLAB you can address these issues using:
  
 
# Multidimensional arrays
 
# Multidimensional arrays
Line 11: Line 7:
 
# Cells/Cell arrays
 
# Cells/Cell arrays
  
In the next sections we will briefly review these data objects and discuss the most straightforward way to handle them.
+
Each of these approaches has its advantages and disadvantages. For example: the easiest way to handle data in a uniform way is to use multidimensional arrays. Multidimensional arrays, however, require that the data are of the same type (logical, numerical, character). In the next sections we will briefly review these data objects and discuss the most straightforward way to handle them.
  
 
= Multidimensional Arrays =
 
= Multidimensional Arrays =
Line 17: Line 13:
 
== Elementwise Operations ==
 
== Elementwise Operations ==
  
Multidimensional arrays are just a generalization of a matrix. Almost all MATLAB functions can generate and operate with multivariate arrays, for example, to generate a three-dimensional array of normal random numbers with dimensions <math>T,p,k</math>, you have to type:
+
Multidimensional arrays are just a generalization of a matrix. Almost all MATLAB functions can generate and operate with multivariate arrays. For example, to generate a three-dimensional array of standardized normally distributed observations with dimensions <math>T,p,k</math>, you have to type:
  
 
<source>  A=randn(T,p,k);</source>
 
<source>  A=randn(T,p,k);</source>
Line 29: Line 25:
 
     0    0
 
     0    0
 
     0    0</source>
 
     0    0</source>
Taking an average over the second dimension and taking a maximum over the third dimension looks like:
+
Taking an average over the second dimension is quite intuitive:
 +
 
 +
<source>  meanA=mean(A,2);</source>
 +
Taking a maximum over the third dimension is less so:
 +
 
 +
<source>  maxA=max(A,[],3);</source>
 +
The bottom line is: please check MATLAB help system first.
  
<source> meanA=mean(A,2);
+
Multidimensional arrays support element-by-element binary operations <source enclose="none">+,-,.*,./,.^,<,>,~= </source> between two arrays with the same dimensions and between arrays and scalars. MATLAB will correctly compute
  maxA=max(A,[],3);</source>
 
Please note the syntax of <source enclose="none">max</source> function. For this function the second argument stands for a matrix/scalar that you are comparing with your matrix A. If you don’t want to experience any surprises, please check with the function help first. Multidimensional arrays support element-by-element operations (operations with dots), as well as operations with scalars. MATLAB will correctly compute
 
  
 
<source>  B=A.^2;
 
<source>  B=A.^2;
Line 46: Line 46:
 
== Collapsing Singleton Dimensions ==
 
== Collapsing Singleton Dimensions ==
  
All operations of extracting sub-arrays from these arrays work as usual (see link). However, there are some particularities that I would like to mention. Assume that we have a <math>3\times 3\times 3</math> array A. MATLAB recognizes the following operation :
+
All operations of extracting sub-arrays from the original arrays work as usual (see link). However, there are some particularities that I would like to mention. Assume that we have a <math>3\times 3\times 3</math> array A. MATLAB recognizes the following operation :
  
 
<source>  A(:,:,1)*A(:,:,2);</source>
 
<source>  A(:,:,1)*A(:,:,2);</source>
Line 52: Line 52:
  
 
<source>  A(1,:,:)*A(1,:,:);</source>
 
<source>  A(1,:,:)*A(1,:,:);</source>
generates an error. We get the same result if we try to plot our data:
+
generates an error. We get similar results for <source enclose="none">plot</source> function:
  
 
<source>  plot(A(:,:,1))</source>
 
<source>  plot(A(:,:,1))</source>
plots a figure, while
+
generates a graph, while
  
 
<source>  plot(A(1,:,:))</source>
 
<source>  plot(A(1,:,:))</source>
generates an error. It happens because from MATLAB point of view <source enclose="none">A(1,:,:)</source> is not a matrix, it is still a 3-D array. There are two ways to deal with this issue:
+
generates an error. It happens because, from MATLAB point of view, <source enclose="none">A(1,:,:)</source> is not a matrix, but a 3-D array. There are two ways to deal with this problem:
  
 
<ol>
 
<ol>
Line 72: Line 72:
 
== Expanding to Higher Dimensions ==
 
== Expanding to Higher Dimensions ==
  
By default, element-wise operations and operations with a scalar are defined on any multi-dimensional array. An additional set of operations is defined on conformable vectors and matrices. The natural question to ask is “What to do if you want to subtract a vector from a matrix, or a matrix from N-D array?” In MATLAB there is a special command that can create a higher-dimensional array from a lower-dimensional one via replication:
+
By default, element-wise operations are defined on any multi-dimensional array of the same dimension. An additional set of operations is defined on conformable vectors and matrices. The natural question to ask is “What to do if you want to subtract a vector from a matrix, or a matrix from N-D array?” MATLAB has a special function that can transform a lower-dimensional array to a higher-dimensional one by replicating the original content:
  
 
<source> A=repmat([1,2,3]',2,3)
 
<source> A=repmat([1,2,3]',2,3)
Line 82: Line 82:
 
     2    2    2
 
     2    2    2
 
     3    3    3</source>
 
     3    3    3</source>
replicates a column-vector <math>[1\ 2\ 3]'</math> twice along each row and 3 times along each column. For 3-D vector replication we have to use slightly different syntax:
+
This command replicates a column-vector <math>[1\ 2\ 3]'</math> six times, i.e. twice along each row and 3 times along each column. For 3-D replication we have to use slightly different syntax:
  
 
<source>  A=repmat([1,2,3]',[1 2 3])
 
<source>  A=repmat([1,2,3]',[1 2 3])
Line 97: Line 97:
 
     2    2
 
     2    2
 
     3    3</source>
 
     3    3</source>
This command does not change the length of the vector, but replicates it twice along the second dimension and 3 times along the third dimension.  
+
This command keeps the length of the vector the same (first index is 1), replicates it twice along the second dimension (second index is 2) and 3 times along the third dimension (third index is 3). Consider the following two real-life examples:
  
Lets consider two real-life examples:
+
# Assume that we have a cross-section of returns and we would like to subtract a series of risk-free rate
 +
# Assume that we need to subtract a mean along the third dimension from a 3-D array.
  
# Assume that we have a cross-section of returns and we would like to subtract a series of risk-free returns from each of the series
+
These examples can be implemented either by using a loop, or by using <source enclose="none">repmat</source>. In the first case the algorithm with a loop looks like:
# Assume that we need to subtract a mean along the third dimension from a 3-D array.
 
  
For the first example we can use loop:
+
# Initialize a matrix of excess returns <source enclose="none">Rex</source>
 +
# Compute a difference of <source enclose="none">R(:,i)-rf</source> and assign it to a column <source enclose="none">Rex(:,i)</source> for <math>i=1</math>
 +
# Repeat (2) for i=2, 3, …, <source enclose="none">size(R,2)</source> times
  
 
<source>% rf - a series of risk-free returns
 
<source>% rf - a series of risk-free returns
 
% R  - a cross-section of returns, it is assumed that size(R,1)=size(rf,1)
 
% R  - a cross-section of returns, it is assumed that size(R,1)=size(rf,1)
R=zeros(size(R));
+
Rex=zeros(size(R));
 
for i=1:size(R,2)
 
for i=1:size(R,2)
 
Rex(:,i)=R(:,i)-rf;
 
Rex(:,i)=R(:,i)-rf;
 
end</source>
 
end</source>
or <source enclose="none">repmat</source> command:
+
The same algorithm using <source enclose="none">repmat</source> looks:
 +
 
 +
# Replicate <source enclose="none">rf</source> vector <source enclose="none">size(R,2)</source> times
 +
# Compute <source enclose="none">Rex</source> by <source enclose="none">Rx=R-rf</source>
  
 
<source>% rf - a series of risk-free returns
 
<source>% rf - a series of risk-free returns
Line 118: Line 123:
 
Rf=repmat(rf,1,size(R,2));
 
Rf=repmat(rf,1,size(R,2));
 
Rex=R-Rf;</source>
 
Rex=R-Rf;</source>
For the second example, to implement loop, we have to:
+
For the second example, to implement an algorithm with a loop:
  
# Compute a mean of 3-D array
+
# Compute a mean of 3-D array <source enclose="none">Rmean3</source>
 
# Initiate a 3-D array of demeaned values
 
# Initiate a 3-D array of demeaned values
# Subtract piece-by-piece a matrix of means from each layer of the 3-D array
+
# Compute a difference between <source enclose="none">R(:,:,i)-Rmean3</source> and assign it to <source enclose="none">Rdemean(:,:,i)</source> for i=1
 +
# Repeat (3) for i=2, 3, …, <source enclose="none">size(R,3)</source>
 
# check whether the mean is indeed subtracted
 
# check whether the mean is indeed subtracted
  
Line 131: Line 137:
 
     Rdemean(:,:,i)=R(:,:,i)-Rmean3;
 
     Rdemean(:,:,i)=R(:,:,i)-Rmean3;
 
   end
 
   end
   disp(mean(Rdemean,3)) %has to be a matrix of 0s </source>
+
   disp(mean(Rdemean,3)) % which is a matrix of zeros implying we worked correctly</source>
# Compute a mean of 3-D array
+
The same algorithm with <source enclose="none">repmat</source>:
# Construct a 3-D array of means
+
 
 +
# Compute a mean of 3-D array <source enclose="none">Rmean3</source>
 +
# Construct a 3-D array of means using <source enclose="none">repmat</source>
 
# subtract one from another
 
# subtract one from another
 
# check whether the mean is indeed subtracted
 
# check whether the mean is indeed subtracted
Line 141: Line 149:
 
   Rmean3expand=repmat(Rmean3,[1 1 size(R,3)]); %Please note, for more than 2-dimensional arrays, repmat accepts a vector of replications instead of a variable number of inputs.
 
   Rmean3expand=repmat(Rmean3,[1 1 size(R,3)]); %Please note, for more than 2-dimensional arrays, repmat accepts a vector of replications instead of a variable number of inputs.
 
   Rdemean=R-Rmean3expand;
 
   Rdemean=R-Rmean3expand;
   disp(mean(Rdemean,3)) %has to be a matrix of 0s</source>
+
   disp(mean(Rdemean,3)) %matrix of zeros, which mean that we did it correctly</source>
Obviously, there are many ways of handling these problems.
 
 
 
Each of these approaches has its advantages and disadvantages. For example: the easiest way to handle data in a uniform way is to use multidimensional arrays. Multivariate arrays, however, require that the data are of the same type (logical, numerical, character).
 
 
 
 
= Structures and Cell Arrays =
 
= Structures and Cell Arrays =
  
Sometimes it is natural to keep data of different type under the same roof, i.e. using the same variable name. Multidimensional arrays do not allow that. Therefore structures/arrays of structures or cell arrays have to be used. These two objects are very similar and, in fact, interchangeable. For some applications, however, cell arrays are more suitable while, for other, structures are preferable. For obvious reasons, arithmetic operations are not defined.
+
Sometimes it is natural to keep data of different type under the same roof, i.e. using the same variable name. Multidimensional arrays are not designed for this purpose. Therefore structures, arrays of structures or cell arrays have to be used in such cases. Structures and cell arrays are very similar and, in fact, interchangeable. For some applications, however, cell arrays are more suitable while, for other, structures are preferable. For obvious reasons, most of binary operations are not defined on these objects.
  
 
== Structures ==
 
== Structures ==
  
A structure variable is a variable that has “fields”. The variable name is separated from field name by a dot. For example, if you want to keep all OLS regression results, i.e. beta coefficients, covariance matrix, t-stats and vector of residuals in one place.
+
Structure variables are variables that have “fields”. The variable name is separated from the field name by a dot. For example, if you want to keep all OLS regression results, i.e. beta coefficients, covariance matrix, t-stats and vector of residuals in one place, you can do it using the following:
  
 
<source>%Assuming X and y are already defined, the whole filling-up process of the structure would look as follows:
 
<source>%Assuming X and y are already defined, the whole filling-up process of the structure would look as follows:
Line 162: Line 166:
 
OLS.tstat=OLS.beta./sqrt(diag(OLS.cov));
 
OLS.tstat=OLS.beta./sqrt(diag(OLS.cov));
 
OLS.name='Regression one';</source>
 
OLS.name='Regression one';</source>
Any assignment of this variable to another one creates a copy of the structure with all fields and values defined.
+
An assignment of <source enclose="none">OLS</source> variable to another one creates a copy of the structure with all fields and values.
  
 
<source>>> OLSnew=OLS;
 
<source>>> OLSnew=OLS;
Line 174: Line 178:
 
ans =
 
ans =
 
Regression one</source>
 
Regression one</source>
Moreover, all fields will be carried in and out of a function. In this way you should not care about the order of inputs and outputs for a function with many inputs/outputs. There are several useful functions that you can use with structures:
+
Moreover, all field names will be carried in and out of a function. In this way you should not worry about the order of inputs and outputs for a function with many inputs/outputs.
  
 +
There are several useful functions that are quite helpful for dealing with structures:
 +
 +
<ul>
 +
<li><p><source enclose="none">isfield(struct,field_name)</source> checks whether a structure <source enclose="none">struct</source> has a field name <source enclose="none">field_name</source>. It returns either 1 (true) or 0 (false)</p>
 
<source>  >> isfield(OLS,'name')
 
<source>  >> isfield(OLS,'name')
 
   ans =
 
   ans =
Line 181: Line 189:
 
   >> isfield(OLS,'pvalues')
 
   >> isfield(OLS,'pvalues')
 
   ans =
 
   ans =
   0</source>
+
   0</source></li>
This command checks the existence of a particular field within the structure. You can also get a list of all fields within a current structure using <source enclose="none">fieldnames</source>
+
<li><p><source enclose="none">fieldnames(struct)</source> generate a cell array (see Section [cell]) with all field names of a structure <source enclose="none">struct</source></p>
 
 
 
<source>>> fnames=fieldnames(OLSnew)'
 
<source>>> fnames=fieldnames(OLSnew)'
 
fnames =
 
fnames =
     'beta'    'resid'    'sigma2'    'cov'    'tstat'    'name'    'old'</source>
+
     'beta'    'resid'    'sigma2'    'cov'    'tstat'    'name'    'old'</source></li>
This command creates a cell array (see Section [cell]) <source enclose="none">fnames</source> with field names of OLSnew structure. You can also access the values of these field names using indirect referencing. In particular,
+
<li><p>Indirect referencing. You can access the values of fieldnames using the information generated by <source enclose="none">fieldnames(struct)</source> using the following syntax:</p>
 
 
 
<source>>> OLSnew.(fnames{6}) %please note curly brackets!
 
<source>>> OLSnew.(fnames{6}) %please note curly brackets!
 
ans =
 
ans =
 
Regression one</source>
 
Regression one</source>
Since <source enclose="none">fnames{6}=’name’</source> and, as a result, we indirectly refer to the field <source enclose="none">name</source> of the structure <source enclose="none">OLSnew</source>. In this way we can compare two structures field-by-field. Other useful commands for working with structure field names are:
+
<p>Since <source enclose="none">fnames{6}=’name’</source>, we indirectly refer to the field <source enclose="none">name</source> of the structure <source enclose="none">OLSnew</source>. In this way we can compare two structures field-by-field.</p></li>
 
+
<li><p>Useful commands for working with structure field names are:</p>
* <source enclose="none">intersect(A,B)</source>, in terms of set operations, it corresponds to <math>A\cap B</math>
+
<ul>
* <source enclose="none">union(A,B)</source>, corresponds to <math>A\cup B</math>
+
<li><p><source enclose="none">intersect(A,B)</source>, in terms of set operations, it corresponds to <math>A\cap B</math></p></li>
* <source enclose="none">setdiff(A,B)</source>, corresponds to <math>A/B</math> (please note, <math>A/B \ne B/A</math>)
+
<li><p><source enclose="none">union(A,B)</source>, corresponds to <math>A\cup B</math></p></li>
* <source enclose="none">setxor(A,B)</source>, corresponds to <math>(A/B)\cup (B/A)</math>
+
<li><p><source enclose="none">setdiff(A,B)</source>, corresponds to <math>A/B</math> (please note, <math>A/B \ne B/A</math>)</p></li>
 +
<li><p><source enclose="none">setxor(A,B)</source>, corresponds to <math>(A/B)\cup (B/A)</math></p></li></ul>
  
 
<source>  f1={'first','second'}; %Creating a cell array, see next section for details
 
<source>  f1={'first','second'}; %Creating a cell array, see next section for details
Line 210: Line 217:
 
   'third'
 
   'third'
 
>> disp(setxor(f1,f2))
 
>> disp(setxor(f1,f2))
   'first'    'third'</source>
+
   'first'    'third'</source></li></ul>
You can collect structures in arrays. Please note, all member structures of an array have to have the same set of fields. Creating a field for one element of the array, you are automatically creating empty fields of the same name for all members of your array. However, despite the fact that the field names are the same, there are no restriction on the field types. Example:
+
 
 +
Structures can be collected into arrays. Please note, all member structures of an array have to have the same set of fields. Creating a field for one element of the array automatically creates empty fields of the same name for all elements of your array. However, despite the fact that the field names are the same, there are no restriction on the field types. Example:
  
 
<source>>> s(2).f1=1;
 
<source>>> s(2).f1=1;
Line 222: Line 230:
 
>> s(1).f1=3;
 
>> s(1).f1=3;
 
>> s(1).f2=[83  116  117  100  101  110  116 32];</source>
 
>> s(1).f2=[83  116  117  100  101  110  116 32];</source>
You can print all values of a structure field in an array using “:” operator:
+
By assigning 1 to <source enclose="none">s(2).f1</source>, you automatically create
 +
 
 +
# the first element of the array <source enclose="none">s</source>
 +
# an empty field <source enclose="none">s(1).f1</source>
 +
 
 +
By assigning ‘Sally’ to <source enclose="none">s(2).f2</source>, you automatically create an empty field <source enclose="none">s(1).f2</source>. You can print all values of a structure field in an array using “:” operator:
  
 
<source>>> s(:).f1
 
<source>>> s(:).f1
Line 239: Line 252:
 
ans =
 
ans =
 
     3    1</source>
 
     3    1</source>
though results may vary (try to figure it out by yourself):
+
Results, though, may vary (try to figure it out by yourself!):
  
 
<source>  [s(:).f2]
 
<source>  [s(:).f2]
Line 249: Line 262:
 
ans =
 
ans =
 
     83  116  117  100  101  110  116    32    83    97  108  108  121</source>
 
     83  116  117  100  101  110  116    32    83    97  108  108  121</source>
Some MATLAB functions require structures as inputs (See non-linear optimization for details). Other MATLAB functions return structures when they are asked. For example, <source enclose="none">data=load(fname)</source> creates a structure <source enclose="none">data</source> with fieldnames that coincide with variables of the workspace saved in the data file <source enclose="none">fname</source>. Thus, for comparing variables from different MATLAB data files, we first need to create two structures:
+
Some MATLAB functions require structures as inputs (See non-linear optimization for details). Other MATLAB functions return structures when they are asked to do so. For example, <source enclose="none">data=load(fname)</source> creates a structure <source enclose="none">data</source> with fieldnames that coincide with variables of the workspace saved in the data file <source enclose="none">fname</source>. Thus, for comparing variables from different MATLAB data files, we first need to create two structures
  
 
<source>  data1=load(fname1);
 
<source>  data1=load(fname1);
 
   data2=load(fname2);</source>
 
   data2=load(fname2);</source>
You can automatically compare variables with the same names from these datasets, using <source enclose="none">intersect</source> and <source enclose="none">setdiff</source> commands:
+
and then compare the fields of interest.
 +
 
 +
It is easy to write a code that automatically compares variables with the same names from different data files, using <source enclose="none">intersect</source> and <source enclose="none">setdiff</source> commands:
  
 
<source>data1=load('data1');%data1.mat has to exist
 
<source>data1=load('data1');%data1.mat has to exist
 
data2=load('data2');%data2.mat has to exist
 
data2=load('data2');%data2.mat has to exist
if isequal(data1,data2)
+
if isequal(data1,data2) %the only binary operation defined on structures and cell arrays
     disp('These structures are the same')
+
     disp('These structures are equal')
 
else
 
else
     fname1=fieldnames(data1);
+
     fname1=fieldnames(data1);%retrieving field names from data1 structure
     fname2=fieldnames(data2);
+
     fname2=fieldnames(data2);%retrieving field names from data2 structure
     fnamejoint=intersect(fname1,fname2);
+
     fnamejoint=intersect(fname1,fname2);%constructing a collection of field names that belong to both structures
 
     for i=1:length(fnamejoint)
 
     for i=1:length(fnamejoint)
         if isequal(data1.(fnamejoint{i}),data2.(fnamejoint{i}))
+
         if isequal(data1.(fnamejoint{i}),data2.(fnamejoint{i})) %indirect referencing
             disp([fnamejoint{i} ' fields are the same'])
+
             disp([fnamejoint{i} ' fields are equal'])
 
         else
 
         else
             disp([fnamejoint{i} ' fields are different'])
+
             disp([fnamejoint{i} ' fields are not equal'])
 
         end
 
         end
 
     end
 
     end
Line 275: Line 290:
 
     disp(setdiff(fname2,fname1)')
 
     disp(setdiff(fname2,fname1)')
 
end</source>
 
end</source>
''Please note: the script above detects only limited number of “equality” cases. Due to a finite number of digits reserved for storing a number in memory (32 bits for double, 16 bit for single), the final result may depend on the computing path. For example, in theory, <math>\ln(\exp(x))\equiv x</math>. In reality this is not exactly the case. If we generate a <math>1000\times 1</math> vector of random numbers <source enclose="none">a</source> and compare it with <source enclose="none">log(exp(a))</source>, we obtain slightly different results. Thus, a standard comparison using <source enclose="none">isequal</source> does not work: ''
+
''Please note: the script above detects only limited number of “equality” cases. Due to a finite number of digits reserved for storing a number in memory (32 bits for double, 16 bit for single), the final result may depend on the computing path. For example, in theory, <math>1+a-1-a\equiv 0</math>. In computer reality this is not always the case. Computers cannot keep simultaneously information about very large and very small parts of the same number. Thus, it can get rid of the small part if needed. This problem has a special name “rounding error”. If we compare <math>1/3+1-1</math> and <math>1/3</math>, we obtain slightly different results: ''
  
<source>>> a=rand(1000,1);
+
<source>>> a=1/3;
>> isequal(a,log(exp(a)))
+
>> disp(a-a)
   ans =
+
    0
        0
+
>> disp(a+1-1-a)
>> disp(mean(a-log(exp(a))))
+
   -5.5511e-17
   9.9335e-17</source>
+
>> disp(a+10-10-a)
As a result, if you want to compare two numbers/vectors/arrays, you have to specify your tolerance level and compare not vectors or matrices, but rather a measure of a distance between the two:
+
  6.1062e-16
 +
>>disp(isequal(a,a+1-1))
 +
   0</source>
 +
The difference is very often negligible. However, a standard comparison using <source enclose="none">isequal</source> does not work. As a result, if you want to compare two numbers/vectors/arrays, you have to specify your tolerance level (precision) and compare not vectors or matrices, but rather a measure of the distance between the two:
  
<source>%function that compares two vectors with predefined tolerance level tol:
+
<source>%function comparing two vectors with predefined tolerance level tol:
 
function out=myisequal(a,b,tol)
 
function out=myisequal(a,b,tol)
if mean(abs(a-b))>tol %checks whether an average absolute difference between two vectors is larger than the tolerance level.
+
if mean(abs(a-b))>tol %checks whether the average absolute difference between the two vectors is larger than the tolerance level.
 
     out=0; %if &quot;Yes&quot;, then these two vectors are different given the current tolerance level
 
     out=0; %if &quot;Yes&quot;, then these two vectors are different given the current tolerance level
 
else
 
else
Line 302: Line 320:
 
     isdir
 
     isdir
 
     datenum</source>
 
     datenum</source>
Then you can load all these files to memory (if you have enough of RAM). If all mat files in the directory have the same variables, you can load all of them by constructing a standard vector of structures. However, if the number of variables varies from file to file, using cell arrays would be a better idea (for details, see next section).
+
Then you can load all these files to memory (if you have enough of RAM). If all mat files in the directory have the same variables, you can load all of them by constructing a vector of structures. However, if the number of variables varies from file to file, using cell arrays would be a better idea (for details, see Section [cell]).
  
 
== Cell Arrays[cell] ==
 
== Cell Arrays[cell] ==
  
The least structured variable type in MATLAB is cell array. You can store a sequence of any data objects in it. You may want to think about cell arrays as a structure where each field is coded by one number (if it is one-dimensional) or some set of numbers, if it is multidimensional. Since everything is coded by numbers, it is easier to use cell arrays inside loops. However, since sequence of numbers is not as informative as structure field names, it is harder to follow. Cell arrays are defined using curly brackets:
+
The least structured variable in MATLAB is cell array. You can store a sequence of any data objects in it. You may want to think about cell arrays as a structure where each field is coded by one number if it is one-dimensional or some set of numbers if it is multidimensional. Since everything is coded with numbers, it is easier to use cell arrays inside loops. However, since sequence of numbers is not as informative as structure field names, it is harder to follow. Cell arrays are defined using curly brackets:
  
 
<source>>> c={'test',rand(10),1:10 }
 
<source>>> c={'test',rand(10),1:10 }
Line 350: Line 368:
  
 
<source>  c(6)={randn(4)};</source>
 
<source>  c(6)={randn(4)};</source>
Curly brackets convert double array to a cell element, and you can assign a cell element to another cell element.
+
Curly brackets convert a double array to a cell element, and you can assign a cell element to another cell element.
  
 
Now, equipped with this information, we can load all mat files in current directory in MATLAB using cell arrays:
 
Now, equipped with this information, we can load all mat files in current directory in MATLAB using cell arrays:
Line 363: Line 381:
 
   for i=1:length(fname)
 
   for i=1:length(fname)
 
     curr_field=['f' fname(i).name(1:end-4)];%Filename is selected, .mat is dropped. The leading 'f' stands for a file. Please note that the code might not work for some file names
 
     curr_field=['f' fname(i).name(1:end-4)];%Filename is selected, .mat is dropped. The leading 'f' stands for a file. Please note that the code might not work for some file names
     data.(curr_field)=load(fname(1).name);
+
     data.(curr_field)=load(fname(i).name);
 
   end</source>
 
   end</source>
 
It is possible to convert structures to cell arrays and vice versa. Also, it is possible to apply a function to each cell of a cell array
 
It is possible to convert structures to cell arrays and vice versa. Also, it is possible to apply a function to each cell of a cell array

Latest revision as of 18:42, 23 October 2012

Intro

In many cases you would like to either (i) store data which have more than two dimensions or (ii) collect data of different type/dimension under the same variable name. Examples of the former could be interest rates for different maturities for different countries over time or Monte-Carlo simulations for different sample sizes and different model specifications. Examples of the latter could be real time prices grouped on a daily basis (number of observations varies from day to day) or results of OLS estimation stored in one variable. In MATLAB you can address these issues using:

  1. Multidimensional arrays
  2. Structures/Structure Arrays
  3. Cells/Cell arrays

Each of these approaches has its advantages and disadvantages. For example: the easiest way to handle data in a uniform way is to use multidimensional arrays. Multidimensional arrays, however, require that the data are of the same type (logical, numerical, character). In the next sections we will briefly review these data objects and discuss the most straightforward way to handle them.

Multidimensional Arrays

Elementwise Operations

Multidimensional arrays are just a generalization of a matrix. Almost all MATLAB functions can generate and operate with multivariate arrays. For example, to generate a three-dimensional array of standardized normally distributed observations with dimensions [math]T,p,k[/math], you have to type:

  A=randn(T,p,k);

MATLAB prints all multidimensional arrays as a sequence of matrices. Say, [math]2\times 2\times 2[/math] array is printed as

>> A=zeros(2,2,2)
A(:,:,1) =
     0     0
     0     0
A(:,:,2) =
     0     0
     0     0

Taking an average over the second dimension is quite intuitive:

  meanA=mean(A,2);

Taking a maximum over the third dimension is less so:

  maxA=max(A,[],3);

The bottom line is: please check MATLAB help system first.

Multidimensional arrays support element-by-element binary operations +,-,.*,./,.^,<,>,~= between two arrays with the same dimensions and between arrays and scalars. MATLAB will correctly compute

  B=A.^2;
  C=A.^B+B;
  D=C./A;

In the first line, 3-D array B consists of squared values of 3-D array A. In the second line elements of 3-D array C consist of elements of A in power of elements B plus elements from B. 3-D array D consists of element-wise division of C by A. However, the following commands

A*A;
A^2;

will throw an error, since ^,* are matrix operations and they are not defined on multidimensional arrays.

Collapsing Singleton Dimensions

All operations of extracting sub-arrays from the original arrays work as usual (see link). However, there are some particularities that I would like to mention. Assume that we have a [math]3\times 3\times 3[/math] array A. MATLAB recognizes the following operation :

  A(:,:,1)*A(:,:,2);

while

  A(1,:,:)*A(1,:,:);

generates an error. We get similar results for plot function:

  plot(A(:,:,1))

generates a graph, while

  plot(A(1,:,:))

generates an error. It happens because, from MATLAB point of view, A(1,:,:) is not a matrix, but a 3-D array. There are two ways to deal with this problem:

  1. Make MATLAB realize that we would like to see a matrix instead of a 3-D array with a singleton in the first dimension

            B(:,:)=A(1,:,:);
            plot(B)
  2. Collapse all singleton dimensions using a special command squeeze

            B=squeeze(A(1,:,:));
            plot(B)

Expanding to Higher Dimensions

By default, element-wise operations are defined on any multi-dimensional array of the same dimension. An additional set of operations is defined on conformable vectors and matrices. The natural question to ask is “What to do if you want to subtract a vector from a matrix, or a matrix from N-D array?” MATLAB has a special function that can transform a lower-dimensional array to a higher-dimensional one by replicating the original content:

 A=repmat([1,2,3]',2,3)
A =
     1     1     1
     2     2     2
     3     3     3
     1     1     1
     2     2     2
     3     3     3

This command replicates a column-vector [math][1\ 2\ 3]'[/math] six times, i.e. twice along each row and 3 times along each column. For 3-D replication we have to use slightly different syntax:

   A=repmat([1,2,3]',[1 2 3])
A(:,:,1) =
     1     1
     2     2
     3     3
A(:,:,2) =
     1     1
     2     2
     3     3
A(:,:,3) =
     1     1
     2     2
     3     3

This command keeps the length of the vector the same (first index is 1), replicates it twice along the second dimension (second index is 2) and 3 times along the third dimension (third index is 3). Consider the following two real-life examples:

  1. Assume that we have a cross-section of returns and we would like to subtract a series of risk-free rate
  2. Assume that we need to subtract a mean along the third dimension from a 3-D array.

These examples can be implemented either by using a loop, or by using repmat. In the first case the algorithm with a loop looks like:

  1. Initialize a matrix of excess returns Rex
  2. Compute a difference of R(:,i)-rf and assign it to a column Rex(:,i) for [math]i=1[/math]
  3. Repeat (2) for i=2, 3, …, size(R,2) times
% rf - a series of risk-free returns
% R  - a cross-section of returns, it is assumed that size(R,1)=size(rf,1)
Rex=zeros(size(R));
for i=1:size(R,2)
Rex(:,i)=R(:,i)-rf;
end

The same algorithm using repmat looks:

  1. Replicate rf vector size(R,2) times
  2. Compute Rex by Rx=R-rf
% rf - a series of risk-free returns
% R  - a cross-section of returns, it is assumed that size(R,1)=size(rf,1)
Rf=repmat(rf,1,size(R,2));
Rex=R-Rf;

For the second example, to implement an algorithm with a loop:

  1. Compute a mean of 3-D array Rmean3
  2. Initiate a 3-D array of demeaned values
  3. Compute a difference between R(:,:,i)-Rmean3 and assign it to Rdemean(:,:,i) for i=1
  4. Repeat (3) for i=2, 3, …, size(R,3)
  5. check whether the mean is indeed subtracted
  R=rand(3,4,10); % constructing a 3-D array of U(0,1) random variables
  Rmean3=mean(R,3); %constructing a matrix of means
  Rdemean=zeros(size(R)); %initializing a 3-D array of zeros
  for i=1:size(R,3)
    Rdemean(:,:,i)=R(:,:,i)-Rmean3;
  end
  disp(mean(Rdemean,3)) % which is a matrix of zeros implying we worked correctly

The same algorithm with repmat:

  1. Compute a mean of 3-D array Rmean3
  2. Construct a 3-D array of means using repmat
  3. subtract one from another
  4. check whether the mean is indeed subtracted
  R=rand(3,4,10); % constructing a 3-D array of U(0,1) random variables
  Rmean3=mean(R,3); %constructing a matrix of means
  Rmean3expand=repmat(Rmean3,[1 1 size(R,3)]); %Please note, for more than 2-dimensional arrays, repmat accepts a vector of replications instead of a variable number of inputs.
  Rdemean=R-Rmean3expand;
  disp(mean(Rdemean,3)) %matrix of zeros, which mean that we did it correctly

Structures and Cell Arrays

Sometimes it is natural to keep data of different type under the same roof, i.e. using the same variable name. Multidimensional arrays are not designed for this purpose. Therefore structures, arrays of structures or cell arrays have to be used in such cases. Structures and cell arrays are very similar and, in fact, interchangeable. For some applications, however, cell arrays are more suitable while, for other, structures are preferable. For obvious reasons, most of binary operations are not defined on these objects.

Structures

Structure variables are variables that have “fields”. The variable name is separated from the field name by a dot. For example, if you want to keep all OLS regression results, i.e. beta coefficients, covariance matrix, t-stats and vector of residuals in one place, you can do it using the following:

%Assuming X and y are already defined, the whole filling-up process of the structure would look as follows:
OLS.beta=X\y;%short and more efficient way to write inv(X'*X)*X'*y
OLS.resid=y-X*OLS.beta;
[T,k]=size(X);
OLS.sigma2=sum(OLS.resid.^2)/(T-k);%computing residual variance
OLS.cov=OLS.sigma2*inv(X'*X);
OLS.tstat=OLS.beta./sqrt(diag(OLS.cov));
OLS.name='Regression one';

An assignment of OLS variable to another one creates a copy of the structure with all fields and values.

>> OLSnew=OLS;
>> OLSnew.name
ans =
Regression one

A field can be a structure by itself. For example,

>> OLSnew.old=OLS;
>> OLSnew.old.name
ans =
Regression one

Moreover, all field names will be carried in and out of a function. In this way you should not worry about the order of inputs and outputs for a function with many inputs/outputs.

There are several useful functions that are quite helpful for dealing with structures:

  • isfield(struct,field_name) checks whether a structure struct has a field name field_name. It returns either 1 (true) or 0 (false)

      >> isfield(OLS,'name')
      ans =
      1
      >> isfield(OLS,'pvalues')
      ans =
      0
  • fieldnames(struct) generate a cell array (see Section [cell]) with all field names of a structure struct

    >> fnames=fieldnames(OLSnew)'
    fnames =
        'beta'    'resid'    'sigma2'    'cov'    'tstat'    'name'    'old'
  • Indirect referencing. You can access the values of fieldnames using the information generated by fieldnames(struct) using the following syntax:

    >> OLSnew.(fnames{6}) %please note curly brackets!
    ans =
    Regression one

    Since fnames{6}=’name’, we indirectly refer to the field name of the structure OLSnew. In this way we can compare two structures field-by-field.

  • Useful commands for working with structure field names are:

    • intersect(A,B), in terms of set operations, it corresponds to [math]A\cap B[/math]

    • union(A,B), corresponds to [math]A\cup B[/math]

    • setdiff(A,B), corresponds to [math]A/B[/math] (please note, [math]A/B \ne B/A[/math])

    • setxor(A,B), corresponds to [math](A/B)\cup (B/A)[/math]

      f1={'first','second'}; %Creating a cell array, see next section for details
      f2={'second','third'}; %Creating a cell array, see next section for details
    >> disp(intersect(f1,f2))
      'second'
    >> disp(union(f1,f2))
      'first'    'second'    'third'
    >> disp(setdiff(f1,f2))
      'first'
    >> disp(setdiff(f2,f1))
      'third'
    >> disp(setxor(f1,f2))
      'first'    'third'

Structures can be collected into arrays. Please note, all member structures of an array have to have the same set of fields. Creating a field for one element of the array automatically creates empty fields of the same name for all elements of your array. However, despite the fact that the field names are the same, there are no restriction on the field types. Example:

>> s(2).f1=1;
>> disp(s(1))
   f1:[];
>> s(2).f2='Sally';
>> disp(s(1))
    f1:[]
    f2:[]
>> s(1).f1=3;
>> s(1).f2=[83   116   117   100   101   110   116 32];

By assigning 1 to s(2).f1, you automatically create

  1. the first element of the array s
  2. an empty field s(1).f1

By assigning ‘Sally’ to s(2).f2, you automatically create an empty field s(1).f2. You can print all values of a structure field in an array using “:” operator:

>> s(:).f1
ans =
     3
ans =
     1
>> s(:).f1
ans =
     83   116   117   100   101   110   116 32
ans =
     Sally

and try to construct an array out of them using concatenation operator [ ]:

[s(:).f1]
ans =
     3     1

Results, though, may vary (try to figure it out by yourself!):

  [s(:).f2]
ans =
Student Sally

An additional hint:

  [s(:).f2]+0
ans =
    83   116   117   100   101   110   116    32    83    97   108   108   121

Some MATLAB functions require structures as inputs (See non-linear optimization for details). Other MATLAB functions return structures when they are asked to do so. For example, data=load(fname) creates a structure data with fieldnames that coincide with variables of the workspace saved in the data file fname. Thus, for comparing variables from different MATLAB data files, we first need to create two structures

  data1=load(fname1);
  data2=load(fname2);

and then compare the fields of interest.

It is easy to write a code that automatically compares variables with the same names from different data files, using intersect and setdiff commands:

data1=load('data1');%data1.mat has to exist
data2=load('data2');%data2.mat has to exist
if isequal(data1,data2) %the only binary operation defined on structures and cell arrays
    disp('These structures are equal')
else
    fname1=fieldnames(data1);%retrieving field names from data1 structure
    fname2=fieldnames(data2);%retrieving field names from data2 structure
    fnamejoint=intersect(fname1,fname2);%constructing a collection of field names that belong to both structures
    for i=1:length(fnamejoint)
        if isequal(data1.(fnamejoint{i}),data2.(fnamejoint{i})) %indirect referencing
            disp([fnamejoint{i} ' fields are equal'])
        else
            disp([fnamejoint{i} ' fields are not equal'])
        end
    end
    disp('Unique for data1:')
    disp(setdiff(fname1,fname2)')
    disp('Unique for data2:')
    disp(setdiff(fname2,fname1)')
end

Please note: the script above detects only limited number of “equality” cases. Due to a finite number of digits reserved for storing a number in memory (32 bits for double, 16 bit for single), the final result may depend on the computing path. For example, in theory, [math]1+a-1-a\equiv 0[/math]. In computer reality this is not always the case. Computers cannot keep simultaneously information about very large and very small parts of the same number. Thus, it can get rid of the small part if needed. This problem has a special name “rounding error”. If we compare [math]1/3+1-1[/math] and [math]1/3[/math], we obtain slightly different results:

>> a=1/3;
>> disp(a-a)
     0
>> disp(a+1-1-a)
  -5.5511e-17
>> disp(a+10-10-a)
   6.1062e-16
>>disp(isequal(a,a+1-1))
   0

The difference is very often negligible. However, a standard comparison using isequal does not work. As a result, if you want to compare two numbers/vectors/arrays, you have to specify your tolerance level (precision) and compare not vectors or matrices, but rather a measure of the distance between the two:

%function comparing two vectors with predefined tolerance level tol:
function out=myisequal(a,b,tol)
if mean(abs(a-b))>tol %checks whether the average absolute difference between the two vectors is larger than the tolerance level.
    out=0; %if &quot;Yes&quot;, then these two vectors are different given the current tolerance level
else
    out=1; %if &quot;No&quot;, then these two vectors are equal given the current tolerance level
end

Another command that generates a (possibly empty) array of structures is dir(file_mask). To obtain a list of all mat files in the current directory, you have to type:

>>   matfiles=dir('*.mat')
matfiles =
1004x1 struct array with fields:
    name
    date
    bytes
    isdir
    datenum

Then you can load all these files to memory (if you have enough of RAM). If all mat files in the directory have the same variables, you can load all of them by constructing a vector of structures. However, if the number of variables varies from file to file, using cell arrays would be a better idea (for details, see Section [cell]).

Cell Arrays[cell]

The least structured variable in MATLAB is cell array. You can store a sequence of any data objects in it. You may want to think about cell arrays as a structure where each field is coded by one number if it is one-dimensional or some set of numbers if it is multidimensional. Since everything is coded with numbers, it is easier to use cell arrays inside loops. However, since sequence of numbers is not as informative as structure field names, it is harder to follow. Cell arrays are defined using curly brackets:

>> c={'test',rand(10),1:10 }
c =
    'test'    [10x10 double]    [1x10 double]

The same result can be achieved in three steps:

>>  c{1}='test';
>>  c{2}=rand(10);
>>  c{3}=1:10;
>>  c
c =
    'test'    [10x10 double]    [1x10 double]

To refer to the second element of c(3), the following syntax has to be employed:

>> c{3}(2)
ans =
      2

Important:

There is a very important difference between c(3) and c{3}. The former refers to [math]1\times 1[/math] cell array, while the second refers to its value.

>>  mean(c(3));
Undefined function 'sum' for input arguments
of type 'cell'.
Error in mean (line 28)
  y = sum(x)/size(x,dim);

Since math functions are not defined on cell variables, while

>> mean(c{3})
ans =
    5.5000

works as expected. By the same logic, the assignment below

  c(4:5)={'ostrich', randn(50)};

works by extending cell array c to [math]5\times 1[/math], while

  c{4:5}={'ostrich', randn(50)};

generates an error. Also an error is generated by

  c(6)=randn(4);

as we are trying to assign a double array randn(4) to a cell element c(6). The correct assignment is

  c(6)={randn(4)};

Curly brackets convert a double array to a cell element, and you can assign a cell element to another cell element.

Now, equipped with this information, we can load all mat files in current directory in MATLAB using cell arrays:

  fname=dir('*.mat');
  for i=1:length(fname)
    data{i}=load(fname(1).name);
  end

The same goal can be achieved using structure:

  fname=dir('*.mat');
  for i=1:length(fname)
    curr_field=['f' fname(i).name(1:end-4)];%Filename is selected, .mat is dropped. The leading 'f' stands for a file. Please note that the code might not work for some file names
    data.(curr_field)=load(fname(i).name);
  end

It is possible to convert structures to cell arrays and vice versa. Also, it is possible to apply a function to each cell of a cell array

  • c = struct2cell(s) converts a structure s to a cell array c, fields are converted to cells using the same order as in structure s

  • s = cell2struct(c,fields) converts a cell array c to a structure array s with fieldnames defined in fields

      >> s=struct('f1',rand(10),'f2','MSFT') %an alternative way to create a structure
    s =
        f1: [10x10 double]
        f2: 'MSFT'
    >> fldnm=fieldnames(s)' %recording fieldnames of structure s
    fldnm =
        'f1'    'f2'
    >> c=struct2cell(s)'  %converting structure s to cell array c
    c =
        [10x10 double]    'MSFT'
    >> s2=cell2struct(c',fldnm) %converting cell array c to structure s2 using fldnm which was recorded earlier. For some reason MATLAB prefers column-vector c
    s2 =
        f1: [10x10 double]
        f2: 'MSFT'
  • cellfun(@function_name,cell1,cell2,...,celln,options) evaluates a function
    function_name picking every first element of cell arrays cell1,cell2,...,celln, every second element of cell arrays cell1,cell2,...,celln, etc.

    >> c1={rand(10,10,10),rand(10,10,10),rand(10,10,10)};
    >> c2={1,2,3};
    >> average=cellfun(@mean,c1,c2, 'UniformOutput',0)
    average =
        [1x10x10 double]    [10x1x10 double]    [10x10 double]

    Without ’UniformOutput’,0 MATLAB tries to construct a vector with three elements and fails for obvious reasons.