13. Intention/Extending

13.1. Intention and Remarks

13.1.1. Genesis

This package was programmed because of my personal need to fit multiple datasets together which differ in attributes defined by the measurements. A very common thing that is not included in numpy/scipy or most other fit programs. What I wanted is a numpy ndarray with its matrix like functionality for evaluating my data, but including attributes related to the data e.g. from a measurement. For multiple measurements I need a list of these with variable length. ==> dataArray and dataList.

As the used models are repeatedly the same a module with physical models was growing. A lot of these models are used frequently in Small Angle Scattering programs like SASview or SASfit. For my purpose the dynamic models as diffusion, ZIMM, ROUSE and other things mainly for protein dynamics were missing.

Some programs (under open license) are difficult to extend as the models are hidden in classes, or the access/reusage includes a special designed interface to get parameters instead of simple function calls. Here Python functions are easier to use for the non-programmers as most PhD-students are. Models are just Python functions (or one line lambda functions) with the arguments accessed by their name (keyword arguments). Scripting in Python with numpy/scipy is easy to learn even without extended programming skills.

The main difficulty beside finding the right model for your problem is proper multidimensional fitting including errors. This is included in dataArray/dataList using scipy.optimize to allow fitting of the models in an simple and easy way. The user can concentrate on reading data / model fitting / presenting results.

13.1.2. Scripting over GUI

Documentation of the evaluation of scientific data is difficult in GUI based programs (sequence of clicking buttons ???). Script oriented evaluation (MATLAB, Python, Jupyter,….) allows easy repetition with stepwise improvement and at the same time document what was done.

Complex models have multiple contributions, background contribution, … which can easily be defined in a short script including a documentation. I cannot guess if the background in a measurement is const linear, parabolic or whatever and each choice is also a limitation. Therefore the intention is to supply not obvious and complex models (with a scientific reference) and allow the user to adopt them to their needs e.g. add background and amplitude or resolution convolution. Simple models are fast implemented in one line as lambda functions or more complex things in scripts. The mathematical basis as integration or linear algebra can be used from scipy/numpy.

13.1.3. Plotting

Matplotlib seems to be the standard for numpy/scipy users. You can use it if you want. If you try to plot fast and live (interactive) it is complicated and slow. 3D plotting has strong limitations.

Frequently I run scripts that show results of different datasets and I want to keep these for comparison open and be able to modify the plot. Some of this is possible in matplotlib but not the default. As I want to think about physics and not plotting, I like more xmgrace, with a GUI interface after plotting. A simple one line command should result in a 90% finished plot, final 10% fine adjustment can be done in the GUI if needed or from additional commands. I adopted the original Graceplot module (python interface to XmGrace) to my needs and added dataArray functionality. For the errorPlot of a fit a simple matplotlib interface is included. Meanwhile, the module mpl is a rudimentary interface to matplotlib to make plotting easier for beginners.

The nice thing about Xmgrace is that it stores the plot as ASCII text instead of the JPG or PDF. So its easy to reopen the plot and change the plot later if your supervisor/boss/reviewer asks for log-log or other colors or whatever. For data inspection zoom, hide of data, simple fitting for trends and else are possible on WYSIWYG/GUI basis. If you want to retrieve the data (or forgot to save your results separately) they are accessible in the ASCII file. Export in scientific paper quality is possible. A simple interface for annotations, lines, …. is included. Unfortunately its only 2D but this is 99% of my work.

13.1.4. Speed/Libraries

The most common libraries for scientific computing in python are NumPy and SciPy and these are the main obligatory dependencies for Jscatter (later added matplotlib and Pillow for image reading). Python in combination with numpy can be quite fast if the ndarrays methods are used consequently instead of explicit for loops. E.g. the numpy.einsum function immediately uses compiled C to do the computation. (See this and look for “Why are NumPy arrays efficient”). SciPy offers all the math needed and optimized algorithms, also from blas/lapack. To speed up, if needed, on a multiprocessor machine the module Parallel execution offers an easy interface to the standard python module multiprocessing within a single command. If your model still needs long computing time and needs speed up the common methods as Cython, Numba or f2py (Fortran with OpenMp for parallel processing) should be used in your model. As these are more difficult the advanced user may use it in their models.

Nevertheless the critical point in these cases is the model and not the small overhead in dataArray/dataList or fitting.

As some models depend on f2py and Fortran code an example is provided how to use f2py and finally contribute a function in Jscatter. Extending/Contributing/Fortran

Some resources :

13.1.5. Development environment/Testing

The development platform is mainly current Linux (Manjaro/Rocky Linux) with testing on Gitlab (Ubuntu docker image). I regularly use Jscatter on macOS and on our Linux cluster. The code is Python 3.x compatible. I rarely use Windows (only if a manufacturer of an instrument forces me…) Jscatter works under native Windows 10, except things that rely on pipes or gfortran as the connection to XmGrace and the DLS module which calls CONTIN through a pipe. As matplotlib is slow fits give no intermediate output.

13.2. Native Windows or WSL, => USE WSL, its faster

… Meanwhile (2021) Windows Subsystem for Linux (WSL) is the best choice to run Jscatter on Windows as any Linux runs natively without limitations. … Now i work also under macOS using M3 including the fortran modules. (2024)

But using Windows nativly or with WSL ? Some parts (Fortran compiled things in bio) are currently not working on Windows. But e.g. cloudscattering does for a comparison.

On Linux Fortran speedup (first example in cloudscattering, only ffe calculation) is a factor of 8 compared to Python version. This uses multiprocessing and compiled Fortran vs. efficient Numpy (Python) code.

The main difference between Linux and Windows is that new sub processes for multiprocessing on Windows use ‘spawn’ method and Linux can use ‘fork’. Fork is much faster.

For comparison we use the same example (first example in cloudscattering only ‘ffe’, 7368 points, relError=100 ). On Linux using ‘fork’ needs 382 ms (using Numpy code). Using ‘spawn’ needs 2050 ms which is ~5 times slower. There is a small overhead (‘fork’ with Fortran compiled code needs 233 ms) but just using WSL instead of native Linux will be 5 times faster because of the limitations of ‘spawn’.

My advice, if methods from cloudscattering (some formfactors, lattice sf and bio) are used Install Jscatter via WSL. Benefits from ‘fork’ and Fortran.

13.3. Extending/Contributing/Fortran

There are different ways to extend Jscatter by new models.
  • Write a python function (see Beginners Guide).

  • Write a Fortran function (preferred >f90) with a wrapper that contains the documentation.

  • Use Cython or numba …. (for experts, not the scope here).

The simplest way is writing a python function as most functions are written like that. The function should contain the basic model without background or other trivial contributions needed for fitting of real data. These should be added in user defined functions as the user may be better know how he wants to treat the background.

Contribution of a single function or a complete module dealing with a new topic are welcome. As Users should know what it is about, the documentation is important and should contain a scientific reference about the models.

Contributors accept that all new modules or functions included will be covered by the GPLv3 or later.

If modules or code from other projects under OpenSource are used the respective License and Copyright should be mentioned and respected in the source code. Please give important contributions also in the documentation.

Using Fortran

Premature optimization is the root of all evil – DonaldKnuth

The speedup to use more sophisticated methods is not too big (not x100) if the numpy and scipy functions were used.

Here an example with a factor of about 6:

The main part of the cloudScattering function is evaluation of

\[F(q)= \sum_N b_i(q) e^{iqr}\]

For e.g. N=20000 . We calc \(S(Q)=F(q) \cdot F^*(q) = |F(q)|^2\) and pass \(F(q)\) for calculation of \(S(q) = < |F(q)|^2 >\) and \(beta =|< F(q) >|^2 / < |F(q)|^2 >\) for spherical averaging.

def _scattering(point,r,q,blength,formfactor):
     fa=blength*np.interp(q, formfactor[0, :], formfactor[1, :])
     qx=q*point
     iqr=np.einsum('i,ji',qx,r)*1j
     beiqrsum=np.einsum('i,i',fa,np.exp(iqr))
     Sq=beiqrsum*beiqrsum.conj()
     return q,Sq.real,beiqrsum.real

The main work is done in np.einsum and np.exp(iqr) which both are already compiled C functions (standard numpy function). We may loose time in coming back to python between these steps.

We can reach a speedup of a factor of 6 if we use a Fortran function and use this in _scattering. The speedup is higher if the computation was done in Python loops (very inefficient), but we use numpy not to do this in Python loops. The speedup depends strongly on the model and how it is implemented. Often one is happy if the python meanders.

The Fortran function/subroutine should be encapsulated in a module as it looks cleaner and the module may contain several functions. Additional it works smoothest in combination with f2py to use a module of functions only (always give intent(in) properly for smooth working). To avoid conflicts and dependency problems it is convenient to copy all needed routines into this module. Please respect Copyrights

module cloud
    implicit none
    use typesandconstants
    use utils
contains

function ffq(point,r,q,blength,formfactor) result(ff)
    ! point point on unit sphere 3 x 1
    real(dp), intent(in) :: point(:)
    real(dp)             :: ff(3)
    .....see fortran/cloud.f95 for all declarations

    fa=blength*interpolatesorted(sizerms*q, formfactor(1,:), formfactor(2,:))
    qx=q*point
    iqr= j1 * matmul( r,qx)
    Fq= sum( fa* exp(iqr) )
    ff(1)=q
    ff(2)=real(Fq*conjg( Fq ) )
    ff(3)=real(Fq)

end function ffq
end module

A function interface and the compilation is done by f2py resulting in a shared object file here fsca.so:

f2py -c filename.f95 -m fsca

The module can be imported and used like any other function in a module:

import fsca
def scattering_f(point,r,q,blength,formfactor):
    """ A doc is needed if it should be used
    Parameters...
    Returns.....
    """
    ret=fsca.cloud.ffq(point,r,q,blength,formfactor)
    return ret

The wrapper is also needed if this function should be called by parallel (using multiprocessing). The reason is that the parallel processed function needs to be serialized and f2py functions cannot. The wrapper can.

If multiprocessing is desired and each process needs to get large arrays the python multiprocessing uses a lot of memory (without shared memory each process gets a pickled copy) In this case Fortran with OpenMP is easier for multiprocessing with shared memory.

Parallel execution with Fortran using OpenMP is used in cloudScattering. Different compiler options for OpenMP with other compilers may lead to failure. The speedup is not very big. In a similar example as above (with individual atomic formfactors) its about 1.5 speedup on 6 core Ryzen for a very complex function. In that specific case the reduced memory was worth it.

Including in Jscatter clone

For contributing or your own development you need to clone the Jscatter repository (See gitlab for documentation.)

git clone https://gitlab.com/biehl/jscatter.git

To use the Jscatter clone do in the main clone directory (dont forget the last point “.” )

pip install --user -e .

The install procedure compiles and builds the wrapper for all source files in the source folder. Additional a link to the clone is placed in your python site-packages folder.

To include a new function in Jscatter package we only need to place the working Fortran module in the source folder and repeat the above command. The setup procedure ads the new function to the fscatter module The function is accessible in a Jscatter module after import of fscatter.

from . import fscatter
data=fscatter.cloud.ffq(q,..... )

The module can be imported were needed and the python wrapper with documentation can be placed in the appropriate module where it is used.

If you are happy and want to contribute, sent it to the author or use a merge request on gitlab.