Exemple #1
0
<h4><em>System attributes</em></h4>
<p><em>This subsection defines requirements related to reliability, availability and security of the system. </em></p>

<p>Information related to verification and all other additional information can be added here in the form of appendices (or independent sections, if you prefer).</p>

<h2>A suggested way for defining requirements</h2>
<p>The requirements definition process usually starts with a kick-off meeting where the decision about starting the new project arises. Even if you work on the project independently, it is always helpful to discuss it with someone first. The outputs of first meetings are usually very raw, but they help later in the process. It is always helpful to do written documentation (Minutes Of The Meeting notes) even in the early stages. After you are aware of the raw content of the project, you can try to create the first wider description with a focus on components of the output. This will help you to map what all is needed in the project.</p>

<p>After you roughly know what the project is about and the top-level components, you can start creating requirements sets. The preferred approach is to use iteration and recursion in this process. Iterate through all elements and recursively define requirements to the desired depth. Usually, requirements are repetitive - especially if elements (components) of the system are similar. If it comes to the required depth, it is necessary to use common sense and cover only up to a reasonable level. However, requirements should cover every important aspect of the system, so after reading SRS, everyone should know what the system is about and how it works. Also, it is necessary to collect regular feedback from eventual users (customers) and all stakeholders during the process.</p> 

<p>It is also worth noticing that dropping figures and schemas into the documents is always helpful. Many smart tools can help, for example, Lucidchart (paid service) or Draw.io (free). Other tools can help you keep documentation safe, for example, GitHub wiki pages that use Git to save documentation MD files (and interpret MD files online). GitHub wiki and Markdown (MD) is arguably the cleanest combination for storing and versioning your documentation. However, if you prefer a more high-level approach, Confluence or SharePoint are good choices (although Confluence is now too resource-hungry application, so the author cannot recommend it).</p>

<h2>Summary</h2>
<p>Defining requirements at the beginning of the project is always beneficial. It helps to clear minds and specify what exactly is needed. Ultimately, clearly defined requirements at the beginning of the project can save a lot of effort and money. The most popular standard for requirements engineering is IEEE 29148:2018. It provides a thorough description of all required processes, approaches, relevant documents and other outcomes. From the technical point of view, the most significant document is the Software requirements specification (aka SRS). After reading SRS, every engineer should know what the system does and be capable of designing it. When writing requirements documents, it is always necessary to use common sense. Also, many tools can help you keep documentation safe (like GitHub wiki or Sharepoint), other tools can help you with schemas (like Lucidchart or Draw.io).</p> 
"""

ENTITY = cr.Article(title="Practical aspects of requirements engineering",
                    url_alias='practical-aspects-of-requirements-engineering',
                    large_image_path="images/req_big.jpg",
                    small_image_path="images/req_small.jpg",
                    date=datetime.datetime(2021, 12, 17),
                    tags=[
                        cr.Tag('Project Management', 'project-management'),
                        cr.Tag('Design', 'design'),
                        cr.Tag('Essentials', 'essentials'),
                        cr.Tag('Performance', 'performance'),
                        cr.Tag('Administration', 'administration')
                    ],
                    content=content,
                    lead=lead)
<h2>Infrastructure</h2>
<p>You have to consider many things when it comes to designing infrastructure. One of them can be: do not believe you can stay cloud-agnostic for long. Also, try not to believe there is a significant difference between cloud providers (AWS, G-Cloud, Azure are more or less the same). It is also necessary to use infrastructure as a code approach (for example, Terraform, Kubernetes). Finally, regarding deployment in the cloud, you need to be aware of the astronomical prices - expect at least a £1,000 bill per month (in 2021 prices) plus some other costs related to operational work. You can also easily hire a full-time DevOps engineer who will be busy enough.</p>
<p>It is also perfectly meaningful to have a powerful local machine for data science computations. However, when using Python, be aware that you cannot rely on multithreading because of GIL. So it is better to have a machine with a good Turbo Boost than with many cores.</p> 
<h2>Team and budget</h2>
<p>Generally, it does not make sense to start your project if you cannot hire at least five full-time employees for at least two (or rather three) years. You will need at least one or two people dedicated to data science, one or two dedicated to back-end development and the same for front-end development. Above that, it is helpful to have applied meteorologists in your team and a full-time project manager. If it comes to the organization of people, SCRUM/agile is arguably the only possible way to success. The project price is usually between £200,000 to £400,000 annually (in 2021 prices).</p>
<p>All these things are necessary to consider when you are preparing a budget. Also, be aware that geospatial data are costly (free data are almost useless for practical purposes).</p>
<h2>Summary</h2>
<p>Optimal programming languages for systems are discussed. There are many options available, but the most popular is Python. This is because open-source libraries are available in Python that can significantly help with development. Other languages (mainly Go) have advantages, but they lack essential libraries. The standard choice for web application framework is Django (another option is to use FastAPI and design system as microservice). Before starting the project, knowing the price and required team size is essential - generally, at least five people are needed to be successful.</p>
"""

ENTITY = cr.Article(
    title=
    "Software engineering perspective of the system for renewable energy prediction",
    url_alias=
    'software-engineering-perspective-of-the-system-for-renewable-energy-prediction',
    large_image_path="images/system_ren_big.jpg",
    small_image_path="images/system_ren_small.jpg",
    date=datetime.datetime(2021, 4, 2),
    tags=[
        cr.Tag('Python', 'python'),
        cr.Tag('Design', 'design'),
        cr.Tag('Renewable energy', 'renewable-energy'),
        cr.Tag('Geospatial', 'geospatial'),
        cr.Tag('Web application', 'web-application')
    ],
    content=content,
    lead=lead,
    description=
    "Article focused on the design and implementation details of the system for the prediction of renewable energy with some project management perspective."  # noqa: E501
)
Exemple #3
0
<h2>Where to get information?</h2>
<p>People often wonder how they could know all these essential things. Maybe there has not been anybody who would share it so far. Also, some people are often selfish - they need to prove that they are necessary for their employer - the simplest way to achieve it is not to share any knowledge - no one can replace you if nobody knows how to deal with problems. Unfortunately, mindsets like this are prevalent (especially in start-up environments).</p> 

<p>Online courses are also often not sufficient - not every lecturer is highly competent in this issue, or they are paid to advertise one concrete technology. Also, every acceptable course needs to be based on some simple example that does not cover all use-cases (it is not easy to deduce all possible use-cases for a particular technology).</p> 

<p>Similar problems are when it comes to literature. There are, however, a few exceptions. For example, one of the very interesting books about data processing issues is Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (author: Martin Kleppmann). I strongly recommend this book to everyone who deals with applications based on big data.</p>

<h2>Summary</h2>
<p>There are two ways to process data - OLAP and OLTP. Online Analytical Processing (OLAP) is the approach for processing big data sets - mainly used to answer complex analytical queries. On the other hand, software engineers use Online Transaction Processing when they develop applications (response from database has to be quickly returned to the user). Each of these approaches has its challenges (and suit of technologies). Differences can cause tunnel vision for engineers, leading to practical problems (either too expensive or too slow applications). Therefore, it is important to reflect these problems in the project's design phase.</p>
"""

ENTITY = cr.Article(
    title="Two universes in the big data environment",  # noqa: E501
    url_alias='two-universes-in-the-big-data-environment',  # noqa: E501
    large_image_path="images/bd_new_big.jpg",
    small_image_path="images/bd_new_small.jpg",
    date=datetime.datetime(2021, 10, 2),
    tags=[
        cr.Tag('Database', 'database'),
        cr.Tag('Design', 'design'),
        cr.Tag('Programming', 'programming'),
        cr.Tag('Performance', 'performance'),
        cr.Tag('Essentials', 'essentials')
    ],
    content=content,
    lead=lead,
    description=
    "There are two ways to process data - OLAP and OLTP. Online Analytical Processing (OLAP) is the approach for processing big data sets in analytical queries."  # noqa: E501
)
<p>There are many cloud-service providers - the most popular ones are Google Cloud Platform, Amazon Web Services, and Microsoft Azure. Many other smaller providers also exist. If it comes to main providers - it is quite difficult to find many differences - provided services are similar in both quality and pricing. Naturally, some of them are better for one or another technology - Azure for Windows-based services, AWS for DynamoDB or S3, G-Cloud for everything from Google, etc. When it comes to prices, for a typical application that deals with large datasets, it is usually something around £1000 - £3000 per month (in 2021 price level). </p>
<p>Small providers usually offer either dedicated servers or virtual machines where you need to deploy your own logic using Kubertness, Docker Swarm or similar technology. Prices are usually cheaper - but you need to consider one important aspect - components in infrastructure are always on and running - at least, usually. Big cloud infrastructure providers bill every minute when any resource is on - so you can set up a smart policy that just spins on a particular resource just for the time when is really needed and then turns it off again (this can save a lot of money). </p>
<p>Generally, setting infrastructure correctly is not always a simple task. The usual approach is to use Infrastructure As A Code (IAAC), like Terraform that can help to set up services and policies - like when to run a particular instance, security, access management, etc. For many popular applications (e. g. CKAN, WordPress), they are usually some ready-made solutions on GitHub. Some popular things (e. g. Elasticsearch, databases) have instances ready as a service (on most cloud-service providers), which might save a lot of time with configuration.</p> 

<h2>On-premises solution</h2>
<p>A special case is a solution when all servers (machines) are literally physically present in the office - called an on-premises solution. Although this approach is considered by many people as an anachronism, it is still applicable and meaningful for many applications. There are also cases when this is the only way - for example when processing sensitive datasets (like secret data, or medical records). </p>
<p>The main disadvantage is that hardware requires maintenance (often a special employee), computers require physical space, and there are many additional costs (particularly for electricity bills). Advantages are also quite clear - you can rely on your own infrastructure, use it whenever want, etc. In many cases, on-premises computers provide the cheapest way - for example when used for scientific computations or similar time-consuming processes (e. g. video processing, satellite image classification). It is because cloud-based services are really expensive when powerful machines are required.</p>

<h2>Useful SAAS</h2>
<p>There are many useful SAAS (software as a service) tools that can be used for development. For example, a free account on Dropbox (or Google Drive) can work as archive storage for larger files. The same holds true for similar services around Office 365. In many cases (like in the case of Dropbox), there is also API for most of the programming languages that allows handling stored files. In many cases, there is also no need to pay for expensive SAAS on cloud-like in the case of SMTP servers - as it can easily run externally (and use cheap web hosting). Similarly, OpenStreetMap can work sufficiently for most cases and does not require any expensive license. It is often good to take your head out of the box and think about things from a wider perspective (especially if there is no near-unlimited budget available).</p>

<h2>Summary</h2>
<p>There are many things that have to be considered when the infrastructure for the application is planned. One of the most common challenges is whether the system should run on the cloud or rather on-premisses. Cloud services provide many comfortable ready-made solutions but are generally speaking quite expensive. One of the biggest advantages of cloud services is the possibility to run your solution just in time (so no computer has to run permanently). The on-premises solution is in many cases cheaper (especially when used for some longer computations). Also, there are many less known services outside of cloud-service providers that can save a lot of money (like using Dropbox for archiving).</p>
"""

ENTITY = cr.Article(
    title="DevOps challenges of the system for processing big data",
    url_alias='devops-challenges-of-the-system-for-processing-big-data',
    large_image_path="images/devops_big.jpg",
    small_image_path="images/devops_small.jpg",
    date=datetime.datetime(2020, 3, 7),
    tags=[cr.Tag('Python', 'python'),
          cr.Tag('Design', 'design'),
          cr.Tag('DevOps', 'devops'),
          cr.Tag('Big Data', 'big-data'),
          cr.Tag('Web application', 'web-application')],
    content=content,
    lead=lead,
    description="There are many common challenges related to systems that process large data sets. The most important decision is if to deploy on a cloud service or locally."  # noqa: E501
)
<h2>What is not yet available outside GDAL?</h2>
<p>There are a few applications of GDAL where there is no other sufficient product available. One of the very common applications is geospatial coordinates transformation (for example, from EPSG:4326 to EPSG:3856). There are some ways to circumvent this issue using the underpinning library of QGIS - but this treatment is maybe worse than a disease (as installing QGIS is in many ways more difficult than installing GDAL). For these specific purposes, GDAL is a very helpful tool. Also, it is good to be aware that many libraries mentioned above use GDAL internally (it is just intelligently wrapped so that you can install a library in a user-friendly way).</p> 
<p>Another library worth mentioning is Fiona - even though it uses GDAL internally, it can be easily installed using PIP. It is beneficial for creating custom shapefiles and the processing of shapefiles generally. Moreover, Fiona is performance-optimized (working faster than library GeoPandas which can also be used in some limited way for this purposes).</p>

<h2>Tools influenced by GDAL</h2>
<p>Many tools use GDAL internally. The most typical example is GeoServer - it is a trendy tool for providing map tiles (and layers) developed again by OSGeo. Unfortunately, it has many disadvantages. Similarly to GDAL, the quality of code is deficient. In addition, the application is written as a single server monolithic application - definitely not a small service that is simple to scale (scaling is generally very difficult), which causes challenges for DevOps engineers. Also, recovery processes often files, which makes GeoServer challenging to use.</p>
<p>GDAL also influenced many other tools, for example, PDAL. It processes point cloud data and uses similar logic as GDAL. The most common use case is the processing of LiDAR data. Similarly to GDAL, it is available under the BSD licence.</p>

<h2>Summary</h2>
<p>This article presents some of the most common use cases for dealing with geospatial data and tools for dealing with these cases (that are quickly accessible and easy to use and install). All these tools can help to replace GDAL in most applications. Furthermore, these tools are libraries in Python simply installable via PIP - namely: rasterstats, rasterio, shapely, Fiona, netCDF4, geopandas, xarray. So it is worth spending some time studying these tools. First, however, it is good to be aware that many of these tools use GDAL internally (but that makes installation easier).</p>
"""

ENTITY = cr.Article(
    title="How to replace GDAL with more efficient tools?",
    url_alias='how-to-replace-gdal-with-more-efficient-tools',
    large_image_path="images/gdal_replacement_big.jpg",
    small_image_path="images/gdal_replacement_small.jpg",
    date=datetime.datetime(2020, 10, 28),
    tags=[
        cr.Tag('GDAL', 'gdal'),
        cr.Tag('Big Data', 'big-data'),
        cr.Tag('Python', 'python'),
        cr.Tag('Geospatial', 'geospatial'),
        cr.Tag('NetCDF', 'netcdf')
    ],
    content=content,
    lead=lead,
    description=
    "Python library GDAL for processing of geospatial data has become a synonym of obsoleteness and inefficiency. There are fortunately better tools to replace it."  # noqa: E501
)
Exemple #6
0
thr.join()</code></pre>
<p>The event handler method contains an infinite loop in practice. Also, tasks are usually added to the queue at a random time.</p>

<h2>Process-based parallelism</h2>
<p>The main disadvantage of process-based parallelism is the overhead of creating the process (and fetching results from the process). This makes processes helpful almost exclusively only for very long-running algorithms (such that it takes a lot of time to process) or for special use cases (publish-subscribe pattern).</p> 
<p>The native Python contains a built-in library called multiprocessing. That contains all tools needed to create multiple processes (especially queues and encapsulation of processes as tasks). But generally speaking, this library is a bit fragile (in terms of portability), and unless you know what you are doing, it is better to avoid it.</p>
<p>Another approach is to use permanent workers that are idling and waiting for some task sent from the master process. This is an example of a publish-subscribe pattern. The typical use case is some web application that does some computations that consumes a lot of resources (or just take a lot of time). In this case, you need to separate them from the main platform (as you cannot afford to wait because of the risk of timeouts).</p>
<p>Arguably the most popular tool for this purpose is called Celery. It allows you to use queues in a very native way. Celery is perfect for simple applications - like sending an email or doing some time-consuming queries on a database. However, if you want to use Celery for more complex applications, you have to expect troubles (especially when you want to spin up multiple workers based on requirements dynamically).</p>


<h2>Summary</h2>
<p>This article presents the most important features of asynchronous programming in Python. Namely, the libraries asyncio and threading (and explain differences). It allows you to implement simple multithreading tasks and complex pipelines. Limits for multiprocessing in Python (CPython) related to GIL (Global Interpreter Lock) are also presented. Some other approaches are discussed as well (like multiprocessing and how to implement the publish-subscribe pattern using Celery).</p>
"""

ENTITY = cr.Article(
    title="Practical aspects of asynchronous programming in Python",
    url_alias='practical-aspects-of-asynchronous-programming-in-python',
    large_image_path="images/asyncio_big.jpg",
    small_image_path="images/asyncio_small.jpg",
    date=datetime.datetime(2021, 5, 16),
    tags=[cr.Tag('Python', 'python'),
          cr.Tag('Design', 'design'),
          cr.Tag('Programming', 'programming'),
          cr.Tag('Performance', 'performance'),
          cr.Tag('Essentials', 'essentials')],
    content=content,
    lead=lead,
    description="Asynchronous programming in Python has some specific. A developer is limited by GIL and effectively the only reasonable application is using the pipeline logic."  # noqa: E501
)
<p>The first problem is the simple one; the best approach is to have a time dimension in the last place.</p>
<p>On the other hand, the second use case is quite problematic. The optimal dimension order has to be determined by the average size of polygons. If your polygons are big enough, having a time series in the first place is the best option (and having in the last place is the worst). This is due to the computation that is performed at each time step. If you have a big polygon and time dimension in the first place, you can very swiftly compute the average value at each time step. On the other hand, if the time dimension is in the last place, you must read each column (in geometric representation) separately, which is slow.</p>

<h2>Column-oriented data stores (and DBMS)</h2>
<p>There is one surprising place where the dimension order matters a lot. That is the environment of Apache Hadoop based databases (or similar proprietary environments). Technically, the database table is just a two-dimensional array. And again, the order of dimensions matters. Typically, rows are the last index when data are serialized. That is meaningful because the typical request is to select rows based on some condition. However, when doing analytical processing (OLAP), the standard request might look very different. For example, if you mainly count sums, averages or select the whole series of columns, the optimal order would be to have columns on the last dimension. There are analytical tools (in the Hadoop environment) that take this into account - like Apache Kudu - that implements precisely this logic. Many other database systems are also available (not only based on HDFS) - like time-series databases (most commonly InfluxDB).</p> 
<p>The typical example for storing massive datasets of data is telemetry deployed on a large e-shop. Business analysts often need to analyze how many items have been sold, the average price of the package, the weight of items, check how many things are in your stock, and many others. These are the typical examples of analytical queries primarily performed on columns space. They also require a lot of time to be performed (if you have more than a million items), so the optimization makes perfect sense.</p>
<h2>Read vs write trade-off</h2>
<p>Again, it is important to know that if your data are coming in some specific dimension order, it is not that trivial to store them in a different order. It is always important to be aware of the trade-off between fast reading and fast writing. So, if you write data just for archive (and do not expect frequent processing), it is handier to write them as they are not to permutate any dimensions. The problem of dimension permutation is even more challenging if you write a massive amount of data (that cannot be stored in operational memory). In this case, the chunking of data should take place. Generally, there is no simple manual for dealing with this issue - you have to use your common sense.</p>
<h2>Remarks and summary</h2>
<p>Suppose you are lucky enough and have some simple problem that fits the above categories. In that case, you can achieve a significant performance overhaul by swapping the dimension order of your data. In many cases, unfortunately, there is no simple solution. You may often need more than one data source to optimize performance (each having different dimension order). The trade-off between writing data in changed dimension order (which is slow) and reading them must also be considered. Do not feel afraid to use your common sense when dealing with this issue, as there is generally no precise manual for your problem.</p>
"""

ENTITY = cr.Article(
    title="Dimension order problem when storing big data",
    url_alias='dimension-order-problem-when-storing-big-data',
    large_image_path="images/dim_orders_big.jpg",
    small_image_path="images/dim_orders_small.jpg",
    date=datetime.datetime(2020, 9, 26),
    tags=[
        cr.Tag('Dimensions', 'dimension'),
        cr.Tag('Big Data', 'big-data'),
        cr.Tag('Performance', 'performance'),
        cr.Tag('Geospatial', 'geospatial'),
        cr.Tag('NetCDF', 'netcdf')
    ],
    content=content,
    lead=lead,
    description=
    "Presents general rules for selecting the optimal order of dimension when storing big multidimensional data. The correct dimension order makes reading faster."  # noqa: E501
)
Exemple #8
0
1 ⋅ 2<sup>0</sup> + 1 ⋅ 2<sup>1</sup> + 1 ⋅ 2<sup>2</sup> + 0 ⋅ 2<sup>3</sup> + 0 ⋅ 2<sup>4</sup> + 1 ⋅ 2<sup>5</sup> + 0 ⋅ 2<sup>6</sup> + 1 ⋅ 2<sup>7</sup> &#8801; 
1 ⋅ 1 + 1 ⋅ 2 + 1 ⋅ 4 + 0 ⋅ 3 + 0 ⋅ 1 + 1 ⋅ 2 + 0 ⋅ 4 + 1 ⋅ 3 &#8801;
12 (mod 5)
</p>

<p>As you can see, we just rewrote numbers on the right side in its form modulo 5 (all exponents of two recomputed as their value module 5). This would work for every modulus. But this is a game-changer. The number on the right side is much smaller than the number on the left side. Also, an algorithm for multiplication is straightforward and fast (compared to division) from the hardware perspective. You can also apply this rule recursively. So, after you minimize the number sufficiently, you can check the divisibility simply by division with a reminder (dividing small numbers is fast enough).</p>
<p>The presented way provides a good base for the optimization of the GNFS algorithm. You can also use the feature of the new FPGA that allows re-programming its content based on a specific condition (for example, check the divisibility of some input chunk of integers towards some subset of factor base in each step). Also, you can use the pipeline logic to optimize throughput.</p>

<h2>Optimizations on the client-side</h2>
<p>As mentioned above, it is critically important to implement RSA correctly on the client-side. However, there are many problems related to successful implementation. Arguably the most crucial issue is the random number generation. It is not trivial to generate truly random numbers on the deterministic device (which means all computers, terminals and similar). This is also the reason why random numbers are called pseud-random in the branch of information technologies. It is therefore essential to follow standards and avoid custom implementation whenever possible. If you are forced to use some custom solution, pay extra attention to random integer generation.</p>
<p>Another problem related to implementation is verifying whether an input is a prime number or not (a critical feature from the security point of view). The only correct approach is to use a combination of some quick algorithm (like the Euler pseudoprime test) and AKS test (that can answer whether the input is prime with 100% confidence). The logic for this split is that the Euler pseudoprime test is much faster than AKS - so you can quickly sieve inputs before using the AKS test. It is also worth noting that many developers (and mathematicians) are still not aware that the prime test can be done with absolute confidence with a deterministic algorithm working in polynomial time. Mentioned AKS test (Agrawal, Kayal, Saxena primality test) was presented in 2002 (so it's still a relatively new algorithm). There are many implementations already available (and worth using).</p>

<h2>Conclusions</h2>
<p>Even though there is no known method of successfully attacking RSA with 1024 bit key, such a method can probably be created shortly. There are two main threats to RSA security. The first one is represented by quantum computers and the fast implementation of the GNFS factorization method. We can expect advancement in both shortly. Quantum computers are the subject of intensive research by large companies such as Google, IBM, Microsoft. GNFS factorization method can now be implemented in a highly effective form on customer's devices (FPGA), which now has more significant potential than ever before. Due to these facts, it is highly recommended to update your systems from RSA to different cryptosystems, optimally quantum-resistant ones, or increase modulus size to 4096 bites.</p>
"""

ENTITY = cr.Article(title="The security perspective of RSA cryptosystem",
                    url_alias='the-security-perspective-of-rsa-cryptosystem',
                    large_image_path="images/security_big.jpg",
                    small_image_path="images/security_small.jpg",
                    date=datetime.datetime(2019, 1, 22),
                    tags=[
                        cr.Tag('Security', 'security'),
                        cr.Tag('Web application', 'web-application'),
                        cr.Tag('RSA', 'rsa'),
                        cr.Tag('Design', 'design'),
                        cr.Tag('Cryptosystem', 'cryptosystem')
                    ],
                    content=content,
                    lead=lead)
<figure>
    <img src="images/prj_github.gif" alt="Figure 4: A concrete example of GitHub project board">
    <figcaption>Figure 4: A concrete example of GitHub project board</figcaption>
</figure>
<p>GitHub developers did significant updates in project management in late 2021 - now there is support for backlogs, link to GitHub Issues, and many other exciting features that makes this feature suitable even for large projects.</p>

<h3>Zenhub</h3>
<p>Zenhub represents an extension of GitHub. Compared to native GitHub projects, it allows many additional features - like adding epics, monitoring performance, and others. However, this solution is a bit clumsy compared to JIRA - the way Zenhub deals with epics is ineffective. Also, there is no clear split between sprints - the whole task set is always on board - this makes the last column very big (and the entire board does not look clean). Rather than using Zenhub, it is worth considering other alternatives.</p>

<h2>Summary</h2>
<p>There are many ways for managing a project. For example, most frameworks use agile logic, splitting each project into smaller tasks and splitting the time frame into sprints. An opposite is a waterfall approach, where things are done step by step using bigger tasks. SCRUM is currently the most popular agile framework. It has its rituals and naming convention. Many tools can help deal with project management. Arguably the most popular is paid service called JIRA. However, it has many flaws. That is why many other tools have recently emerged - like Zenhub and GitHub projects. It is worth doing some testing before choosing the right project management tool for your project.</p>

"""

ENTITY = cr.Article(
    title="Organizing work in software engineering projects (Agile, SCRUM and Waterfall)",
    url_alias='organizing-work-in-software-engineering-projects-agile-scrum-and-waterfall',
    large_image_path="images/team_big.jpg",
    small_image_path="images/team_small.jpg",
    date=datetime.datetime(2020, 6, 12),
    tags=[cr.Tag('Project Management', 'project-management'),
          cr.Tag('Design', 'design'),
          cr.Tag('Essentials', 'essentials'),
          cr.Tag('Performance', 'performance'),
          cr.Tag('Administration', 'administration')],
    content=content,
    lead=lead,
    description="This might help you to decide how to organise work in your project by presenting available frameworks (particularly Waterfall, Agile and SCRUM frameworks)."  # noqa: E501
)
<h3>Pipeline</h3>
<p>The pipeline is a useful concept in Unix. It makes the output of one command be passed as the input of another. It is characterised by the '|' symbol. The typical example is:</p>
<pre class="code"><code>ps -e | grep "firefox"</code></pre>
<p>This example prints all the system processes and then finds those with the name "firefox" using the grep command.</p>

<h3>Print to file and append to file</h3>
<p>If you want to print the output of a command to the file, use the '>' symbol. For appending to the existing file, use '>>'.</p>
<pre class="code"><code>ls -la > list_of_files.txt</code></pre>
<p>This example creates a file list_of_files.txt with a list of all entities in the directory.</p>
<h2>Summary</h2>
<p>This article describes the most popular Unix commands. The list is naturally not complete but covers the most important ones. These commands are as general as they can be. They are typically available on all Linux distributions, and some of them are also on the Mac terminal (as Mac is technically the Unix). If you spend some time searching other articles, you can find a lot of other helpful tools that are available.</p>
"""

ENTITY = cr.Article(
    title=
    "Helpful commands for Linux terminal with a quick introduction to Unix shell",
    url_alias=
    'helpful-commands-for-linux-terminal-with-a-quick-introduction-to-unix-shell',
    large_image_path="images/terminal_big.jpg",
    small_image_path="images/terminal_small.jpg",
    date=datetime.datetime(2020, 7, 17),
    tags=[
        cr.Tag('Linux', 'linux'),
        cr.Tag('Programming', 'programming'),
        cr.Tag('Essentials', 'essentials'),
        cr.Tag('Terminal', 'terminal'),
        cr.Tag('Administration', 'administration')
    ],
    content=content,
    lead=lead)
    open("source.json").read()
)

sites.generate_pages(
    # Output path
    pathlib.Path("./demo")
)</code></pre>

<p>There are many configuration options that you can find in documentation (or on the project's GitHub repository).</p>
<p>To generate a website, install Crinita locally (<code>pip install crinita</code>) and run the Python script defined above. The output is the set of HTML files that can be easily uploaded to some static website hosting (like GitHub pages).</p>

<h2>Summary</h2>
<p>There are multiple ways to generate static websites. The simplest way for smaller applications is to directly use a template engine (like Jinja2). For more complex applications like blogs, there are ready-made frameworks like Pelican or Crinita (in Python). They use Jinja2 internally and implement many useful additional features. Another option is to use a front-end framework like Vue.JS - this option allows the biggest flexibility, but it has cons like resource-hungry outputs. The third way is to combine a static website generator with a front-end framework. This article discusses only the most popular Python frameworks and template engines.</p>
"""

ENTITY = cr.Article(
    title="Suitable ways to generate complex static websites in Python",
    url_alias='suitable-ways-to-generate-complex-static-websites-in-python',
    large_image_path="images/blog_big.jpg",
    small_image_path="images/blog_small.jpg",
    date=datetime.datetime(2020, 11, 3),
    tags=[cr.Tag('Web application', 'web-application'),
          cr.Tag('Programming', 'programming'),
          cr.Tag('Python', 'python'),
          cr.Tag('Design', 'design'),
          cr.Tag('Crinita', 'crinita')],
    content=content,
    lead=lead,
    description="Generators of static pages present a suitable way for the effective creation of secure and compact websites. Crinita is highly efficient Python generator."  # noqa: E501
)
    a(9)</code></pre>

<p>Real problems are typically so complicated that generated figure is not easily readable (generated objects are massive). The script is also very susceptible and cannot handle many useful things (multithreading and multitasking, many external libraries, etc.).</p>

<h2>Other types of profiling</h2>
<p>So far, only a generic type of code profiling has been discussed. However, there might be other essential metrics depending on the application type. One of the common things for web application is to measure the number of database hits - meaning how many times does application access the database server for a specific request (or set of requests). This information is important because every request is delayed by latency when accessing the database server. Similar essential matric is the number of input and output operations on disk (or through the network) - for the same reason (each IO operation has its latency). This matric can be measured using presented tools (focusing on IO operation calls). There always is a trade-off between reading or writing more but less often versus the opposite.</p>

<h2>Summary</h2>
<p>There are many similar tools for code profiling. It is, however, always good to bear in mind that measuring itself changes a code behaviour (always true). This article summarises some essential tools available in Python for code profiling. The performance (run time) profiling is cProfile + snakeviz or vprof. The profiling of memory usage is the package memory-profiler and guppy3. When it comes to the dynamic call graphs, it is mainly the package pycallgraph2. All presented packages are available through PIP (the pycallgraph2 also requires one system package).</p>

<h2>Related GIT repo</h2>
<p>All used source codes are available in the REPO: <a href="https://github.com/david-salac/python-memory-profiling">GIT repo</a></p>
<p>See the <a href="https://github.com/zhuyifei1999/guppy3">guppy3 documentation on GitHub</a></p>

"""

ENTITY = cr.Article(title="Helpful tools for code profiling in Python",
                    url_alias='helpful-tools-for-code-profiling-in-python',
                    large_image_path="images/profiling_big.jpg",
                    small_image_path="images/profiling_small.jpg",
                    date=datetime.datetime(2020, 4, 16),
                    tags=[
                        cr.Tag('Data Visualisation', 'data-visualisation'),
                        cr.Tag('Profiling', 'profiling'),
                        cr.Tag('Python', 'python'),
                        cr.Tag('Design', 'design'),
                        cr.Tag('Performance', 'performance')
                    ],
                    content=content,
                    lead=lead)
Exemple #13
0
<p>Deterministic prediction using the algorithm described above is essential - but it provides the estimated value without error bounds. But many investors, banks and insurance companies are interested in the worst-case scenarios. Something that happens in probabilistic quantile like 0.9 or similar. The typical error is modelled using Gaussian multivariate distribution, reflecting spatial and temporal covariances. Large weather datasets with the high temporal resolution are required for accurate results - usually at least ten years with a resolution of ten minutes. Also, it is necessary to have some ground measurements from existing installations to predict errors.</p>

<h2>Summary</h2>

<p>There are many tools for predicting renewable energy from various types of installation. This post is focused only on predicting the energy from wind turbines and photovoltaic installations. The most important library for photovoltaic installation is called PVLIB. And the most important source of weather data (for long-term forecast) is ERA5. If it comes to wind turbines (aka wind energy converter), there is no comprehensive library like PVLIB, and you have to rely on basic formulas. The essential weather data source is again ERA5 (for long-term forecast). Often, it is also vital to know the error distribution of prediction. That is not always simple as the ground-based measurements from existing installations is required for modelling.</p>

<h2>Useful links</h2>
<ol>
    <li>ERA-5 at EMCWF, available at <a href="https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5">https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5</a></li>
    <li>Copernicus Global Land Service, Surface Albedo, available at <a href="https://land.copernicus.eu/global/products/sa">https://land.copernicus.eu/global/products/sa</a></li>
    <li>Fast SZA and SAA computation, available at <a href="https://github.com/david-salac/Fast-SZA-and-SAA-computation">https://github.com/david-salac/Fast-SZA-and-SAA-computation</a></li>
    <li>Jupyter Notebook with codes for predicting energy, available at <a href="https://github.com/david-salac/Renewable-energy-prediction">https://github.com/david-salac/Renewable-energy-prediction</a></li>
</ol>
"""

ENTITY = cr.Article(
    title="Prediction of renewable energy from the software engineering perspective",
    url_alias='prediction-of-renewable-energy-from-the-software-engineering-perspective',
    large_image_path="images/solar_panel_big.jpg",
    small_image_path="images/solar_panel_small.jpg",
    date=datetime.datetime(2020, 5, 25),
    tags=[cr.Tag('Data Visualisation', 'data-visualisation'),
          cr.Tag('Renewable energy', 'renewable-energy'),
          cr.Tag('Python', 'python'),
          cr.Tag('Big Data', 'big-data'),
          cr.Tag('Geospatial', 'geospatial')],
    content=content,
    lead=lead,
    description="Renewable energy sources become more efficient and mainly more available. It also becomes crucial to understand how to predict the energy which we describe."  # noqa
)
Exemple #14
0
<p>Django provides many helpful tools for testing your application. One of the best tools is the class <code>TransactionTestCase</code> (in package <code>django.test</code>). It allows you to test models in a very efficient way. More than that, it contains a method called <code>assertNumQueries</code>. The method allows you to test a number of queries executed in your test case. This method is particularly helpful for optimizing application performance.</p>

<h3>Using fields specific to DBMS</h3>
<p>Another way to optimize performance is to use specific fields dedicated to the particular DBMS provider, most commonly PostgreSQL. For example, it is possible to use fields such as arrays of a specific type or nested JSON structures. These fields are optimized in a way that you can query them efficiently. Also, many proprietary providers provide support for Django ORM - it is always helpful to study specific options and optimize as much as possible when designing tables.</p>

<h2>Other ORMs</h2>
<p>Many other ORM tools are available, including ORM tools in different languages. But described principles are more or less generally true. The most popular ORM in Python (besides Django ORM) is SQLAlchemy. This ORM is much more low-lever than the Django version. It allows querying the database almost as efficiently as the raw SQL. That is a considerable advantage when performance matters. But it is simultaneously a significant disadvantage to safety, and it is not user-friendly ORM from the programmer point of view.</p>

<h2>Summary</h2>
<p>There are many ways how to optimize the performance of your application. One of them is to find the optimal way of accessing data in the database. Django ORM offers many vital tools to help you achieve this goal. Some of the most important are classes F and Q that can allow you to apply intelligent filters for your queries. Another essential function is QuerySet.values(COLUMNS) that can allow you to select just the correct number of columns in a query. There are also beneficial tools for testing models in Django contained in class TransactionTestCase, among others function assertNumQueries that check the number of database hits in the test case. In the end, there are some general rules for how to make an application faster (not dependant on Django).</p>
"""

ENTITY = cr.Article(
    title="Optimizing database queries (not only) in Django ORM",
    url_alias='optimizing-database-queries-not-only-in-django-orm',
    large_image_path="images/database_big.jpg",
    small_image_path="images/database_small.jpg",
    date=datetime.datetime(2020, 12, 6),
    tags=[
        cr.Tag('Web application', 'web-application'),
        cr.Tag('Design', 'design'),
        cr.Tag('Performance', 'performance'),
        cr.Tag('Programming', 'programming'),
        cr.Tag('Python', 'python')
    ],
    content=content,
    lead=lead,
    description=
    "Technical analysis presenting ways for effective querying of the database using Django ORM like reducing the number of selected columns or using F or Q class."  # noqa: E501
)
<p>Suppose that we have the following problem: a burglar knows that the probability that someone catches him is about 20 per cent. So how many houses can he pick to have a likelihood of being caught below 50 per cent?</p>
<p>As you can see in this problem, you know everything needed. The only thing is, how many places can a burglar visit? The solution is the maximal n in the equation:</p>
<p class="center">(1 - 0.2)<sup><em>n</em></sup> > 0.5</p>
<p>In other words, if the probability that someone catches burglar is <em>p</em> (our 20 per cent) and the desired probability of being cached is <em>q</em> (our 50 per cent), the equation has the form:</p>
<p class="center">(1 - <em>p</em>)<sup><em>n</em></sup> > <em>q</em></p>
<p>To solve this issue you can use a simple logarithm (or just try all options). The solution has the form:</p>
<p class="center"><em>n</em> = floor(log(<em>q</em>) / log(1 - <em>p</em>)))</p>
<p>The floor function is the lowest part of the number (rounding down). For our case, it gives <em>n</em>&nbsp;=&nbsp;3, which means that bulgar can rob three houses with a probability of 50 per cent of being caught.</p>
<h2>Summary</h2>
<p>This article discusses a fascinating problem called optimal stopping. It demonstrates the most common use-cases. The most common case is searching for a new employee (aka secretary problem or 37 per cent rule), then there is some generalisation presented (for example searching for a flat). Another class of problems is also discussed - the significant difference for this case is that we have a norm (we can measure how good something is in an absolute way, not just by comparison to another entity). Another interesting problem related to optimal stopping is a burglar dilemma (or Bomb Squad dilemma). It is about the number of secure attempts in some risky activity that can be performed.</p>
"""

ENTITY = cr.Article(
    title="Optimal stopping: pure mathematics in real life",  # noqa: E501
    url_alias='optimal-stopping-pure-mathematics-in-real-life',  # noqa: E501
    large_image_path="images/opt_stop_big.jpg",
    small_image_path="images/opt_stop_small.jpg",
    date=datetime.datetime(2021, 1, 17),
    tags=[
        cr.Tag('Mathematics', 'mathematics'),
        cr.Tag('Design', 'design'),
        cr.Tag('Programming', 'programming'),
        cr.Tag('Performance', 'performance'),
        cr.Tag('Essentials', 'essentials')
    ],
    content=content,
    lead=lead,
    description=
    "The article discusses a very interesting problem called optimal stopping. It demonstrates the most common use-cases. The most common case is searching for a new employee."  # noqa: E501
)
<p>These databases are beneficial in the authorization and authentication process. For example, they can help store authentication tokens or tokens of various types (like a token for resetting passwords). In addition, some key-value databases support multiple keys (like Memcached, which is both in-memory and key-value database).</p>
<p>Generally, key-value datasets act as a dictionary (or map) returning some stored value for a given key. Some solutions allow you to store almost every type as a value, and requirements for the key are also not strict - like Memcached (which is simultaneously an in-memory database)</p>

<h2>Time series databases</h2>
<p>This form is practically identical to a traditional relational database, but the primary key is time series. This feature makes it helpful for storing some real-time observations (like some physical time dependant variables). The most common use-case is monitoring systems (monitoring traffic, temperature, bandwidth and other features in time). Time series (with data) can then be easily transformed (with different binning, etc.) and shown in some graphs. Tools like Graphana works precisely on this principle.</p>
<p>The most popular solution for time-series databases is InfluxDB. Some rudimentary form is available under MIT license - advanced application is proprietary software.</p>

<h2>Graph databases</h2>
<p>These databases are crucial when the relation between objects is the most important thing. There are many use-cases for this type of database. Arguably the most common are smart suggestions - showing product suggestions (for example, on some e-shop) based on the product you are currently analyzing. Another widespread use case is the relation between roles, groups and users (for authentication and security purposes). An optimized graph can make queries for such data much more effective than standard trees in a relational database. Analyzing relations between entities is also an underpinning principle of social networks.</p> 
<p>A prevalent solution for graph databases is neo4j (source codes are available, but technically, it is proprietary software). Also, from 2019 there is a standard for querying these databases (Graph Query Language).</p> 

<h2>Summary</h2>
<p>The most popular technology for document databases is MongoDB. Document databases are helpful when you need to store the whole structure (like JSON file) with minimal restrictions. In-memory databases are significant if it comes to caching. They allow the effective implementation of task queues or cache results from workers. Key-value databases act as an optimized mapping structure that allows to store values with some key (or many keys) and load this value in optimized time. Key-value databases are often in-memory databases (which is the case of Memcache). Time-series databases help store observations and monitoring outputs (like some physical variables). Finally, graph databases are essential if you need to select objects based on a set of relations.</p>
"""

ENTITY = cr.Article(
    title="Most common use cases for NoSQL databases",  # noqa: E501
    url_alias='most-common-use-cases-for-nosql-databases',  # noqa: E501
    large_image_path="images/bigdata_big.jpg",
    small_image_path="images/bigdata_small.jpg",
    date=datetime.datetime(2021, 7, 18),
    tags=[cr.Tag('NoSQL', 'nosql'),
          cr.Tag('Design', 'design'),
          cr.Tag('Programming', 'programming'),
          cr.Tag('Performance', 'performance'),
          cr.Tag('Essentials', 'essentials')],
    content=content,
    lead=lead,
    description="This article analyzes the most common use cases for well-known NoSQL databases (like document, in-memory, key-value, time-series and graph database solutions)."  # noqa: E501
)
[8] Y. Y. Song, 2017, Computational Number Theory and Modern
Cryptography, Higher Education Press, 2017, pp. 191-260. doi:
10.1002/9781118188606.ch5<br>
[9] L. T. Yang, Ying Huang, J. Feng, Q. Pan and C. Zhu, &#8221;An improved
parallel block Lanczos algorithm over GF(2) for integer factorization&#8221;,
Information Sciences, doi: 10.1016/j.ins.2016.09.052.<br>
[10] L. T. Yang, G. Huang, J. Feng and L. Xu, &#8221;Parallel GNFS algorithm
integrated with parallel block Wiedemann algorithm for RSA security in
cloud computing&#8221;, Information Sciences, doi: 10.1016/j.ins.2016.10.017.</p>
"""

ENTITY = cr.Article(
    title="The concept for an acceleration of public-key cryptanalysis methods",
    url_alias=
    'the-concept-for-an-acceleration-of-public-key-cryptanalysis-methods',
    large_image_path="images/hardware_big.jpg",
    small_image_path="images/hardware_small.jpg",
    date=datetime.datetime(2018, 11, 5),
    tags=[
        cr.Tag('Security', 'security'),
        cr.Tag('FPGA', 'FPGA'),
        cr.Tag('Hardware', 'hardware'),
        cr.Tag('Design', 'design'),
        cr.Tag('Cryptosystem', 'cryptosystem')
    ],
    content=content,
    lead=lead,
    description=
    "This article presents a new way of dealing with integer factorization and discrete logarithm problems from a custom hardware perspective with Zynq-7000 as SoC FPGA."  # noqa: E501
)
<figure>
    <img alt="Figure 2: Word construction process" src="images/ps_composition.png">
    <figcaption>Figure 2: Word construction process</figcaption>
</figure>

<p>Generally, each operation has its prefixes, suffices, separators and operands. The order and values of each must be specified when the operation grammar is defined. As in Excel, words (operations) can be nested, which means inserting words into another word. Practically, each language defines many operations, like aggregation functions, binary and unary operations, references to cells, and others. All the library does is compose words based on overloaded operators (or direct method calls) and keep their value for each language and cell.</p>

<h2>Summary </h2>
<p>If there is a need to export data into Excel format with formulas, two options are generally available. One uses some Excel file driver directly, and another uses the Portable Spreadsheet library. If you choose to use drivers directly, there are two popular XLSX 2010 file format drivers in Python - XlsxWriter and openpyxl. The disadvantage of this approach is that the code is complex and fragile. Another option is to use the encapsulation of the driver, the Portable Spreadsheet library. It encapsulates all standard operations and allows export into multiple formats like JSON, XLSX or XML.</p>
"""

ENTITY = cr.Article(
    title="Python library for exporting formulas to Excel and other formats",
    url_alias=
    'python-library-for-exporting-formulas-to-excel-and-other-formats',
    large_image_path="images/portable_spreadsheet_big.jpg",
    small_image_path="images/portable_spreadsheet_small.jpg",
    date=datetime.datetime(2020, 8, 16),
    tags=[
        cr.Tag('Data Visualisation', 'data-visualisation'),
        cr.Tag('Exporting', 'exporting'),
        cr.Tag('Python', 'python'),
        cr.Tag('Design', 'design'),
        cr.Tag('Excel', 'excel')
    ],
    content=content,
    lead=lead,
    description=
    "We present a new library called Portable Spreadsheet - it can easily export simply defined formulas in Python to many formats including Excel, JSON, etc."  # noqa: E501
)
Exemple #19
0
if __name__ == '__main__':
    # Call client
    main()</code></pre>

<p>Client code is again very similar to gRPC client; creating connection is more complex (3 lines more), but the rest is almost the same. In addition, error handling is more straightforward.</p>

<p>Similarly, as gRPC, Apache Thrift can be used to serialize data to disk. As is mentioned above, you can easily find ready-made examples of how to do it. However, the principle is the same (you still need to define data structures in Thrift's interface definition language).</p> 

<h2>Summary</h2>
<p>One of the common challenges when storing or sending data on the network is effectively serializing them (converting them into technically suitable representation). Binary serialization is one of the essential parts of most Remote Procedure Call (RPC) frameworks. The most popular RPC frameworks are gRPC and Apache Thrift. The principle of interface definition language is to describe interface (methods and data types that the subject of transfer). The IDL of gRPC is called protocol buffers. Apache Thrift uses IDL with the same name (Thrift file). There are, of course, many other similar technologies - like Apache Avro, designed for specific purposes (and extended to support RPC).</p>
"""

ENTITY = cr.Article(
    title="Technical possibilities in binary serialization and RPC",
    url_alias='technical-possibilities-in-binary-serialization-and-rpc',
    large_image_path="images/protocol_big.jpg",
    small_image_path="images/protocol_small.jpg",
    date=datetime.datetime(2021, 10, 24),
    tags=[
        cr.Tag('Python', 'python'),
        cr.Tag('Design', 'design'),
        cr.Tag('Programming', 'programming'),
        cr.Tag('Performance', 'performance'),
        cr.Tag('Essentials', 'essentials')
    ],
    content=content,
    lead=lead,
    description=
    "Remote Procedure Call (RPC) frameworks are the fundamental programming concept. The crucial part of RPC is the binary serialization of data usable also when writing on a disk."  # noqa: E501
)
<h3>Appending to the file</h3>
<p>When writing in chunks to some variable, it is useful to know that none xarray and the native driver support multidimensional variables having one dimension unlimited in size. The native driver supports unlimited dimensions only for 1D variables (xarray does not) but does not support unlimited dimensions for any nD variables (n > 1). Unlimited dimensions are especially helpful if you need to append data into the file. Furthermore, appending mode itself is currently not supported in xarray (native driver works without any problems in the 'a' mode).</p>

<h3>Other inconveniences</h3>
<p>There is also a lot of other things. One of the most famous is that you always have to be careful what you receive at the end of the operation chain (if you use xarray). Sometimes it is a Dask object, and sometimes it can be a Numpy object. This is very inconvenient, and you have to read the documentation very carefully to avoid it. Regarding types, you cannot be sure how strings as "1" and other number literals (having type string) are interpreted in the xarray world.</p>

<p>It is also good to remember that xarray does not interpret the value until you explicitly ask for it. This can lead to a very long pipeline that can cause memory and performance issues. Therefore, it is necessary to run the pipeline from time to time explicitly - technically, it means breaking a big pipeline into smaller ones.</p>

<p>Another issue is the number of dependencies that are installed together with xarray. By default, it is six dependencies and many others, depending on what purpose you want to use it. The number of sub-dependencies can cause common problems when sub-dependencies are not compatible with their previous version - which can cause a crash of the system. Everyone who has ever maintained any larger Python application is very familiar with this issue.</p>

<h2>Summary</h2>
<p>As xarray provides many incredible features that can save time in development, it also contains a lot of pitfalls you have to consider before moving to production. This article describes some of the most common problems like memory leaks, troubles with chunking, no possibility of appending to file, type misinterpretation, etc. Generally speaking, none of the described issues is critical enough to make xarray useless. On the other hand, it is essential to make an informed decision before deploying to production. Many alternatives are working with a safe native driver that, combined, provides a safer alternative to xarray.</p>
"""

ENTITY = cr.Article(
    title="Pros and Cons of using xarray when accessing NetCDF files",
    url_alias='pros-and-cons-of-using-xarray-when-accessing-netcdf-files',
    large_image_path="images/globe_big.jpg",
    small_image_path="images/globe_small.jpg",
    date=datetime.datetime(2020, 4, 10),
    tags=[
        cr.Tag('xarray', 'xarray'),
        cr.Tag('Big Data', 'big-data'),
        cr.Tag('Python', 'python'),
        cr.Tag('Geospatial', 'geospatial'),
        cr.Tag('NetCDF', 'netcdf')
    ],
    content=content,
    lead=lead)
</ul>
<p>The second sentence requires some explanation because it partially contradicts the overall logic. So far, we know, our current computers are not capable of computing more than one billion complex operations per second. No known technology could increase this performance exponentially to this value. It seems to be a huge number, but it actually is not. Consider how many possible combinations does 128 bits number has? It is exactly 340,282,366,920,938,463,463,374,607,431,768,211,456 - the number that has 39 decimal digits. Consider that we use all available computers in the world to find correct input values. We have roughly about ten billion computers; each can compute a billion combinations per second (reality is, that computer is much slower, but we can ignore it now). In that case, it would take ten power to 19 seconds to find the correct value, which is more than 100,000,000,000 years. You do not have to be worried that someone will hack your password after such a long time.</p>

<p>Of course, it is critically important to mention that you have to use a really random password. A random set of characters and numbers. For example: Sz2xNgVaRmQJrkL0eEAI8H, definitely not passwords like: JoHnOhMGoDItIs2So1CooL. Be aware of this. Believe in yourself and learn one meaningfully large password! You can use only alphabetical passwords with lower and upper cases of size 23 characters. Each is equally secure.</p>

<h2>Practical complications</h2>
<p>To make the situation even more complicated, almost every provider uses different validators for passwords. Some require composition that must contain a special character, some composition that must not contain it. Also, the restriction of what that special character should be differs significantly (for some providers, the exclamation mark is a good character, for another, it is not acceptable). Also, the size of the password is often restricted to insufficient size. All these restrictions make the internet even less safe place; rather than preventing hackers from success; the opposite holds is happening.</p>
<p>Also, it is good to be aware of many password generators that are available as free tools. Unfortunately, they often generate insufficient passwords, or their internal algorithm is not sufficient. It isn't easy to generate a random series on the computer - it is caused by the fact that all computers use a deterministic algorithm (for the same input, always the same output). That makes the situation complicated. Some tools even generate random series of words instead of characters. That is potentially dangerous because the whole word then behaves like one character (it's counterintuitive and can leave users complacent about their password).</p>
<p>Another issue that is worth discussing is password managers, meaning programs that store passwords. These tools are frequently very vulnerable - you usually need a master password to access it (which makes life difficult as all passwords depend on this password). So it is not surprising that hackers frequently target these programs (and very often are successful). Famously, there were troubles with FTP clients that holds passwords as a plain-text (like FileZilla or Total Commander), but also with browsers that keep passwords in this way (you know that clingy store password button). </p>
<p>Also, password expiry presents a severe threat to security. Nobody wants to change passwords too often. The result often is that the new password is less and less secure than the old one. As a result, people often change passwords in the loop (swap two passwords each time they are forced to change). That, of course, does not make the system more secure (in fact, quite the opposite holds true). It is mathematically reasonable to change your passwords from time to time and use a unique password for every service - but the theory often fails as it is difficult to memorize so many passwords.</p>

<h2>Conclusions</h2>
<p>The main conclusion is that it is not so important whether special characters are included in your password or not. It is also not so important whether numbers are included or not. The password which is secure enough is 23 random alphabetical characters (lower and upper case). The critically important is the size of the password. Everything with less than 22 alphanumerical characters (or 23 alphabetical lower/upper case) can be considered vulnerable. Generally, try to use common sense for passwords (do not store them, try to change them reasonably often).</p>
"""

ENTITY = cr.Article(
    title="What is really the optimal size and composition of the password",
    url_alias='what-is-really-the-optimal-size-and-composition-of-the-password',
    large_image_path="images/password_big.jpg",
    small_image_path="images/password_small.jpg",
    date=datetime.datetime(2018, 11, 15),
    tags=[
        cr.Tag('Security', 'security'),
        cr.Tag('Web application', 'web-application'),
        cr.Tag('Password', 'password'),
        cr.Tag('Design', 'design'),
        cr.Tag('Cryptosystem', 'cryptosystem')
    ],
    content=content,
    lead=lead)
<h3>Available technologies </h3>
<p>Available technologies depend on the solution that you choose. In the Python land, the simple approach is to use Celery based workers and RabbitMQ (or Redis) based task queues. Both are under open licences and use free and widely supported technologies for storing/accessing data (Redis / Memcached). Another option is to use Faust (as a broker and worker interface) + Apache Kafka (as task queue) combination. The advantage of Celery is that code for workers is written as it were a code of the main application (broker), making the codebase easier to maintain. The disadvantage of Celery is that it is hard to follow the task state based on its ID as Celery cannot separate inexisting (e.g. lost) tasks from tasks that any worker has not accepted (therefore are in the queue).</p>

<h2>Is there something in the middle? Yes, the streaming logic based on WebSocket</h2>
<p>There is also a way based on a combination of previous approaches. Technically, it is helpful when your application processes data sets where results are splittable into small chunks. For example, in the case of splitting the video into frames - you do not need to wait till the whole video is processed to work with already generated frames.</p>
<figure>
    <img alt="Figure 3: The logic of the streaming component" src="images/lds_streaming.png">
    <figcaption>Figure 3: The logic of the streaming component</figcaption>
</figure>
<p>The logic is based on the asynchronous connection between client and server - that is performable using WebSocket. So, technically, instead of sending a notification about finishing the task, the server sends results continuously as they come. That makes the network connection optimally busy all the time, and there is no risk of timeout (because of WebSocket).</p>
<p>Although streaming logic seems to be an optimal approach (if it is deployable) - the reality is not that optimistic. There are many technical challenges when it comes to implementation. For example, keeping a WebSocket connection open is resource hungry operation (both CPU time and memory to keep the connection alive are required). Currently, there is no available correctly working open-source framework to manage WebSocket streaming for larger systems (with many customers).</p>
<h2>Summary</h2>
<p>This article presents three ways of dealing with big-data processing for web applications. One uses the naive (synchronous) approach with one heavy-weighted worker, and another uses a message broker (task queue) approach or continuous streaming using WebSocket. The naive approach is unsuitable for practical problems (can only prove that concept works). The task queues approach is initially more difficult for implementation but is more flexible, and the system's throughput is more significant (as well as the system's security). Even the price for deploying such a system is lower. The third option is to stream data continuously. If this option is possible, it is the optimal one. Unfortunately, no simple, ready-made open-source implementation is currently available.</p>
"""

ENTITY = cr.Article(
    title="Design of a system for on-demand processing of the large datasets",
    url_alias='design-of-a-system-for-on-demand-processing-of-the-large-datasets',
    large_image_path="images/emails_big.jpg",
    small_image_path="images/emails_small.jpg",
    date=datetime.datetime(2020, 4, 26),
    tags=[cr.Tag('Task Queue', 'task-queue'),
          cr.Tag('Big Data', 'big-data'),
          cr.Tag('Python', 'python'),
          cr.Tag('Design', 'design'),
          cr.Tag('Performance', 'performance')],
    content=content,
    lead=lead
)
Exemple #23
0
print(obj_b.some_method(4))
print(obj_b.some_method.cache_info())
# >>> CacheInfo(hits=0, misses=2, maxsize=3, currsize=1)
# => Great!!! Now it works correctly.</code></pre>

<p>The drawback of the <code>methodtools</code> library is that it contains quite a few sub-dependencies (which can sometimes cause issues).</p>

<h2>Summary</h2>
<p>The article presents some theory behind caching and then practical examples of caching in Python language using functools and methodtools lru_cache decorator. Caching is generally helpful for making your code run faster. Of course, it is convenient to have some support functionality on the language level. In Python, it is good to know about the unexpected behaviour of lru_cache in the functools package (memory leaks) and how to overcome it. Also, if needed, you can usually quickly implement your cache logic as well - for example, the key for dictionaries can be almost everything,  so the dictionary itself can be a data structure for the cache.</p>
"""

ENTITY = cr.Article(
    title="A bit more about cache and the way how to implement it in Python",
    url_alias=
    'a-bit-more-about-cache-and-the-way-how-to-implement-it-in-python',
    large_image_path="images/cache_big.gif",
    small_image_path="images/cache_small.gif",
    date=datetime.datetime(2021, 3, 27),
    tags=[
        cr.Tag('Python', 'python'),
        cr.Tag('Design', 'design'),
        cr.Tag('Programming', 'programming'),
        cr.Tag('Performance', 'performance'),
        cr.Tag('Essentials', 'essentials')
    ],
    content=content,
    lead=lead,
    description=
    "The article presents some theory behind caching and then practical examples of caching in Python language using functools and methodtools lru_cache decorator."  # noqa: E501
)
Exemple #24
0
<figure>
    <img src="images/venv_pycharm.gif" alt="Figure 1: Screen of the PyCharm requirements window.">
    <figcaption>Figure 1: Screen of the PyCharm requirements window.</figcaption>
</figure>

<p>If you use a different IDE, the exact way to manage dependencies might differ - but the logic is always the same. For example, another popular IDE in the Python community is called Visual Studio Code. It is also available on all platforms and, similarly to PyCharm, provides a simple integrated way of managing dependencies.</p>

<h2>Summary</h2>
<p>A virtual environment is a fundamental concept in Python. The main goal is to separate Python's application dependencies. There are many ways for managing virtual environments - the most straightforward way is to use native command venv. That requires a system application that allows running venv - there is a simple way to install it on most Linux distributions and a slightly cumbersome way on Windows. Another way is to use Anaconda, which allows installing packages requiring system dependencies without installing anything on the system level. The most popular way in Python software engineering is to use the built-in support for virtual environments in IDE. Every popular IDE for Python supports virtual environments in some way.</p>

"""

ENTITY = cr.Article(
    title="Virtual environments in Python language",
    url_alias='virtual-environments-in-python-language',
    large_image_path="images/python_venv_big.jpg",
    small_image_path="images/python_venv_small.jpg",
    date=datetime.datetime(2020, 8, 14),
    tags=[
        cr.Tag('Virtual Environment', 'virtual-environment'),
        cr.Tag('Programming', 'programming'),
        cr.Tag('Python', 'python'),
        cr.Tag('Performance', 'performance'),
        cr.Tag('Essentials', 'essentials')
    ],
    content=content,
    lead=lead,
    description=
    "Virtual environments in Python are the fundamental concept that makes developing of application much easier and cleaner. There is a simple way of managing it."
)
<p>The biggest drawback of the presented example is the time needed for serialization and deserialization. However, this can be overcome if you use some native formats - like n-dimensional symmetric numeric arrays - which supports native serialization (to REDIS). Although REDIS does not support n-dimensional arrays, you can overcome this issue by mapping it to a 1D array (this mapping is trivial as the physical representation in memory is always a one-dimension array). For example, NumPy supports a simple method to perform this operation (flattening the n-dimensional input).</p> 
<p>If it comes to particular cases - say geospatial data - be aware of the potentially massive size of these data sets. It can be a big hurdle in successfully speeding up your system using the in-memory database approach. Technically, you can achieve slightly bigger performance for huge prices (or no better performance if you choose the wrong way for the serialization process). Therefore, it is necessary to examine the requirements for your system thoroughly. For example, if you want to share your data among many workers in a message-passing pattern, you need to be aware of latencies when accessing the REDIS instance from the worker.</p> 
<p>Many other aspects are critical if you need to increase the performance of your system. The most notorious example is the dimension order of multi-dimensional data. There is a whole theory behind this. Also, in a cloud environment, where the REDIS instance runs on a different machine than your primary application, delays caused by the latencies in communication can be critical. </p>
<p>It is also still true that when dealing with any database (including REDIS), optimising your queries in advance is beneficial. For example, if possible, always read data in one call rather than in many separate calls as decreasing database hits significantly increases overall performance. This approach holds for all database technologies and is well known in the world of relational databases.</p>

<h2>Different approaches for caching</h2>
<p>The main advantage of caching into the REDIS instance is that your values can be shared but come with costs related to latencies when accessing REDIS. If you need to cache locally, there is a simpler way to do so - the most straightforward way is to use the <code>lru_cache</code> decorator from the <code>functools</code> package (there is a separate article about caching here). There are also many ready-made tools for caching in various frameworks - like Django (that internally uses mainly REDIS, sometimes Memcached).</p>
<p>Sometimes, if your application is correctly optimized, even reading data from a disk can be sufficient. This is the most common case when you process an extended data set - you usually need to load all auxiliary data once at the beginning. As the process is relatively slow, it does not matter a lot if it takes a second or five. Also, when using fast SSD disks, you can reach a comparable performance for in-memory caching. The price is generally much lower when dealing with disk space - so this line of reasoning should be reflected in the design phase.</p>

<h2>Conclusions and further research</h2>
<p>This article proves that storing a multi-dimensional array of values using a REDIS in-memory database (when dealing with Pandas DataFrame in Python) is suitable for accelerating the system's overall performance. Furthermore, it shows that the cutting value of the table size is one million elements for Intel Xeon CPU E5-2673 v4 @ 2.30 GHz processor. We also discuss other possibilities for caching and serialization of multi-dimensional (and scalar) data like in-place caching using decorators or fast SSD. Finally, some general rules for reading from databases are reminded.</p>
"""

ENTITY = cr.Article(
    title=
    "Acceleration of frequently accessed multi-dimensional values in Python using REDIS",
    url_alias=
    'acceleration-of-frequently-accessed-multi-dimensional-values-in-python-using-redis',
    large_image_path="images/frequency_big.jpg",
    small_image_path="images/frequency_small.jpg",
    date=datetime.datetime(2019, 2, 3),
    tags=[
        cr.Tag('Pandas', 'pandas'),
        cr.Tag('Big Data', 'big-data'),
        cr.Tag('Performance', 'performance'),
        cr.Tag('Geospatial', 'geospatial'),
        cr.Tag('REDIS', 'redis')
    ],
    content=content,
    lead=lead)
Exemple #26
0
<h3>Typical authentication logic</h3>
<p>The authentication process usually follows the logic depicted on the schema below:</p>
<figure>
    <img src="images/auth_schema.png" alt="Schema of authentication logic with standard components">
    <figcaption>Figure 1: Schema of authentication logic with standard components</figcaption>
</figure>
<p>As you can see, the authentication module (or service) is logically independent of the rest of the application - which is the most common logic (and the most secure one). Internally it verifies user credentials in a database and generates a token (stored in a database, usually in-memory cache). Then, the application accepts the request with the token, verifies it using the Authentication component, and serves response (either forbidden error or actual response).</p>

<h2>Conclusions</h2>
<p>The most common technical ways for dealing with authentication of web-application are presented. These are basic authentication, session-based authentication and token-based authentication. Each of these logic has its pros and cons (and some of them are not mutually exclusive). Generally, the most secure is token-based authentication. Also, the most common third-party implementations are presented - mainly OAuth - these services are helpful for the implementation of single sign-on logic (user uses just one credential for multiple services). The multi-factor authentication concept is then presented to improve application security by using numerous verification channels.</p>
"""

ENTITY = cr.Article(
    title=
    "The concepts for the secure authentication process in web application",
    url_alias=
    'the-concepts-for-the-secure-authentication-process-in-web-application',
    large_image_path="images/web_big.jpg",
    small_image_path="images/web_small.jpg",
    date=datetime.datetime(2019, 4, 25),
    tags=[
        cr.Tag('Security', 'security'),
        cr.Tag('Web application', 'web-application'),
        cr.Tag('REST', 'rest'),
        cr.Tag('Design', 'design'),
        cr.Tag('Services', 'services')
    ],
    content=content,
    lead=lead)