Casssandra: the introduction

Distributed, partitioned, multi-master, increment horizontal scale-out NoSQL data manegement system for global mission-critical use cases handling petabyte sized datasets

Table of Contents

Dynamo, Amazon, Facebook, Apache…

Reliability at massive scale is one of the biggest challenges we face at Amazon

source: “Dynamo: amazon’s highly available key-value store”

In 2004 there have been performance issues in Amazon e-commerce handling due to high traffic. By 2007 the concept of Dynamo has been materialized as Amazon S3. Then in 2008 Facebook with co-authors of Amazon Dynamo developed its own distributed NoSQL system, Cassandra. In 2009 Cassandra became Apache’s project. However, Amazon’s DynamoDB came in 2012, it also uses Dynamo concept, however it is fully proprietary solution.

So we have Dynamo concept, we have Amazon and Facebook and finally Apache Cassandra project. Without diving deep into the details of who was the first and who borrowed what… Cassandra is on the market for 16 years already (as of 2024).

Characteristics

Cassandra is meant to be run in multi-node environment and it is especially useful when running in dispersed networks. It is master-less which means that there is no single point of failure, of course only in case of proper keyspace/database and table designs especially in terms of replication factor. Having replication factor greater than 1, means that we introduce redundancy into our data layout which leads us towards resilience of a Cassandra cluster. Cassandra being distributed-capable still is visible as a single entity from user perspective.

The key concept which provides almost endless scaling capabilities is that Cassandra is distributed and uses data partitions. Comparing to RDBMS (like PostgreSQL) you need no additional configuration in order to run partitions and shards as it is built into Cassandra engine. Data in Cassandra are placed on particular nodes taking into account the partition key. If you need more storage or more computational power, you just add additional nodes to the cluster.

“Each node owns a particular set of tokens, and Cassandra distributes data based on the ranges of these tokens across the cluster.”

source: cassandra.apache.org/_/cassandra-basics.html

Data is spread across cluster based on partition key which is a hash-function value of data key which then maps to nodes which hold this particular tokens. Any node act as a coordinator looking for target node and gossiping with other nodes about cluster structure. There are of course many challanges, for instance when you defined replication factor > 1 and experience some networking issues. Transactions may or may not be successful depending on configuration, how many of nodes responded to transaction and confirmed it.

source: https://cassandra.apache.org/doc/stable/cassandra/data_modeling/data_modeling_logical.html

Concerning data placement strategy: for non-production or simple production setups you can use SimpleStrategy, however for bigger production setups it is recommended to use NetworkTopologyStrategy which take into consideration DC and Rack definitions.

“Data modeling that considers the querying patterns and assigns primary keys based on the queries will have the lowest latency in fetching data”

source: cassandra.apache.org/doc/stable/cassandra/cql/ddl.html

And how data is organized into files?

“Sorted Strings Table (SSTable) is a persistent file format used by ScyllaDB, Apache Cassandra, and other NoSQL databases to take the in-memory data stored in memtables, order it for fast access, and store it on disk in a persistent, ordered, immutable set of files. Immutable means SSTables are never modified. They are later merged into new SSTables or deleted as data is updated.”

source: www.scylladb.com/glossary/sstable/

Cassandra QL quirks

Keep mind that…

“The append and prepend operations are not idempotent by nature”

“lists have limitations and specific performance considerations that you should take into account before using them”

Currently, aliases aren’t recognized in the WHERE or ORDER BY clauses in the statement. You must use the orignal column name instead.

The primary key uniquely identifies a row in the table, as described above. A consequence of this uniqueness is that if another row is inserted using the same primary key, then an UPSERT occurs and an existing row with the same primary key is replaced. Columns that are not part of the primary key cannot define uniqueness.

…but as Cassandra allows the client to provide any timestamp on any table, it is theoretically possible to use another convention. Please be aware that if you do so, dropping a column will not correctly execute.

What I find interesting

I think that worth noticing is speculative_retry feature which defines when query coordinator could query another node for data in case of slow response or unavailability of some node.

Worth trying out is Vector Search feature:

Vector Search is a new feature added to Cassandra 5.0. It is a powerful technique for finding relevant content within large datasets and is particularly useful for AI applications.

I find in documentation that it is not recommended to run Cassandra on NFS/SAN as the whole Cassandra distrubuted concept relies on separated nodes. However it also told that Cassandra will better perform in RAID0 or JBOD (Just Bunch Of Disk) rather than on RAID1 or RAID5. Well… that is obvious that things run better without additional overhead, and I think that it should be stated that this suggestion is valid only if we aim for maximum performance sacrificing a little bit of security.

Installation in Docker/Portainer/Proxmox

In order to start with Cassandra we can use Proxmo virtual environment and Docker Swarm with Portainer on top of it. However, there are some certain issues with running it in Swarm mode. Long story short, Swarm adds additional ingress layer which adds additional IP addresses to the container. It somehow confuses Cassandra. I belive that there is some solution for this, however I found few bug reports in this matter without clear conclusion.

So, we can stay with Swarm mode, but deploy Cassandra as regular container, not a service. Yes, we can run regular containers without putting them into services while running in Swarm mode.

I will use previously prepared Docker Swarm cluster.

For this deployment I will go for Docker image cassandra:5.0.0:

And here it starts:

Every Cassandra node should open port 7000 for inter-node communication. Port 9042 is for query handling.

We need to point exact Docker Swarm node on which we would like to place our Cassandra node. Then in environments section define CASSANDRA_BROADCAST_ADDRESS and CASSANDRA_SEEDS. It is important to pass seed nodes cross cluster, so in case of outage everything left in the cluster should remain operational.

Monitoring nodes

Every node container contains nodetool utility which helps identifing status of our Cassandra cluster. We can query for general status (status command), detaled info (info command), initialite compatcion (compact command) and many many more.

cd /opt/cassandra/bin
./nodetool status

For this demo I decide to go for simple strategy cluster with one main node for seeding and two workers.

Data analysis with Redash

To query Cassandra you can use either CQLSH (present on every Cassandra node, /opt/cassandra/bin) or install Redash. It is a complete data browsers and visualizer with ability to connect to all major RDBMS and to Cassandra also. To install redash download https://github.com/getredash/setup repository and follow instructions.

To start playing with CQL (Cassandra Query Language) we need to define keyspace (using CQLSH), which is some sort of database. We define it as SimpleStrategy with 3 copies. So all of our data will be spread on all cluster nodes. This way we will be resilient of hardware or network failure. For more complex scenarios use NetworkTopologyStrategy with defined DC and Rack parametrs.

create keyspace domains
with replication = {
  'class: 'SimpleStrategy',
  'replication_factor': 3
};

Now, once we created keyspace, we can go to Redash and define Data Source.

Then, start new query and play around.

CQL DDL

We already created keyspace/database, and it time to create table with single-column primary key.

create table mylist 
(
  myid int primary key,
  mytext text
)

In return in Redash you will get:

Error running query: 'NoneType' object is not iterable

which means that Redash expects to receive an iterable object and instead of got nothing, because creating table or keyspaces returns nothing.

CQL DML

Cassandra restricts user as it limits possiblities to query for data only if clause matches primary key columns.

insert into mylist (myid, mytext) values (1, 'test 1');
insert into mylist (myid, mytext) values (2, 'test 2');

select * from mylist where myid = 1;
select * from mylist where myid in (1,2);

Take another example:

create table myotherlist (
  myid int,
  myotherid int,
  mylastid int,
  primary key (myid, myotherid, mylastid)
);

Then insert some data:

insert into myotherlist (myid, myotherid, mylastid) values (1, 1, 1);
insert into myotherlist (myid, myotherid, mylastid) values (2, 2, 2);
insert into myotherlist (myid, myotherid, mylastid) values (3, 3, 3);
insert into myotherlist (myid, myotherid, mylastid) values (4, 4, 4);

And then try various combinations of where clause. The following will return error “column cannot be restricted as preceding column is not restricted”. It means that it does not have individual indices which could help locate those different column values. Instead it seems that it have some kind of tree-like index structure (Log Structured Merge Tree to be specific) which can be traversed only by using all consecutive and adjacent primary key ingredients:

select * from myotherlist where mylastid = 1; 

But this one will work:

select * from myotherlist where myid = 1;
select * from myotherlist where myid = 2 and myotherid = 2;
select * from myotherlist where myid = 3 and myotherid = 3 and mylastid = 3;

As primary key must be unique then you cannot insert same values in all columns which already exist in a table. Moreover, you cannot skip any or required clustering keys (myotherid and mylastid).

Same applies with partition key (myid):

Design principals

Contrary to RDBMS, Cassandra’s design principals are focused more on denomalization than normalization. You need to design your data model by the way you will be using it, intead of just describing the schema.

By contrast, in Cassandra you don’t start with the data model; you start with the query model

The sort order available on queries is fixed, and is determined entirely by the selection of clustering columns you supply in the CREATE TABLE command

In relational database design, you are often taught the importance of normalization. This is not an advantage when working with Cassandra because it performs best when the data model is denormalized.

A key goal that you will see as you begin creating data models in Cassandra is to minimize the number of partitions that must be searched in order to satisfy a given query. Because the partition is a unit of storage that does not get divided across nodes, a query that searches a single partition will typically yield the best performance.

Application example & flushing commitlog

So, previously I defined table called myotherlist with three integer colums contained in primary key. Let’s use Python to insert some data. First install the driver:

pip3 install cassandra-driver

Then define the program. We are going to use prepared statements as they save CPU cycles.

import cassandra
print(cassandra.__version__)

from cassandra.cluster import Cluster

cluster = Cluster(['192.168.2.0', '192.168.2.1', '192.168.3.0'])
session = cluster.connect()

ret = session.execute("USE domains")

rangeofx = range(100)
rangeofy = range(100)
rangeofz = range(100)

stmt = session.prepare("INSERT INTO myotherlist (myid, myotherid, mylastid) values (?, ?, ?)");

for x in rangeofx:
    for y in rangeofy:
        for z in rangeofz:
            print(x, y, z)
            session.execute(stmt, [x, y, z])

It is quite interesting that those data will not be present in data file instantly. Instead they will appear in commitlog, lets take a look:

You can see that there is not much happening here. However, when we take a look at commitlog, we can see that most probably there is our data located.

In order to write our data into SSTable files, you run ./nodetool flush. This will move memory and commitlog into data file.

PostgreSQL manual partitioning

Have you ever wondered how many tables can we create and use in PostgreSQL database server? Shall we call them partitions or shards? Why not to use built-in “automatic” partitioning?

Partitions or shards?

Lets first define the difference between partitions and shards. Partitions are placed on the same server, but shards can be spread across various machines. We can use inheritance or more recent “automatic” partitioning. However both of these solutions lead to tight join with PostgreSQL RDBMS, which in some situations we would like to avoid. Imagine a perspective of migrating our schemas to different RDBMS like Microsoft SQL Server. Not using any vendor-specific syntax and configuration would be beneficial.

Vendor agnostic partitions

So instead, we can just try to create partition-like tables manually:

sudo apt install postgresql -y
sudo -u postgres -i
psql
CREATE DATABASE paritions
exit

Then, after installing PostgreSQL and creating new database:

for i in `seq 1 10000`; 
do 
  echo $i;
  psql -c "create table demo_$i (id int, val int);" partitions; 
done

This way we created 10 000 tables with just generic SQL syntax, which is 100% compatible with all other RDBMS. What is more important we do not rely on shared memory configuration and limits coming from attaching too many partitions into main table.

How many regular partitions can I have?

In case of PostgreSQL (regular partitions) if we attach too many tables, we can easily start negatively notice it in terms of performances and memory consumption. So if you would like to use PostgreSQL “automatic” partitioning keep in mind not to attach too many tables. How many is too many? I started noticing it after attaching just 100 – 200 tables, which is small/medium deployments should be our highest number.

How big my data can be?

In terms of how big our PostgreSQL single node can be I would say that 5 – 10 TB of data with tables reaching (including toasts) 2 TB is fairly normal situation and regular hardware will handle it. If you have 512 GB of RAM on the serve, then buffer and cache will be sufficient to operate on such huge tables.

How many tables can I create in single node?

As mentioned before, you are restricted by storage, memory and CPU – as always. However you should also monitor inodes count as well as file descriptors count in the system, because this separate tables might be put in different files and it is more important if we place in records some lenghty data which go into toasts. However, using regular tables as partitions is the most denormalized way of achieving goal of dividing our data physically.

I can tell that 50 000 tables in a single node is just fine even on small/mid system.

But, what is the actual limit? I think the only practical limit comes from hardware and operating system constraints. On Ubuntu 22 LXC container, 8GB drive, 1 vCPU, 512 MB of memory we have 524k inodes available. After adding 50k tables we can see that inodes increased up to 77126 entries which is 15% total available.

postgres@z10-test:~$ df -i
Filesystem                        Inodes IUsed   IFree IUse% Mounted on
/dev/mapper/pve-vm--131--disk--0  524288 77126  447162   15% /
none                             4070043    25 4070018    1% /dev
tmpfs                            4070043     2 4070041    1% /dev/shm
tmpfs                             819200   149  819051    1% /run
tmpfs                            4070043     2 4070041    1% /run/lock
tmpfs                             814008    11  813997    1% /run/user/110

I think that at least fom inodes perspective we are on good side, even with 50k tables.

How to design data architecture then?

Now lets image real world scenario of a system comprising of 1 million customers. With that number I would recommend having multiple nodes in different locations to decrease latency (in case of global services). Application architecture would also require to be distributed. So we would use both partitioning within one node and sharding “per se”. On the other hand, we may stick with single shard with all customers in a single node but actual work to be done within Cassandra nodes and not RDBMS….

About me

Hello. I’m Michael, I’m IT professional and enthusiast.

I graduated computer programming as well as economics at Szkoła Główna Handlowa in Warsaw. I’m author of 5 books concerning software design, development, quality assurance and performance. Since 2005 I have been working with many different companies providing them with various aspects of software development process. I’m specifically interested in corporate architecture but also in bootstraping new startup ideas.

My motto is “getting things done” and I like to learn new things. Would you like to build something special? Contact me.

Block AI web-scrapers from stealing your website content

Did you know that you may block AI-related web-scrapers from downloading your whole websites and actually stealing your content. This way LLM models will need to have different data source for learning process!

Why you may ask? First of all, AI companies make money on their LLM, so using your content without paying you is just stealing. It applies for texts, images and sounds. It is intellectual property which has certain value. Long time ago I placed on my website a license “Attribution-NonCommercial-NoDerivatives” and guest what… it does not matter. I did not receive any attribution. Dozens of various bot visit my webiste and just download all the content. So I decided…

… to block those AI-related web-crawling web-scraping bots. And no, not by modyfing robots.txt file (or any XML sitemaps) as it might be not sufficient in case of some chinese bots as they just “don’t give a damn”. Neither I decided to use any kind of plugins or server extenstions. I decided to go hard way:

location / {
  if ($http_user_agent ~* "Bytespider") { return 403; }
  ...
}

And decide to which exactly HTTP User Agent (client “browser” in other words) I would like to show middle finger. For those who do not stare at server logs at least few minutes a day, “Bytespider” is a scraping-bot from ByteDance company which owns TikTok. It is said that this bot could possible download content to feed some chinese LLM. Chinese or US it actually does not matter. If you would like to use my content, either pay me or attribute usage of my content. How you may ask? To be honest I do not know.

There is either hard way (as with NGINX blocking certain UA) or diplomacy way which could lead to creating a websites catalogue which do not want to participate in AI feeding process for free. I think there are many more content creators who would like to get some piece of AI birthday cake…

How to build computer inside computer?

Even wondered how computer is built? And no, I’m not talking about unscrewing your laptop… but exactly how the things happen inside the CPU. If so, then check out TINA from Texas Instruments and open my custom-made all-in-one computer.

I spend few weeks preparing this schematic. It contains clock, program counter, memory address register, RAM, ALU, A&B registers, instruction register, microcode decoder, instruction register, address register and program counter. Well that’s a lot ot stuff you need to build 8-bit data and 4-bit address computer, even in simulator.

Sample program in my assembly + binary representation, which needs to be manually enter into memory in the simulator as there is input and output device designed for this machine. You need to program directly into memory and read its results also directly from the same memory but in different region.

Here is a general overview:

Clock, program counter, memory address register:

Memory address register, RAM:

ALU (arithmetic logic unit):

A register with bus:

Microcode area

Comparator to achieve conditional jumps:

BLOOM LLM: how to use?

Asking BLOOM-560M “what is love?” it replies with “The woman who had my first kiss in my life had no idea that I was a man”. wtf?!

Intro

I’ve been into parallel computing since 2021, playing with OpenCL (you can read about it here), looking for maximizing devices capabilities. I’ve got pretty decent in-depth knowledge about how computational process works on GPUs and I’m curious how the most recent AI/ML/LLM technology works. And here you have my little introduction to LLM topic from practical point-of-view.

Course of Action

  • BLOOM overview
  • vLLM
  • Transformers
  • Microsoft Azure NV VM
  • What’s next?

What is BLOOM?

It is a BigScience Large Open-science Open-access Multilingual language model. It based on transformer deep-learning concept, where text is coverted into tokens and then vectors for lookup tables. Deep learning itself is a machine learning method based on neural networks where you train artificial neurons. BLOOM is free and it was created by over 1000 researches. It has been trained on about 1.6 TB of pre-processed multilingual text.

There are few variants of this model 176 billion elements (called just BLOOM) but also BLOOM 1b7 with 1.7 billion elements. There is even BLOOM 560M:

  • to load and run 176B you need to have 350 GB VRAM with FP32 and half with FP16
  • to load and run 1B7 you need somewhere between 10 and 12 GB VRAD and half with FP16

So in order to use my NVIDIA GeForce RTX 3050 Ti with 4GB RAM I would either need to run with BLOOM 560M which requires 2 to 3 GB VRAM and even below 2 GB VRAD in case of using FP16 mixed precision or… use CPU. So 176B requires 700 GB RAM, 1B7 requires 12 – 16 GB RAM and 560M requires 8 – 10 GB RAM.

Are those solid numbers? Lets find out!

vLLM

“vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.”

“A high-throughput and memory-efficient inference and serving engine for LLMs”

You can download (from Hugging Face, company created in 2016 in USA) and serve language models with these few steps:

pip install vllm
vllm serve "bigscience/bloom"

And then once it’s started (and to be honest it won’t start just like that…):

curl -X POST "http://localhost:8000/v1/chat/completions" \ 
	-H "Content-Type: application/json" \ 
	--data '{
		"model": "bigscience/bloom"
		"messages": [
			{"role": "user", "content": "Hello!"}
		]
	}'

You can back up your vLLM runtime using GPU or CPU but also ROCm, OpenVINO, Neuron, TPU and XPU. It requires GPU compute capability 7.0 or higher. I’ve got my RTX 3050 Ti which has 8.6, but my Tesla K20Xm with 6GB VRAD has only 3.5 so it will not be able to use it.

Here is the Python program:

from vllm import LLM, SamplingParams
model_name = "bigscience/bloom-560M"
llm = LLM(model=model_name, gpu_memory_utilization=0.6,  cpu_offload_gb=4, swap_space=2)
question = "What is love?"
sampling_params = SamplingParams(
    temperature=0.5,     
    max_tokens=10,
)
output = llm.generate([question], sampling_params)
print(output[0].outputs[0].text)

In return, there is either:

[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 736.00 MiB. GPU 0 has a total capacity of 3.81 GiB of which 73.00 MiB is free. Including non-PyTorch memory, this process has 3.73 GiB memory in use. Of the allocated memory 3.56 GiB is allocated by PyTorch, and 69.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

or the following:

No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

I may try later to check it out on bigger GPU but as for now, I will try to run it using transformers library which is the next topic.

Transformers

So I picked the same BLOOM 560M model. First, you need to install the following main packages and plenty of dependencies:

pip install transformers
pip install torch
pip install accelerate

Source code of Python program using those libraries is as follows:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-560m", 
    device_map="auto", 
    torch_dtype="auto"
)
def ask_bloom(question, max_length=100, temperature=0.7):
    inputs = tokenizer(question, return_tensors="pt").to(model.device)
    output = model.generate(
        inputs['input_ids'],
        max_length=max_length,  
        temperature=temperature,   
        pad_token_id=tokenizer.eos_token_id, 
        do_sample=True             
    )
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer
question = "What is love?"
answer = ask_bloom(question)
print(f"Q: {question}\nAnwser: {answer}")

To run:

python3 transformers-torch.py

It will download the model and execute in Python program.

We can see that my NVIDIA GeForce 940MX with 2 GB VRAM is loaded around 60% with over 90% utilization (results from nvidia-smi).

“What is love?”. Anwser is as follows:

What is love? She never asked me the question but I am the one who has to give my answer. She is a beautiful, beautiful, very beautiful woman. The first thing you know about love is that it is the most complicated thing in the world. I was so shocked when I saw the man I thought I was marrying. My life was over. The woman who had my first kiss in my life had no idea that I was a man. She was so shocked. She said something

Next I asked for “The sun is shining and?”

I love this scene and I like when it comes to it. The sun, the sky and all the other stars are bright and I love that! I really enjoy this scene and it sets the scene for the rest of the story. It seems like the characters are just going about their day and the sun is shining. The world seems to be in its perfect place and everything is beautiful. I love the lighting in this scene and the warmth of the sunlight that

Does it make any sense? What is a concept of “sense” at all? Anyway it works, somehow. Lets find out the other possibilities.

Microsoft Azure N-series virtual machines

Instead of buying MSI Vector, ASUS ROG, Lenovo Legion Pro, MSI Raider or any kind of ultimate gaming laptops you go to Azure and pick on their NV virtual machines. Especially that they have 14 and 28 GB of VRAM onboard. It costs around 400 Euro per month, but you will not be using it all the time (I suppose).

We have:

root@z92-az-bloom:/home/adminadmin# lspci 
0002:00:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 [Instinct MI25 MxGPU/MI25x2 MxGPU/V340 MxGPU/V340L MxGPU]

And I was not so sure how to use AMD GPU, so instead I decided to requests for a quote increase:

However I got rejected on my account with that request:

Unfortantely changing parameters and virtual machine types did not change the situation, I got still rejected and neeeded to submit support ticket to Microsoft in order to manually process it. So until next time!

What’s next to check?

AWS g6 and Hetzner GEX44. Keep reading!

Further reading

Big-tech cloud vs competitors: price-wise

Imagine you would like run 2 x vCPU and 4 GB RAM virtual machine. Which service provider do you choose? Azure, AWS or Hetzner?

With AWS you pay 65 USD (c5.large instance… and by the way why this is called large at all?). You pick Microsoft Azure you pay 36 Euro (B2s). If you would pick DigitealOcean then you pay 24 USD (noname “droplet”). Choosing Scaleway you pay 19 Euro (PLAY2-NANO compute instance in Warsaw DC). However, with Hetzner Cloud you pay as little as 4.51 Euro (CX22 virtual server). How is that even possible? So it goes like this (converted from USD to Euro):

  • AWS: 59 Euro
  • Azure: 36 Euro
  • DigitalOcean: 22 Euro
  • Scaleway: 19 Euro
  • Hetzner Cloud: less than 5 Euro

With different offers for vCPU and RAM the propotion stays similar, especially in Microsoft Azure. Both Azure and AWS are big-tech companies which make billions of dollars on this offer. Companies like DigitalOcean, Scaleway and Hetzner cannot be called big-tech, because they are much narrower in their businesses and do not offer that high amount of features in their platforms, contrary to Azure and AWS which have hundreds of features available. Keep in mind that this is very synthetic comparison, but if you would like to go with that specific use case scenario you will see the difference.

Every platform has its best. AWS was among the first on the market. Microsoft Azure has its gigantic platform and it is the most recognizable. DigitalOcean, Scaleway and Hetzner are many more (like Rackspace for instance) are the most popular, reliable among other non-big-tech dedicated servers and cloud services providers. I personally especially like services from Hetzner, not only because of their prices but excellent customer service, which is hard to offer within Microsoft or Amazon. If you want somehow more personal approach then go for non-big-tech solutions.

Important notice: any kind of recommendations here are my personal opinion and has not been backup-up by any of those providers.

Who’s got the biggest load average?

Ever wondered what can be the highest load average on the unix-like system? Do we even know what this parameter tells about? It shows the average number of either actively running or waiting processes. It should be close to the number of logical processors present on the system, otherwise, in case it is greater than this, some things will need to wait in order to be executed.

So I was testing 1000 LXC containers on the 2 x 6 core Xeon system (totalling as 24 logical processors) and leave it for a while. Once I got back I saw that there is something wrong with system responsiveness.

And my load average was 1min: 5719, 5 min: 2642, 15 min: 1707. I think that this the highest I have ever seen on systems under my supervision. What is interesing is that the system was not totally unresponsive, rather it was a little sluggish. Proxmox UI recorded load up to somewhere around 100 which should be a quite okey value. But then it sky-rocketed and Proxmox lost its ability to keep track of it.

I managed to login into the system and at that moment load average was already at 1368/2030/1582, which is way less than a few minutes before. I tried to cancel top command and reboot it, but even such trival operation was too much at that time.

Once I managed to initiate system restart it started to shut down all those 1000 LXC containers present on the system. It took somwhere around 20 minutes to shut everything down and proceed with reboot.

Make Opera browser great again!

Get rid of ads and stop sending your data for free to Opera

Little history

My favourite browser was Opera for so many years. Between 2000 and 2005 it was adware showing, well… ads. In 2005 ads have been remove as the financing came from Google, Opera’s default search engine. In 2013 Opera dropped its own rendering engine in favor to Chromium. In 2023 Opera gets some AI features.

What is all about?

I still like Opera.

It has this great multi workspace feature, battery saving mode and in general it is much more capable of running plenty of tabs comparing to other major browsers like Firefox or Chome. However…

Opera has tons of “features” like shopping, Booking.com, promotional offers, AI services etc. Most of those features, including wallet, address data, spelling and payment options are enabled by default. Image how much data you share with Opera this way. Image how many of features can be used against you. Fortunately you can disable all of these, which makes Opera the great browser again.

Start with blank configuration page

Click on Opera logo and select Settings. You will go to configuration page in which you find multiple sections like Basic, Advanced, Privacy & security, Features and Browser. Remember that in some cases configuration page navigation is not linear.

Privacy & Security

Here in this section you find settings which concerns suggestions and diagnostics, but also here you can find promotional notifications and promotional Speed Dials, bookmarks and campaigns. As you can see it is mixed. This is the main issue with Opera settings, there are mixed to confuse you more so you will not be able to identify whether you disabled all unwanted features already.

In this section you can find:

  • Improve search suggestions
  • Automatically send crash reports to Opera
  • Fetch images for suggested sources in News, based on history
  • Display promotional notifications
  • Receive promotional Speed Dials, bookmarks, and campaigns

Change search engine to non-big-tech

Instead of using Google, and feed big-tech with loads of your search data, you can use DuckDuckGo as your primary engine. Fortunately there is option to change default search engine.

In this section you can find:

  • Search engine used in the address bar – set to DuckDuckGo 🙂

Password manager

I prefer to manually enter passwords which I keep in secure encrypted place and I know how they are secured. Saving passwords in any other form could be dangerous as you do not know to whom you give those passwords and in what form. And there are several examples of similar tools that have been hacked in the past.

In this section you can find:

  • Offer to save passwords
  • Sign in automatically

Payment methods

I think that anything (id est information) remembered/saved about my person, location or browsing scheme could be potentially monetized by those companies like Opera or Google which offer browsers. You may say that those things like payment types or passwords probably are locally save. Maybe, but how about future upgrades? Will someone give me guarantee about this? I’m not so sure.

In this section you can find:

  • Save and fill payment methods
  • Allow sites to check if you have payment method saved

Address forms

In case of form data it is more about malicious websites stilling data than Opera as such. There are known vulnerabilities which offer hidden form elements which will be auto-filled even if you could not see them. Keeping this option as “on” may cause to similar issues in the future. And actually it does not matter if Opera is vulnerable to this kind of “attack” today or it is not, it is all about approach.

In this section you can find:

  • Save and fill addresses

Crypto wallet

If you own some cryptos you may wonder if this option is a safe place for your crypto wallets. I am not so sure about this. As far as I remember it is all about having some private key. So keep your private key private. Keeping any keys or IDs in such place from my perspective is not a good idea. You may see this in other colors and keep using this, but this my opinion.

In this section you can find:

  • Enable Wallet Selector

AI services

It is nothing bad about having AI features in a browser. I do not see any major issues with this one as I do not think that Opera would send all the traffic and data to those machine learning pipelines. So with that being crossed-out, you may only think about your battery life if you have more and more features enabled. Please note that I did not conduct any test, so it is only my opinion about this one.

In this section you can find:

  • Aria in the sidebar
  • AI Prompts in text hightlight popup

My Flow

My files on my computer and phone at the same time? It sounds like sending my data outside of my device? I would not do this as I do not use OneDrive and Dropbox and as I identify that my device contain such software it is immediately uninstalled. If I want to send some files to where else I send it by myself and on my own rules. You may choose differently, it is my approach, the secure way.

In this section you can find:

  • Enable My Flow
  • Enable Pinboards

Start page

So here you have suggestions, which are based on our data. You have Booking.com options. It should be self-explaining that these are commercial contracts which are based either or data or on affiliation, which still my identify you as a person making a purchase somewhere else.

In this section you can find:

  • Hide search box
  • Hide Speed Dial
  • Show Continue Shopping section
  • Show Continue on Booking.com section
  • Show weather widget

Spell check

This feature itself is not harmuf, but consumes battery. You may leave it enabled if you want.

In this section you can find:

  • Check for spelling errors when you type text on web page

Social media

Messenger and WhatsApp are the most popular ways of communicating nowadays but having Telegram here… well, I have heard that are some issues with this, so be sure you know what you are doing actually. WhatsApp works just fine. Messenger is just a little bit less crippled that the whole Facebook thing.

In this section you can find:

  • You can disable Telegram 🙂

What’s next?

With beforementioned adjustments you can start using your Opera in way more secure way that comparing to its default settings which are stupendous but still somehow understandable. Opera is a commercial company which would like to make money, and they make money thru various channels like: ads, affiliations, “by-defaulting” things, data/diagnostics, features inclusion as services. With just a little time spend on this configuration you get great and efficient workspace. I think it is worth spending this time.