IP2Location – complete IPv4 ranges ingestion

IP2Location is a IP address databases where you can find the latest IP to location associations. The complete IPv4 range is 4 294 967 296 addresses which is 32 bit. IP2Location contains 4 291 944 710 addresses which is a little less. However as much as 608 487 295 addresses come with no location set. It is because of:

  • 0.0.0.0/8 (local)
  • 10.0.0.0/8 (class A)
  • 100.64.0.0/10 (shared)
  • 127.0.0.0/8 (host)
  • 169.254.0.0/16 (link-local)
  • 172.16.0.0/12 (class B)
  • 192.0.0.0/24 (dual-stack)
  • 192.0.2.0/24 (documentation etc)
  • 192.88.99.0/24 (reserved, IPv6 to IPV4)
  • 192.168.0.0/16 (class C)
  • 198.18.0.0/15 (benchmarking)
  • 198.51.100.0/24 (documentation etc)
  • 203.0.113.0/24 (documentation etc)
  • 224.0.0.0/4 (multicast, class D)
  • 233.252.0.0/24 (documentation etc)
  • 240.0.0/4 (reserved, class E)
  • 255.255.255.255/32 (broadcast)

Above should not have location set as those are special IPv4 ranges. So valid commercial (and non-commercial also) use IPv4 addresses count should be somewhere near value of 3 683 457 415, which is all addresses minus addresses without location. DARPA developed IPv4 started in 1981 and has been exhausted in 2011.

How does this apply to Cassandra databases?

Complete 3.6B addresses weights around 40GB of data in single Apache Cassandra 5.0 node. Using commodity hardware with Intel i5 10200H and Western Digital SN530 NVME drive we can get up to 29k inserts per second. Doing the math the complete ingestion should finish within 35 hours. However if we would put multiple Cassandra nodes to split writes it most probably be much faster. Lets say we would run at 100k/s so ingestion time would be 10 hours. With 1M/s this would run for only 1 hour.

PostgreSQL manual partitioning

Have you ever wondered how many tables can we create and use in PostgreSQL database server? Shall we call them partitions or shards? Why not to use built-in “automatic” partitioning?

Partitions or shards?

Lets first define the difference between partitions and shards. Partitions are placed on the same server, but shards can be spread across various machines. We can use inheritance or more recent “automatic” partitioning. However both of these solutions lead to tight join with PostgreSQL RDBMS, which in some situations we would like to avoid. Imagine a perspective of migrating our schemas to different RDBMS like Microsoft SQL Server. Not using any vendor-specific syntax and configuration would be beneficial.

Vendor agnostic partitions

So instead, we can just try to create partition-like tables manually:

sudo apt install postgresql -y
sudo -u postgres -i
psql
CREATE DATABASE paritions
exit

Then, after installing PostgreSQL and creating new database:

for i in `seq 1 10000`; 
do 
  echo $i;
  psql -c "create table demo_$i (id int, val int);" partitions; 
done

This way we created 10 000 tables with just generic SQL syntax, which is 100% compatible with all other RDBMS. What is more important we do not rely on shared memory configuration and limits coming from attaching too many partitions into main table.

How many regular partitions can I have?

In case of PostgreSQL (regular partitions) if we attach too many tables, we can easily start negatively notice it in terms of performances and memory consumption. So if you would like to use PostgreSQL “automatic” partitioning keep in mind not to attach too many tables. How many is too many? I started noticing it after attaching just 100 – 200 tables, which is small/medium deployments should be our highest number.

How big my data can be?

In terms of how big our PostgreSQL single node can be I would say that 5 – 10 TB of data with tables reaching (including toasts) 2 TB is fairly normal situation and regular hardware will handle it. If you have 512 GB of RAM on the serve, then buffer and cache will be sufficient to operate on such huge tables.

How many tables can I create in single node?

As mentioned before, you are restricted by storage, memory and CPU – as always. However you should also monitor inodes count as well as file descriptors count in the system, because this separate tables might be put in different files and it is more important if we place in records some lenghty data which go into toasts. However, using regular tables as partitions is the most denormalized way of achieving goal of dividing our data physically.

I can tell that 50 000 tables in a single node is just fine even on small/mid system.

But, what is the actual limit? I think the only practical limit comes from hardware and operating system constraints. On Ubuntu 22 LXC container, 8GB drive, 1 vCPU, 512 MB of memory we have 524k inodes available. After adding 50k tables we can see that inodes increased up to 77126 entries which is 15% total available.

postgres@z10-test:~$ df -i
Filesystem                        Inodes IUsed   IFree IUse% Mounted on
/dev/mapper/pve-vm--131--disk--0  524288 77126  447162   15% /
none                             4070043    25 4070018    1% /dev
tmpfs                            4070043     2 4070041    1% /dev/shm
tmpfs                             819200   149  819051    1% /run
tmpfs                            4070043     2 4070041    1% /run/lock
tmpfs                             814008    11  813997    1% /run/user/110

I think that at least fom inodes perspective we are on good side, even with 50k tables.

How to design data architecture then?

Now lets image real world scenario of a system comprising of 1 million customers. With that number I would recommend having multiple nodes in different locations to decrease latency (in case of global services). Application architecture would also require to be distributed. So we would use both partitioning within one node and sharding “per se”. On the other hand, we may stick with single shard with all customers in a single node but actual work to be done within Cassandra nodes and not RDBMS….

Load OSM map files into PostgreSQL

You can use OpenStreetMaps on your own hardware. You need to grab map files first, which can be found at https://download.geofabrik.de. Once you downloaded it, install PostgreSQL and enable few extensions:

CREATE DATABASE yourdbname;
CREATE EXTENSION postgis;
CREATE EXTENSION hstore;

It may be useful at some point later on especially if you would like to try build your own tile server. Next you need to install osm2pgsql and it can be found in system packages in Ubuntu 22. Then:

osm2pgsql -c -d yourdbname -U youruser -H yourdbhost somefile.osm.pbf -W

Now give it database password and it will start loading. Depening on hardware it might take from few minutes to tens of hours in case of large files like Europe area. You should have something like this:

geography_columns
geometry_columns
planet_osm_line
planet_osm_point
planet_osm_polygon
planet_osm_roads
spatial_ref_sys 

These are tables with points, lines, polygons and roads. Most of the data is located in the PBF file itself, but few other are calculated on the fly while loading. There is plenty of options for such data. You can use them in some desktop software like QGIS (loading directly PBF file) or setting up complete tile server similar as on the official OSM website.

Ruby on Rails vs PostgreSQL 14 partitioning primary keys

Starting from PostgreSQL 10 we have available new type of paritioning, which is declarative one. Before we had inheritance which is also good but has some limitations. Now, with declarative partitioning user need not to specify partition name while inserting data. To create partitioned table:

CREATE TABLE tablename 
( 
  id BIGSERIAL NOT NULL, 
  created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL,
  CONSTRAINT tablename_pkey PRIMARY KEY (id, created_at)
) PARTITION BY RANGE(created_at);

There are few things worth explaining. First one is BIGSERIAL data type. It is a bigger form of integer type with automatic sequence creation, which is very convenient. Second thing is PARTITION BY RANGE(columnname). It defined that the data will be spread across paritions based on created_at date range. It positively affects query performance.

To create a partition:

CREATE TABLE tabelename_2022_09 PARTITION OF tablename
    FOR VALUES FROM ('2022-09-01 00:00:00.00') TO ('2022-09-30 23:59:59.99');

With table inheritance I’ve used partitioning with over 1000 tables. Generally speaking, PostgreSQL, even older versions, could easily handle 100 000 tables on single server and data volume around 10 TB. However with declarative partitioning you need to remember, that when executing a query o master table there will be shared lock on each and every parition and it costs memory. Althought it migth be a little simplification you need to increase max_locks_per_transaction parameter to fit required number of partitions created, attached. If you do not need online access to some old data or created future paritions upfront then either do not attach them or just detach if unsed.

Last and the most important thing related to multi-column primary key. Ruby on Rails and ActiveRecord library does not support it by default. You could use some gem to overcome this limitation, but what you only need is to point the specific colum in model definition:

self.primary_key = :id

It works just fine.

pg_stat_statements must be loaded via shared_preload_libraries

Latest Clustercontrol – version 1.9.4 – installation somehow requires pg_stat_statements to be activated in PostgreSQL 14. I have never noticed it before, both on 9.6, 12 and 14. This is simple fix. Change postgresql.conf configuration file (which should be either in /etc or /var/lib/ in standard package installations):

shared_preload_libraries = 'pg_stat_statements'

Then activate extension in psql:

CREATE EXTENSION pg_stat_statements;

Restart server and you are done.

Installing PostgreSQL 14 on CentOS 7

PostgreSQL 14 has few improvements over previous versions, especially older ones, like 9.6. This includes automatic data direction using partitioning. So, let’s install it on CentOS 7.

yum -y install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
yum install -y postgresql14-server postgresql14
/usr/pgsql-14/bin/postgresql-14-setup initdb
systemctl enable postgresql-14
service postgresql-14 start

Then you should login to the database and set password:

sudo -u postgres -i
psql
\password

Basic configuration includes firewall setup (if it’s active), access control and performance settings:

/var/lib/pgsql/14/data/pg_hba.conf
/var/lib/pgsql/14/data/postgresql.auto.conf

I strongly recommend visiting pgtune.leopard.in.ua for tuning parameters. Last thing is to set interface address on which server will be listening, it’s set in:

/var/lib/pgsql/14/data/postgresql.conf

Now you are good to go.