Cassandra -NoSQL database

From berki WIKI
Revision as of 10:15, 7 October 2018 by Adam (talk | contribs) (Bevezető)

Jump to: navigation, search
ClipCapIt-180930-152752.PNG

Fontos fogalmak

Segéd fogalmak

Referential integrity

Referential integrity is a property of data stating references within it are valid. In the context of relational databases, it requires every value of one attribute (column) of a relation (table) to exist as a value of another attribute (column) in a different (or the same) relation (table).[1]

For referential integrity to hold in a relational database, any column in a base table that is declared a foreign key can contain either a null value, or only values from a parent table's primary key or a candidate key. In other words, when a foreign key value is used it must reference a valid, existing primary key in the parent table. For instance, deleting a record that contains a value referred to by a foreign key in another table would break referential integrity.


ACID

Atomicity, Consistency, Isolation, Durability

Atomicity (database systems)

In database systems, atomicity is one of the ACID (Atomicity, Consistency, Isolation, Durability) transaction properties. An atomic transaction is an indivisible and irreducible series of database operations such that either all occur, or nothing occurs.[1] A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright

Consistency

Consistency in database systems refers to the requirement that any given database transaction must change affected data only in allowed ways. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof.

Isolation

Isolation is typically defined at database level as a property that defines how/when the changes made by one operation become visible to other.

Durability

In database systems, durability is the ACID property which guarantees that transactions that have committed will survive permanently.


Normal forms

Codd introduced the concept of normalization and what is now known as the first normal form (1NF) in 1970.[4] Codd went on to define the second normal form (2NF) and third normal form (3NF) in 1971,[5] and Codd and Raymond F. Boyce defined the Boyce-Codd normal form (BCNF) in 1974.[6]

Informally, a relational database relation is often described as "normalized" if it meets third normal form.[7] Most 3NF relations are free of insertion, update, and deletion anomalies.

First normal form

First normal form is an essential property of a relation in a relational database. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement.

First normal form enforces these criteria:

  • Eliminate repeating groups in individual tables.
  • Create a separate table for each set of related data.
  • Identify each set of related data with a primary key


Second normal form

A relation that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a relation is in 2NF if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the relation. A non-prime attribute of a relation is an attribute that is not a part of any candidate key of the relation.

Put simply, a relation is in 2NF if it is in 1NF and every non-prime attribute of the relation is dependent on the whole of every candidate key.

Third normal form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that:-

  1. the entity is in second normal form
  2. all the attributes in a table are determined only by the candidate keys of that relation and not by any non-prime attributes.

3NF was designed to improve database processing while minimizing storage costs.


An example of a 2NF table that fails to meet the requirements of 3NF is:

Tournament Winners
Tournament Year Winner Winner Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.

The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

In order to express the same facts without violating 3NF, it is necessary to split the table into two:

Tournament Winners
Tournament Year Winner
Indiana Invitational 1998 Al Fredrickson
Cleveland Open 1999 Bob Albertson
Des Moines Masters 1999 Al Fredrickson
Indiana Invitational 1999 Chip Masterson
Winner Dates of Birth
Winner Date of Birth
Chip Masterson 14 March 1977
Al Fredrickson 21 July 1975
Bob Albertson 28 September 1968

Update anomalies cannot occur in these tables, because unlike before, Winner is now a primary key in the second table, thus allowing only one value for Date of Birth for each Winner.



Database Normalization

Database normalization is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as an integral part of his relational model.



Denormalization

Bevezető

A Cassandra ugyan úgy mint az RDBMS elkülönített adatbázisokat használ, és azon belül táblákat hozhatunk létre, amiben sorok és oszlopok vannak.

  • Keyspace: ez felel meg egy adatbázisnak az RDBMS-ben, vagy egy Index-nek az Elasticsearch-ben
  • Column families: egy tábla az adatbázisban, amihez sorok tartoznak. Azonban az oszlopok fajtája soronként eltérhet.


eventually consistent

Telepítés

# docker run --name cassandra -p 9042:9042 -d cassandra

A parancssori Cassandra kliens része a telepítőnek, a bin mappában ül. Ezért indítunk még egy cassandra konténert, és abból fogunk csatlakozni a "cassandra" nevű konténerhez a 9042-es porton.

# docker run -it --rm cassandra /bin/bash


Majd indítsuk el a cqlsh programot: cqlsh host port

root@0ce39e38988a:/# cqlsh 192.168.0.101 9042
Connected to Test Cluster at create keyspace dev
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> 

INDEX vs Filter

Hatékonysági okokból alap esetben a Cassandra csak arra az oszlopra enged lekérdezni, amire van index.Csak akkor lehet index nélküli oszlopra hivatkozni a WHERE kifejezésben ha ezt implicit engedélyezzük az ALLOW FILTERING kulcsszóval a lezáró ; előtt. Ugyanis ha egy oszlopon nincs index, akkor a Cassandra az összes sort be fogja olvasni, és filterrel fogja kiválasztani WHERE-ben definiált mezőket. Milliós sorszám esetén ez már problémás lehet.

https://www.datastax.com/2012/01/getting-started-with-cassandra

Lekérdezések

https://www.datastax.com/2012/01/getting-started-with-cassandra

Keyspace és cluster

cqlsh:adam> DESCRIBE KEYSPACE adam;

CREATE KEYSPACE adam WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;

CREATE TABLE adam.emp (
    empid int PRIMARY KEY,
    emp_dept text,
    emp_first text,
    emp_last text
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
 ...


cqlsh:adam> DESCRIBE TABLE adam.emp;

CREATE TABLE adam.emp (
    empid int PRIMARY KEY,
    emp_dept text,
...


Időbélyegek

A timestamp és a TTL is egy mezőre vonatkozik, nem az egész sorra. Tehát minden egyes column pair-nek van egy TTL és egy timestamp metadata értéke is.

cqlsh:adam> SELECT first_name, last_name,
        ... writetime(last_name) FROM user;

 first_name | last_name | writetime(last_name)
------------+-----------+----------------------
       Mary | Rodriguez |     1538771050876617
       Bill |    Nguyen |     1538771031333072


Ha a TTL lejár, NULL-ra fogja állítani a mező értékét.

UPDATE user USING TTL 3600 SET last_name =
'McDonald' WHERE first_name = 'Mary' ;
cqlsh:adam> SELECT first_name, last_name, TTL(last_name)
        ... FROM user WHERE first_name = 'Mary';

 first_name | last_name | ttl(last_name)
------------+-----------+----------------
       Mary | Rodriguez |           3588

Adatok

Adatbázis létrehozása:

cqlsh> create keyspace adam with replication = {'class':'SimpleStrategy','replication_factor':1};


Tábla létrehozása:

cqlsh> use adam;
cqlsh:adam> create table emp (empid int primary key, emp_first varchar, emp_last varchar, emp_dept varchar);


Adat beszúrása:

cqlsh:adam> insert into emp (empid, emp_first, emp_last, emp_dept)  values (1,'fred','smith','eng');
WarningIcon.png

Warning
A primary key értékét beszúrás után soha többet nem lehet megváltoztatni, mivel az határozza meg, hogy melyik node-ra kerül a row


Lekérdezés index-ra:

cqlsh:adam> SELECT * FROM adam.emp;
 empid | emp_dept | emp_first | emp_last
-------+----------+-----------+----------
     1 |      eng |      fred |    smith
     2 |      eng |      fred |    smith


Lekérdezés filterrel:

cqlsh:adam> SELECT * FROM adam.emp WHERE emp_last = 'smith' ALLOW FILTERING;

 empid | emp_dept | emp_first | emp_last
-------+----------+-----------+----------
     1 |      eng |      fred |    smith
     2 |      eng |      fred |    smith

Sets

A set esetében mivel egy elem többször is előfordulhat, hatékonyabb mint a lista, ahol egy elem beszúrásához az egész listát végig kell olvasni.

cqlsh:adam> ALTER TABLE user ADD emails set<text>;


cqlsh:adam> UPDATE user SET emails = {
        ... 'mary@example.com', 'mary2@example.com' } WHERE first_name = 'Mary';
cqlsh:adam> SELECT emails FROM user WHERE first_name = 'Mary';

 emails
-------------------------------------------
 {'mary2@example.com', 'mary@example.com'}

(1 rows)


List

cqlsh:adam> ALTER TABLE user ADD
        ... phone_numbers list<text>;
cqlsh:adam> UPDATE user SET phone_numbers = [
        ... '1-800-999-9999' ] WHERE first_name = 'Mary';
cqlsh:adam> SELECT phone_numbers FROM user WHERE
        ... first_name = 'Mary';

 phone_numbers
--------------------
 ['1-800-999-9999']
UPDATE user SET phone_numbers[0] =
'480-111-1111' WHERE first_name = 'Mary';


Map

cqlsh:adam> ALTER TABLE user ADD
        ... login_sessions map<timeuuid, int>;
cqlsh:adam> UPDATE user SET login_sessions =
        ... { now(): 13, now(): 18} WHERE first_name = 'Mary';
cqlsh:adam> SELECT login_sessions FROM user WHERE
        ... first_name = 'Mary';

 login_sessions
--------------------------------------------------------------------------------------
 {601a2e20-c8e3-11e8-8684-6d2c86545d91: 13, 601a2e21-c8e3-11e8-8684-6d2c86545d91: 18}


User defined type

cqlsh:adam> CREATE TYPE address (street text, city text, state text, zip_code int);

cqlsh:adam> ALTER TABLE user ADD addresses map<text, frozen<address>>;

cqlsh:adam> UPDATE user SET addresses = addresses + {'home': { street: '7712 E. Broadway', city: 'Tucson',
        ... state: 'AZ', zip_code: 85715} } WHERE first_name = 'Mary';
cqlsh:adam> 
cqlsh:adam> SELECT addresses FROM user WHERE first_name = 'Mary';

 addresses
--------------------------------------------------------------------------------------
 {'home': {street: '7712 E. Broadway', city: 'Tucson', state: 'AZ', zip_code: 85715}}


Másodlagos INDEX-ek

A másodlagos indexekekkel az a baj, hogy lehet hogy több node-on lesznek szétszorva, mivel az elsődleges index határozza meg hogy melyik node-ra kerül az elem. Mivel több node-on is le kell futtatni a keresést, azért nagyon drága lehet a használata.
Mikor nem szabad másodlagos indexet használni:

  • Ha nem túl nagy az elemek kardinalitása. Ha szinte minden elem különbözik egymástól, akkor gyakorlatilag majd az összes node-ot végig kell járni.
  • Ha túl kicsi a kardinalitás. Az sem jó ha szinte minden egyes sorban ugyan az az érték van, túl sok lesz a találat.
  • Gyakran átírt mezőknél: Ha gyorsabban gyűlnek fel a tombstones-ok mint ahogy azokat a Cassandra fel tudná dolgozni, hibára fog futni az update egy idő után
CREATE INDEX ON user ( last_name );
ImportantIcon.png

Note
Másodlagos indexeket a set, list, map elemeire is létrehozhatunk, akár még a user-defined-type belsejében lévő elemekre is, még akkor is ha egy map belsejében annak


CREATE INDEX ON user ( emails );



Cassandara 3.4-től használhatjuk az Apple által kifejlesztet SASI másodlagos index implementációt, ami több funkcionalitást tesz lehetővé mint a beépített változat:

  • Kisebb, nagyobb -ra keresés
  • LIKE használata string típusú mezők esetén.


Mikor létrehozzuk a másodlagos indexet meg kell mondani, hogy egy custom implementációt szeretnénk használni.

CREATE CUSTOM INDEX user_last_name_sasi_idx ON user (last_name)
USING 'org.apache.cassandra.index.sasi.SASIIndex';

Adatbázis GUI

Sokféle grafikus eszközzel csatlakozhatunk a Cassandra adatbázishoz. A legtöbben a RazorSQL-t ajánlották, ami fizetős: https://razorsql.com/

ClipCapIt-180930-154548.PNG