about summary refs log tree commit diff
diff options
context:
space:
mode:
authorVincent Ambo <mail@tazj.in>2023-01-18T21·49+0300
committerclbot <clbot@tvl.fyi>2023-01-18T21·58+0000
commit0dfe460fbb8cda0831fbcf4d9e42948c2bb88afa (patch)
tree8a2cb0708df5450b5308a6dea5e57a4ccc344c0d
parentdb26825eecacb22b60abebf2879bf1420493b8c5 (diff)
docs(corp/data-import): document OpenRussian format r/5703
This is the second dataset I want to integrate as it contains some
more practically useful, but somewhat less structured, information.

Change-Id: Ib46b2597a33e76f59e030f889a0961ecc5a144eb
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7873
Tested-by: BuildkiteCI
Autosubmit: tazjin <tazjin@tvl.su>
Reviewed-by: tazjin <tazjin@tvl.su>
-rw-r--r--corp/russian/data-import/src/main.rs57
1 files changed, 53 insertions, 4 deletions
diff --git a/corp/russian/data-import/src/main.rs b/corp/russian/data-import/src/main.rs
index 85e89a905b8c..21d4209991c5 100644
--- a/corp/russian/data-import/src/main.rs
+++ b/corp/russian/data-import/src/main.rs
@@ -1,10 +1,10 @@
-//! This program imports Russian language data from OpenCorpora
-//! ("Открытый корпус") into a SQLite database that can be used for
-//! [//corp/russian][corp-russian] projects.
+//! This program imports Russian language data from OpenCorpora and
+//! OpenRussian ("Открытый корпус") into a SQLite database that can be
+//! used for [//corp/russian][corp-russian] projects.
 //!
 //! [corp-russian]: https://at.tvl.fyi/?q=%2F%2Fcorp%2Frussian
 //!
-//! Ideally, running this on an OpenCorpora dump should yield a fully
+//! Ideally, running this on intact dumps should yield a fully
 //! functional SQLite database compatible with all other tools
 //! consuming it.
 //!
@@ -53,6 +53,55 @@
 //!
 //!   For example, a relationship `cardinal/ordinal` might be established
 //!   between the lemmas "два" and "второй".
+//!
+//! ## OpenRussian format
+//!
+//! The [OpenRussian](https://en.openrussian.org/dictionary) project
+//! lets users export its database as a set of CSV-files. For our
+//! purposes, we download the files using `<tab>` separators.
+//!
+//! Whereas OpenCorpora opts for a flat structure with a "tag" system
+//! (through its flexible grammemes), OpenRussian has a fixed pre-hoc
+//! structure into which it sorts some words with their morphologies.
+//! The OpenRussian database is much smaller as of January 2023 (~1.7
+//! million words vs. >5 million for OpenCorpora), but some of the
+//! information is much more practically useful.
+//!
+//! Two very important bits of information OpenRussian has are accent
+//! marks (most tables containing actual words have a normal form
+//! containing and accent mark, and a "bare" form without) and
+//! translations into English and German.
+//!
+//! The full dump includes the following tables (and some more):
+//!
+//! * `words`: List of lemmas in the corpus, with various bits of
+//!    metadata as well as hand-written notes.
+//!
+//! * `adjectives`: Contains IDs for words that are adjectives.
+//!
+//! * `nouns`: IDs for words that are nouns; and noun metadata (e.g.
+//!   gender, declinability)
+//!
+//! * `verbs`: IDs of words that are verbs, including their aspect and
+//!   "partnered" verb in the other aspect
+//!
+//! * `words_forms`: Contains all morphed variants of the lemmas from
+//!   `words`, including information about their grammeme, and accent
+//!   marks.
+//!
+//! * `words_rels`: Contains relations between words, containing
+//!   information like "synonyms" or general relation between words.
+//!
+//! * `translations`: Contains translations tagged by target language,
+//!   as well as examples and (occasionally) additional information.
+//!
+//! These tables also contain something, but have not been analysed
+//! yet:
+//!
+//! * `expressions_words`
+//! * `sentences`
+//! * `sentences_translations`
+//! * `sentences_words`
 
 use log::{error, info};
 use rusqlite::{Connection, Result};