about summary refs log tree commit diff
path: root/corp/russian/data-import
AgeCommit message (Collapse)AuthorFilesLines
2023-01-22 r/5732 feat(corp/data-import): add import of OR 'translations' tableVincent Ambo3-0/+70
The original dataset contains translations into different languages, but only the English ones are imported here. Note that translations are for lemmata only. Change-Id: Ifb9c32c25fda44c38ad899efca9d205c520c0fa3 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7895 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-21 r/5730 feat(corp/data-import): add import of OR 'words_forms' tableVincent Ambo3-6/+69
This is the full morphological set table for all the words from the lemmata table, which they don't call it that. Change-Id: I6f5be673c5f59f11e36bd8c8c935844a7d4fd170 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7894 Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-21 r/5729 feat(corp/data-import): add import of OpenRussian 'words' tableVincent Ambo6-30/+348
This is actually the lemmata table of this corpus, not the forms of all words (they're in a separate table). Change-Id: I89a2c2817ccce840f47406fa2a636f4ed3f49154 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7893 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-21 r/5728 chore(corp/data-import): make OR data archive available in envVincent Ambo1-8/+15
Change-Id: Idacf42743051eae0cf7010f952a4f91af17ad708 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7892 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5703 docs(corp/data-import): document OpenRussian formatVincent Ambo1-4/+53
This is the second dataset I want to integrate as it contains some more practically useful, but somewhat less structured, information. Change-Id: Ib46b2597a33e76f59e030f889a0961ecc5a144eb Reviewed-on: https://cl.tvl.fyi/c/depot/+/7873 Tested-by: BuildkiteCI Autosubmit: tazjin <tazjin@tvl.su> Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 r/5702 chore(corp/data-import): namespace tables for OpenCorpora dataVincent Ambo2-22/+22
I'm changing strategies to importing both OC and another dataset before continuing to normalise the data, as it might be easier to do in a set of table-constructing queries inside of SQLite with all raw data in place. Change-Id: I26b41af80586fc1bfd8e26a6be20579068a82507 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7872 Autosubmit: tazjin <tazjin@tvl.su> Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5693 feat(corp/data-import): build morphology database in derivationVincent Ambo1-6/+10
This makes the actual imported database of the ~whole Russian language (all lemmas, grammemes, forms etc.) a Nix build target which is built in CI. This still needs schema normalisation (it's fairly directly mapped to the raw data), but it's already starting to be a useful data set. This also happens to be a pretty cool demonstration of the power of Nix. You can do `nix-build -A corp.russian.data-import.database` and out comes a perfectly valid SQLite database with a valid external data import! Change-Id: I5d6d15e67d0e4a7ff590fad06252be34f5d561fd Reviewed-on: https://cl.tvl.fyi/c/depot/+/7866 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5692 feat(corp/data-import): let users specify output pathVincent Ambo1-6/+14
Change-Id: I61ad021c7a5318b099f3adc8bc6aedef65500974 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7865 Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 r/5691 feat(corp/data-import): parse and import linksVincent Ambo2-3/+78
Change-Id: Iebdbc8f884f28064d7b00b8f8808b5030fa3d05c Reviewed-on: https://cl.tvl.fyi/c/depot/+/7864 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5690 feat(corp/data-import): parse and import link typesVincent Ambo2-2/+54
Change-Id: Iae01d1dc6894117dc693b4690d8bc79861212ae6 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7863 Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 r/5689 fix(corp/data-import): commit the final transaction, tooVincent Ambo1-0/+2
Otherwise up to 1000 elements might be missing. Change-Id: I20d6238424eec27f0e758e7737c9c31bcb81b23d Reviewed-on: https://cl.tvl.fyi/c/depot/+/7862 Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 r/5688 feat(corp/data-import): insert OpenCorpora data into SQLiteVincent Ambo2-9/+155
This is an initial and kind of dumb table structure, but there's some massaging that needs to be done before this makes more sense. Change-Id: I441288b684ef86be507099bcc4ebf984598789c8 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7861 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5684 feat(corp/data-import): parse lemmas from OpenCorpora dumpVincent Ambo2-14/+135
Change-Id: I1e4efcfc8e555f61578b563411d5e6ed9590d8e8 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7860 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5683 feat(corp/russian/data-import): new OpenCorpora data import toolVincent Ambo6-0/+829
Adds the beginning of a tool which can import OpenCorpora data into a SQLite database. This is quite a lot of toil and there's probably a better way to do this, but overall becoming this intimately familiar with the data structures is quite helpful for understanding what I can/can't do with only this dataset. Change-Id: Ieab33a8ce07ea4ac87917b9c8132226bbc6523b1 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7859 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI