about summary refs log tree commit diff
path: root/corp/russian/data-import/src (follow)
AgeCommit message (Collapse)AuthorFilesLines
2023-01-21 r/5730 feat(corp/data-import): add import of OR 'words_forms' tableVincent Ambo3-6/+69
This is the full morphological set table for all the words from the lemmata table, which they don't call it that. Change-Id: I6f5be673c5f59f11e36bd8c8c935844a7d4fd170 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7894 Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-21 r/5729 feat(corp/data-import): add import of OpenRussian 'words' tableVincent Ambo3-27/+223
This is actually the lemmata table of this corpus, not the forms of all words (they're in a separate table). Change-Id: I89a2c2817ccce840f47406fa2a636f4ed3f49154 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7893 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5703 docs(corp/data-import): document OpenRussian formatVincent Ambo1-4/+53
This is the second dataset I want to integrate as it contains some more practically useful, but somewhat less structured, information. Change-Id: Ib46b2597a33e76f59e030f889a0961ecc5a144eb Reviewed-on: https://cl.tvl.fyi/c/depot/+/7873 Tested-by: BuildkiteCI Autosubmit: tazjin <tazjin@tvl.su> Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 r/5702 chore(corp/data-import): namespace tables for OpenCorpora dataVincent Ambo2-22/+22
I'm changing strategies to importing both OC and another dataset before continuing to normalise the data, as it might be easier to do in a set of table-constructing queries inside of SQLite with all raw data in place. Change-Id: I26b41af80586fc1bfd8e26a6be20579068a82507 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7872 Autosubmit: tazjin <tazjin@tvl.su> Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5692 feat(corp/data-import): let users specify output pathVincent Ambo1-6/+14
Change-Id: I61ad021c7a5318b099f3adc8bc6aedef65500974 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7865 Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 r/5691 feat(corp/data-import): parse and import linksVincent Ambo2-3/+78
Change-Id: Iebdbc8f884f28064d7b00b8f8808b5030fa3d05c Reviewed-on: https://cl.tvl.fyi/c/depot/+/7864 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5690 feat(corp/data-import): parse and import link typesVincent Ambo2-2/+54
Change-Id: Iae01d1dc6894117dc693b4690d8bc79861212ae6 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7863 Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 r/5689 fix(corp/data-import): commit the final transaction, tooVincent Ambo1-0/+2
Otherwise up to 1000 elements might be missing. Change-Id: I20d6238424eec27f0e758e7737c9c31bcb81b23d Reviewed-on: https://cl.tvl.fyi/c/depot/+/7862 Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 r/5688 feat(corp/data-import): insert OpenCorpora data into SQLiteVincent Ambo2-9/+155
This is an initial and kind of dumb table structure, but there's some massaging that needs to be done before this makes more sense. Change-Id: I441288b684ef86be507099bcc4ebf984598789c8 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7861 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5684 feat(corp/data-import): parse lemmas from OpenCorpora dumpVincent Ambo2-14/+135
Change-Id: I1e4efcfc8e555f61578b563411d5e6ed9590d8e8 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7860 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-01-18 r/5683 feat(corp/russian/data-import): new OpenCorpora data import toolVincent Ambo2-0/+388
Adds the beginning of a tool which can import OpenCorpora data into a SQLite database. This is quite a lot of toil and there's probably a better way to do this, but overall becoming this intimately familiar with the data structures is quite helpful for understanding what I can/can't do with only this dataset. Change-Id: Ieab33a8ce07ea4ac87917b9c8132226bbc6523b1 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7859 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI