Scripts¶

pipeline-copy¶

Installation¶

The preferred way to use pipeline-copy is to use it in a project virtual environment due to pipeline’s dependences. I use poetry in this case:

poetry add 'tanbih-pipeline[full]'

or if you know you are only going to use it with certain backend technology, for example, mongodb:

poetry add 'tanbih-pipeline[mongodb]'

Usage¶

# dump jsonl to mongodb
poetry run pipeline-copy -model-definition [./model.py] \
    --in-kind FILE --in-content-only \
    --in-filename [input.jsonl] \
    --out-kind MONGO \
    --out-uri [mongodb_uri] \
    --out-database [database] \
    --out-keyname key1,key2,key3 \
    --out-topic [collection]

Since JSON format does not support datetimes, in order for pipeline-copy to treat datetime field as datetime instead of string, you can provide a model definition via argument –model-definition. An example of such model definition is as following (the class name needs to be Model):

from datetime import datetime
from typing import Optional

from pydantic import BaseModel

class Model(BaseModel):
    hashtag: str
    username: str
    text: str
    tweet_id: str
    location: Optional[str]
    created_at: datetime
    retweet_count: int