I'm thinking of creating my own db of the internet archive twitter dataset for research purposes (a master thesis).
I have a few questions before I go on this journey:
- Is it feasible to store this complete dataset on a MongoDB on my 8GB RAM laptop, with approximately 2 terabytes of free space? A brief calculation puts all compressed jsons at approximately 1TB; What kind of query times would I be looking at for a simple query: e.g. only return tweets from country X?
- The ultimate goal is to end up with a fraction of what is in there, I'm just not entirely sure which tweets I'm excluding / including as of now. Is it feasible to load everything in and then query afterwards, or should I select a subsample, find excluding/inclusions and then filter during staging?
- I'm expecting to use Python to load and place all the individual compressed JSONs into the database, is this prefferable or are there other methods you would recommend?
Any help is greatly appreciated.
Aucun commentaire:
Enregistrer un commentaire