Relational database versus R/Python data frames -

January 15, 2010

i exposed world of tables , data structures in r before rdbms systems , other database systems. quite elegant in r/python create tables , lists stuctured data (.csv or other formats) , data manipulations programmatically.

last year, attended course in database management , learnt structured , unstructured databases. noticed norm feed data multiple sources of data databases rather directly use them in r (for convenience , discipline?).

for research purposes, r seems suffice, joining, appending or complicated data manipulations.

the questions keeps arising is: when use r directly using commands such read.csv, when use r creating database , querying tables using r-sql interface?

for instance, if have multi-source data, (a) person level information (age, gender, smoking habits), (b) outcome variables (such surveys taken them in real time), (c) covariate information (environment characteristics), (d) treatment input (occurrence of event modifies outcome - survey response) (d) time , space information of participants taking survey

how approach data collection , processing in case. there may standard industry procedures, put question forward here, understand list of feasible , optimal approaches individuals , small group of researchers can adopt.

what you're describing when "that norm feed data multiple sources of data databases" sounds more data warehouse. databases used many reasons, , in plenty of situations hold data 1 source - instance, database used data store of transactional system hold data needed run system, , data produced system.

the process you're describing commonly called extract, transform, load (etl), , might find looking information etl , data warehousing helpful if decide go in direction of combining data prior working in r.

i can't tell should choose, or optimal way of accomplishing it, because vary in different situations , might come down opinion. can tell of reasons why people create data warehouses, , can decide whether might useful in situation:

a data warehouse can provide central location hold combined data. means people not need combine data each time need use specific combination of data. unlike simple one-off report or extract of combined data, should provide flexibility, letting people obtain combined set of data need specific task. often, in enterprise situations, multiple things run on top of same combined set of data - multidimensional data analysis tools (cubes), reports, data mining, etc.

some of benefits of might include:

individuals saving time when otherwise have needed combine data themselves.
if data needs combined complex, or people not have proficiency @ handling part of process, there less risk of data being combined incorrectly; can sure different pieces of work have used same source data.
if data suffers data quality issues, resolve once in data warehouse, rather working around or resolving repeatedly in code.
if new data being received, collection , integration of data warehouse can carried out automatically.

like say, can't decide whether useful direction or not - decision of kind you'll need weigh costs of implementing such solution against benefits, , both specific individual case. answers core question of why might choose work in database instead of in code, , gives starting point work from.

Search This Blog

Lix

Relational database versus R/Python data frames -

Comments

Post a Comment

Popular posts from this blog

javascript - three.js lot of meshes optimization -

smartface.io - Proper way to change color scheme for whole application -

Email notification in google apps script -