data.table이 다른 데이터에 대한 참조 인 경우를 정확히 이해하십시오.
의 참조 별 속성을 이해하는 데 약간의 어려움이 data.table
있습니다. 일부 작업은 참조를 '파손'하는 것처럼 보이고 무슨 일이 일어나고 있는지 정확하게 이해하고 싶습니다.
을 통해 data.table
다른 테이블을 만들면 data.table
(을 통해 <-
새 테이블을 업데이트 :=
하여 원본 테이블도 변경됨) 이는 다음과 같이 예상됩니다.
?data.table::copy
및 스택 오버플로 : 데이터 테이블 패키지에서 참조에 의한 전달자
예를 들면 다음과 같습니다.
library(data.table)
DT <- data.table(a=c(1,2), b=c(11,12))
print(DT)
# a b
# [1,] 1 11
# [2,] 2 12
newDT <- DT # reference, not copy
newDT[1, a := 100] # modify new DT
print(DT) # DT is modified too.
# a b
# [1,] 100 11
# [2,] 2 12
그러나 과제와 위 의 줄 :=
사이에 비 기반 수정을 삽입하면 더 이상 수정되지 않습니다.<-
:=
DT
DT = data.table(a=c(1,2), b=c(11,12))
newDT <- DT
newDT$b[2] <- 200 # new operation
newDT[1, a := 100]
print(DT)
# a b
# [1,] 1 11
# [2,] 2 12
그래서 그 newDT$b[2] <- 200
줄은 어떻게 든 참조를 '파산'하는 것처럼 보입니다 . 나는 이것이 어떻게 든 사본을 호출한다고 생각하지만, R이 이러한 작업을 처리하는 방법을 완전히 이해하여 코드에 잠재적 인 버그가 발생하지 않도록하고 싶습니다.
누군가 나에게 이것을 설명 할 수 있다면 대단히 감사하겠습니다.
예, 전체 객체 의 복사본을 만드는 <-
(또는 =
또는 ->
)를 사용하는 R의 하위 할당입니다 . 당신이 사용하는 것을 추적 할 수 있습니다 및 다음과 같습니다. 기능 과 무엇을 참조하여 할당은 전달되는 객체. 따라서 해당 객체가 이전에 (하위 할당 또는 explicit ) 복사 된 경우 참조에 의해 수정되는 사본입니다.tracemem(DT)
.Internal(inspect(DT))
data.table
:=
set()
<-
copy(DT)
DT <- data.table(a = c(1, 2), b = c(11, 12))
newDT <- DT
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
.Internal(inspect(newDT)) # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
tracemem(newDT)
# [1] "<0x0000000003b7e2a0"
newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]:
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<-
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB: # ..snip..
Notice how even the a
vector was copied (different hex value indicates new copy of vector), even though a
wasn't changed. Even the whole of b
was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why :=
and set()
were introduced to data.table
.
Now, with our copied newDT
we can modify it by reference :
newDT
# a b
# [1,] 1 11
# [2,] 2 200
newDT[2, b := 400]
# a b # See FAQ 2.21 for why this prints newDT
# [1,] 1 11
# [2,] 2 400
.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB: # ..snip ..
Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.
Or, we can modify the original DT
by reference :
DT[2, b := 600]
# a b
# [1,] 1 11
# [2,] 2 600
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
# ATTRIB: # ..snip..
Those hex values are the same as the original values we saw for DT
above. Type example(copy)
for more examples using tracemem
and comparison to data.frame
.
Btw, if you tracemem(DT)
then DT[2,b:=600]
you'll see one copy reported. That is a copy of the first 10 rows that the print
method does. When wrapped with invisible()
or when called within a function or script, the print
method isn't called.
All this applies inside functions too; i.e., :=
and set()
do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x)
at the start of the function. But, remember data.table
is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).
Just a quick sum up.
<-
with data.table
is just like base; i.e., no copy is taken until a subassign is done afterwards with <-
(such as changing the column names or changing an element such as DT[i,j]<-v
). Then it takes a copy of the whole object just like base. That's known as copy-on-write. Would be better known as copy-on-subassign, I think! It DOES NOT copy when you use the special :=
operator, or the set*
functions provided by data.table
. If you have large data you probably want to use them instead. :=
and set*
will NOT COPY the data.table
, EVEN WITHIN FUNCTIONS.
Given this example data :
DT <- data.table(a=c(1,2), b=c(11,12))
The following just "binds" another name DT2
to the same data object bound currently bound to the name DT
:
DT2 <- DT
This never copies, and never copies in base either. It just marks the data object so that R knows that two different names (DT2
and DT
) point to the same object. And so R will need to copy the object if either are subassigned to afterwards.
That's perfect for data.table
, too. The :=
isn't for doing that. So the following is a deliberate error as :=
isn't for just binding object names :
DT2 := DT # not what := is for, not defined, gives a nice error
:=
is for subassigning by reference. But you don't use it like you would in base :
DT[3,"foo"] := newvalue # not like this
you use it like this :
DT[3,foo:=newvalue] # like this
That changed DT
by reference. Say you add a new column new
by reference to the data object, there is no need to do this :
DT <- DT[,new:=1L]
because the RHS already changed DT
by reference. The extra DT <-
is to misunderstand what :=
does. You can write it there, but it's superfluous.
DT
is changed by reference, by :=
, EVEN WITHIN FUNCTIONS :
f <- function(X){
X[,new2:=2L]
return("something else")
}
f(DT) # will change DT
DT2 <- DT
f(DT) # will change both DT and DT2 (they're the same data object)
data.table
is for large datasets, remember. If you have a 20GB data.table
in memory then you need a way to do this. It's a very deliberate design decision of data.table
.
Copies can be made, of course. You just need to tell data.table that you're sure you want to copy your 20GB dataset, by using the copy()
function :
DT3 <- copy(DT) # rather than DT3 <- DT
DT3[,new3:=3L] # now, this just changes DT3 because it's a copy, not DT too.
To avoid copies, don't use base type assignation or update :
DT$new4 <- 1L # will make a copy so use :=
attr(DT,"sorted") <- "a" # will make a copy use setattr()
If you want to be sure that you are updating by reference use .Internal(inspect(x))
and look at the memory address values of the constituents (see Matthew Dowle's answer).
Writing :=
in j
like that allows you subassign by reference by group. You can add a new column by reference by group. So that's why :=
is done that way inside [...]
:
DT[, newcol:=mean(x), by=group]
'Programming' 카테고리의 다른 글
C #에서 비동기 메소드를 어떻게 작성합니까? (0) | 2020.05.18 |
---|---|
데이터베이스 대신 데이터 저장소에서 생각하는 방법? (0) | 2020.05.18 |
이름이 'default'인 Android Studio Gradle 구성을 찾을 수 없습니다 (0) | 2020.05.18 |
다국어 데이터베이스 디자인에 대한 모범 사례는 무엇입니까? (0) | 2020.05.18 |
정적 키워드와 C ++에서의 다양한 용도 (0) | 2020.05.18 |