







NASA有32,000多個資料集,并且NASA有興趣了解這些資料集之間的聯系,以及與NASA以外其他政府組織中其他重要資料集的聯系。有關NASA資料集的中繼資料  可以JSON格式線上獲得。讓我們使用tf-idf在描述字段中找到重要的單詞,并将其與關鍵字聯系起來。



library(jsonlite)library(dplyr)library(tidyr)metadata <- fromJSON("data.json")names(metadata$dataset)##  [1] "_id"                "@type"              "accessLevel"        "accrualPeriodicity"##  [5] "bureauCode"         "contactPoint"       "description"        "distribution"      ##  [9] "identifier"         "issued"             "keyword"            "landingPage"       ## [13] "language"           "modified"           "programCode"        "publisher"         ## [17] "spatial"            "temporal"           "theme"              "title"             ## [21] "license"            "isPartOf"           "references"         "rights"            ## [25] "describedBy"nasadesc <- data_frame(id = metadata$dataset$`_id`$`$oid`, desc = metadata$dataset$description)nasadesc## # A tibble: 32,089 x 2##                          id##                       <chr>## 1  55942a57c63a7fe59b495a77## 2  55942a57c63a7fe59b495a78## 3  55942a58c63a7fe59b495a79## 4  55942a58c63a7fe59b495a7a## 5  55942a58c63a7fe59b495a7b## 6  55942a58c63a7fe59b495a7c## 7  55942a58c63a7fe59b495a7d## 8  55942a58c63a7fe59b495a7e## 9  55942a58c63a7fe59b495a7f## 10 55942a58c63a7fe59b495a80## # ... with 32,079 more rows, and 1 more variables: desc <chr>           


## # A tibble: 32,089 x 2##                          id##                       <chr>## 1  55942a57c63a7fe59b495a77## 2  55942a57c63a7fe59b495a78## 3  55942a58c63a7fe59b495a79## 4  55942a58c63a7fe59b495a7a## 5  55942a58c63a7fe59b495a7b## 6  55942a58c63a7fe59b495a7c## 7  55942a58c63a7fe59b495a7d## 8  55942a58c63a7fe59b495a7e## 9  55942a58c63a7fe59b495a7f## 10 55942a58c63a7fe59b495a80## # ... with 32,079 more rows, and 1 more variables: desc <chr>           



nasadesc %>% select(desc) %>% sample_n(5)## # A tibble: 5 x 1##                                                                                                                                                      desc##                                                                                                                                                     <chr>## 1  A Group for High Resolution Sea Surface Temperature (GHRSST) Level 4 sea surface temperature analysis produced as a retrospective dataset at the JPL P## 2  ML2CO is the EOS Aura Microwave Limb Sounder (MLS) standard product for carbon monoxide derived from radiances measured by the 640 GHz radiometer. The## 3                                                                                                              Crew lock bag. Polygons: 405 Vertices: 514## 4  JEM Engineering proved the technical feasibility of the FlexScan array?a very low-cost, highly-efficient, wideband phased array antenna?in Phase I, an## 5 MODIS (or Moderate Resolution Imaging Spectroradiometer) is a key instrument aboard the\nTerra (EOS AM) and Aqua (EOS PM) satellites. Terra's orbit aro           



## # A tibble: 126,814 x 2##                          id       keyword##                       <chr>         <chr>## 1  55942a57c63a7fe59b495a77 EARTH SCIENCE## 2  55942a57c63a7fe59b495a77   HYDROSPHERE## 3  55942a57c63a7fe59b495a77 SURFACE WATER## 4  55942a57c63a7fe59b495a78 EARTH SCIENCE## 5  55942a57c63a7fe59b495a78   HYDROSPHERE## 6  55942a57c63a7fe59b495a78 SURFACE WATER## 7  55942a58c63a7fe59b495a79 EARTH SCIENCE## 8  55942a58c63a7fe59b495a79   HYDROSPHERE## 9  55942a58c63a7fe59b495a79 SURFACE WATER## 10 55942a58c63a7fe59b495a7a EARTH SCIENCE## # ... with 126,804 more rows           



## # A tibble: 1,774 x 2##                    keyword     n##                      <chr> <int>## 1            EARTH SCIENCE 14362## 2                  Project  7452## 3               ATMOSPHERE  7321## 4              Ocean Color  7268## 5             Ocean Optics  7268## 6                   Oceans  7268## 7                completed  6452## 8  ATMOSPHERIC WATER VAPOR  3142## 9                   OCEANS  2765## 10            LAND SURFACE  2720## # ... with 1,764 more rows           


看起來“已完成項目”對于某些目的來說可能不是有用的關鍵字,我們可能希望将所有這些都更改為小寫或大寫,以消除諸如“ OCEANS”和“ Oceans”之類的重複項。


什麼是tf-idf?評估文檔中單詞的重要性的一種方法可能是其  術語頻率  (tf),即單詞在文檔中出現的頻率。但是,一些經常出現的單詞并不重要。在英語中,這些詞可能是“ the”,“ is”,“ of”等詞。另一種方法是檢視術語的  逆文檔頻率  (idf),這會降低常用單詞的權重,而增加在文檔集中很少使用的單詞的權重。

## # A tibble: 2,728,224 x 3##                          id  word     n##                       <chr> <chr> <int>## 1  55942a88c63a7fe59b498280   amp   679## 2  55942a88c63a7fe59b498280  nbsp   655## 3  55942a8ec63a7fe59b4986ef    gt   330## 4  55942a8ec63a7fe59b4986ef    lt   330## 5  55942a8ec63a7fe59b4986ef     p   327## 6  55942a8ec63a7fe59b4986ef   the   231## 7  55942a86c63a7fe59b49803b   amp   208## 8  55942a86c63a7fe59b49803b  nbsp   204## 9  56cf5b00a759fdadc44e564a   the   201## 10 55942a86c63a7fe59b4980a2    gt   191## # ... with 2,728,214 more rows           



## # A tibble: 1 x 1##                                                                                                                                                     desc##                                                                                                                                                    <chr>## 1 &lt;p&gt;The objective of the Variable Oxygen Regulator Element is to develop an oxygen-rated, contaminant-tolerant oxygen regulator to control suit p           



## # A tibble: 2,728,224 x 6##                          id  word     n         tf       idf      tf_idf##                       <chr> <chr> <int>      <dbl>     <dbl>       <dbl>## 1  55942a88c63a7fe59b498280   amp   679 0.35661765 3.1810813 1.134429711## 2  55942a88c63a7fe59b498280  nbsp   655 0.34401261 4.2066578 1.447143322## 3  55942a8ec63a7fe59b4986ef    gt   330 0.05722213 3.2263517 0.184618705## 4  55942a8ec63a7fe59b4986ef    lt   330 0.05722213 3.2903671 0.188281801## 5  55942a8ec63a7fe59b4986ef     p   327 0.05670192 3.3741126 0.191318680## 6  55942a8ec63a7fe59b4986ef   the   231 0.04005549 0.1485621 0.005950728## 7  55942a86c63a7fe59b49803b   amp   208 0.32911392 3.1810813 1.046938133## 8  55942a86c63a7fe59b49803b  nbsp   204 0.32278481 4.2066578 1.357845252## 9  56cf5b00a759fdadc44e564a   the   201 0.06962245 0.1485621 0.010343258## 10 55942a86c63a7fe59b4980a2    gt   191 0.12290862 3.2263517 0.396546449## # ... with 2,728,214 more rows           



## # A tibble: 2,728,224 x 6##                          id                                          word     n    tf       idf##                       <chr>                                         <chr> <int> <dbl>     <dbl>## 1  55942a7cc63a7fe59b49774a                                           rdr     1     1 10.376269## 2  55942ac9c63a7fe59b49b688 palsar_radiometric_terrain_corrected_high_res     1     1 10.376269## 3  55942ac9c63a7fe59b49b689  palsar_radiometric_terrain_corrected_low_res     1     1 10.376269## 4  55942a7bc63a7fe59b4976ca                                          lgrs     1     1  8.766831## 5  55942a7bc63a7fe59b4976d2                                          lgrs     1     1  8.766831## 6  55942a7bc63a7fe59b4976e3                                          lgrs     1     1  8.766831## 7  55942ad8c63a7fe59b49cf6c                      template_proddescription     1     1  8.296827## 8  55942ad8c63a7fe59b49cf6d                      template_proddescription     1     1  8.296827## 9  55942ad8c63a7fe59b49cf6e                      template_proddescription     1     1  8.296827## 10 55942ad8c63a7fe59b49cf6f                      template_proddescription     1     1  8.296827##       tf_idf##        <dbl>## 1  10.376269## 2  10.376269## 3  10.376269## 4   8.766831## 5   8.766831## 6   8.766831## 7   8.296827## 8   8.296827## 9   8.296827## 10  8.296827## # ... with 2,728,214 more rows           



## # A tibble: 1 x 1##    desc##   <chr>## 1   RDR           





## # A tibble: 11,013,838 x 7##                          id  word     n         tf      idf    tf_idf              keyword##                       <chr> <chr> <int>      <dbl>    <dbl>     <dbl>                <chr>## 1  55942a88c63a7fe59b498280   amp   679 0.35661765 3.181081 1.1344297              ELEMENT## 2  55942a88c63a7fe59b498280   amp   679 0.35661765 3.181081 1.1344297 JOHNSON SPACE CENTER## 3  55942a88c63a7fe59b498280   amp   679 0.35661765 3.181081 1.1344297                  VOR## 4  55942a88c63a7fe59b498280   amp   679 0.35661765 3.181081 1.1344297               ACTIVE## 5  55942a88c63a7fe59b498280  nbsp   655 0.34401261 4.206658 1.4471433              ELEMENT## 6  55942a88c63a7fe59b498280  nbsp   655 0.34401261 4.206658 1.4471433 JOHNSON SPACE CENTER## 7  55942a88c63a7fe59b498280  nbsp   655 0.34401261 4.206658 1.4471433                  VOR## 8  55942a88c63a7fe59b498280  nbsp   655 0.34401261 4.206658 1.4471433               ACTIVE## 9  55942a8ec63a7fe59b4986ef    gt   330 0.05722213 3.226352 0.1846187 JOHNSON SPACE CENTER## 10 55942a8ec63a7fe59b4986ef    gt   330 0.05722213 3.226352 0.1846187              PROJECT## # ... with 11,013,828 more rows           




## # A tibble: 122 x 7##                          id      word     n        tf      idf   tf_idf    keyword##                       <chr>    <fctr> <int>     <dbl>    <dbl>    <dbl>      <chr>## 1  55942a60c63a7fe59b49612f estimates     1 0.5000000 3.172863 1.586432     CLOUDS## 2  55942a76c63a7fe59b49728d      ncdc     1 0.1666667 7.603680 1.267280     CLOUDS## 3  55942a60c63a7fe59b49612f     cloud     1 0.5000000 2.464212 1.232106     CLOUDS## 4  55942a5ac63a7fe59b495bd8      fife     1 0.2000000 5.910360 1.182072     CLOUDS## 5  55942a5cc63a7fe59b495deb allometry     1 0.1428571 7.891362 1.127337 VEGETATION## 6  55942a5dc63a7fe59b495ede       tgb     3 0.1875000 5.945452 1.114772 VEGETATION## 7  55942a5ac63a7fe59b495bd8      tovs     1 0.2000000 5.524238 1.104848     CLOUDS## 8  55942a5ac63a7fe59b495bd8  received     1 0.2000000 5.332843 1.066569     CLOUDS## 9  55942a5cc63a7fe59b495dfd       sap     1 0.1250000 8.430358 1.053795 VEGETATION## 10 55942a60c63a7fe59b496131  abstract     1 0.3333333 3.118561 1.039520     CLOUDS## # ... with 112 more rows           


## # A tibble: 1 x 1##              desc##             <chr>## 1 Cloud estimates           


