If you did not already know

Fast Randomized PCA

Principal component analysis (PCA) is widely used for dimension reduction and embedding of real data in social network analysis, information retrieval, and natural language processing, etc. In this work we propose a fast randomized PCA algorithm for processing large sparse data. The algorithm has similar accuracy to the basic randomized SVD (rPCA) algorithm (Halko et al., 2011), but is largely optimized for sparse data. It also has good flexibility to trade off runtime against accuracy for practical usage. Experiments on real data show that the proposed algorithm is up to 9.1X faster than the basic rPCA algorithm without accuracy loss, and is up to 20X faster than the svds in Matlab with little error. The algorithm computes the first 100 principal components of a large information retrieval data with 12,869,521 persons and 323,899 keywords in less than 400 seconds on a 24-core machine, while all conventional methods fail due to the out-of-memory issue. …


A hackable text editor for the 21st Century. Atom-IDE is a set of optional packages to bring IDE-like functionality to Atom and improve language integrations. Index ide screenshot Get smarter context-aware auto-completion, code navigation features such as an outline view, go to definition and find all references. You can also hover-to-reveal information, diagnostics (errors and warnings) and document formatting. To get all these IDE features, open Atom IDE UI in Atom and install the package. …


Browsing news articles on multiple devices is now possible. The lengths of news article headlines have precise upper bounds, dictated by the size of the display of the relevant device or interface. Therefore, controlling the length of headlines is essential when applying the task of headline generation to news production. However, because there is no corpus of headlines of multiple lengths for a given article, prior researches on controlling output length in headline generation have not discussed whether the evaluation of the setting that uses a single length reference can evaluate multiple length outputs appropriately. In this paper, we introduce two corpora (JNC and JAMUL) to confirm the validity of prior experimental settings and provide for the next step toward the goal of controlling output length in headline generation. The JNC provides common supervision data for headline generation. The JAMUL is a large-scale evaluation dataset for headlines of three different lengths composed by professional editors. We report new findings on these corpora; for example, while the longest length reference summary can appropriately evaluate the existing methods controlling output length, the methods do not manage the selection of words according to length constraint. …

Hierarchical Compartmental Model

A variety of triangle-based stochastic reserving techniques have been proposed for estimating future general insurance claims payments, ranging from generalized linear models (England and Verrall, 2002) to nonlinear hierarchical models (Guszcza, 2008). Methods incorporating both paid and incurred information have been explored (Martínez-Miranda, Nielsen and Verrall, 2012; Quarg and Mack, 2004), which provide richer inference and improved interpretability. Furthermore, Bayesian methods (Zhang, Dukic and Guszcza, 2012; Meyers, 2007; England and Verrall, 2005; Verrall, 2004) have become increasingly ubiquitous; providing flexibility and the ability to robustly incorporate judgment into uncertainty projections. This paper explores a new triangle-based (and optionally-Bayesian) stochastic reserving framework which considers the relationship between exposure, case reserves and paid claims. By doing so, it enables practitioners to build communicable models that are consistent with their understanding of the insurance claims process. Furthermore, it supports the identification and quantification of claims process characteristics to provide tangible business insights.

Hierarchical compartmental reserving models