data

Posts tagged with data

Why it matters: This article demonstrates how to automate the challenging process of migrating and scaling stateful Hadoop clusters, significantly reducing manual effort and operational risk. It offers a blueprint for managing large-scale distributed data infrastructure efficiently.

  • Pinterest developed Hadoop Control Center (HCC) to automate complex migration and scaling operations for its large, stateful Hadoop clusters on AWS.
  • Traditional manual scale-in procedures for Hadoop clusters were tedious, error-prone, and involved many steps like updating exclude files, monitoring data drainage, and managing ASGs.
  • HCC enables in-place cluster migrations by introducing new Auto Scaling Groups (ASGs) with updated AMIs/instance types, avoiding costly and risky full cluster replacements.
  • The tool streamlines scaling-in by managing node decommissioning and ensuring HDFS data replication to new nodes before termination, preventing data loss or workload impact.
  • HCC provides a centralized platform for various Hadoop-related tasks, including ASG resizing, node status monitoring, YARN application reporting, and AWS event tracking.
  • Its architecture includes a manager node for API calls and caching, and worker nodes per VPC to manage clusters, facilitating automated and efficient cluster administration.

Why it matters: This article details how to build secure, privacy-preserving enterprise search and AI features. It offers a blueprint for integrating external data without compromising user data, leveraging RAG, federated search, and strict access controls. Essential for engineers building secure data platforms.

  • Slack's enterprise search and AI uphold strict security and privacy by keeping customer data within Slack's trust boundary, utilizing an AWS escrow VPC for LLMs.
  • The system employs Retrieval Augmented Generation (RAG) instead of training Large Language Models (LLMs) on customer data, ensuring data privacy and preventing retention.
  • Enterprise search operates on a federated, real-time model, never storing external source data in Slack's databases, but rather fetching it via partner APIs.
  • Access to external content is strictly permissioned based on the user's existing Access Control Lists (ACLs) and requires explicit user/admin consent, adhering to the principle of least privilege.
  • External data and permissions are always up-to-date with the source system, ensuring accuracy and compliance.
  • Search Answer summaries generated by the AI are ephemeral, shown to the user and immediately discarded, further enhancing privacy.

Why it matters: Managing content quality at scale requires balancing real-time signals with static analysis. This approach shows how to operationalize quality metrics and use multi-stage ML pipelines to protect users while maintaining high-performance recommendation systems.

  • Combined manual labeling with classifier scores to create calibrated metrics for statistically significant A/B testing results.
  • Developed 'read-path' models that utilize real-time engagement signals like comments and likes to improve detection precision.
  • Maintained 'write-path' filters at the sourcing level to handle low-prevalence violations and ensure a baseline of benign content.
  • Implemented a multi-stage pipeline that balances high-precision sourcing filters with fine-tuned ranking models.
  • Established continuous model performance tracking to identify edge cases and maintain user safety standards.

Why it matters: Engineers must balance performance and resource consumption. This case study shows how optimizing data usage through prefetching and resolution controls can improve user engagement and retention in data-constrained markets, proving that efficiency and growth can go hand-in-hand.

  • Instagram launched Data Saver Mode for Android to address high data consumption and improve efficiency relative to other Meta apps.
  • The implementation focuses on three levers: disabling video prefetch, disabling video autoplay, and offering manual media resolution controls.
  • Disabling prefetch ensures video data is only downloaded when a user stops scrolling, preventing waste on unviewed content.
  • Users can configure high-resolution media settings to 'Never,' 'Wi-Fi Only,' or 'Cellular and Wi-Fi' to manage their data budgets.
  • Global A/B testing showed that reducing data usage led to unexpected increases in user interactions and content creation.
  • The custom solution provides a smoother experience than Android's native Data Saver, which often blocks media loading entirely.

Why it matters: This article provides a blueprint for building massive-scale recommendation engines. It demonstrates how custom DSLs and multi-stage filtering balance high-velocity experimentation with the extreme computational efficiency required to serve millions of users in real-time.

  • Instagram uses a three-stage ranking funnel to filter billions of media items into a personalized feed for each user in real-time.
  • Engineers developed IGQL, a C++ optimized domain-specific language, to allow for high-level algorithm design with low-latency execution.
  • The system utilizes 'ig2vec' account embeddings to identify topical similarities based on user interaction sequences, similar to word2vec.
  • Facebook’s FAISS library is used for efficient nearest-neighbor retrieval across millions of account embeddings.
  • The infrastructure supports massive scale, processing 65 billion features and making 90 million model predictions every second.

Why it matters: Cache-first rendering provides immediate UI feedback but creates complex state sync challenges. This approach shows how to use Git-like rebase patterns in Redux to ensure user interactions aren't lost when merging stale cached data with fresh server responses.

  • Implemented cache-first rendering by storing a subset of the Redux store in IndexedDB to allow immediate page hydration.
  • Addressed race conditions where user interactions on cached data, such as likes or comments, could be overwritten by incoming server responses.
  • Developed a staging mechanism that treats cached state as a local branch and server data as master, performing a rebase-like operation for state updates.
  • Created a staging API using stagingAction and stagingCommit to queue dispatched actions while network requests are pending.
  • Used a Redux reducer enhancer to apply queued actions to the fresh server state before committing it to the main store.
  • Achieved significant performance gains, including a 2.5% improvement in feed display time and an 11% improvement in stories tray display time.

Why it matters: This interview offers a look into how Instagram uses data science and experimentation to drive product strategy. It highlights the intersection of technical leadership, user-centric culture, and the professional development skills necessary to succeed in high-scale engineering organizations.

  • Tamar Shapiro leads Instagram's analytics team, overseeing data scientists and engineers focused on experimentation and data-driven product decisions.
  • A key project discussed is the 'private like counts' test, which aimed to shift user focus from quantity to quality of interactions.
  • Instagram's engineering culture is characterized as highly collaborative with a 'people first' value system centered on the user community.
  • Effective communication and context sharing are emphasized as vital for maintaining alignment in fast-paced development environments.
  • Career growth advice for engineers includes prioritizing networking outside immediate teams and building confidence to advocate for accomplishments.
Page 9 of 9