Loading...
Loading...
Loading...
- [Data Analytics Fundamentals](#data-analytics-fundamentals)
# Tổng Hợp Data Analytics & Business Intelligence ## 📋 Mục Lục - [Data Analytics Fundamentals](#data-analytics-fundamentals) - [Business Intelligence Tools](#business-intelligence-tools) - [Data Warehousing](#data-warehousing) - [ETL/ELT Tools](#etlelt-tools) - [Statistical Analysis](#statistical-analysis) - [Data Visualization](#data-visualization) - [Big Data Analytics](#big-data-analytics) - [Real-time Analytics](#real-time-analytics) - [Self-Service Analytics](#self-service-analytics) ## 📊 Data Analytics Fundamentals ### Types of Analytics | Type | Purpose | Questions Answered | Techniques | Business Value | |------|---------|-------------------|------------|----------------| | **Descriptive** | What happened? | Historical analysis | Reporting, dashboards | Understanding past | | **Diagnostic** | Why did it happen? | Root cause analysis | Drill-down, correlation | Problem identification | | **Predictive** | What will happen? | Future forecasting | ML, statistical models | Planning, preparation | | **Prescriptive** | What should we do? | Optimization | AI, simulation | Decision support | ### Analytics Maturity Model | Level | Characteristics | Capabilities | Tools | ROI | |-------|----------------|-------------|-------|-----| | **Level 1: Basic** | Spreadsheets, manual reports | Basic reporting | Excel, simple BI | Low | | **Level 2: Opportunistic** | Some automation | Standard dashboards | BI tools, databases | Medium | | **Level 3: Systematic** | Integrated analytics | Self-service analytics | Advanced BI, data lakes | High | | **Level 4: Differentiating** | Predictive analytics | ML-driven insights | AI/ML platforms | Very High | | **Level 5: Transformational** | AI-first organization | Autonomous decisions | Advanced AI, real-time | Exceptional | ### Data Types & Sources | Data Type | Characteristics | Sources | Analysis Methods | |-----------|----------------|---------|------------------| | **Structured** | Organized, tabular | Databases, CRM, ERP | SQL, statistical analysis | | **Semi-structured** | Partially organized | JSON, XML, logs | NoSQL, text mining | | **Unstructured** | No predefined format | Text, images, video | NLP, computer vision | | **Streaming** | Real-time flow | IoT, social media, sensors | Stream processing | ## 📈 Business Intelligence Tools ### Enterprise BI Platforms | Platform | Vendor | Strengths | Weaknesses | Target Market | |----------|--------|-----------|------------|---------------| | **Tableau** | Salesforce | Visualization, ease of use | Cost, performance with big data | Enterprise, analysts | | **Power BI** | Microsoft | Office integration, cost-effective | Limited customization | Microsoft ecosystem | | **QlikView/QlikSense** | Qlik | Associative model, in-memory | Learning curve | Enterprise | | **Looker** | Google | Modern architecture, Git integration | Technical complexity | Data-driven organizations | | **Sisense** | Sisense | Simplicity, AI-driven | Limited advanced features | Mid-market | ### Self-Service BI Tools | Tool | Vendor | User Type | Strengths | Use Cases | |------|--------|-----------|-----------|-----------| | **Tableau Desktop** | Salesforce | Analysts | Powerful visualization | Ad-hoc analysis | | **Power BI Desktop** | Microsoft | Business users | Easy to learn | Departmental reporting | | **QlikSense** | Qlik | Business users | Associative exploration | Self-service discovery | | **Spotfire** | TIBCO | Data scientists | Advanced analytics | Scientific analysis | ### Open Source BI | Tool | Type | Strengths | Limitations | Community | |------|------|-----------|-------------|-----------| | **Apache Superset** | Web-based BI | Modern, extensible | Smaller ecosystem | Growing | | **Metabase** | Simple BI | Easy setup, user-friendly | Limited advanced features | Active | | **Grafana** | Monitoring/BI | Time-series focus | Not general-purpose BI | Large | | **Apache Zeppelin** | Notebook-based | Data science integration | Technical users only | Moderate | ## 🏢 Data Warehousing ### Traditional Data Warehouses | Platform | Vendor | Architecture | Strengths | Use Cases | |----------|--------|--------------|-----------|-----------| | **Oracle Exadata** | Oracle | Engineered systems | Performance, Oracle integration | Enterprise Oracle shops | | **IBM Db2 Warehouse** | IBM | MPP architecture | IBM ecosystem integration | IBM environments | | **Microsoft SQL Server** | Microsoft | SMP/MPP hybrid | .NET integration | Microsoft-centric orgs | | **SAP HANA** | SAP | In-memory | Real-time analytics | SAP environments | ### Cloud Data Warehouses | Platform | Vendor | Architecture | Pricing Model | Strengths | |----------|--------|--------------|---------------|-----------| | **Snowflake** | Snowflake | Multi-cluster | Compute + storage | Scalability, ease of use | | **Amazon Redshift** | AWS | MPP columnar | On-demand/reserved | AWS integration | | **Google BigQuery** | Google | Serverless | Query-based | Serverless, ML integration | | **Azure Synapse** | Microsoft | Unified analytics | Pay-per-use | Analytics + data integration | | **Databricks** | Databricks | Lakehouse | DBU-based | Unified analytics platform | ### Modern Data Architecture | Architecture | Characteristics | Benefits | Challenges | |--------------|----------------|----------|------------| | **Data Lake** | Raw data storage | Flexibility, cost | Data governance | | **Data Lakehouse** | Lake + warehouse hybrid | Best of both worlds | Complexity | | **Data Mesh** | Decentralized domains | Scalability, ownership | Coordination overhead | | **Data Fabric** | Unified data layer | Integration, governance | Implementation complexity | ## 🔄 ETL/ELT Tools ### Enterprise ETL Platforms | Platform | Vendor | Approach | Strengths | Target Market | |----------|--------|----------|-----------|---------------| | **Informatica PowerCenter** | Informatica | ETL | Enterprise features, performance | Large enterprises | | **IBM DataStage** | IBM | ETL | Parallel processing, IBM integration | IBM shops | | **Microsoft SSIS** | Microsoft | ETL | SQL Server integration | Microsoft environments | | **Talend** | Talend | ETL/ELT | Open source + enterprise | Mixed environments | | **Pentaho** | Hitachi Vantara | ETL | Open source option | Cost-conscious orgs | ### Cloud-Native ETL/ELT | Platform | Vendor | Approach | Strengths | Use Cases | |----------|--------|----------|-----------|-----------| | **AWS Glue** | AWS | Serverless ETL | AWS integration, serverless | AWS data pipelines | | **Azure Data Factory** | Microsoft | Cloud ETL/ELT | Azure integration, hybrid | Azure environments | | **Google Dataflow** | Google | Stream/batch processing | Apache Beam, auto-scaling | GCP data processing | | **Fivetran** | Fivetran | ELT-focused | Pre-built connectors | SaaS data integration | | **Stitch** | Talend | Simple ELT | Easy setup, affordable | Small to medium businesses | ### Modern Data Integration | Tool | Type | Approach | Strengths | |------|------|----------|-----------| | **Apache Airflow** | Workflow orchestration | Code-based | Flexibility, Python | | **Prefect** | Workflow orchestration | Modern Python | Developer experience | | **dbt** | Data transformation | SQL-based | Analytics engineering | | **Apache NiFi** | Data flow | Visual interface | Real-time, drag-and-drop | ## 📊 Statistical Analysis ### Statistical Software | Software | Vendor | Strengths | Use Cases | Learning Curve | |----------|--------|-----------|-----------|----------------| | **R** | Open source | Statistical computing, packages | Research, advanced analytics | Steep | | **SAS** | SAS Institute | Enterprise analytics, reliability | Regulated industries | Moderate | | **SPSS** | IBM | User-friendly, comprehensive | Social sciences, surveys | Easy | | **Stata** | StataCorp | Econometrics, data management | Economic research | Moderate | | **Minitab** | Minitab | Quality improvement, Six Sigma | Manufacturing, quality | Easy | ### Python Statistical Libraries | Library | Purpose | Strengths | Use Cases | |---------|---------|-----------|-----------| | **Pandas** | Data manipulation | Data analysis, cleaning | Data preprocessing | | **NumPy** | Numerical computing | Fast arrays, mathematical operations | Scientific computing | | **SciPy** | Scientific computing | Statistical functions, optimization | Research, analysis | | **Statsmodels** | Statistical modeling | Statistical tests, econometrics | Statistical analysis | | **Scikit-learn** | Machine learning | Easy-to-use ML algorithms | Predictive modeling | ### Statistical Methods | Method | Purpose | When to Use | Tools | |--------|---------|-------------|-------| | **Descriptive Statistics** | Summarize data | Data exploration | All statistical tools | | **Hypothesis Testing** | Test assumptions | Validate theories | R, SAS, SPSS | | **Regression Analysis** | Predict relationships | Forecasting, modeling | R, Python, SAS | | **Time Series Analysis** | Analyze temporal data | Forecasting, trends | R, Python, specialized tools | | **Multivariate Analysis** | Multiple variables | Complex relationships | R, SAS, SPSS | ## 📊 Data Visualization ### Visualization Tools | Tool | Type | Strengths | Best For | Cost | |------|------|-----------|----------|------| | **Tableau** | Desktop/Server | Rich visualizations, interactivity | Business dashboards | High | | **Power BI** | Cloud/Desktop | Microsoft integration, cost-effective | Microsoft environments | Medium | | **D3.js** | JavaScript library | Complete customization | Custom web visualizations | Free | | **Plotly** | Multi-language | Interactive plots, web deployment | Data science, web apps | Freemium | | **Matplotlib/Seaborn** | Python | Statistical plots, publication-ready | Scientific visualization | Free | ### Chart Types & Use Cases | Chart Type | Purpose | Best For | Avoid When | |------------|---------|----------|------------| | **Bar Charts** | Compare categories | Categorical data comparison | Too many categories | | **Line Charts** | Show trends over time | Time series data | Non-temporal data | | **Scatter Plots** | Show relationships | Correlation analysis | No clear relationship | | **Heatmaps** | Show patterns in matrices | Correlation matrices, geographic data | Sparse data | | **Box Plots** | Show distributions | Statistical distributions | Non-statistical audiences | ### Dashboard Design Principles | Principle | Description | Implementation | |-----------|-------------|----------------| | **Clarity** | Clear, unambiguous information | Simple layouts, clear labels | | **Relevance** | Show what matters | Focus on key metrics | | **Consistency** | Uniform design elements | Standard colors, fonts, layouts | | **Interactivity** | Enable exploration | Filters, drill-down capabilities | | **Performance** | Fast loading and response | Optimized queries, caching | ## 🗄️ Big Data Analytics ### Big Data Platforms | Platform | Type | Strengths | Use Cases | |----------|------|-----------|-----------| | **Apache Hadoop** | Distributed storage/processing | Mature ecosystem, cost-effective | Batch processing, data lakes | | **Apache Spark** | In-memory processing | Speed, unified analytics | Real-time + batch processing | | **Cloudera** | Hadoop distribution | Enterprise features, support | Enterprise Hadoop deployments | | **Hortonworks** | Hadoop distribution | Open source focus | Cost-conscious Hadoop | | **MapR** | Converged platform | Performance, real-time | High-performance requirements | ### Big Data Processing Frameworks | Framework | Type | Strengths | Use Cases | |-----------|------|-----------|-----------| | **Apache Spark** | Unified analytics | In-memory, multi-language | Batch + stream processing | | **Apache Flink** | Stream processing | Low latency, exactly-once | Real-time stream processing | | **Apache Storm** | Stream processing | Real-time, fault-tolerant | Event processing | | **Apache Kafka** | Stream platform | High throughput, durable | Event streaming, messaging | ### NoSQL Analytics | Database | Type | Analytics Capabilities | Use Cases | |----------|------|----------------------|-----------| | **MongoDB** | Document | Aggregation pipeline, Atlas | Content analytics | | **Cassandra** | Wide-column | Spark integration | Time-series analytics | | **Neo4j** | Graph | Cypher queries, graph algorithms | Network analysis | | **Elasticsearch** | Search engine | Kibana, aggregations | Log analytics, search | ## ⚡ Real-time Analytics ### Stream Processing Platforms | Platform | Vendor | Latency | Scalability | Use Cases | |----------|--------|---------|-------------|-----------| | **Apache Kafka Streams** | Apache | Low | High | Event-driven applications | | **Apache Flink** | Apache | Ultra-low | Very high | Real-time analytics | | **Amazon Kinesis** | AWS | Low | High | AWS real-time processing | | **Azure Stream Analytics** | Microsoft | Low | High | Azure real-time analytics | | **Google Dataflow** | Google | Low | High | GCP stream processing | ### Real-time Use Cases | Use Case | Requirements | Technologies | Benefits | |----------|--------------|--------------|---------| | **Fraud Detection** | Sub-second response | ML + stream processing | Prevent losses | | **Recommendation Engines** | Low latency | Real-time ML | Increase engagement | | **IoT Analytics** | High throughput | Time-series databases | Operational insights | | **Trading Systems** | Ultra-low latency | Specialized hardware/software | Competitive advantage | ### Event-Driven Architecture | Component | Purpose | Technologies | Considerations | |-----------|---------|--------------|---------------| | **Event Producers** | Generate events | Applications, IoT devices | Event schema design | | **Event Brokers** | Route events | Kafka, Pulsar, cloud services | Scalability, durability | | **Event Processors** | Process events | Stream processing frameworks | Stateful vs stateless | | **Event Stores** | Store events | Event databases, data lakes | Retention policies | ## 🔧 Self-Service Analytics ### Self-Service BI Platforms | Platform | Target Users | Capabilities | Governance | |----------|--------------|-------------|------------| | **Tableau** | Analysts, power users | Advanced visualization | Tableau Server governance | | **Power BI** | Business users | Easy report creation | Power BI governance | | **QlikSense** | Business users | Associative exploration | Qlik governance framework | | **Looker** | Technical users | Git-based modeling | LookML governance | ### Data Preparation Tools | Tool | Vendor | Approach | Target Users | |------|--------|----------|--------------| | **Tableau Prep** | Salesforce | Visual data prep | Tableau users | | **Power Query** | Microsoft | Formula-based | Excel/Power BI users | | **Trifacta** | Alteryx | ML-assisted prep | Data analysts | | **Dataiku** | Dataiku | Collaborative platform | Data teams | ### Governance Considerations | Aspect | Challenges | Solutions | |--------|------------|-----------| | **Data Quality** | Inconsistent definitions | Data catalogs, lineage | | **Security** | Unauthorized access | Role-based access control | | **Compliance** | Regulatory requirements | Audit trails, data classification | | **Performance** | Resource contention | Query optimization, caching | ## 🎯 Analytics Career Paths ### Entry Level Roles | Role | Responsibilities | Skills Required | Salary Range | |------|-----------------|-----------------|--------------| | **Data Analyst** | Reporting, basic analysis | SQL, Excel, BI tools | $45K-$70K | | **Business Analyst** | Requirements, process analysis | Business knowledge, basic analytics | $50K-$80K | | **Junior Data Scientist** | Model building, analysis | Python/R, statistics | $60K-$90K | ### Mid-Level Roles | Role | Responsibilities | Skills Required | Salary Range | |------|-----------------|-----------------|--------------| | **Senior Data Analyst** | Advanced analysis, mentoring | Advanced SQL, statistics, domain expertise | $70K-$100K | | **BI Developer** | Dashboard development, ETL | BI tools, SQL, data modeling | $75K-$110K | | **Data Engineer** | Data pipelines, infrastructure | Programming, big data tools | $90K-$130K | ### Senior Level Roles | Role | Responsibilities | Skills Required | Salary Range | |------|-----------------|-----------------|--------------| | **Analytics Manager** | Team leadership, strategy | Leadership, business acumen | $100K-$150K | | **Principal Data Scientist** | Technical leadership, research | Advanced analytics, domain expertise | $130K-$200K | | **Chief Data Officer** | Data strategy, governance | Executive leadership, data strategy | $200K-$400K+ | ## 📊 Industry Applications ### Retail & E-commerce - **Customer Analytics**: Segmentation, lifetime value - **Recommendation Systems**: Product recommendations - **Price Optimization**: Dynamic pricing strategies - **Inventory Analytics**: Demand forecasting ### Financial Services - **Risk Analytics**: Credit scoring, market risk - **Fraud Detection**: Transaction monitoring - **Algorithmic Trading**: Quantitative strategies - **Regulatory Reporting**: Compliance analytics ### Healthcare - **Clinical Analytics**: Treatment effectiveness - **Population Health**: Public health insights - **Drug Discovery**: Pharmaceutical research - **Healthcare Operations**: Resource optimization ### Manufacturing - **Quality Analytics**: Defect prediction - **Predictive Maintenance**: Equipment optimization - **Supply Chain Analytics**: Logistics optimization - **Process Optimization**: Operational efficiency ## 📚 Learning Resources ### Certifications - **Microsoft Certified: Data Analyst Associate** - **Tableau Desktop Specialist/Certified Associate** - **Google Analytics Individual Qualification** - **SAS Certified Specialist** - **Qlik Sense Business Analyst Certification** ### Online Learning - **Coursera**: Data Science specializations - **edX**: Analytics and BI courses - **Udacity**: Data Analyst Nanodegree - **DataCamp**: Interactive data science learning - **Pluralsight**: Technology skills platform ### Books - **"The Data Warehouse Toolkit"** - Ralph Kimball - **"Storytelling with Data"** - Cole Nussbaumer Knaflic - **"The Analytics Setup Guidebook"** - Paul Kamp - **"Data Science for Business"** - Foster Provost ### Practice Datasets - **Kaggle Datasets**: Real-world data challenges - **UCI ML Repository**: Classic datasets - **Google Dataset Search**: Discover datasets - **AWS Open Data**: Cloud-hosted datasets - **Government Open Data**: Public sector data --- *Cập nhật lần cuối: December 2024*
> 屬於 [research/](./README.md)。涵蓋 LLM-as-Judge、Reasoning Model、評估維度、Judge 設計原則。
> ⚠️ Note (Option A): `hwp-web (planned)` is intentionally excluded/disabled in this repo snapshot.
Here are three new, highly specialized AI agents for the T20 framework:
The **LLM Judge** is LLMTrace's third security detector alongside the