Help & Support

S3 Data Access

This guide explains how to access and analyze Lens Protocol data stored in Amazon S3. The data is synchronized from PostgreSQL databases using AWS Database Migration Service (DMS).

Configure AWS CLI

  1. Install the AWS CLI by following the official AWS CLI installation guide

  2. Test your access:

    aws s3 ls s3://lens-protocol-mainnet-data/ --no-sign-request

Access the Data

  1. List available schemas:

    aws s3 ls s3://lens-protocol-mainnet-data/ --no-sign-request
  2. List available tables within a schema:

    aws s3 ls s3://lens-protocol-mainnet-data/{schema_name}/ --no-sign-request
  3. Download data using the AWS CLI:

    aws s3 cp s3://lens-protocol-mainnet-data/schema/table/LOAD00000001.parquet . --no-sign-request
  4. Download data using HTTPS:

    curl -O https://lens-protocol-mainnet-data.s3.us-east-1.amazonaws.com/schema/table/LOAD00000001.parquet

Data Organization

The data in S3 is organized in the following structure:

s3://lens-protocol-mainnet-data/├── schema/                 # e.g., account, post, etc.    ├── table/             # e.g., metadata, record, etc.        ├── LOAD[0-9,A-F]{8}.parquet    # Initial data load        └── YYYY/MM/DD/            # CDC changes by date            └── YYYYMMDD-HHMMSSXXX.parquet

File Types

  1. Initial Load Files

    • Named: LOAD[0-9,A-F]{8}.parquet

    • Contains: Base data snapshot(s)

    • Format: Apache Parquet

    • Example:
      • Table with a single file (e.g., LOAD00000001.parquet)

      • Table with multiple files (e.g., LOAD00000001.parquet through LOAD0000000F.parquet)

  2. Change Data Capture (CDC) Files

    • Named: YYYYMMDD-HHMMSSXXX.parquet

    • Contains: Incremental changes

    • Operations:
      • I: Insert (new record)

      • U: Update (modified record)

      • D: Delete (removed record)

Example

Install Python Dependencies

  1. Create a virtual environment (optional but recommended):

    python -m venv lens-envsource lens-env/bin/activate  # On Unix/macOS
  2. Install required packages:

    pip install pandas>=2.0.0 pyarrow>=14.0.1

Save the following code as read_lens_data.py:

#!/usr/bin/env python3
import pandas as pdimport argparsefrom datetime import datetimeimport sysimport binascii
def format_timestamp(ts):    """Convert timestamp to a readable format"""    try:        return pd.to_datetime(ts).strftime('%Y-%m-%d %H:%M:%S')    except:        return ts
def format_binary(binary_data):    """Convert binary data to a readable hex format"""    try:        if isinstance(binary_data, bytes):            # Convert to hex and remove the '0x' prefix            return binascii.hexlify(binary_data).decode('utf-8')        return binary_data    except:        return str(binary_data)
def read_parquet_file(url):    """Read a parquet file from S3 and return as DataFrame"""    try:        df = pd.read_parquet(url)        return df    except Exception as e:        print(f"Error reading parquet file: {e}")        sys.exit(1)
def display_data(df):    """Display the data in a human-readable format"""    # Create a copy of the dataframe for display    display_df = df.copy()        # Format timestamp columns    timestamp_cols = [col for col in df.columns if 'timestamp' in col.lower() or 'time' in col.lower() or 'date' in col.lower()]    for col in timestamp_cols:        display_df[col] = display_df[col].apply(format_timestamp)        # Format binary columns    binary_cols = ['post', 'account', 'app']  # Known binary columns    for col in binary_cols:        if col in display_df.columns:            display_df[col] = display_df[col].apply(format_binary)
    # Display basic information    print("\n=== Dataset Information ===")    print(f"Number of records: {len(df)}")    print(f"Columns: {', '.join(df.columns)}")        # Display column types    print("\n=== Column Types ===")    for col in df.columns:        print(f"{col}: {df[col].dtype}")        print("\n=== Sample Data (first 5 rows) ===")        # Set display options for better readability    pd.set_option('display.max_columns', None)    pd.set_option('display.width', None)    pd.set_option('display.max_colwidth', None)        # Display the formatted data    print(display_df.head().to_string())        # Display value counts for categorical columns    categorical_cols = df.select_dtypes(include=['object', 'category']).columns    for col in categorical_cols:        if col not in binary_cols and df[col].nunique() < 10:            print(f"\n=== {col} Distribution ===")            print(df[col].value_counts().to_string())
def save_to_csv(df, output_path):    """Save DataFrame to CSV with proper handling of binary data"""    # Create a copy for saving    save_df = df.copy()        # Convert binary columns to hex    binary_cols = ['post', 'account', 'app']    for col in binary_cols:        if col in save_df.columns:            save_df[col] = save_df[col].apply(format_binary)        # Save to CSV    save_df.to_csv(output_path, index=False)
def main():    parser = argparse.ArgumentParser(description='Read Lens Protocol Parquet files')    parser.add_argument('url', help='URL of the parquet file')    parser.add_argument('--output', '-o', help='Output file path (optional)')    args = parser.parse_args()
    print(f"Reading data from: {args.url}")    df = read_parquet_file(args.url)        display_data(df)        if args.output:        save_to_csv(df, args.output)        print(f"\nData saved to: {args.output}")
if __name__ == "__main__":    main()

Running the Script

  1. Basic Usage

    python3 read_lens_data.py "s3://lens-protocol-testnet-data/post/reaction/LOAD00000001.parquet"
  2. Save to CSV

    python3 read_lens_data.py "s3://lens-protocol-testnet-data/post/reaction/LOAD00000001.parquet" -o reactions.csv

Output

Reading data from: https://lens-protocol-testnet-data.s3.us-east-1.amazonaws.com/post/reaction/LOAD00000001.parquet
=== Dataset Information ===Number of records: 354Columns: timestamp, post, account, type, action_at, app
=== Column Types ===timestamp: objectpost: objectaccount: objecttype: objectaction_at: datetime64[us, UTC]app: object
=== Sample Data (first 5 rows) ===            timestamp                                                              post                                   account    type                        action_at                                       app0  2025-06-26 11:17:46  ab47c67b39b399a62aacdcc2d9b78f3460b627a9d6740564d0a0d1b7c1e3f01b  7479b233fb386ed4bcc889c9df8b522c972b09f2  UPVOTE 2025-04-14 20:52:13.656723+00:00  4abd67c2c42ff2b8003c642d0d0e562a3f9008051  2025-06-26 11:17:46  33ca34ec9dc9da2b08931d97e092508f01a4e1f695f56e4822ab2491c98fe278  41bf9732b1e83f62d56f834ba090af3d79b21d83  UPVOTE 2025-04-14 22:43:20.331929+00:00  4abd67c2c42ff2b8003c642d0d0e562a3f9008052  2025-06-26 11:17:46  33ca34ec9dc9da2b08931d97e092508f01a4e1f695f56e4822ab2491c98fe278  b9b0358d9f2461b2852255a07d3736a45300442b  UPVOTE 2025-04-14 22:56:46.519339+00:00  4abd67c2c42ff2b8003c642d0d0e562a3f9008053  2025-06-26 11:17:46  5c8a610bef5ea5576b863e1a48ef4078974d4aeff3bf22a7b1748aaad98b5b38  7479b233fb386ed4bcc889c9df8b522c972b09f2  UPVOTE 2025-04-14 23:11:54.549963+00:00  4abd67c2c42ff2b8003c642d0d0e562a3f9008054  2025-06-26 11:17:46  204e761b3eb1493470bcdc839a102bc0df57d3a43539565dad449855676906fe  7754f6ffd9a8cbcbc3d59aa54cce76cc0f27902c  UPVOTE 2025-04-15 11:42:47.021848+00:00  10651b74b26ac31aafe23615fc23872402086e85

Best Practices

  1. Performance Optimization

    • Use appropriate data types when reading Parquet files

    • Implement column filtering to read only needed data

    • Process data in batches for large datasets

  2. Error Handling

    from tenacity import retry, stop_after_attempt, wait_exponential
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))def read_parquet_with_retry(url):    return pd.read_parquet(url)
  3. Data Processing

    • Handle binary data appropriately (convert to hex for readability)

    • Format timestamps for your timezone

    • Implement proper error handling for data type conversions

Troubleshooting

  1. Common Issues

    • Binary data handling: Use proper conversion for bytea columns

    • Timestamp parsing: Handle timezone information correctly

    • Memory management: Process large files in chunks

  2. Data Validation

    • Verify data types match schema definitions

    • Check for missing or null values

    • Validate binary data lengths for addresses and hashes

References