S3 Data Access
This guide explains how to access and analyze Lens Protocol data stored in Amazon S3. The data is synchronized from PostgreSQL databases using AWS Database Migration Service (DMS).
Configure AWS CLI
-
Install the AWS CLI by following the official AWS CLI installation guide
-
Test your access:
- Mainnet
- Testnet
aws s3 ls s3://lens-protocol-mainnet-data/ --no-sign-request
Access the Data
-
List available schemas:
- Mainnet
- Testnet
aws s3 ls s3://lens-protocol-mainnet-data/ --no-sign-request
-
List available tables within a schema:
- Mainnet
- Testnet
aws s3 ls s3://lens-protocol-mainnet-data/{schema_name}/ --no-sign-request
-
Download data using the AWS CLI:
- Mainnet
- Testnet
aws s3 cp s3://lens-protocol-mainnet-data/schema/table/LOAD00000001.parquet . --no-sign-request
-
Download data using HTTPS:
- Mainnet
- Testnet
curl -O https://lens-protocol-mainnet-data.s3.us-east-1.amazonaws.com/schema/table/LOAD00000001.parquet
Data Organization
The data in S3 is organized in the following structure:
- Mainnet
- Testnet
s3://lens-protocol-mainnet-data/├── schema/ # e.g., account, post, etc. ├── table/ # e.g., metadata, record, etc. ├── LOAD[0-9,A-F]{8}.parquet # Initial data load └── YYYY/MM/DD/ # CDC changes by date └── YYYYMMDD-HHMMSSXXX.parquet
File Types
-
Initial Load Files
Named: LOAD[0-9,A-F]{8}.parquet
Contains: Base data snapshot(s)
Format: Apache Parquet
- Example:
Table with a single file (e.g., LOAD00000001.parquet)
Table with multiple files (e.g., LOAD00000001.parquet through LOAD0000000F.parquet)
-
Change Data Capture (CDC) Files
Named: YYYYMMDD-HHMMSSXXX.parquet
Contains: Incremental changes
- Operations:
I: Insert (new record)
U: Update (modified record)
D: Delete (removed record)
Example
Install Python Dependencies
-
Create a virtual environment (optional but recommended):
python -m venv lens-envsource lens-env/bin/activate # On Unix/macOS
-
Install required packages:
pip install pandas>=2.0.0 pyarrow>=14.0.1
Save the following code as read_lens_data.py:
#!/usr/bin/env python3
import pandas as pdimport argparsefrom datetime import datetimeimport sysimport binascii
def format_timestamp(ts): """Convert timestamp to a readable format""" try: return pd.to_datetime(ts).strftime('%Y-%m-%d %H:%M:%S') except: return ts
def format_binary(binary_data): """Convert binary data to a readable hex format""" try: if isinstance(binary_data, bytes): # Convert to hex and remove the '0x' prefix return binascii.hexlify(binary_data).decode('utf-8') return binary_data except: return str(binary_data)
def read_parquet_file(url): """Read a parquet file from S3 and return as DataFrame""" try: df = pd.read_parquet(url) return df except Exception as e: print(f"Error reading parquet file: {e}") sys.exit(1)
def display_data(df): """Display the data in a human-readable format""" # Create a copy of the dataframe for display display_df = df.copy() # Format timestamp columns timestamp_cols = [col for col in df.columns if 'timestamp' in col.lower() or 'time' in col.lower() or 'date' in col.lower()] for col in timestamp_cols: display_df[col] = display_df[col].apply(format_timestamp) # Format binary columns binary_cols = ['post', 'account', 'app'] # Known binary columns for col in binary_cols: if col in display_df.columns: display_df[col] = display_df[col].apply(format_binary)
# Display basic information print("\n=== Dataset Information ===") print(f"Number of records: {len(df)}") print(f"Columns: {', '.join(df.columns)}") # Display column types print("\n=== Column Types ===") for col in df.columns: print(f"{col}: {df[col].dtype}") print("\n=== Sample Data (first 5 rows) ===") # Set display options for better readability pd.set_option('display.max_columns', None) pd.set_option('display.width', None) pd.set_option('display.max_colwidth', None) # Display the formatted data print(display_df.head().to_string()) # Display value counts for categorical columns categorical_cols = df.select_dtypes(include=['object', 'category']).columns for col in categorical_cols: if col not in binary_cols and df[col].nunique() < 10: print(f"\n=== {col} Distribution ===") print(df[col].value_counts().to_string())
def save_to_csv(df, output_path): """Save DataFrame to CSV with proper handling of binary data""" # Create a copy for saving save_df = df.copy() # Convert binary columns to hex binary_cols = ['post', 'account', 'app'] for col in binary_cols: if col in save_df.columns: save_df[col] = save_df[col].apply(format_binary) # Save to CSV save_df.to_csv(output_path, index=False)
def main(): parser = argparse.ArgumentParser(description='Read Lens Protocol Parquet files') parser.add_argument('url', help='URL of the parquet file') parser.add_argument('--output', '-o', help='Output file path (optional)') args = parser.parse_args()
print(f"Reading data from: {args.url}") df = read_parquet_file(args.url) display_data(df) if args.output: save_to_csv(df, args.output) print(f"\nData saved to: {args.output}")
if __name__ == "__main__": main()
Running the Script
-
Basic Usage
python3 read_lens_data.py "s3://lens-protocol-testnet-data/post/reaction/LOAD00000001.parquet"
-
Save to CSV
python3 read_lens_data.py "s3://lens-protocol-testnet-data/post/reaction/LOAD00000001.parquet" -o reactions.csv
Output
Reading data from: https://lens-protocol-testnet-data.s3.us-east-1.amazonaws.com/post/reaction/LOAD00000001.parquet
=== Dataset Information ===Number of records: 354Columns: timestamp, post, account, type, action_at, app
=== Column Types ===timestamp: objectpost: objectaccount: objecttype: objectaction_at: datetime64[us, UTC]app: object
=== Sample Data (first 5 rows) === timestamp post account type action_at app0 2025-06-26 11:17:46 ab47c67b39b399a62aacdcc2d9b78f3460b627a9d6740564d0a0d1b7c1e3f01b 7479b233fb386ed4bcc889c9df8b522c972b09f2 UPVOTE 2025-04-14 20:52:13.656723+00:00 4abd67c2c42ff2b8003c642d0d0e562a3f9008051 2025-06-26 11:17:46 33ca34ec9dc9da2b08931d97e092508f01a4e1f695f56e4822ab2491c98fe278 41bf9732b1e83f62d56f834ba090af3d79b21d83 UPVOTE 2025-04-14 22:43:20.331929+00:00 4abd67c2c42ff2b8003c642d0d0e562a3f9008052 2025-06-26 11:17:46 33ca34ec9dc9da2b08931d97e092508f01a4e1f695f56e4822ab2491c98fe278 b9b0358d9f2461b2852255a07d3736a45300442b UPVOTE 2025-04-14 22:56:46.519339+00:00 4abd67c2c42ff2b8003c642d0d0e562a3f9008053 2025-06-26 11:17:46 5c8a610bef5ea5576b863e1a48ef4078974d4aeff3bf22a7b1748aaad98b5b38 7479b233fb386ed4bcc889c9df8b522c972b09f2 UPVOTE 2025-04-14 23:11:54.549963+00:00 4abd67c2c42ff2b8003c642d0d0e562a3f9008054 2025-06-26 11:17:46 204e761b3eb1493470bcdc839a102bc0df57d3a43539565dad449855676906fe 7754f6ffd9a8cbcbc3d59aa54cce76cc0f27902c UPVOTE 2025-04-15 11:42:47.021848+00:00 10651b74b26ac31aafe23615fc23872402086e85
Best Practices
-
Performance Optimization
Use appropriate data types when reading Parquet files
Implement column filtering to read only needed data
Process data in batches for large datasets
-
Error Handling
from tenacity import retry, stop_after_attempt, wait_exponential @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))def read_parquet_with_retry(url): return pd.read_parquet(url)
-
Data Processing
Handle binary data appropriately (convert to hex for readability)
Format timestamps for your timezone
Implement proper error handling for data type conversions
Troubleshooting
-
Common Issues
Binary data handling: Use proper conversion for bytea columns
Timestamp parsing: Handle timezone information correctly
Memory management: Process large files in chunks
-
Data Validation
Verify data types match schema definitions
Check for missing or null values
Validate binary data lengths for addresses and hashes