uv
is not required to use Firecrawl, AWS, or Pinecone with Python).
Install uv
uv
, run one of the following commands, depending on your operating system:curl
with sh
:wget
with sh
instead:uv
by using other approaches such as PyPI, Homebrew, or WinGet,
see Installing uv.Install Python
uv
will detect and use Python if you already have it installed.
To view a list of installed Python versions, run the following command:uv
by running the following command. For example, this command installs Python 3.12 for use with uv
:Create the project directory
firecrawl_unstructured_demo
within your current working directory and then
switches to this new project directory:Intiialize the project
uv
to initialize the project by running the following command:Create a venv virtual environment
venv
is not required to use CrewAI or the Unstructured Workflow Endpoint MCP Server).
From the root of your project directory, use uv
to create a virtual environment with venv
by running the following command:Activate the virtual environment
venv
virtual environment, run one of the following commands from the root of your project directory:bash
or zsh
, run source .venv/bin/activate
fish
, run source .venv/bin/activate.fish
csh
or tcsh
, run source .venv/bin/activate.csh
pwsh
, run .venv/bin/Activate.ps1
Install the Firecrawl Python SDK
uv
to install the Firecrawl Python SDK package, by running the following command:Install the AWS SDK for Python
uv
to install the AWS SDK for Python package, by running the following command:
credentials
in the ~/.aws/
directory for macOS or Linux, or the <drive>:\Users\<username>\.aws\
directory for Windows. Then add the
AWS access key ID and secret access key of the AWS IAM user that has access to your Amazon S3 bucket, and the short code for the AWS Region of the bucket (for example, us-east-1
), to the credentials
file.
In the following credentials
file example, replace the following placeholders:
<your-access-key-id>
with the AWS access key ID of the AWS IAM user that has access to the bucket.<your-secret-access-key>
with the secret access key for the related access key ID.<the-aws-short-region-code-for-your-bucket>
with the short code for the AWS Region of the bucket.Install the Pinecone Python SDK
uv
to install the Pinecone Python SDK package, along with the grpc
extra to enable the programmatic creation of a Pinecone serverless index later, by running the following command:Add your Firecrawl API key to your project
.env
..env
file, and replace
<your-firecrawl-api-key>
with your Firecrawl API key:uv
to install the dotenv
package by running the following command:
.gitignore
file in the root of your project, add the following line, to help prevent
accidentally checking in your Firecrawl API key (or anything else in the .env
file) into any shared code repositories later:
Create the Python script to extract the website data
.env
file, add the following line, which defines an environment variable representing the base URL
of the website to crawl. This walkthrough uses a website named Books to Scrape that contains ficticious data (although you can use any accessible website you want):
.env
file, add the following line, which defines an environment variable representing the name of the target Amazon S3 bucket to
have Firecrawlstore the website crawl results in. Replace <the-name-of-your-bucket>
with the name of the bucket:
s3://
. Do not include any trailing slash (/
) after the bucket name.firecrawl_extract.py
in the root of your project directory, and add the following code to it:
.env
file your Firecrawl API key, the base URL for Firecrawl to use for website crawling, and the S3 bucket name for Firecrawl to send the website crawl results to.main
that uses Firecrawl to crawl the website.main
function then calls the save_to_s3
function, which adds the website crawl results to the S3 bucket.full_results_key
and page_key
variables. For example, you might want to save the
website crawl results in a single folder and then keep overwriting those results with new results as they come in, instead of adding new results to separate subfolders.
You can also change the number of crawled pages by changing the limit
argument, outputting the results as Markdown instead of HTML, and so on. For more
information, see the Firecrawl Python SDK documentation.
Run the script to extract the data
uv
to run the script to extract the data from the website by running the following command:<your-bucket-name>
is the name of your bucket, and <timestamp>
is the timestamp generated by the script:crawls
folder within your bucket.Add your Pinecone API key to your project
.env
file, and replace
<your-pinecone-api-key>
with your Pinecone API key:Add the Pinecone index name to your project
.env
file, and replace <the-name-of-your-index>
with the name of the serverlessindex you want to create:Create the Python script to create the Pinecone serverless index
pinecone_create_index.py
in the root of your project directory and add the following code to it:Run the script to create the index
uv
to run the script to create the Pinecone serverless index by running the following command:Create the source connector
s3-firecrawl-source
.s3://<your-bucket-name>/crawls/<timestamp>/pages/
, replacing <your-bucket-name>
with the name of your bucket, and <timestamp>
with the timestamp generated by the script in the previous step..Create the destination connector
pinecone-firecrawl-destination
.firecrawl-dense-index
.50
.Create the workflow
s3-firecrawl-source
.pinecone-firecrawl-destination
.firecrawl-s3-to-pinecone-workflow
.
Then click the checkmark (save) icon.Run the workflow as a job
Monitor the job
View the results
pinecone_fetch_from_index.py
in the root of your project directory and add the following code to it:
uv
to run the script as follows: