This Python script is designed to find UK companies based on a search query, enrich their data by scraping contact details from Yell.com, and then automatically upload the structured business listings to a GeoDirectory-powered WordPress website via its REST API.
It's essentially an automated data aggregation and listing tool for building a business directory.
Key Steps and Functionality
The script follows a multi-step process:
1. Configuration & Setup
Imports: It imports necessary libraries for web requests ($\mathbf{requests}$), JSON handling, time delays ($\mathbf{time}$), data manipulation ($\mathbf{pandas}$), web scraping ($\mathbf{BeautifulSoup}$), and API authorization ($\mathbf{HTTPBasicAuth}$, $\mathbf{OAuth1}$).
API Keys: It stores numerous sensitive credentials for:
Companies House API (UK business registry).
GeoDirectory API (WordPress plugin for business directories) with both Basic Auth (username/password) and OAuth1 (consumer keys).
Google Maps Geocoding API.
OpenAI API (although the key is set, it's $\mathbf{not}$ used in the provided code snippet).
Search Parameters: It defines the search query (e.g., "Digital Marketing") and the directory category ("Test").
2. Helper Functions (Data Collection & Processing)
Function
Purpose
Data Source
$\mathbf{get_or_create_category}$
Creates a new category on the GeoDirectory site or retrieves the ID of an existing one. This is crucial for correctly tagging the new business listings.
GeoDirectory API
$\mathbf{get_companies}$
Searches the Companies House public registry for businesses matching the $\mathbf{search_query}$ (e.g., "Digital Marketing") and returns basic company details like name, company number, address snippet, status, and type.
Companies House API
$\mathbf{scrape_yell}$
Attempts to find contact information (phone and website) for a company by performing a targeted search on the Yell.com business directory and scraping the results.
Yell.com (via web scraping)
$\mathbf{get_coordinates_from_address}$
Converts a company's street address into geographical coordinates (latitude and longitude).
Google Maps Geocoding API
$\mathbf{check_existing_company}$
Checks the GeoDirectory site to see if a company with the same name has already been listed, preventing duplicates.
GeoDirectory API
$\mathbf{post_to_geodirectory}$
Sends the final, prepared business data to the GeoDirectory API to create a new live listing on the target WordPress site.
GeoDirectory API
3. Main Execution (The Workflow)
Category Setup: It calls $\mathbf{get_or_create_category("Test")}$ to ensure all subsequent listings are added under the correct category ID.
Company Search: It retrieves the top 10 companies matching "Digital Marketing" from Companies House using $\mathbf{get_companies}$.
Data Structuring: It converts the Companies House data into a Pandas DataFrame ($\mathbf{df}$) for easier processing and initializes "phone" and "website" columns as empty.
Enrichment and Validation (First Loop): The script iterates through the Companies House results:
Scraping: There is a logical gap here, as the $\mathbf{scrape_yell}$ function is defined but not called within the first processing loop to fill the "phone" and "website" columns of the DataFrame. The script will likely skip most companies because it immediately checks if $\mathbf{row["phone"]}$ and $\mathbf{row["website"]}$ are empty/$\text{"N/A"}$, which they are since they haven't been populated.
Validation: It checks for missing address, phone, or website (leading to skips).
Duplicate Check: It uses $\mathbf{check_existing_company}$ to prevent relisting.
Payload Creation: It constructs the final JSON payload for the GeoDirectory API, including the company name, status, type, address, and contact details. It also sets $\mathbf{latitude}$ and $\mathbf{longitude}$ to empty strings (the $\mathbf{get_coordinates_from_address}$ function is also defined but not called in the final payload preparation).
Posting: It attempts to create the listing using $\mathbf{post_to_geodirectory}$ and pauses for 2 seconds ($\mathbf{time.sleep(2)}$) between posts to respect API rate limits.
Final Loop (Redundant/Modified Post): A second, simpler loop also iterates through the DataFrame and attempts to post listings again if they still don't exist, using less detailed content.
Data Export: Finally, it saves the DataFrame to a file named companies_house_with_contacts.csv.