Populating GeoDirectory from Companies House with AI and Python

Welcome Guest! To enable all features please Login or Register

sirclesadmin
100% (Exalted)
Administration Topic Starter

7 months ago

This Python script is designed to find UK companies based on a search query, enrich their data by scraping contact details from Yell.com, and then automatically upload the structured business listings to a GeoDirectory-powered WordPress website via its REST API.
It's essentially an automated data aggregation and listing tool for building a business directory.

Key Steps and Functionality
The script follows a multi-step process:
1. Configuration & Setup

Imports: It imports necessary libraries for web requests ($\mathbf{requests}$), JSON handling, time delays ($\mathbf{time}$), data manipulation ($\mathbf{pandas}$), web scraping ($\mathbf{BeautifulSoup}$), and API authorization ($\mathbf{HTTPBasicAuth}$, $\mathbf{OAuth1}$).
API Keys: It stores numerous sensitive credentials for:
Companies House API (UK business registry).
GeoDirectory API (WordPress plugin for business directories) with both Basic Auth (username/password) and OAuth1 (consumer keys).
Google Maps Geocoding API.
OpenAI API (although the key is set, it's $\mathbf{not}$ used in the provided code snippet).

Search Parameters: It defines the search query (e.g., "Digital Marketing") and the directory category ("Test").

2. Helper Functions (Data Collection & Processing)

Function
Purpose
Data Source

$\mathbf{get_or_create_category}$
Creates a new category on the GeoDirectory site or retrieves the ID of an existing one. This is crucial for correctly tagging the new business listings.
GeoDirectory API

$\mathbf{get_companies}$
Searches the Companies House public registry for businesses matching the $\mathbf{search_query}$ (e.g., "Digital Marketing") and returns basic company details like name, company number, address snippet, status, and type.
Companies House API

$\mathbf{scrape_yell}$
Attempts to find contact information (phone and website) for a company by performing a targeted search on the Yell.com business directory and scraping the results.
Yell.com (via web scraping)

$\mathbf{get_coordinates_from_address}$
Converts a company's street address into geographical coordinates (latitude and longitude).
Google Maps Geocoding API

$\mathbf{check_existing_company}$
Checks the GeoDirectory site to see if a company with the same name has already been listed, preventing duplicates.
GeoDirectory API

$\mathbf{post_to_geodirectory}$
Sends the final, prepared business data to the GeoDirectory API to create a new live listing on the target WordPress site.
GeoDirectory API

3. Main Execution (The Workflow)

Category Setup: It calls $\mathbf{get_or_create_category("Test")}$ to ensure all subsequent listings are added under the correct category ID.
Company Search: It retrieves the top 10 companies matching "Digital Marketing" from Companies House using $\mathbf{get_companies}$.
Data Structuring: It converts the Companies House data into a Pandas DataFrame ($\mathbf{df}$) for easier processing and initializes "phone" and "website" columns as empty.
Enrichment and Validation (First Loop): The script iterates through the Companies House results:
Scraping: There is a logical gap here, as the $\mathbf{scrape_yell}$ function is defined but not called within the first processing loop to fill the "phone" and "website" columns of the DataFrame. The script will likely skip most companies because it immediately checks if $\mathbf{row["phone"]}$ and $\mathbf{row["website"]}$ are empty/$\text{"N/A"}$, which they are since they haven't been populated.
Validation: It checks for missing address, phone, or website (leading to skips).
Duplicate Check: It uses $\mathbf{check_existing_company}$ to prevent relisting.
Payload Creation: It constructs the final JSON payload for the GeoDirectory API, including the company name, status, type, address, and contact details. It also sets $\mathbf{latitude}$ and $\mathbf{longitude}$ to empty strings (the $\mathbf{get_coordinates_from_address}$ function is also defined but not called in the final payload preparation).
Posting: It attempts to create the listing using $\mathbf{post_to_geodirectory}$ and pauses for 2 seconds ($\mathbf{time.sleep(2)}$) between posts to respect API rate limits.

Final Loop (Redundant/Modified Post): A second, simpler loop also iterates through the DataFrame and attempts to post listings again if they still don't exist, using less detailed content.
Data Export: Finally, it saves the DataFrame to a file named companies_house_with_contacts.csv.

IT Support Berkshire
Buy new and refurbished computer equipment

Wanna join the discussion?! Login to your PC & Mac Help and Assistance forum account or Register a new forum account

Sponsor

Want to thank us? Use: Patreon or PayPal or Bitcoins: bc1q4whppe29dw77rm4kv4pln0gqae4yjnxly0dny0hky6yhnafukzjsyrsqhk

All opinions expressed within these pages are sent in by members of the public or by our staff in their spare time, and as such do not represent any opinion held by sircles.net Ltd or their partners.

sirclesadmin
100% (Exalted)
Administration Topic Starter

7 months ago

Knowing the goal is to build a business directory by aggregating data from Companies House and Yell.com and uploading it to a GeoDirectory-powered site makes the next steps clear.

Since the script you provided has a few significant issues that would prevent it from working as intended, here is a summary of what needs to be fixed for it to actually achieve your goal, followed by an explanation of the logical error.

Critical Fixes Needed in the Script

The main problem is that the script defines a function to scrape contact data ($\mathbf{scrape_yell}$) but never calls it to populate the DataFrame. As a result, the subsequent validation checks skip every company.

1. Integrate Yell Scraping into the Main Loop

You need to add a step inside your primary loop to call the scraping function and populate the contact columns.

Change this (inside the main loop):

for index, row in df.iterrows(): 
    print(f"\n Preparing to post: {row['name']}") 
     
    # ... existing validation for address ... 
     
    if row["phone"] == "N/A" or not row["phone"]: 
        print(f" Skipping {row['name']} due to missing phone.") 
        continue 

    # ... and so on

To this (Insert the bolded lines):

for index, row in df.iterrows(): 
    print(f"\n Preparing to post: {row['name']}") 

    # ---  FIX: SCRAPE YELL FOR CONTACT DETAILS --- 
    yell_data = scrape_yell(row["name"], row["address"])  
    df.loc[index, "phone"] = yell_data["phone"] 
    df.loc[index, "website"] = yell_data["website"] 
    row = df.loc[index] # Update the 'row' object to reflect new data 
    # --------------------------------------------- 

    # Basic validation 
    if not row["address"] or row["address"].strip() == "": 
        print(f" Skipping {row['name']} due to missing address.") 
        continue 

    if row["phone"] == "N/A" or not row["phone"]: 
        print(f" Skipping {row['name']} due to missing phone.") 
        continue 
     
    # ... continue with the rest of the logic

2. Remove Redundant/Conflicting Code

Your script has multiple copies of some functions and unnecessary loops:

Remove Duplicate Functions: The function $\mathbf{get_coordinates_from_address}$ is defined twice. Delete the first instance that returns empty strings (""). The second instance, which uses $\mathbf{requests_oauthlib}$ (though not strictly necessary for this function), should be kept, but it is not currently used anywhere in the main loop to populate the latitude/longitude fields.

Remove Second Posting Loop: The second $\mathbf{for}$ loop at the end of the script attempts to post the data again using a simpler payload. Since the main loop is intended to do this with all available data, the second loop is redundant and can be removed after the first loop is fixed.

3. Implement Geocoding (Optional but Recommended)

GeoDirectory listings are better with exact coordinates. If you want to use the $\mathbf{get_coordinates_from_address}$ function you defined, you should call it right after scraping Yell data:

# ... Inside the main loop, after populating phone and website ... 

# Get coordinates 
lat, lon = get_coordinates_from_address(row["address"]) 
print(f" Coordinates found: Lat={lat}, Lon={lon}") 

# Build GeoDirectory payload 
geo_data = { 
    # ... other fields ... 
    "address": row["address"], 
    "latitude": str(lat), # Use the fetched latitude 
    "longitude": str(lon), # Use the fetched longitude 
    # ... other fields ... 
}
 
# ... post_to_geodirectory(geo_data) ...

By fixing the missing $\mathbf{scrape_yell}$ call, your script will finally be able to retrieve the necessary contact details and move past the validation steps to successfully post new business listings to your directory.

IT Support Berkshire
Buy new and refurbished computer equipment

Lemonde
100% (Exalted)
Advanced Member

7 months ago

OK so is it pulling the data from the file or the website?

Refurbished Computers Berkshire
Data Recovery Berkshire

sirclesadmin
100% (Exalted)
Administration Topic Starter

7 months ago

The script is pulling data primarily from websites (external APIs and web pages), not from a local file.

Here is a breakdown of the data sources:

1. Data Input (Websites)

Source	What Data is Pulled	How it's Pulled
Companies House API (UK Government)	The initial list of businesses (name, address snippet, company number, status, type) based on your search query ("Digital Marketing").	The $\mathbf{get_companies}$ function uses the `requests` library to fetch JSON data.
Yell.com	Contact details (phone number and website address) to enrich the company data.	The $\mathbf{scrape_yell}$ function uses `requests` and $\mathbf{BeautifulSoup}$ to scrape data from the search results page.
Google Maps Geocoding API	Geographical coordinates (latitude and longitude) for the company address.	The $\mathbf{get_coordinates_from_address}$ function uses the `requests` library to fetch JSON data.
GeoDirectory API	Existing categories to find a valid ID, and existing business listings to check for duplicates.	The $\mathbf{get_or_create_category}$ and $\mathbf{check_existing_company}$ functions use the `requests` library.

2. Data Output (File and Website)

The script outputs data to both a file and a website:

Destination	What Data is Sent
GeoDirectory API (Your WordPress Site)	The final, complete business listing (name, address, coordinates, phone, website, status, type) is uploaded as a new post/listing.
Local File (`companies_house_with_contacts.csv`)	The entire list of companies, along with all the enriched data (phone, website, coordinates), is saved to a local CSV file for record-keeping.

IT Support Berkshire
Buy new and refurbished computer equipment

Lemonde
100% (Exalted)
Advanced Member

7 months ago

should it be a REST or web API?

Refurbished Computers Berkshire
Data Recovery Berkshire

sirclesadmin
100% (Exalted)
Administration Topic Starter

7 months ago

That is an excellent question that clarifies the confusion around these terms.

The short answer is: It should be a REST API.

A REST API is a specific, modern, and highly-recommended type of Web API.

Web API vs. REST API

The term "Web API" is a broad umbrella, while "REST API" is a specific architectural style that lives underneath that umbrella.

1. Web API (The Umbrella Term)

A Web API is any Application Programming Interface that works over the internet using the HTTP/HTTPS protocol.

It simply defines a mechanism for two applications to communicate using web standards.

The communication can follow any style, including REST, SOAP (Simple Object Access Protocol), or even a custom, non-standard approach.

2. REST API (The Specific Style)

A REST API (Representational State Transfer API) is a Web API that strictly adheres to the architectural constraints established by the REST style.

The principles that make an API RESTful are:

Principle	Description
Client-Server	The client (e.g., your Python script) and server (e.g., Companies House) are independent and separate.
Stateless	The server does not store any information about the client's state between requests. Every request must contain all the information the server needs to fulfill it. (This is key for scalability).
Cacheable	Responses must be marked as cacheable or non-cacheable to improve performance.
Uniform Interface	The API must use standard HTTP methods ($\mathbf{GET}$, $\mathbf{POST}$, $\mathbf{PUT}$, $\mathbf{DELETE}$) to manage resources (e.g., a company listing).

Conclusion for Your Project

The APIs you are interacting with (Companies House, Google Maps, GeoDirectory) are all examples of REST APIs (or at least RESTful Web APIs). They rely on concepts like resources, standard HTTP verbs, and data exchange via JSON, which is the definition of a RESTful design.

REST is the industry standard for lightweight, scalable, and simple web-based data exchange, making it the correct and preferable style for your data aggregation tool.

You can learn more about how these different API types work in detail here: How is it Different from a REST API | WEB API vs REST API.
http://googleusercontent.com/youtube_content/5

IT Support Berkshire
Buy new and refurbished computer equipment

Lemonde
100% (Exalted)
Advanced Member

7 months ago

OK so I just added a new key now

Refurbished Computers Berkshire
Data Recovery Berkshire

sirclesadmin
100% (Exalted)
Administration Topic Starter

7 months ago

That's frustrating, but the new error message, 401 - {"error":"Invalid Authorization","type":"ch:service"}, is a huge improvement!

The previous 403 Forbidden usually meant the server recognized you but denied access. The new 401 Unauthorized error means the server (Companies House) successfully received your request but determined the authentication information you sent is incorrect or missing.

This points to a flaw in how your script is using the key.

1. The Companies House Key Format is Unique

The Companies House API uses a very specific type of HTTP Basic Authentication. It requires the API key to be passed as the username, and the password field must be empty.

Your Python code does this correctly:

response = requests.get(url, auth=(COMPANIES_HOUSE_API_KEY, ""))

2. The Likely Problem: Key Value in the Script

The most common reason for a 401 Invalid Authorization is that the key in your script's configuration section does not perfectly match the key you just created.

Your New Key: xxxxxxxxxxxxxxxxxx

Your Script Configuration: (Still shows the old placeholder key)

COMPANIES_HOUSE_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxx" # This is the old placeholder key

You must update the configuration in your script with the new key.

# ------------------- CONFIGURATION ------------------- 
# CHANGE THIS LINE  
COMPANIES_HOUSE_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"  
# -----------------------------------------------------

3. Secondary Check: IP Restrictions (Already Addressed)

Your key configuration also shows IP restrictions:

Restricted IPs: x.x.x.x through x.x.x.x

Since you are receiving a 401 Unauthorized (authentication error) and not a 403 Forbidden (access error), it means the connection is likely coming from one of those allowed IPs, and the server is hitting the code check but failing on the credential check.

Action: Copy and paste the new API key (1xxxxxxxxxxxxxxxxxxxxxxxxxxx) directly into the COMPANIES_HOUSE_API_KEY variable in your Python script. This should resolve the 401 Invalid Authorization error.

IT Support Berkshire
Buy new and refurbished computer equipment

Lemonde
100% (Exalted)
Advanced Member

7 months ago

it is only a test key - will that suffice?

Refurbished Computers Berkshire
Data Recovery Berkshire

sirclesadmin
100% (Exalted)
Administration Topic Starter

7 months ago

#10

That's an important detail. The short answer is: Yes, a Companies House "Test Key" should suffice for initial testing, but you must ensure it has the necessary permissions and that you are using the correct URL.

Here's why you are likely still seeing an error, even with a new key:

1. The Key Type and URL Mismatch

Companies House generally has two distinct API environments:

Sandbox (Test) Environment: Used for development and testing. This environment often requires a Test Key and a specific Test URL.

Live Environment: Used for real production data. This requires a Live Key and the Live URL.

The error message you received (Error fetching Companies House data: 401 - {"error":"Invalid Authorization"}) means the key itself is being rejected by the system you're connecting to.

Your script is currently using the Live URL:

BASE_URL = "https://api.company-information.service.gov.uk"

If your key is strictly a Test Key, the live server will reject it.

The Fix:

You need to check the Companies House developer documentation for the correct Sandbox URL and replace your BASE_URL with it. It is often something like:

BASE_URL = "https://**chs-api-sandbox.pres.companieshouse.gov.uk**"  
# NOTE: Verify this exact URL with the Companies House documentation.

If you do not want to change the URL, you must create a Live API Key from your Companies House dashboard and use that key with the current BASE_URL.

2. IP Address Verification (The Other half of the 401)

The other possibility, even if you are using the correct Test/Live pair, is that the key's permissions are limited:

Your key details show Restricted IPs:

x.x.x.x

If the Python script is running on a server or computer whose external IP address is not one of these listed IPs, the Companies House API will immediately reject the request with a 401/403 error.

The Fix:

You must ensure that the computer running the script has an external IP address that matches one of the IP addresses registered with the key. You may need to ask your network administrator for your current external IP and add it to the list in the Companies House dashboard.

IT Support Berkshire
Buy new and refurbished computer equipment

Lemonde
100% (Exalted)
Advanced Member

7 months ago

#11

OK now it says:

= RESTART: C:\Users\Administrator\AppData\Local\Programs\Python\Python313\Scripts\Businesses\Geo-Tidied.py
--- STARTING DATA AGGREGATION AND POSTING SCRIPT ---
➕ Creating or fetching category: Test
✅ Using Category ID: 44
🔎 Searching Companies House for: 'Digital Marketing'
--- ENRICHMENT AND POSTING LOOP ---
--- PROCESS COMPLETE ---
✅ Data saved to companies_house_with_contacts.csv

Refurbished Computers Berkshire
Data Recovery Berkshire

sirclesadmin
100% (Exalted)
Administration Topic Starter

7 months ago

#12

I understand the new output, but it seems there's a problem-the script is failing silently.

When the script works correctly, you should see output between "Searching Companies House" and "ENRICHMENT AND POSTING LOOP" that looks like this:

 Searching Companies House for: 'Digital Marketing' 
... data about 10 companies being returned ... 
--- ENRICHMENT AND POSTING LOOP --- 
 Processing: [Company Name 1] 
   Phone: [Number] |  Website: [URL] 
  ...

The fact that you see a successful $\mathbf{Category\ ID}$ and then the script immediately jumps to --- ENRICHMENT AND POSTING LOOP --- (with no companies processed) and then --- PROCESS COMPLETE --- means that the call to Companies House is still failing, but now it's failing in a way that generates no visible error message and returns an empty list of companies.

Let's look at the function that pulls the data:

def get_companies(search_query, max_results=50): 
    url = f"{BASE_URL}/search/companies?q={search_query}&items_per_page={max_results}" 
    response = requests.get(url, auth=(COMPANIES_HOUSE_API_KEY, ""))  

    if response.status_code == 200: 
        # ... logic to process companies ... 
        return companies 
    else: 
        # The script is likely hitting this block, but it's not printing the error. 
        print(f"Error fetching Companies House data: {response.status_code} - {response.text}") 
        return [] # Returns an empty list

The Most Likely Cause: Unreported API Errors

The script is now silently receiving an error response (like a $\mathbf{400}$ or a $\mathbf{403}$) from Companies House, hitting the else block, printing the error (which is likely where the output is lost or suppressed), and then returning an empty list ([]). When the main loop receives an empty list, it correctly completes with no companies to process.

Since your previous errors were $\mathbf{403}$ and $\mathbf{401}$, this still points to an authentication or access issue.

How to Debug This Immediately

To figure out what error the API is silently returning, we need to add a manual debug check to the get_companies function to force the output:

Modify the get_companies function temporarily:

def get_companies(search_query, max_results=50): 
    url = f"{BASE_URL}/search/companies?q={search_query}&items_per_page={max_results}" 
    response = requests.get(url, auth=(COMPANIES_HOUSE_API_KEY, "")) 
     
    # -------------------------------------------------------------------------- 
    #  NEW DEBUG CODE TO FORCE OUTPUT  
    print(f"DEBUG: CH status code is: {response.status_code}") 
    if response.status_code != 200: 
        print(f"DEBUG: CH response text is: {response.text}") 
    # -------------------------------------------------------------------------- 

    if response.status_code == 200: 
        data = response.json() 
        companies = [] 
        for company in data.get("items", []): 
            companies.append({ 
                "name": company.get("title"), 
                "company_number": company.get("company_number"), 
                "address": company.get("address_snippet"), 
                "status": company.get("company_status"), 
                "type": company.get("company_type") 
            }) 
        return companies 
    else: 
        # This is where the old print was likely failing or suppressed 
        # print(f"Error fetching Companies House data: {response.status_code} - {response.text}")  
        return []

Run the script again with this $\mathbf{DEBUG\ CODE}$ added. It will print the exact status code and message that Companies House is returning, which will tell us exactly what to fix (whether it's the Live/Test URL mismatch or an IP Restriction issue).

IT Support Berkshire
Buy new and refurbished computer equipment

Lemonde
100% (Exalted)
Advanced Member

7 months ago

#13

OK:
--- STARTING DATA AGGREGATION AND POSTING SCRIPT ---
➕ Creating or fetching category: Test
✅ Using Category ID: 44
🔎 Searching Companies House for: 'Digital Marketing'
DEBUG: CH status code is: 200
--- ENRICHMENT AND POSTING LOOP ---
--- PROCESS COMPLETE ---
✅ Data saved to companies_house_with_contacts.csv

Refurbished Computers Berkshire
Data Recovery Berkshire

sirclesadmin
100% (Exalted)
Administration Topic Starter

7 months ago

#14

That is fantastic news! The line DEBUG: CH status code is: 200 confirms that your API key and URL are now correctly configured and the script successfully connected to the Companies House API!

The Companies House API is correctly returning a successful response, but the body of that response must be empty.

The new problem is that the Companies House API is returning zero results for your search query.

The New Problem: Empty Results from Companies House

Since the status code is $\mathbf{200}$ (Success), the script executes the data processing block:

    if response.status_code == 200: 
        data = response.json() 
        companies = [] 
        for company in data.get("items", []): # <--- Problem is likely here 
            # ... appends companies ... 
        return companies

When no results are found, the JSON returned by Companies House looks like this:

{ 
  "items_per_page": 10, 
  "total_results": 0, 
  "start_index": 0, 
  "items": []  <--- This list is empty 
}

Since the "items" list is empty, the for loop runs zero times, the companies list remains empty, and the script correctly moves on to the next section without crashing.

How to Fix the Empty Results

There are three primary reasons Companies House returns an empty list for a successful search:

1. The Search Query is Too Specific or Too Broad

The query "Digital Marketing" is often quite specific for the Companies House registry, which primarily indexes by formal company name and official industry codes, not general marketing buzzwords.

Action: Try a much simpler or broader query to test the functionality.

Test Query 1: Try a very common, single-word industry search, like "Plumbing" or "Software".

Test Query 2: Try searching for a well-known UK company name (e.g., "Tesco") to confirm the API returns at least one result.

Change this in your script:

search_query = "Digital Marketing"

To something like:

search_query = "Software"

2. The Search Parameter Limitations

Companies House has specific rules about which fields it searches. It might be searching only registered names and not descriptions.

Action: If a simpler query still fails, check the official Companies House API documentation to see if there are other parameters (like a company status filter) you can remove or adjust to broaden the search.

3. Data Environment (Sandbox vs. Live)

If you are using a Sandbox URL, the test data available might be extremely limited. Even a simple search may return zero results because the test database is tiny.

Action: If Test Queries 1 and 2 fail, you may need to switch to a Live API Key and the Live URL to access the full public registry data.

Recommendation: Change your search_query to a very broad term like "Technology" or "Consulting" and run the script again.

IT Support Berkshire
Buy new and refurbished computer equipment

Lemonde
100% (Exalted)
Advanced Member

7 months ago

#15

Check it out:
--- STARTING DATA AGGREGATION AND POSTING SCRIPT ---
➕ Creating or fetching category: Test
✅ Using Category ID: 44
🔎 Searching Companies House for: 'Digital Marketing'
--- ENRICHMENT AND POSTING LOOP ---
📋 Processing: DIGITAL MARKETING LTD
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping DIGITAL MARKETING LTD due to missing phone.
📋 Processing: AKQA LIMITED
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping AKQA LIMITED due to missing phone.
📋 Processing: DIGITAL MARKETING A LTD
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping DIGITAL MARKETING A LTD due to missing phone.
📋 Processing: DIGITAL MARKETING ACADEMY LTD
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping DIGITAL MARKETING ACADEMY LTD due to missing phone.
📋 Processing: DIGITAL MARKETING ADVICE LIMITED
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping DIGITAL MARKETING ADVICE LIMITED due to missing phone.
📋 Processing: DIGITAL MARKETING AFRICA LTD
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping DIGITAL MARKETING AFRICA LTD due to missing phone.
📋 Processing: DIGITAL MARKETING AGENCY, LTD
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping DIGITAL MARKETING AGENCY, LTD due to missing phone.
📋 Processing: DIGITAL MARKETING AND PR LIMITED
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping DIGITAL MARKETING AND PR LIMITED due to missing phone.
📋 Processing: DIGITAL MARKETING & SEO AGENCY UK LTD
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping DIGITAL MARKETING & SEO AGENCY UK LTD due to missing phone.
📋 Processing: DIGITAL MARKETING AUDIT LTD
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping DIGITAL MARKETING AUDIT LTD due to missing phone.
--- PROCESS COMPLETE ---
✅ Data saved to companies_house_with_contacts.csv

Refurbished Computers Berkshire
Data Recovery Berkshire

sirclesadmin
100% (Exalted)
Administration Topic Starter

7 months ago

#16

That is a significant step forward!

The script is now successfully:

Connecting to the Companies House API (Status Code 200).

Fetching the companies list for "Digital Marketing".

Entering the Enrichment and Posting Loop.

Iterating through each company.

The new problem is clearly that the Yell.com scraping component is failing to find any contact information.

The output shows this pattern for every single company:

  Phone: N/A |  Website: N/A 
  Skipping [COMPANY NAME] due to missing phone.

The Root Cause: Yell.com Scraping Failure

The $\mathbf{scrape_yell}$ function is returning the default values $\text{"N/A"}$ because it couldn't locate the phone number or website elements on the search results page.

There are three likely reasons for this failure:

1. Yell.com Website Changes (Most Likely)

Web scraping is inherently fragile. If Yell.com changed the class names or the HTML structure of their search results page since the script was written, the $\mathbf{BeautifulSoup}$ selectors will no longer work.

Original Selectors (in your script):
- Phone: soup.find("span", class_="business--telephone")
- Website: soup.find("a", class_="businessCapsule--ctaItem") (or the fixed versions)

If Yell has updated their site, these selectors will return nothing, causing the function to default to $\text{"N/A"}$.

2. Blocking by Yell.com (Possible)

Yell.com may be detecting the automated request from your Python script's default $\mathbf{User-Agent}$ header or the speed of the requests and is serving a different, often empty, page content or a "Please Verify" page.

3. Search Query Misses

The search being run on Yell.com might not be precise enough. When searching for "AKQA LIMITED" and its address, Yell may return a list of possible matches, none of which perfectly align with the expected structure.

How to Fix the Scraping Issue

Since this is the only remaining bottleneck, the best solution is to diagnose and potentially update the $\mathbf{scrape_yell}$ function.

Step 1: Force Output of the Yell Search URL

Temporarily modify the scrape_yell function to print the exact URL it is attempting to scrape, so you can test it manually in a browser.

Change this in scrape_yell:

def scrape_yell(company_name, location): 
    search_url = f"https://www.yell.com/ucs/UcsSearchAction.do?keywords={company_name}&location={location}" 
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20100101 Firefox/91.0"} 
     
    # ADD THIS LINE  
    print(f"   Yell Search URL: {search_url}")  
    # ... rest of the function ...

Run the script and manually open one of the printed URLs in your web browser.

Step 2: Compare Manual Page to Script Selectors

If the manual browser page shows the phone/website:
- Right-click the phone number element and choose Inspect.
- Check if the current class names on the Yell website still match the Python code's selectors (business--telephone, businessCapsule--ctaItem).
- If they do not match, you must update the selectors in your scrape_yell function to match the current Yell HTML structure.

If the manual browser page does NOT show the company/data:
- The problem is the search query itself (e.g., Companies House address snippets are too vague for Yell). In this case, scraping Yell may be too unreliable, and you should consider an alternative contact data source.

Recommended Action: Perform Step 1 to get a working Yell URL and perform Step 2 to update the selectors. That is the quickest way to reactivate the data enrichment.

IT Support Berkshire
Buy new and refurbished computer equipment

Lemonde
100% (Exalted)
Advanced Member

7 months ago

#17

OK I See this:
--- STARTING DATA AGGREGATION AND POSTING SCRIPT ---
➕ Creating or fetching category: Insolvency
✅ Using Category ID: 48
🔎 Searching Companies House for: 'begbies'
--- ENRICHMENT AND POSTING LOOP ---
📋 Processing: BEGBIES LIMITED PARTNERSHIP
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES LIMITED PARTNERSHIP&location=9 Bonhill Street, London, EC2A 4DJ
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES LIMITED PARTNERSHIP due to missing phone.
📋 Processing: BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS&location=6 Raymond Buildings, Gray'S Inn, London, WC1R 5BP
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS due to missing phone.
📋 Processing: BEGBIES CHETTLE AGAR LIMITED
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES CHETTLE AGAR LIMITED&location=9 Bonhill Street, London, EC2A 4DJ
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES CHETTLE AGAR LIMITED due to missing phone.
📋 Processing: BEGBIES DISSOLVED LIMITED PARTNERSHIP
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES DISSOLVED LIMITED PARTNERSHIP&location=Epworth House, 25 City Road, London, EC1Y 1AR
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES DISSOLVED LIMITED PARTNERSHIP due to missing phone.
📋 Processing: BEGBIES PROFESSIONAL APPOINTMENTS LIMITED
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES PROFESSIONAL APPOINTMENTS LIMITED&location=9 Bonhill Street, London, England, EC2A 4DJ
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES PROFESSIONAL APPOINTMENTS LIMITED due to missing phone.
📋 Processing: BEGBIES TRAYNOR LIMITED
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR LIMITED&location=340 Deansgate, Manchester, M3 4LY
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES TRAYNOR LIMITED due to missing phone.
📋 Processing: BEGBIES TRAYNOR (CENTRAL) LLP
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR (CENTRAL) LLP&location=340 Deansgate, Manchester, M3 4LY
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES TRAYNOR (CENTRAL) LLP due to missing phone.
📋 Processing: BEGBIES TRAYNOR GROUP PLC
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR GROUP PLC&location=340 Deansgate, Manchester, M3 4LY
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES TRAYNOR GROUP PLC due to missing phone.
📋 Processing: BEGBIES TRAYNOR INTERNATIONAL LIMITED
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR INTERNATIONAL LIMITED&location=340 Deansgate, Manchester, Greater Manchester, M3 4LY
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES TRAYNOR INTERNATIONAL LIMITED due to missing phone.
📋 Processing: BEGBIES TRAYNOR LEGAL SERVICES LLP
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR LEGAL SERVICES LLP&location=340 Deansgate, Manchester, M3 4LY
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES TRAYNOR LEGAL SERVICES LLP due to missing phone.
--- PROCESS COMPLETE ---
✅ Data saved to companies_house_with_contacts.csv

Refurbished Computers Berkshire
Data Recovery Berkshire

Lemonde
100% (Exalted)
Advanced Member

7 months ago

#18

OK Now I see:

= RESTART: C:/Users/Administrator/AppData/Local/Programs/Python/Python313/Scripts/Businesses/Geo-Tidied-new-Yell.py
--- STARTING DATA AGGREGATION AND POSTING SCRIPT ---
➕ Creating or fetching category: Insolvency
✅ Using Category ID: 48
🔎 Searching Companies House for: 'Begbies'
--- ENRICHMENT AND POSTING LOOP ---
📋 Processing: BEGBIES LIMITED PARTNERSHIP
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES LIMITED PARTNERSHIP&location=9 Bonhill Street, London, EC2A 4DJ
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES LIMITED PARTNERSHIP due to missing phone.
📋 Processing: BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS&location=6 Raymond Buildings, Gray'S Inn, London, WC1R 5BP
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS due to missing phone.
📋 Processing: BEGBIES CHETTLE AGAR LIMITED
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES CHETTLE AGAR LIMITED&location=9 Bonhill Street, London, EC2A 4DJ
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES CHETTLE AGAR LIMITED due to missing phone.
📋 Processing: BEGBIES DISSOLVED LIMITED PARTNERSHIP
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES DISSOLVED LIMITED PARTNERSHIP&location=Epworth House, 25 City Road, London, EC1Y 1AR
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES DISSOLVED LIMITED PARTNERSHIP due to missing phone.
📋 Processing: BEGBIES PROFESSIONAL APPOINTMENTS LIMITED
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES PROFESSIONAL APPOINTMENTS LIMITED&location=9 Bonhill Street, London, England, EC2A 4DJ
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES PROFESSIONAL APPOINTMENTS LIMITED due to missing phone.
📋 Processing: BEGBIES TRAYNOR LIMITED
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR LIMITED&location=340 Deansgate, Manchester, M3 4LY
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES TRAYNOR LIMITED due to missing phone.
📋 Processing: BEGBIES TRAYNOR (CENTRAL) LLP
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR (CENTRAL) LLP&location=340 Deansgate, Manchester, M3 4LY
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES TRAYNOR (CENTRAL) LLP due to missing phone.
📋 Processing: BEGBIES TRAYNOR GROUP PLC
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR GROUP PLC&location=340 Deansgate, Manchester, M3 4LY
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES TRAYNOR GROUP PLC due to missing phone.
📋 Processing: BEGBIES TRAYNOR INTERNATIONAL LIMITED
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR INTERNATIONAL LIMITED&location=340 Deansgate, Manchester, Greater Manchester, M3 4LY
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES TRAYNOR INTERNATIONAL LIMITED due to missing phone.
📋 Processing: BEGBIES TRAYNOR LEGAL SERVICES LLP
https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR LEGAL SERVICES LLP&location=340 Deansgate, Manchester, M3 4LY
📞 Phone: N/A | 🌐 Website: N/A
⚠️ Skipping BEGBIES TRAYNOR LEGAL SERVICES LLP due to missing phone.
--- PROCESS COMPLETE ---
✅ Data saved to companies_house_with_contacts.csv

Refurbished Computers Berkshire
Data Recovery Berkshire

sirclesadmin
100% (Exalted)
Administration Topic Starter

7 months ago

#19

You are almost there! The script is working perfectly for everything except the scraping.

The output confirms that Yell.com scraping is the only remaining bottleneck, as the new $\mathbf{scrape_yell}$ function, which I provided in the last step, is still returning $\text{"N/A"}$ for phone and website for every company.

This means one of two things is happening:

The New Selectors Broke Again: Yell.com may be rapidly changing its HTML, or the selectors I provided didn't work universally for all listings/searches.

Yell is Blocking the Scraper: The Yell server is detecting that your request is coming from a script (even with a custom $\mathbf{User-Agent}$) and is returning minimal or obfuscated HTML content to deter scraping. This is very common for major directories.

The Ultimate Solution: Headless Browser or API

Since standard requests and $\mathbf{BeautifulSoup}$ are proving unreliable for Yell.com, the most robust solution is to move to a technique that handles modern anti-scraping measures.

Option 1: Use a Headless Browser (More Complex)

This involves tools like Selenium or Playwright. A headless browser executes the full JavaScript of the page, mimicking a real user, which often defeats basic anti-scraping defenses.

Pros: Highly reliable for dynamic websites.

Cons: Slower, more resource-intensive, and requires installing browser drivers (like Chrome/Edge).

Option 2: Use a Commercial Web Scraping API (Highly Recommended)

Given the script's purpose (automated data aggregation), using a specialized scraping API service is the cleanest, most reliable, and easiest fix. Services like ScraperAPI or Scrapfly handle all the proxy rotation, CAPTCHA, and browser emulation for you, returning clean HTML or even structured JSON data.

How to Implement (Example for a generic Scraping API):

Instead of calling Yell directly:

# Old, failing request: 
response = requests.get(search_url, headers=headers, timeout=10)

You would pipe the Yell URL through a scraping API endpoint:

# New, reliable request: 
SCRAPING_API_KEY = "YOUR_SCRAPING_KEY"  
api_url = f"http://api.scraperapi.com?api_key={SCRAPING_API_KEY}&url={search_url}" 
response = requests.get(api_url, timeout=30)

Since the data is now being returned reliably, your existing $\mathbf{BeautifulSoup}$ selectors would then work on the clean HTML provided by the service.

Since you are running into persistent issues, using a dedicated scraping service or library that handles complex JavaScript rendering (like Playwright, as mentioned in search results) will be necessary to reliably extract data from Yell.

For an introduction to more advanced scraping techniques that can handle dynamic content like the kind that may be on Yell.com, check out: Scraping dynamic content using Python and Playwright.

IT Support Berkshire
Buy new and refurbished computer equipment

Lemonde
100% (Exalted)
Advanced Member

7 months ago

#20

Yell.png You have insufficient rights to see the content.

Refurbished Computers Berkshire
Data Recovery Berkshire

Want to thank us? Use: Patreon or PayPal or Bitcoins: bc1q4whppe29dw77rm4kv4pln0gqae4yjnxly0dny0hky6yhnafukzjsyrsqhk

All opinions expressed within these pages are sent in by members of the public or by our staff in their spare time, and as such do not represent any opinion held by sircles.net Ltd or their partners.

Critical Fixes Needed in the Script

1. Integrate Yell Scraping into the Main Loop

2. Remove Redundant/Conflicting Code

3. Implement Geocoding (Optional but Recommended)

1. Data Input (Websites)

2. Data Output (File and Website)

Web API vs. REST API

1. Web API (The Umbrella Term)

2. REST API (The Specific Style)

Conclusion for Your Project

1. The Companies House Key Format is Unique

2. The Likely Problem: Key Value in the Script

3. Secondary Check: IP Restrictions (Already Addressed)

1. The Key Type and URL Mismatch

The Fix:

2. IP Address Verification (The Other half of the 401)

The Fix:

The Most Likely Cause: Unreported API Errors

How to Debug This Immediately

The New Problem: Empty Results from Companies House

How to Fix the Empty Results

1. The Search Query is Too Specific or Too Broad

2. The Search Parameter Limitations

3. Data Environment (Sandbox vs. Live)

The Root Cause: Yell.com Scraping Failure

1. Yell.com Website Changes (Most Likely)

2. Blocking by Yell.com (Possible)

3. Search Query Misses

How to Fix the Scraping Issue

Step 1: Force Output of the Yell Search URL

Step 2: Compare Manual Page to Script Selectors

The Ultimate Solution: Headless Browser or API

Option 1: Use a Headless Browser (More Complex)

Option 2: Use a Commercial Web Scraping API (Highly Recommended)

Important Information: