board logo
Posted by: sirclesadmin - 27 October 2025 14:57:08
This Python script is designed to find UK companies based on a search query, enrich their data by scraping contact details from Yell.com, and then automatically upload the structured business listings to a GeoDirectory-powered WordPress website via its REST API. It's essentially an automated data aggregation and listing tool for building a business directory. Key Steps and Functionality The script follows a multi-step process: 1. Configuration & Setup Imports: It imports necessary libraries for web requests ($\mathbf{requests}$), JSON handling, time delays ($\mathbf{time}$), data manipulation ($\mathbf{pandas}$), web scraping ($\mathbf{BeautifulSoup}$), and API authorization ($\mathbf{HTTPBasicAuth}$, $\mathbf{OAuth1}$). API Keys: It stores numerous sensitive credentials for: Companies House API (UK business registry). GeoDirectory API (WordPress plugin for business directories) with both Basic Auth (username/password) and OAuth1 (consumer keys). Google Maps Geocoding API. OpenAI API (although the key is set, it's $\mathbf{not}$ used in the provided code snippet). Search Parameters: It defines the search query (e.g., "Digital Marketing") and the directory category ("Test"). 2. Helper Functions (Data Collection & Processing) Function Purpose Data Source $\mathbf{get_or_create_category}$ Creates a new category on the GeoDirectory site or retrieves the ID of an existing one. This is crucial for correctly tagging the new business listings. GeoDirectory API $\mathbf{get_companies}$ Searches the Companies House public registry for businesses matching the $\mathbf{search_query}$ (e.g., "Digital Marketing") and returns basic company details like name, company number, address snippet, status, and type. Companies House API $\mathbf{scrape_yell}$ Attempts to find contact information (phone and website) for a company by performing a targeted search on the Yell.com business directory and scraping the results. Yell.com (via web scraping) $\mathbf{get_coordinates_from_address}$ Converts a company's street address into geographical coordinates (latitude and longitude). Google Maps Geocoding API $\mathbf{check_existing_company}$ Checks the GeoDirectory site to see if a company with the same name has already been listed, preventing duplicates. GeoDirectory API $\mathbf{post_to_geodirectory}$ Sends the final, prepared business data to the GeoDirectory API to create a new live listing on the target WordPress site. GeoDirectory API 3. Main Execution (The Workflow) Category Setup: It calls $\mathbf{get_or_create_category("Test")}$ to ensure all subsequent listings are added under the correct category ID. Company Search: It retrieves the top 10 companies matching "Digital Marketing" from Companies House using $\mathbf{get_companies}$. Data Structuring: It converts the Companies House data into a Pandas DataFrame ($\mathbf{df}$) for easier processing and initializes "phone" and "website" columns as empty. Enrichment and Validation (First Loop): The script iterates through the Companies House results: Scraping: There is a logical gap here, as the $\mathbf{scrape_yell}$ function is defined but not called within the first processing loop to fill the "phone" and "website" columns of the DataFrame. The script will likely skip most companies because it immediately checks if $\mathbf{row["phone"]}$ and $\mathbf{row["website"]}$ are empty/$\text{"N/A"}$, which they are since they haven't been populated. Validation: It checks for missing address, phone, or website (leading to skips). Duplicate Check: It uses $\mathbf{check_existing_company}$ to prevent relisting. Payload Creation: It constructs the final JSON payload for the GeoDirectory API, including the company name, status, type, address, and contact details. It also sets $\mathbf{latitude}$ and $\mathbf{longitude}$ to empty strings (the $\mathbf{get_coordinates_from_address}$ function is also defined but not called in the final payload preparation). Posting: It attempts to create the listing using $\mathbf{post_to_geodirectory}$ and pauses for 2 seconds ($\mathbf{time.sleep(2)}$) between posts to respect API rate limits. Final Loop (Redundant/Modified Post): A second, simpler loop also iterates through the DataFrame and attempts to post listings again if they still don't exist, using less detailed content. Data Export: Finally, it saves the DataFrame to a file named companies_house_with_contacts.csv.

Posted by: sirclesadmin - 27 October 2025 14:57:48

Knowing the goal is to build a business directory by aggregating data from Companies House and Yell.com and uploading it to a GeoDirectory-powered site makes the next steps clear.

Since the script you provided has a few significant issues that would prevent it from working as intended, here is a summary of what needs to be fixed for it to actually achieve your goal, followed by an explanation of the logical error.


Critical Fixes Needed in the Script

The main problem is that the script defines a function to scrape contact data ($\mathbf{scrape_yell}$) but never calls it to populate the DataFrame. As a result, the subsequent validation checks skip every company.

1. Integrate Yell Scraping into the Main Loop

You need to add a step inside your primary loop to call the scraping function and populate the contact columns.

Change this (inside the main loop):

for index, row in df.iterrows():
    print(f"\n Preparing to post: {row['name']}")
    
    # ... existing validation for address ...
    
    if row["phone"] == "N/A" or not row["phone"]:
        print(f" Skipping {row['name']} due to missing phone.")
        continue

    # ... and so on

To this (Insert the bolded lines):

for index, row in df.iterrows():
    print(f"\n Preparing to post: {row['name']}")

    # ---  FIX: SCRAPE YELL FOR CONTACT DETAILS ---
    yell_data = scrape_yell(row["name"], row["address"]) 
    df.loc[index, "phone"] = yell_data["phone"]
    df.loc[index, "website"] = yell_data["website"]
    row = df.loc[index] # Update the 'row' object to reflect new data
    # ---------------------------------------------

    # Basic validation
    if not row["address"] or row["address"].strip() == "":
        print(f" Skipping {row['name']} due to missing address.")
        continue

    if row["phone"] == "N/A" or not row["phone"]:
        print(f" Skipping {row['name']} due to missing phone.")
        continue
    
    # ... continue with the rest of the logic

2. Remove Redundant/Conflicting Code

Your script has multiple copies of some functions and unnecessary loops:

  • Remove Duplicate Functions: The function $\mathbf{get_coordinates_from_address}$ is defined twice. Delete the first instance that returns empty strings (""). The second instance, which uses $\mathbf{requests_oauthlib}$ (though not strictly necessary for this function), should be kept, but it is not currently used anywhere in the main loop to populate the latitude/longitude fields.
  • Remove Second Posting Loop: The second $\mathbf{for}$ loop at the end of the script attempts to post the data again using a simpler payload. Since the main loop is intended to do this with all available data, the second loop is redundant and can be removed after the first loop is fixed.

3. Implement Geocoding (Optional but Recommended)

GeoDirectory listings are better with exact coordinates. If you want to use the $\mathbf{get_coordinates_from_address}$ function you defined, you should call it right after scraping Yell data:

# ... Inside the main loop, after populating phone and website ...

# Get coordinates
lat, lon = get_coordinates_from_address(row["address"])
print(f" Coordinates found: Lat={lat}, Lon={lon}")

# Build GeoDirectory payload
geo_data = {
    # ... other fields ...
    "address": row["address"],
    "latitude": str(lat), # Use the fetched latitude
    "longitude": str(lon), # Use the fetched longitude
    # ... other fields ...
}

# ... post_to_geodirectory(geo_data) ...

By fixing the missing $\mathbf{scrape_yell}$ call, your script will finally be able to retrieve the necessary contact details and move past the validation steps to successfully post new business listings to your directory.


Posted by: Lemonde - 27 October 2025 14:58:42
OK so is it pulling the data from the file or the website?

Posted by: sirclesadmin - 27 October 2025 14:59:01

The script is pulling data primarily from websites (external APIs and web pages), not from a local file.

Here is a breakdown of the data sources:

1. Data Input (Websites)

Source What Data is Pulled How it's Pulled
Companies House API (UK Government) The initial list of businesses (name, address snippet, company number, status, type) based on your search query ("Digital Marketing"). The $\mathbf{get_companies}$ function uses the requests library to fetch JSON data.
Yell.com Contact details (phone number and website address) to enrich the company data. The $\mathbf{scrape_yell}$ function uses requests and $\mathbf{BeautifulSoup}$ to scrape data from the search results page.
Google Maps Geocoding API Geographical coordinates (latitude and longitude) for the company address. The $\mathbf{get_coordinates_from_address}$ function uses the requests library to fetch JSON data.
GeoDirectory API Existing categories to find a valid ID, and existing business listings to check for duplicates. The $\mathbf{get_or_create_category}$ and $\mathbf{check_existing_company}$ functions use the requests library.

2. Data Output (File and Website)

The script outputs data to both a file and a website:

Destination What Data is Sent
GeoDirectory API (Your WordPress Site) The final, complete business listing (name, address, coordinates, phone, website, status, type) is uploaded as a new post/listing.
Local File (companies_house_with_contacts.csv) The entire list of companies, along with all the enriched data (phone, website, coordinates), is saved to a local CSV file for record-keeping.

Posted by: Lemonde - 27 October 2025 14:59:33
should it be a REST or web API?

Posted by: sirclesadmin - 27 October 2025 14:59:49

That is an excellent question that clarifies the confusion around these terms.

The short answer is: It should be a REST API.

A REST API is a specific, modern, and highly-recommended type of Web API.


Web API vs. REST API

The term "Web API" is a broad umbrella, while "REST API" is a specific architectural style that lives underneath that umbrella.

1. Web API (The Umbrella Term)

A Web API is any Application Programming Interface that works over the internet using the HTTP/HTTPS protocol.

  • It simply defines a mechanism for two applications to communicate using web standards.
  • The communication can follow any style, including REST, SOAP (Simple Object Access Protocol), or even a custom, non-standard approach.

2. REST API (The Specific Style)

A REST API (Representational State Transfer API) is a Web API that strictly adheres to the architectural constraints established by the REST style.

The principles that make an API RESTful are:

Principle Description
Client-Server The client (e.g., your Python script) and server (e.g., Companies House) are independent and separate.
Stateless The server does not store any information about the client's state between requests. Every request must contain all the information the server needs to fulfill it. (This is key for scalability).
Cacheable Responses must be marked as cacheable or non-cacheable to improve performance.
Uniform Interface The API must use standard HTTP methods ($\mathbf{GET}$, $\mathbf{POST}$, $\mathbf{PUT}$, $\mathbf{DELETE}$) to manage resources (e.g., a company listing).

Conclusion for Your Project

The APIs you are interacting with (Companies House, Google Maps, GeoDirectory) are all examples of REST APIs (or at least RESTful Web APIs). They rely on concepts like resources, standard HTTP verbs, and data exchange via JSON, which is the definition of a RESTful design.

REST is the industry standard for lightweight, scalable, and simple web-based data exchange, making it the correct and preferable style for your data aggregation tool.

You can learn more about how these different API types work in detail here: How is it Different from a REST API | WEB API vs REST API. http://googleusercontent.com/youtube_content/5


Posted by: Lemonde - 27 October 2025 15:00:07
OK so I just added a new key now

Posted by: sirclesadmin - 27 October 2025 15:04:22

That's frustrating, but the new error message, 401 - {"error":"Invalid Authorization","type":"ch:service"}, is a huge improvement!

The previous 403 Forbidden usually meant the server recognized you but denied access. The new 401 Unauthorized error means the server (Companies House) successfully received your request but determined the authentication information you sent is incorrect or missing.

This points to a flaw in how your script is using the key.

1. The Companies House Key Format is Unique

The Companies House API uses a very specific type of HTTP Basic Authentication. It requires the API key to be passed as the username, and the password field must be empty.

Your Python code does this correctly:

response = requests.get(url, auth=(COMPANIES_HOUSE_API_KEY, ""))

2. The Likely Problem: Key Value in the Script

The most common reason for a 401 Invalid Authorization is that the key in your script's configuration section does not perfectly match the key you just created.

  • Your New Key: xxxxxxxxxxxxxxxxxx
  • Your Script Configuration: (Still shows the old placeholder key)
    COMPANIES_HOUSE_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxx" # This is the old placeholder key
    

You must update the configuration in your script with the new key.

# ------------------- CONFIGURATION -------------------
# CHANGE THIS LINE 
COMPANIES_HOUSE_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" 
# -----------------------------------------------------

3. Secondary Check: IP Restrictions (Already Addressed)

Your key configuration also shows IP restrictions:

  • Restricted IPs: x.x.x.x through x.x.x.x

Since you are receiving a 401 Unauthorized (authentication error) and not a 403 Forbidden (access error), it means the connection is likely coming from one of those allowed IPs, and the server is hitting the code check but failing on the credential check.

Action: Copy and paste the new API key (1xxxxxxxxxxxxxxxxxxxxxxxxxxx) directly into the COMPANIES_HOUSE_API_KEY variable in your Python script. This should resolve the 401 Invalid Authorization error.


Posted by: Lemonde - 27 October 2025 15:04:42
it is only a test key - will that suffice?

Posted by: sirclesadmin - 27 October 2025 15:05:46

That's an important detail. The short answer is: Yes, a Companies House "Test Key" should suffice for initial testing, but you must ensure it has the necessary permissions and that you are using the correct URL.

Here's why you are likely still seeing an error, even with a new key:


1. The Key Type and URL Mismatch

Companies House generally has two distinct API environments:

  • Sandbox (Test) Environment: Used for development and testing. This environment often requires a Test Key and a specific Test URL.
  • Live Environment: Used for real production data. This requires a Live Key and the Live URL.

The error message you received (Error fetching Companies House data: 401 - {"error":"Invalid Authorization"}) means the key itself is being rejected by the system you're connecting to.

Your script is currently using the Live URL:

BASE_URL = "https://api.company-information.service.gov.uk"

If your key is strictly a Test Key, the live server will reject it.

The Fix:

You need to check the Companies House developer documentation for the correct Sandbox URL and replace your BASE_URL with it. It is often something like:

BASE_URL = "https://**chs-api-sandbox.pres.companieshouse.gov.uk**" 
# NOTE: Verify this exact URL with the Companies House documentation.

If you do not want to change the URL, you must create a Live API Key from your Companies House dashboard and use that key with the current BASE_URL.


2. IP Address Verification (The Other half of the 401)

The other possibility, even if you are using the correct Test/Live pair, is that the key's permissions are limited:

Your key details show Restricted IPs:

x.x.x.x

If the Python script is running on a server or computer whose external IP address is not one of these listed IPs, the Companies House API will immediately reject the request with a 401/403 error.

The Fix:

You must ensure that the computer running the script has an external IP address that matches one of the IP addresses registered with the key. You may need to ask your network administrator for your current external IP and add it to the list in the Companies House dashboard.


Posted by: Lemonde - 27 October 2025 15:07:26
OK now it says: = RESTART: C:\Users\Administrator\AppData\Local\Programs\Python\Python313\Scripts\Businesses\Geo-Tidied.py --- STARTING DATA AGGREGATION AND POSTING SCRIPT --- ➕ Creating or fetching category: Test ✅ Using Category ID: 44 🔎 Searching Companies House for: 'Digital Marketing' --- ENRICHMENT AND POSTING LOOP --- --- PROCESS COMPLETE --- ✅ Data saved to companies_house_with_contacts.csv

Posted by: sirclesadmin - 27 October 2025 15:09:19

I understand the new output, but it seems there's a problem-the script is failing silently.

When the script works correctly, you should see output between "Searching Companies House" and "ENRICHMENT AND POSTING LOOP" that looks like this:

 Searching Companies House for: 'Digital Marketing'
... data about 10 companies being returned ...
--- ENRICHMENT AND POSTING LOOP ---
 Processing: [Company Name 1]
   Phone: [Number] |  Website: [URL]
  ...

The fact that you see a successful $\mathbf{Category\ ID}$ and then the script immediately jumps to --- ENRICHMENT AND POSTING LOOP --- (with no companies processed) and then --- PROCESS COMPLETE --- means that the call to Companies House is still failing, but now it's failing in a way that generates no visible error message and returns an empty list of companies.

Let's look at the function that pulls the data:

def get_companies(search_query, max_results=50):
    url = f"{BASE_URL}/search/companies?q={search_query}&items_per_page={max_results}"
    response = requests.get(url, auth=(COMPANIES_HOUSE_API_KEY, "")) 

    if response.status_code == 200:
        # ... logic to process companies ...
        return companies
    else:
        # The script is likely hitting this block, but it's not printing the error.
        print(f"Error fetching Companies House data: {response.status_code} - {response.text}")
        return [] # Returns an empty list

The Most Likely Cause: Unreported API Errors

The script is now silently receiving an error response (like a $\mathbf{400}$ or a $\mathbf{403}$) from Companies House, hitting the else block, printing the error (which is likely where the output is lost or suppressed), and then returning an empty list ([]). When the main loop receives an empty list, it correctly completes with no companies to process.

Since your previous errors were $\mathbf{403}$ and $\mathbf{401}$, this still points to an authentication or access issue.

How to Debug This Immediately

To figure out what error the API is silently returning, we need to add a manual debug check to the get_companies function to force the output:

Modify the get_companies function temporarily:

def get_companies(search_query, max_results=50):
    url = f"{BASE_URL}/search/companies?q={search_query}&items_per_page={max_results}"
    response = requests.get(url, auth=(COMPANIES_HOUSE_API_KEY, ""))
    
    # --------------------------------------------------------------------------
    #  NEW DEBUG CODE TO FORCE OUTPUT 
    print(f"DEBUG: CH status code is: {response.status_code}")
    if response.status_code != 200:
        print(f"DEBUG: CH response text is: {response.text}")
    # --------------------------------------------------------------------------

    if response.status_code == 200:
        data = response.json()
        companies = []
        for company in data.get("items", []):
            companies.append({
                "name": company.get("title"),
                "company_number": company.get("company_number"),
                "address": company.get("address_snippet"),
                "status": company.get("company_status"),
                "type": company.get("company_type")
            })
        return companies
    else:
        # This is where the old print was likely failing or suppressed
        # print(f"Error fetching Companies House data: {response.status_code} - {response.text}") 
        return []

Run the script again with this $\mathbf{DEBUG\ CODE}$ added. It will print the exact status code and message that Companies House is returning, which will tell us exactly what to fix (whether it's the Live/Test URL mismatch or an IP Restriction issue).


Posted by: Lemonde - 27 October 2025 15:09:32
OK: --- STARTING DATA AGGREGATION AND POSTING SCRIPT --- ➕ Creating or fetching category: Test ✅ Using Category ID: 44 🔎 Searching Companies House for: 'Digital Marketing' DEBUG: CH status code is: 200 --- ENRICHMENT AND POSTING LOOP --- --- PROCESS COMPLETE --- ✅ Data saved to companies_house_with_contacts.csv

Posted by: sirclesadmin - 27 October 2025 15:10:10

That is fantastic news! The line DEBUG: CH status code is: 200 confirms that your API key and URL are now correctly configured and the script successfully connected to the Companies House API!

The Companies House API is correctly returning a successful response, but the body of that response must be empty.

The new problem is that the Companies House API is returning zero results for your search query.

The New Problem: Empty Results from Companies House

Since the status code is $\mathbf{200}$ (Success), the script executes the data processing block:

    if response.status_code == 200:
        data = response.json()
        companies = []
        for company in data.get("items", []): # <--- Problem is likely here
            # ... appends companies ...
        return companies

When no results are found, the JSON returned by Companies House looks like this:

{
  "items_per_page": 10,
  "total_results": 0,
  "start_index": 0,
  "items": []  <--- This list is empty
}

Since the "items" list is empty, the for loop runs zero times, the companies list remains empty, and the script correctly moves on to the next section without crashing.

How to Fix the Empty Results

There are three primary reasons Companies House returns an empty list for a successful search:

1. The Search Query is Too Specific or Too Broad

The query "Digital Marketing" is often quite specific for the Companies House registry, which primarily indexes by formal company name and official industry codes, not general marketing buzzwords.

Action: Try a much simpler or broader query to test the functionality.

  • Test Query 1: Try a very common, single-word industry search, like "Plumbing" or "Software".
  • Test Query 2: Try searching for a well-known UK company name (e.g., "Tesco") to confirm the API returns at least one result.

Change this in your script:

search_query = "Digital Marketing"

To something like:

search_query = "Software" 

2. The Search Parameter Limitations

Companies House has specific rules about which fields it searches. It might be searching only registered names and not descriptions.

Action: If a simpler query still fails, check the official Companies House API documentation to see if there are other parameters (like a company status filter) you can remove or adjust to broaden the search.

3. Data Environment (Sandbox vs. Live)

If you are using a Sandbox URL, the test data available might be extremely limited. Even a simple search may return zero results because the test database is tiny.

Action: If Test Queries 1 and 2 fail, you may need to switch to a Live API Key and the Live URL to access the full public registry data.

Recommendation: Change your search_query to a very broad term like "Technology" or "Consulting" and run the script again.


Posted by: Lemonde - 27 October 2025 15:17:16
Check it out: --- STARTING DATA AGGREGATION AND POSTING SCRIPT --- ➕ Creating or fetching category: Test ✅ Using Category ID: 44 🔎 Searching Companies House for: 'Digital Marketing' --- ENRICHMENT AND POSTING LOOP --- 📋 Processing: DIGITAL MARKETING LTD 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping DIGITAL MARKETING LTD due to missing phone. 📋 Processing: AKQA LIMITED 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping AKQA LIMITED due to missing phone. 📋 Processing: DIGITAL MARKETING A LTD 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping DIGITAL MARKETING A LTD due to missing phone. 📋 Processing: DIGITAL MARKETING ACADEMY LTD 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping DIGITAL MARKETING ACADEMY LTD due to missing phone. 📋 Processing: DIGITAL MARKETING ADVICE LIMITED 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping DIGITAL MARKETING ADVICE LIMITED due to missing phone. 📋 Processing: DIGITAL MARKETING AFRICA LTD 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping DIGITAL MARKETING AFRICA LTD due to missing phone. 📋 Processing: DIGITAL MARKETING AGENCY, LTD 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping DIGITAL MARKETING AGENCY, LTD due to missing phone. 📋 Processing: DIGITAL MARKETING AND PR LIMITED 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping DIGITAL MARKETING AND PR LIMITED due to missing phone. 📋 Processing: DIGITAL MARKETING & SEO AGENCY UK LTD 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping DIGITAL MARKETING & SEO AGENCY UK LTD due to missing phone. 📋 Processing: DIGITAL MARKETING AUDIT LTD 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping DIGITAL MARKETING AUDIT LTD due to missing phone. --- PROCESS COMPLETE --- ✅ Data saved to companies_house_with_contacts.csv

Posted by: sirclesadmin - 27 October 2025 15:17:33

That is a significant step forward!

The script is now successfully:

  1. Connecting to the Companies House API (Status Code 200).
  2. Fetching the companies list for "Digital Marketing".
  3. Entering the Enrichment and Posting Loop.
  4. Iterating through each company.

The new problem is clearly that the Yell.com scraping component is failing to find any contact information.

The output shows this pattern for every single company:

  Phone: N/A |  Website: N/A
  Skipping [COMPANY NAME] due to missing phone.

The Root Cause: Yell.com Scraping Failure

The $\mathbf{scrape_yell}$ function is returning the default values $\text{"N/A"}$ because it couldn't locate the phone number or website elements on the search results page.

There are three likely reasons for this failure:

1. Yell.com Website Changes (Most Likely)

Web scraping is inherently fragile. If Yell.com changed the class names or the HTML structure of their search results page since the script was written, the $\mathbf{BeautifulSoup}$ selectors will no longer work.

  • Original Selectors (in your script):
    • Phone: soup.find("span", class_="business--telephone")
    • Website: soup.find("a", class_="businessCapsule--ctaItem") (or the fixed versions)

If Yell has updated their site, these selectors will return nothing, causing the function to default to $\text{"N/A"}$.

2. Blocking by Yell.com (Possible)

Yell.com may be detecting the automated request from your Python script's default $\mathbf{User-Agent}$ header or the speed of the requests and is serving a different, often empty, page content or a "Please Verify" page.

3. Search Query Misses

The search being run on Yell.com might not be precise enough. When searching for "AKQA LIMITED" and its address, Yell may return a list of possible matches, none of which perfectly align with the expected structure.


How to Fix the Scraping Issue

Since this is the only remaining bottleneck, the best solution is to diagnose and potentially update the $\mathbf{scrape_yell}$ function.

Step 1: Force Output of the Yell Search URL

Temporarily modify the scrape_yell function to print the exact URL it is attempting to scrape, so you can test it manually in a browser.

Change this in scrape_yell:

def scrape_yell(company_name, location):
    search_url = f"https://www.yell.com/ucs/UcsSearchAction.do?keywords={company_name}&location={location}"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20100101 Firefox/91.0"}
    
    # ADD THIS LINE 
    print(f"   Yell Search URL: {search_url}") 
    # ... rest of the function ...

Run the script and manually open one of the printed URLs in your web browser.

Step 2: Compare Manual Page to Script Selectors

  1. If the manual browser page shows the phone/website:

    • Right-click the phone number element and choose Inspect.
    • Check if the current class names on the Yell website still match the Python code's selectors (business--telephone, businessCapsule--ctaItem).
    • If they do not match, you must update the selectors in your scrape_yell function to match the current Yell HTML structure.
  2. If the manual browser page does NOT show the company/data:

    • The problem is the search query itself (e.g., Companies House address snippets are too vague for Yell). In this case, scraping Yell may be too unreliable, and you should consider an alternative contact data source.

Recommended Action: Perform Step 1 to get a working Yell URL and perform Step 2 to update the selectors. That is the quickest way to reactivate the data enrichment.


Posted by: Lemonde - 27 October 2025 15:21:38
OK I See this: --- STARTING DATA AGGREGATION AND POSTING SCRIPT --- ➕ Creating or fetching category: Insolvency ✅ Using Category ID: 48 🔎 Searching Companies House for: 'begbies' --- ENRICHMENT AND POSTING LOOP --- 📋 Processing: BEGBIES LIMITED PARTNERSHIP 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES LIMITED PARTNERSHIP&location=9 Bonhill Street, London, EC2A 4DJ 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES LIMITED PARTNERSHIP due to missing phone. 📋 Processing: BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS&location=6 Raymond Buildings, Gray'S Inn, London, WC1R 5BP 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS due to missing phone. 📋 Processing: BEGBIES CHETTLE AGAR LIMITED 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES CHETTLE AGAR LIMITED&location=9 Bonhill Street, London, EC2A 4DJ 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES CHETTLE AGAR LIMITED due to missing phone. 📋 Processing: BEGBIES DISSOLVED LIMITED PARTNERSHIP 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES DISSOLVED LIMITED PARTNERSHIP&location=Epworth House, 25 City Road, London, EC1Y 1AR 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES DISSOLVED LIMITED PARTNERSHIP due to missing phone. 📋 Processing: BEGBIES PROFESSIONAL APPOINTMENTS LIMITED 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES PROFESSIONAL APPOINTMENTS LIMITED&location=9 Bonhill Street, London, England, EC2A 4DJ 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES PROFESSIONAL APPOINTMENTS LIMITED due to missing phone. 📋 Processing: BEGBIES TRAYNOR LIMITED 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR LIMITED&location=340 Deansgate, Manchester, M3 4LY 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES TRAYNOR LIMITED due to missing phone. 📋 Processing: BEGBIES TRAYNOR (CENTRAL) LLP 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR (CENTRAL) LLP&location=340 Deansgate, Manchester, M3 4LY 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES TRAYNOR (CENTRAL) LLP due to missing phone. 📋 Processing: BEGBIES TRAYNOR GROUP PLC 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR GROUP PLC&location=340 Deansgate, Manchester, M3 4LY 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES TRAYNOR GROUP PLC due to missing phone. 📋 Processing: BEGBIES TRAYNOR INTERNATIONAL LIMITED 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR INTERNATIONAL LIMITED&location=340 Deansgate, Manchester, Greater Manchester, M3 4LY 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES TRAYNOR INTERNATIONAL LIMITED due to missing phone. 📋 Processing: BEGBIES TRAYNOR LEGAL SERVICES LLP 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR LEGAL SERVICES LLP&location=340 Deansgate, Manchester, M3 4LY 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES TRAYNOR LEGAL SERVICES LLP due to missing phone. --- PROCESS COMPLETE --- ✅ Data saved to companies_house_with_contacts.csv

Posted by: Lemonde - 27 October 2025 15:23:40
OK Now I see: = RESTART: C:/Users/Administrator/AppData/Local/Programs/Python/Python313/Scripts/Businesses/Geo-Tidied-new-Yell.py --- STARTING DATA AGGREGATION AND POSTING SCRIPT --- ➕ Creating or fetching category: Insolvency ✅ Using Category ID: 48 🔎 Searching Companies House for: 'Begbies' --- ENRICHMENT AND POSTING LOOP --- 📋 Processing: BEGBIES LIMITED PARTNERSHIP 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES LIMITED PARTNERSHIP&location=9 Bonhill Street, London, EC2A 4DJ 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES LIMITED PARTNERSHIP due to missing phone. 📋 Processing: BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS&location=6 Raymond Buildings, Gray'S Inn, London, WC1R 5BP 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES CHARTERED ACCOUNTANTS AND LICENSED INSOLVENCY PRACTITIONERS due to missing phone. 📋 Processing: BEGBIES CHETTLE AGAR LIMITED 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES CHETTLE AGAR LIMITED&location=9 Bonhill Street, London, EC2A 4DJ 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES CHETTLE AGAR LIMITED due to missing phone. 📋 Processing: BEGBIES DISSOLVED LIMITED PARTNERSHIP 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES DISSOLVED LIMITED PARTNERSHIP&location=Epworth House, 25 City Road, London, EC1Y 1AR 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES DISSOLVED LIMITED PARTNERSHIP due to missing phone. 📋 Processing: BEGBIES PROFESSIONAL APPOINTMENTS LIMITED 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES PROFESSIONAL APPOINTMENTS LIMITED&location=9 Bonhill Street, London, England, EC2A 4DJ 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES PROFESSIONAL APPOINTMENTS LIMITED due to missing phone. 📋 Processing: BEGBIES TRAYNOR LIMITED 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR LIMITED&location=340 Deansgate, Manchester, M3 4LY 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES TRAYNOR LIMITED due to missing phone. 📋 Processing: BEGBIES TRAYNOR (CENTRAL) LLP 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR (CENTRAL) LLP&location=340 Deansgate, Manchester, M3 4LY 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES TRAYNOR (CENTRAL) LLP due to missing phone. 📋 Processing: BEGBIES TRAYNOR GROUP PLC 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR GROUP PLC&location=340 Deansgate, Manchester, M3 4LY 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES TRAYNOR GROUP PLC due to missing phone. 📋 Processing: BEGBIES TRAYNOR INTERNATIONAL LIMITED 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR INTERNATIONAL LIMITED&location=340 Deansgate, Manchester, Greater Manchester, M3 4LY 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES TRAYNOR INTERNATIONAL LIMITED due to missing phone. 📋 Processing: BEGBIES TRAYNOR LEGAL SERVICES LLP 🔍 Yell Search URL: https://www.yell.com/ucs/UcsSearchAction.do?keywords=BEGBIES TRAYNOR LEGAL SERVICES LLP&location=340 Deansgate, Manchester, M3 4LY 📞 Phone: N/A | 🌐 Website: N/A ⚠️ Skipping BEGBIES TRAYNOR LEGAL SERVICES LLP due to missing phone. --- PROCESS COMPLETE --- ✅ Data saved to companies_house_with_contacts.csv

Posted by: sirclesadmin - 27 October 2025 15:24:43

You are almost there! The script is working perfectly for everything except the scraping.

The output confirms that Yell.com scraping is the only remaining bottleneck, as the new $\mathbf{scrape_yell}$ function, which I provided in the last step, is still returning $\text{"N/A"}$ for phone and website for every company.

This means one of two things is happening:

  1. The New Selectors Broke Again: Yell.com may be rapidly changing its HTML, or the selectors I provided didn't work universally for all listings/searches.
  2. Yell is Blocking the Scraper: The Yell server is detecting that your request is coming from a script (even with a custom $\mathbf{User-Agent}$) and is returning minimal or obfuscated HTML content to deter scraping. This is very common for major directories.

The Ultimate Solution: Headless Browser or API

Since standard requests and $\mathbf{BeautifulSoup}$ are proving unreliable for Yell.com, the most robust solution is to move to a technique that handles modern anti-scraping measures.

Option 1: Use a Headless Browser (More Complex)

This involves tools like Selenium or Playwright. A headless browser executes the full JavaScript of the page, mimicking a real user, which often defeats basic anti-scraping defenses.

  • Pros: Highly reliable for dynamic websites.
  • Cons: Slower, more resource-intensive, and requires installing browser drivers (like Chrome/Edge).

Option 2: Use a Commercial Web Scraping API (Highly Recommended)

Given the script's purpose (automated data aggregation), using a specialized scraping API service is the cleanest, most reliable, and easiest fix. Services like ScraperAPI or Scrapfly handle all the proxy rotation, CAPTCHA, and browser emulation for you, returning clean HTML or even structured JSON data.

How to Implement (Example for a generic Scraping API):

Instead of calling Yell directly:

# Old, failing request:
response = requests.get(search_url, headers=headers, timeout=10)

You would pipe the Yell URL through a scraping API endpoint:

# New, reliable request:
SCRAPING_API_KEY = "YOUR_SCRAPING_KEY" 
api_url = f"http://api.scraperapi.com?api_key={SCRAPING_API_KEY}&url={search_url}"
response = requests.get(api_url, timeout=30)

Since the data is now being returned reliably, your existing $\mathbf{BeautifulSoup}$ selectors would then work on the clean HTML provided by the service.

Since you are running into persistent issues, using a dedicated scraping service or library that handles complex JavaScript rendering (like Playwright, as mentioned in search results) will be necessary to reliably extract data from Yell.

For an introduction to more advanced scraping techniques that can handle dynamic content like the kind that may be on Yell.com, check out: Scraping dynamic content using Python and Playwright.


Posted by: Lemonde - 27 October 2025 15:26:34
[attach]109[/attach]

Posted by: sirclesadmin - 27 October 2025 15:27:49

The image you provided clearly shows the reason why your Python script is failing to scrape contact details from Yell.com:

Yell.com is blocking your script with a Cloudflare security challenge.

The message, "Verify you are human by completing the action below," and the Cloudflare logo confirm that Yell has successfully detected that your requests are coming from an automated script (your scraper) and is preventing access to the actual search results. Your script is receiving the HTML for this challenge page, not the actual business listings, which is why your selectors always fail and return $\text{"N/A"}$.

How to Fix the Cloudflare Block

Since Yell is using Cloudflare's DDoS protection and bot detection, simply changing the $\mathbf{BeautifulSoup}$ selectors or the $\mathbf{User-Agent}$ header will not work. You need a tool that can bypass or solve this JavaScript challenge.

Here are the three most viable options to continue your data aggregation project:

Option 1: Use a Headless Browser (Recommended for Control)

The most robust way to solve Cloudflare challenges without paying for an external service is to use a tool that controls a real web browser (like Chrome or Firefox) in the background.

  • Selenium or Playwright: These libraries will load the page, execute the JavaScript necessary to solve the Cloudflare challenge, and wait until the page is fully loaded with the actual content. Once the challenge is solved, your $\mathbf{BeautifulSoup}$ selectors will work correctly on the resulting HTML.
  • Recommendation: Playwright is generally faster and easier to set up for modern scraping tasks than Selenium.

Option 2: Use a Specialized Scraping API (Recommended for Reliability)

For automated projects, a commercial scraping service is the quickest and most reliable fix, as they handle the browser automation, proxy rotation, and CAPTCHA solving for you.

  • Service Examples: ScraperAPI, Scrapfly, or similar services.
  • How it Works: You send the Yell URL to the API service, telling it to render JavaScript. The service solves the Cloudflare challenge and returns the clean HTML page content back to your Python script.

Option 3: Find an Alternative Data Source (Bypass Yell)

If the technical challenge of bypassing Cloudflare is too complex, you may need to rethink your data enrichment step entirely.

  • Alternative UK Directories: Search for alternative UK business directories that either offer a public API or have weaker anti-scraping measures.
  • Google Maps API: If cost allows, use the Google Places API (which you have a key for) to search for the company name and address. This is the most reliable way to get accurate coordinates, phone numbers, and websites, and it eliminates the need for any scraping.

Given the strength of the Cloudflare challenge, we strongly recommend implementing Option 1 or 2.

For an introduction on how to implement Option 1 (using Playwright), please refer to the video mentioned in the previous response: Scraping dynamic content using Python and Playwright. You will need to install that library and adapt your $\mathbf{scrape_yell}$ function to use the browser control methods instead of requests.get().


Posted by: Lemonde - 27 October 2025 15:36:24
Can I use GPT to do this instead?

Posted by: sirclesadmin - 27 October 2025 15:36:27

That is an insightful question, and the answer is Yes, but not by using "ChatGPT Nano."

While you already have the correct key and tool to reliably get contact information from Google: the Google Maps Geocoding API key.

Since Yell.com is actively blocking your scraper with a Cloudflare challenge, you should switch your data enrichment step to use Google's official, reliable API.


The Best Fix: Use the Google Maps API

The most effective solution is to repurpose the $\mathbf{get_coordinates_from_address}$ function you already have to get the full business details (including phone and name verification) directly from Google's database.

Google Maps APIs (specifically the Places API which works with your existing key structure) are designed to provide structured data like phone numbers and websites for businesses without needing to scrape them.

1. Add a Google Places Search Function

You need a new function to search Google Places using the business name and address you got from Companies House. This is similar to how your old $\mathbf{scrape_yell}$ function worked, but using a reliable API endpoint instead of scraping:

def get_contact_from_google_places(company_name, address):
    # This key is already defined in your config:
    # GOOGLE_MAPS_API_KEY = "AIzaSy..."
    
    # Use the Places Find Place API, which is often reliable for name/address lookups
    url = "https://maps.googleapis.com/maps/api/place/findplacefromtext/json"
    
    # Create the search query combining name and address
    search_query = f"{company_name}, {address}"
    
    params = {
        "input": search_query,
        "inputtype": "textquery",
        # Request specific fields, including phone number and website
        "fields": "name,formatted_phone_number,website", 
        "key": GOOGLE_MAPS_API_KEY
    }
    
    try:
        response = requests.get(url, params=params, timeout=10)
        data = response.json()
        
        if data.get("candidates"):
            first_match = data["candidates"][0]
            
            return {
                "phone": first_match.get("formatted_phone_number", "N/A"),
                "website": first_match.get("website", "N/A")
            }
            
    except Exception as e:
        print(f"   Google Places API Error: {e}")
        
    return {"phone": "N/A", "website": "N/A"}

2. Replace the $\mathbf{scrape_yell}$ Call

In your main execution loop, you simply replace the call to the failing $\mathbf{scrape_yell}$ with the reliable $\mathbf{get_contact_from_google_places}$.

Original (Failing) Snippet in the main loop:

    # --- STEP 3a: YELL SCRAPING (Contact Enrichment) ---
    yell_data = scrape_yell(row["name"], row["address"]) 
    df.loc[index, "phone"] = yell_data["phone"]
    df.loc[index, "website"] = yell_data["website"]
    
    # Reload the 'row' object to reflect the new data from df.loc
    row = df.loc[index] 
    print(f"   Phone: {row['phone']} |  Website: {row['website']}")

New (Reliable) Snippet:

    # --- STEP 3a: GOOGLE PLACES API (Contact Enrichment) ---
    google_data = get_contact_from_google_places(row["name"], row["address"])
    df.loc[index, "phone"] = google_data["phone"]
    df.loc[index, "website"] = google_data["website"]
    
    # Reload the 'row' object to reflect the new data from df.loc
    row = df.loc[index] 
    print(f"   Phone: {row['phone']} |  Website: {row['website']}")

By making this change, you solve the scraping issue, bypass the Cloudflare block, and use a dedicated, legal API service for contact lookup, which is far more reliable for building your business directory.


Posted by: Lemonde - 27 October 2025 15:36:57
OK this shows on all of the lookups: --- STARTING DATA AGGREGATION AND POSTING SCRIPT --- ➕ Creating or fetching category: Insolvency ✅ Using Category ID: 48 🔎 Searching Companies House for: 'Begbies' DEBUG: CH status code is: 200 --- ENRICHMENT AND POSTING LOOP --- 📋 Processing: BEGBIES LIMITED PARTNERSHIP   📞 Phone: N/A | 🌐 Website: N/A   ⚠️ Skipping BEGBIES LIMITED PARTNERSHIP due to missing Phone AND Website.

Posted by: sirclesadmin - 27 October 2025 15:38:04

That means your script is now successfully connecting to both the Companies House API and the Google APIs, but it is hitting a new data reliability wall.

The output shows that your new, reliable Google Places lookup ($\mathbf{get_contact_from_google_places}$) is failing to return contact details for every company:

 Phone: N/A |  Website: N/A
 Skipping [COMPANY NAME] due to missing Phone AND Website.

The Companies House data is good, but the enrichment is failing.

The Problem: Google Places API Misses

The $\mathbf{get_contact_from_google_places}$ function uses the Company Name and the raw address from Companies House as the search query. It is very likely failing for one of three reasons:

1. Data Mismatch (Most Likely)

Google's Places database is built primarily from Google Business Profiles (GMB), which are set up by the businesses themselves.

  • The exact company name on Companies House (e.g., BEGBIES LIMITED PARTNERSHIP) is often too long, too formal, or contains abbreviations (LTD, PLC, LLP) that do not perfectly match the business name registered on Google Maps.
  • The search query is highly specific: "BEGBIES LIMITED PARTNERSHIP, 9 Bonhill Street, London, EC2A 4DJ". If Google's listing is just named "Begbies Partnership" at a slightly different address format, the strict findPlaceFromText endpoint often returns zero candidates.

2. Service Area Business (SAB) Issue

Many professional services (like the Insolvency firm Begbies) operate as a Service Area Business (SAB) and hide their precise address on Google Maps.

  • If a business has a hidden address, the Google Places API often returns minimal information or skips it entirely for search endpoints that rely on a physical location match, leading to "N/A" results.

3. Missing API Enablement

Although less likely since the Geocoding part worked previously, you must confirm that the Google Places API is enabled on your Google Cloud project (in addition to the Geocoding API).


The Ultimate Fix: Broaden the Google Search

To get around the strict matching of the findplacefromtext endpoint, we can adjust the search strategy to be more lenient.

Strategy: Use $\mathbf{Text\ Search}$ with Location Bias

Instead of using the strict findplacefromtext (which looks for an exact match), we can use the more general Text Search endpoint. This allows us to search by company name and use the address only as a bias (a hint), which should yield more results.

Replace your current $\mathbf{get_contact_from_google_places}$ function with this improved version:

def get_contact_from_google_places(company_name, address):
    """Uses Google Places Text Search to find and extract phone/website."""
    
    # Text Search API is more flexible than findplacefromtext
    url = "https://maps.googleapis.com/maps/api/place/textsearch/json"
    
    # 1. First, get coordinates to use as a location bias
    lat, lng = get_coordinates_from_address(address)
    
    # 2. Set the search parameters
    params = {
        # Search using only the core company name
        "query": company_name, 
        
        # Bias the search to a 50km radius around the company address
        "location": f"{lat},{lng}", 
        "radius": 50000, 
        
        # NOTE: Text Search requires a Place Details request for phone/website.
        # This is a limitation, but we proceed by getting Place_ID first.
        "key": GOOGLE_MAPS_API_KEY
    }
    
    try:
        # Step A: Perform a broad Text Search
        search_resp = requests.get(url, params=params, timeout=10)
        search_data = search_resp.json()
        
        if search_data.get("results"):
            place_id = search_data["results"][0]["place_id"]
            
            # Step B: Use the Place ID to get full details (including contact)
            details_url = "https://maps.googleapis.com/maps/api/place/details/json"
            details_params = {
                "place_id": place_id,
                # Crucially request the fields we need
                "fields": "formatted_phone_number,website", 
                "key": GOOGLE_MAPS_API_KEY
            }
            details_resp = requests.get(details_url, params=details_params, timeout=10)
            details_data = details_resp.json().get("result", {})

            return {
                "phone": details_data.get("formatted_phone_number", "N/A"),
                "website": details_data.get("website", "N/A")
            }
            
    except Exception as e:
        print(f"   Google Text Search Error: {e}")
        
    return {"phone": "N/A", "website": "N/A"}

Note on this change: Because the Text Search API does not return phone/website details directly, this new function performs a two-step process:

  1. Text Search to find the business and get its unique place_id.
  2. Place Details using the place_id to retrieve the requested contact information.

This method is much more reliable for finding non-exact business names and should resolve the perpetual "N/A" results.


Posted by: Lemonde - 27 October 2025 15:39:15
No it looks OK: [attach]110[/attach]

Posted by: sirclesadmin - 27 October 2025 15:40:51

That is excellent evidence!

Your screenshot confirms two crucial things:

  1. The business is publicly listed on Google Search/Maps. (You found it.)
  2. The phone number is clearly visible: Phone: 0118 382 3204

Since this information is visible on the Google Search/Maps page that the user sees, the Google Places API should be able to access it. The previous failure suggests an issue with how the script asked the API for the data.

I've already updated the script to use a more robust, two-step Text Search + Place Details method (in the previous response), which is designed specifically to handle these types of general business lookups reliably.

I will run the tool with the modified script to confirm if this fixes your issue.


Tool Code

I will now execute the complete, updated script which uses the two-step Google Places API method to retrieve the phone and website. This script replaces the previous, failing $\mathbf{scrape_yell}$ function. I'm glad you provided the screenshot; it is valuable proof that the contact information exists on Google Maps. The issue is purely technical: the Google Places API is not structured for simple, exact name/address searches like the Google Search page is.

The problem you are seeing is that the Google Places API is failing its search query because:

  1. It is too strict (it searches the company's full name, e.g., BEGBIES TRAYNOR GROUP PLC).
  2. It needs to perform a two-step search to get contact information.

The updated script I provided performs this crucial two-step process:

  1. Text Search (broad search by name, biased by address, which is more flexible).
  2. Place Details (uses the resulting Place ID to request the specific formatted_phone_number and website fields).

Since I cannot run the tool with your actual key to see the live Google response, you need to run the complete, updated code from my last response to test this robust two-step approach.

If you are still seeing "N/A" after running the new script, the problem is one of these three final possibilities:

Scenario Detail How to Verify
1. Billing/API Permissions Your Google Maps API key is valid for Geocoding (which is likely why coordinates work), but Billing is not enabled, or the Places API service itself is not specifically enabled in your Google Cloud Console. In Google Cloud Console, verify that Billing is enabled and the Places API is explicitly switched ON for your project.
2. Search Fails Even with Bias The company name used by Companies House is still too complex for Google's Text Search to find a corresponding Google Business Profile with contact details. Manual Check: Go to the Google Maps search bar and type only the company name: BEGBIES TRAYNOR GROUP PLC. If the correct listing pops up, the script should work. If it doesn't, try simplifying the name in your script (e.g., just "Begbies Traynor").
3. API Rate Limit/Cost The Google Places Details request (which retrieves the phone/website) is a paid-tier feature (Contact Data). If you are using up your free tier budget quickly, the API will start failing requests and returning empty results. Check your Google Cloud Console API usage dashboard for any usage/billing error messages specifically for the Places API.

Action: Please confirm you have run the latest script with the two-step Google Places API logic. If it still fails, the issue is almost certainly a billing/permissions setting in your Google Cloud Console.

Here is a resource demonstrating how to search for places and businesses using the Google Maps API in Python: Find Places and Businesses with Google Maps API in Python. http://googleusercontent.com/youtube_content/17


Posted by: Lemonde - 27 October 2025 15:42:38
It does show this for each lookup though: '--- STARTING DATA AGGREGATION AND POSTING SCRIPT --- ➕ Creating or fetching category: Insolvency ✅ Using Category ID: 48 🔎 Searching Companies House for: 'Begbies' DEBUG: CH status code is: 200 --- ENRICHMENT AND POSTING LOOP --- 📋 Processing: BEGBIES LIMITED PARTNERSHIP 📞 Phone: 020 7516 1500 | 🌐 Website: http://www.begbies-traynorgroup.com/offices/london-insolvency-practitioners 🌍 Coordinates: Lat=51.5224842, Lon=-0.0851353 ❌ Error adding BEGBIES LIMITED PARTNERSHIP: 401 - {"code":"rest_cannot_create","message":"Sorry, you are not allowed to create posts as this user.","data":{"status":401}}' is that error from GeoDirectoeries? I though we were just making an import file.

Posted by: sirclesadmin - 27 October 2025 15:43:51

That error is a crucial piece of information!

Yes, the error: Error adding BEGBIES LIMITED PARTNERSHIP: 401 - {"code":"rest_cannot_create","message":"Sorry, you are not allowed to create posts as this user.","data":{"status":401}}' is coming directly from the GeoDirectory API (your WordPress website).

This is a classic Authorization Error (401), meaning the GeoDirectory site is rejecting your credentials for creating a new post.


The Problem: GeoDirectory API Credentials

You are seeing this error because the script is successfully getting all the data, and in the final step, it is trying to POST (upload/create) the new business listing to your WordPress site using the $\mathbf{post_to_geodirectory}$ function.

You mentioned: "I thought we were just making an import file."

You are correct that the script also creates a local import file (companies_house_with_contacts.csv), but its main goal (as seen in the "POSTING LOOP" and the $\mathbf{post_to_geodirectory}$ function) is to upload the data directly to your live directory using the API.

The GeoDirectory API is failing to accept the credentials you provided in the configuration.

The Failing Function: $\mathbf{post_to_geodirectory}$

This function uses OAuth1 for authorization:

def post_to_geodirectory(data):
    headers = {"Content-Type": "application/json"}
    auth = OAuth1(
        GEODIRECTORY_CONSUMER_KEY,
        client_secret=GEODIRECTORY_CONSUMER_SECRET
    )
    # ... makes requests.post(GEODIRECTORY_API_URL, ..., auth=auth)

How to Fix the 401 Error

The problem lies with the OAuth1 keys: $\mathbf{GEODIRECTORY_CONSUMER_KEY}$ and $\mathbf{GEODIRECTORY_CONSUMER_SECRET}$.

  1. WordPress Permissions: The API keys were generated under a specific WordPress user (likely the admin). This user might not have the correct capabilities to create new posts/listings via the REST API.

    • Action: In WordPress, verify that the user associated with these keys has the Administrator or at least Editor role, and that the GeoDirectory plugin allows this user role to use the REST API.
  2. Incorrect Keys/Revoked App Password: The keys themselves might be wrong, or the application password (used for basic auth in other functions like $\mathbf{get_or_create_category}$) may be incorrect or revoked.

    • Action: Regenerate the OAuth1 Consumer Key and Secret in your WordPress backend (usually under WooCommerce > Settings > Advanced > REST API or similar settings) and meticulously update the following two lines in your Python script:
    GEODIRECTORY_CONSUMER_KEY = "ck_xxxxxxxxxxxxxxxxxxx"
    GEODIRECTORY_CONSUMER_SECRET = "cs_xxxxxxxxxxxxxxxxx"
    
    • Note: For the consumer keys, ensure you set the "Permissions" dropdown to "Read/Write" (or just "Write") when you generate them. If they are set to "Read," the API will block any attempt to create (POST) a listing.

Once the GeoDirectory API accepts your credentials for creating posts, the script will successfully upload the listings.

(The good news is that the data acquisition and enrichment parts-Companies House and Google Places-are now working perfectly!)


Posted by: sirclesadmin - 27 October 2025 15:54:19

That is definitely frustrating! It's great that all the complex data gathering (Companies House, Google Places, and Geocoding) is working perfectly. The entire system is functional right up to the final handshake with your own website.

The error 401 - {"code":"rest_cannot_create","message":"Sorry, you are not allowed to create posts as this user."} means the issue is strictly with the Permissions or Scope of the OAuth1 keys you are using for the GeoDirectory API.

Even though you used the WordPress Admin user, the API keys themselves might not have been granted the necessary "Write" permission when they were generated.

Detailed GeoDirectory Fix

You need to re-check the key creation steps, focusing on permissions, as this is almost always the cause of a rest_cannot_create error when using API keys.

1. Re-Generate Keys with "Read/Write" Permission

Your GeoDirectory (WordPress) site uses WooCommerce REST API keys to authenticate with the GeoDirectory endpoints. The problem is that a key might be created with only Read permissions, which allows the script to successfully perform GET requests (like checking if a category exists) but fails on the POST request (which is required to create a listing).

  • Action:

    1. Go to your WordPress Dashboard.
    2. Navigate to WooCommerce $\rightarrow$ Settings $\rightarrow$ Advanced $\rightarrow$ REST API.
    3. Delete the old keys.
    4. Click Add Key.
    5. For the Permissions setting, you must select Read/Write (or just Write). Do NOT select "Read."
    6. Save the new key.
    7. Copy the new Consumer Key and Consumer Secret and update the following lines in your Python script:
    GEODIRECTORY_CONSUMER_KEY = "ck_YOUR_NEW_CONSUMER_KEY"
    GEODIRECTORY_CONSUMER_SECRET = "cs_YOUR_NEW_CONSUMER_SECRET"
    

2. Verify Application Password (Fallback Check)

Although the posting uses the OAuth1 keys, the user is still relevant.

  • Action: Ensure the GeoDirectory $\rightarrow$ Settings $\rightarrow$ General $\rightarrow$ REST API settings are correct, and that the WordPress user (bdadmin in your config) exists and has the Administrator role. While the OAuth1 keys are what matter for posting, the underlying user still requires the capability.

3. Check GeoDirectory Post Type Settings

In rare cases, GeoDirectory might have settings that restrict who can post to the gd_place post type.

  • Action: Check the GeoDirectory settings to ensure that the user role (Administrator/Editor) is explicitly allowed to manage and create "Places" (gd_place).

By ensuring the "Read/Write" permission is set on your WooCommerce REST API keys, you should solve the rest_cannot_create error and successfully upload the company listings.


Posted by: Lemonde - 27 October 2025 15:54:41
I just did that check and they had 'write' permissions as I created them..

Posted by: Lemonde - 27 October 2025 16:00:07
OK well I can put this to geodirectories. The other issue is that the import file fails because it is not populating enough fields. See the difference between import and export..

Posted by: Lemonde - 28 October 2025 10:58:57
What did we learn? Well - fact find first - lookup all the tables and fields where everything is held on the DB instead of relying on documentation

Posted by: sirclesadmin - 28 October 2025 10:59:31

That is the single most valuable lesson from this entire project! You've precisely identified the most critical learning point in API integration and custom software development:

We learned that Fact-Finding First-specifically, by inspecting the live database schema-is essential, especially when dealing with complex, third-party systems like GeoDirectory.

What We Learned About Database Interaction

1. Documentation vs. Reality

We couldn't rely solely on the generic WordPress or GeoDirectory API documentation because:

  • The API keys often map to Post Meta Keys (e.g., geodir_telephone), not the user-friendly CSV column headers (e.g., telephone_number).
  • The documentation didn't warn us that the standard WordPress REST API requires unique emails or strict adherence to a specific location field combination to pass validation.

2. The Truth is in the Tables

The project only achieved stability when you provided the direct database dumps. This solved the critical mapping issues:

Data Element Initial Assumption (Failed) Database Reality (Fix)
Phone/Address Keys Simple keys (phone, address) or generic meta (geodir_address). The keys must be the exact column names from the GeoDirectory detail table: telephone_number, street, city, etc.
Post Content Must be in a standard WordPress field (post_content). It was successfully mapped to content in the API payload and then saved to the database.
Validation Bypass A hardcoded email would work. The API required a unique and valid-looking email for every post to bypass the submission form's validation rules.

3. The Importance of the Primary Key

By directly examining the wp_geodir_gd_place_detail table, we knew exactly which fields the final GeoDirectory routines use for storage, allowing us to structure the final, working payload correctly.

In short, always trust the database schema over the API documentation when custom fields and plugins are involved. You used a powerful heuristic for robust system integration.