moleculeresolver¶

Submodules¶

Attributes¶

__version__

Classes¶

`Molecule`	Represents a molecule with various properties and identifiers.
`MoleculeResolver`

Package Contents¶

class moleculeresolver.Molecule¶

Represents a molecule with various properties and identifiers.

Attributes:

SMILES (Optional[str]): The SMILES (Simplified Molecular Input Line Entry System) representation of the molecule.

synonyms (Optional[list[str]]): A list of alternative names or synonyms for the molecule.

CAS (Optional[list[str]]): A list of CAS (Chemical Abstracts Service) registry numbers for the molecule.

additional_information (Optional[str]): Any additional information about the molecule.

mode (Optional[str]): The mode associated with the molecule.

service (Optional[str]): The service associated with the molecule.

number_of_crosschecks (Optional[int]): The number of cross-checks performed on the molecule.

identifier (Optional[str]): A unique identifier for the molecule.

found_molecules (Optional[list]): A list of related molecules found during processing.

SMILES: str | None = None¶

synonyms: List[str] | None = []¶

CAS: List[str] | None = []¶

additional_information: str | None = ''¶

mode: str | None = ''¶

service: str | None = ''¶

number_of_crosschecks: int | None = 1¶

identifier: str | None = ''¶

found_molecules: list | None = []¶

to_dict(found_molecules: str | None = 'recursive') → Dict[str, Any]¶

Convert the Molecule object to a dictionary.

Args:

found_molecules (Optional[str]): Determines how ‘found_molecules’ are handled.

If ‘remove’, the ‘found_molecules’ field will be excluded.
If ‘recursive’, ‘found_molecules’ will be recursively converted to dictionaries.

Returns:

Dict[str, Any]: A dictionary representation of the Molecule object.

Note:

This method creates a deep copy of the object’s __dict__ attribute. Depending on the found_molecules parameter, it may exclude or recursively convert the ‘found_molecules’ field before returning the dictionary.

class moleculeresolver.MoleculeResolver(available_service_API_keys: dict[str, str | None] | None = None, molecule_cache_db_path: str | None = None, molecule_cache_expiration: datetime.datetime | None = None, standardization_options: dict | None = None, differentiate_isomers: bool | None = True, differentiate_tautomers: bool | None = True, differentiate_isotopes: bool | None = True, check_for_resonance_structures: bool | None = None, show_warning_if_non_unique_structure_was_found: bool | None = False)¶

chunker(seq: list, size: int) → set¶

Split a sequence into chunks of a specified size.

Args:

seq (list): The sequence to be chunked.

size (int): The size of each chunk.

Returns:

set: A set containing subsequences (chunks) from the input sequence.

Example:

>>> list(self.chunker([1, 2, 3, 4, 5, 6], 2))
[(1, 2), (3, 4), (5, 6)]

take_most_common(container: list, number_to_take: int | None = None) → list¶

Select the most common elements from a container.

Identifies and returns the most frequently occurring elements in the given container. Handles both case-sensitive and case-insensitive comparisons for string elements.

Args:

container (list): The input list of elements to process.

number_to_take (Optional[int]): The number of most common elements to return. If None, returns all elements sorted by frequency. Defaults to None.

Returns:

list: A list of the most common elements, preserving the original case for strings.

Notes:

If the container has fewer than 2 elements, it returns the container as is.
For string elements, comparisons are case-insensitive, but the original case is preserved in the output.
The method maintains the order of elements based on their frequency, with ties broken by the order of appearance in the original container.
Whitespace is stripped from string elements before comparison.

_module_path¶

molecule_cache_db_path = None¶

molecule_cache_expiration = None¶

molecule_cache¶

_OPSIN_executable_path = None¶

available_service_API_keys = None¶

_standardization_options¶

_differentiate_isomers = True¶

_differentiate_tautomers = True¶

_differentiate_isotopes = True¶

_check_for_resonance_structures = None¶

_show_warning_if_non_unique_structure_was_found = False¶

_available_services_with_batch_capabilities = ['srs', 'comptox', 'pubchem']¶

_message_slugs_shown = []¶

_session = None¶

_session_CompTox = None¶

_java_path = None¶

_OPSIN_tempfolder = None¶

supported_modes_by_services¶

_available_services¶

supported_modes = []¶

supported_services_by_mode¶

_service_adapters¶

CAS_regex_with_groups¶

CAS_regex = '(\\d{2,7}-\\d{2}-\\d)'¶

empirical_formula_regex_compiled¶

formula_bracket_group_regex_compiled¶

non_generic_SMILES_regex_compiled¶

InChI_regex_compiled¶

InChIKey_regex_compiled¶

chemeo_API_token_regex_compiled¶

cas_registry_API_token_regex_compiled¶

comptox_API_token_regex_compiled¶

html_tag_regex_compiled¶

tighten_commas_on_N_regex_compiled¶

tighten_commas_on_enclosed_numbers_regex_compiled_list¶

tighten_commas_at_the_beginning_of_the_name_regex_compiled¶

__enter__() → MoleculeResolver¶

Enter the runtime context for the MoleculeResolver.

This method is called when entering a ‘with’ statement. It sets up the necessary resources for the MoleculeResolver to function.

Returns:

MoleculeResolver: The instance of the class (self).

Raises:

Exception: Any exceptions raised during the setup process.

Notes:

Performs the following actions:
1. Disables the RDKit logger.
2. Initializes the molecule cache.
3. Sets up a temporary folder for OPSIN if it’s available.

_enter_rdkit_log_context() → None¶: Start suppressing RDKit logs for the resolver runtime.

_enter_molecule_cache_context() → None¶: Create and enter the molecule cache context when needed.

_create_opsin_tempfolder() → None¶: Create the OPSIN temp folder when OPSIN batch mode is enabled.

_cleanup_opsin_tempfolder(*, error_ocurred: bool) → None¶: Cleanup OPSIN temp folder unless an exception is currently bubbling up.

_teardown_runtime_contexts(*, error_ocurred: bool) → None¶: Teardown all runtime contexts in a single lifecycle helper.

__exit__(exception_type, exception_value, exception_traceback) → None¶

Exit the runtime context for the MoleculeResolver.

This method is called when exiting a ‘with’ statement. It cleans up resources used by the MoleculeResolver.

Args:

exception_type (Type[BaseException] or None): The type of the exception that caused the context to be exited.

exception_value (BaseException or None): The instance of the exception that caused the context to be exited.

exception_traceback (TracebackType or None): A traceback object encoding the stack trace.

Returns:

None

Notes:

Performs the following cleanup actions:
1. Determines if an error occurred during execution.
2. Exits the RDKit logger disabling context.
3. Exits the molecule cache context.
4. Cleans up the OPSIN temporary folder if no error occurred.

_register_default_service_adapters() → None¶: Register built-in adapters used by find_single_molecule.

_resolve_service_with_adapter(service: str, flattened_identifiers: list[str], flattened_modes: list[str], required_formula: str | None, required_charge: int | None, required_structure_type: str | None) → moleculeresolver.services.ServiceSearchResult | None¶: Resolve one service by delegating to its configured adapter.

_resolve_identifier_with_adapter(service: str, identifier: str, mode: str, required_formula: str | None, required_charge: int | None, required_structure_type: str | None) → moleculeresolver.services.ServiceSearchResult | None¶: Resolve one identifier/mode pair via the configured service adapter.

static _service_result_to_exhaustive_candidate(service: str, result: moleculeresolver.services.ServiceSearchResult) → dict[str, Any]¶: Normalize a service adapter result into the exhaustive candidate payload.

query_molecule_cache(service: str, identifier_mode: str, identifier: str) → Generator[tuple[bool, list[moleculeresolver.molecule.Molecule]], None, None]¶

Query the molecule cache for a given identifier and yield the results.

Searches the molecule cache for a specific identifier using the provided service and identifier mode. Yields whether an entry is available and the list of molecules found. After the context is exited, it handles saving new molecules to the cache if necessary.

Args:

service (str): The service used for querying (e.g., “cts”, “cir”).

identifier_mode (str): The mode of identification used.

identifier (str): The specific identifier to search for.

Returns:

tuple[bool, list[Molecule]]: A tuple containing:

bool: True if an entry is available in the cache, False otherwise.
list[Molecule]: The list of molecules found in the cache (empty if not found).

Raises:

Exception: Any exceptions raised by the underlying cache operations.

Notes:

Uses a context manager to ensure proper handling of cache operations.
Handles special cases for CTS and CIR services when they are down.
New molecules are saved to the cache after the context is exited, unless specific conditions prevent saving (e.g., service is down).

query_molecule_cache_batchmode(service: str, identifier_mode: str, identifiers: list[str], save_not_found: bool | None = True) → Generator[tuple[list[str], list[int], list[list[moleculeresolver.molecule.Molecule] | None] | None], None, None]¶

Query the molecule cache for multiple identifiers in batch mode.

Searches the cache for molecules matching the given service, identifier mode, and list of identifiers. Yields information about identifiers to search, their indices, and the results. After the context is exited, it saves new molecules to the cache.

Args:

service (str): The service used for querying (e.g., “cts”, “cir”).

identifier_mode (str): The mode of identification used.

identifiers (list[str]): The list of identifiers to search for.

save_not_found (Optional[bool]): Whether to save entries for identifiers not found. Defaults to True.

Returns:

tuple[list[str], list[int], Optional[list[Optional[list[Molecule]]]]]: A tuple containing:

list[str]: Identifiers that need to be searched (not found in cache).
list[int]: Indices of the identifiers to be searched.
Optional[list[Optional[list[Molecule]]]]: Results from the cache search.

_init_session(pool_connections: int | None = None, pool_maxsize: int | None = None) → None¶

Initialize HTTP sessions for making requests.

Sets up two sessions: a general session and a specific session for CompTox. Configures connection pooling and SSL contexts for these sessions.

Args:

pool_connections (Optional[int]): The number of connection pools to cache. If None, it’s set to twice the number of available services.

pool_maxsize (Optional[int]): The maximum number of connections to save in the pool. If None, it’s set to 10. The minimum value is always 10.

Notes:

This method is idempotent; it will not reinitialize existing sessions.
The CompTox session uses a custom SSL context to handle specific SSL requirements.

_resilient_request(url: str, kwargs: dict[str, Any] | None = None, request_type: str | None = 'get', accepted_status_codes: list[int] = [200], rejected_status_codes: list[int] = [404], offline_status_codes: list[int] = [], max_retries: int | None = 10, sleep_time: int | float = 2, allow_redirects: bool | None = False, json: str | None = None, return_response: bool | None = False) → str | None¶

Make a resilient HTTP request with retry logic.

Attempts to make an HTTP request, handling various error conditions and retrying the request if necessary.

Args:

url (str): The URL to send the request to.

kwargs (Optional[dict[str, Any]]): Additional keyword arguments for the request.

request_type (Optional[str]): The type of HTTP request (‘get’ or ‘post’). Defaults to ‘get’.

accepted_status_codes (list[int]): List of HTTP status codes to accept. Defaults to [200].

rejected_status_codes (list[int]): List of HTTP status codes to reject. Defaults to [404].

max_retries (Optional[int]): Maximum number of retry attempts. Defaults to 10.

sleep_time (Union[int, float]): Time to sleep between retries in seconds. Defaults to 2.

allow_redirects (Optional[bool]): Whether to allow URL redirection. Defaults to False.

json (Optional[str]): JSON data to send in the request body. Defaults to None.

return_response (Optional[bool]): If True, return the full response object instead of the text. Defaults to False.

Returns:

Optional[str]: The response text if successful, or None if the request failed.

Raises:

ValueError: If an invalid request_type is provided. requests.exceptions.ConnectionError: If connection errors persist after maximum retries.

Notes:

Automatically sets a user agent if not provided in the headers.
Uses different sessions for CompTox and other services.
Implements exponential backoff for retries.

try_disconnect_more_metals(SMILES: str) → str¶

Attempt to disconnect additional metal atoms in a molecule represented by a SMILES string.

This method performs a more extensive metal disconnection process than the standard RDKit metal disconnector. It includes more metals and handles specific cases like mercury (Hg) correctly: https://github.com/rdkit/rdkit/discussions/6729 if the issue for Hg is a bug, this should be replaced by the metal disconnector from rdkit

Args:

SMILES (str): The input SMILES string representing the molecule.

Returns:

str: The SMILES string of the molecule after attempting to disconnect metals. If no changes are made, the original SMILES string is returned.

Notes:

Uses caching to improve performance for repeated calls with the same input.
Employs a custom SMARTS pattern to identify metal-nonmetal bonds.
If the input SMILES is invalid or no metal disconnection is possible, the original SMILES is returned.
Particularly useful for handling cases where the standard RDKit metal disconnector may not be sufficient or may have known issues (e.g., with mercury compounds).

Standardize a SMILES string representation of a molecule.

Applies various standardization procedures to a given SMILES string, including metal disconnection, normalization, reionization, uncharging, stereochemistry assignment, and atom mapping number removal.

Args:

SMILES (str): The input SMILES string to be standardized.

disconnect_metals (Optional[bool]): Whether to disconnect metals. Defaults to None.

normalize (Optional[bool]): Whether to normalize the molecule. Defaults to None.

reionize (Optional[bool]): Whether to reionize the molecule. Defaults to None.

uncharge (Optional[bool]): Whether to uncharge the molecule. Defaults to None.

try_assign_sterochemistry (Optional[bool]): Whether to attempt stereochemistry assignment. Defaults to None.

remove_atom_mapping_number (Optional[bool]): Whether to remove atom mapping numbers. Defaults to None.

Returns:

Optional[str]: The standardized SMILES string, or None if standardization fails.

Notes:

Uses caching to improve performance for repeated calls with the same input.
The standardization process uses the RDKit library for molecular operations.

convert_zwitterion_to_sulfynil(mol: rdkit.Chem.Mol) → rdkit.Chem.Mol¶

Converts the zwitterionic form [S+][O-] in a given RDKit molecule to the sulfynil group O=S. Matches are done via SMARTS, and modifications are applied directly.

Parameters: mol (rdkit.Chem.Mol): The RDKit molecule object to be modified.

Returns: rdkit.Chem.Mol: The modified RDKit molecule object with the zwitterionic form converted to the sulfynil group.

Standardize an RDKit molecule object.

Applies various standardization procedures to a given RDKit molecule, including metal disconnection, normalization, reionization, uncharging, stereochemistry assignment, and atom mapping number removal.

Args:

mol (Chem.rdchem.Mol): The input RDKit molecule to be standardized.

disconnect_metals (Optional[bool]): Whether to disconnect metals. Defaults to None.

normalize (Optional[bool]): Whether to normalize the molecule. Defaults to None.

reionize (Optional[bool]): Whether to reionize the molecule. Defaults to None.

uncharge (Optional[bool]): Whether to uncharge the molecule. Defaults to None.

try_assign_sterochemistry (Optional[bool]): Whether to attempt stereochemistry assignment. Defaults to None.

remove_atom_mapping_number (Optional[bool]): Whether to remove atom mapping numbers. Defaults to None.

fix_rdkit_normalization_exceptions (Optional[bool]): Whether to fix the normalization isues of RDKit (e.g., zwitterionic form [S+][O-] to the sulfynil group O=S). Defaults to True.

Returns:

Optional[Chem.rdchem.Mol]: The standardized RDKit molecule, or None if standardization fails.

Raises:

Warning: If reionization step cannot be performed.

Notes:

If any standardization option is None, it defaults to the values set on class creation or default values.
For molecules with multiple fragments, each fragment is standardized separately.
Special handling is implemented for certain molecules (e.g., DMSO) during normalization converting the zwitterionic form [S+][O-] to the sulfynil group O=S. This is control with the fix_rdkit_normalization_exceptions flag
Uses RDKit’s standardization tools (e.g., MetalDisconnector, Normalize, Reionizer).

has_isotopes(mol: rdkit.Chem.rdchem.Mol)¶

Check if the molecule contains any isotopes.

Examines each atom in the given molecule to determine if any of them have a non-zero isotope value.

Args:

mol (Chem.rdchem.Mol): The input RDKit molecule to check for isotopes.

Returns:

bool: True if the molecule contains at least one isotope, False otherwise.

Notes:

Uses RDKit’s GetIsotope() function to check each atom’s isotope value.
An isotope value of 0 indicates the most common isotope for that element.

remove_isotopes(mol: rdkit.Chem.rdchem.Mol) → rdkit.Chem.rdchem.Mol¶

Remove all isotope information from the molecule.

Sets the isotope value of all atoms in the molecule to 0, effectively removing any isotope information.

Args:

mol (Chem.rdchem.Mol): The input RDKit molecule from which to remove isotopes.

Returns:

Chem.rdchem.Mol: A new RDKit molecule with all isotope information removed and explicit hydrogens removed.

Notes:

Setting an atom’s isotope to 0 indicates the most common isotope for that element.
Modifies the input molecule in-place before returning a new molecule with explicit hydrogens removed.
Removing explicit hydrogens can change the molecule’s representation but not its chemical identity.

_check_and_flatten_identifiers_and_modes(identifiers: str | list[str] | list[list[str]], modes: str | list[str]) → tuple[list[str], list[str], list[str], list[str], str]¶

Validate and flatten the input identifiers and modes.

Processes the input identifiers and modes to ensure they are in the correct format and flattens nested structures. It also performs validation checks on the inputs.

Args:

identifiers (Union[str, list[str], list[list[str]]]): The input identifiers, which can be a single string, a list of strings, or a list of lists of strings.

modes (Union[str, list[str]]): The input modes, which can be a single string or a list of strings.

Returns:

tuple[list[str], list[str], list[str], list[str], str]: A tuple containing:

list[str]: Flattened list of identifiers.
list[str]: Flattened list of modes.
list[str]: List of unique identifiers.
list[str]: List of unique modes.
str: A string representation of the unique modes.

Raises:

TypeError: If an identifier is not of type int, str, list, or tuple. ValueError: If the number of modes doesn’t match the number of identifier groups.

Notes:

Handles various input formats and normalizes them for further processing.
Ensures that each identifier has a corresponding mode.
Strips whitespace from identifiers and converts them to strings.
Converts modes to lowercase for consistency.

_expand_identifier_mode_pairs(flattened_identifiers: list[str], flattened_modes: list[str], search_strategy: SearchStrategy) → list[tuple[str, str]]¶: Expand identifier/mode pairs for single-molecule search strategies.

_resolve_single_service_candidate(service: str, identifier: str, mode: str, required_formula: str | None, required_charge: int | None, required_structure_type: str | None) → dict[str, Any] | None¶: Backwards-compatible wrapper for single-pair adapter resolution.

_is_list_of_list_of_str(value: list[list[str]]) → bool¶

Check if the input is a valid list of lists of strings.

Verifies that the input value is a list containing only lists, and that each nested list contains only string elements.

Args:

value (list[list[str]]): The input to be checked.

Returns:

bool: True if the input is a valid list of lists of strings, False otherwise.

Notes:

Performs a two-level deep check on the input structure.
First ensures that all elements of the outer list are themselves lists.
Then checks that all elements within each nested list are strings.
An empty list or a list containing empty lists will return True.

_check_parameters(*, modes=None, services=None, identifiers=None, required_formulas=None, required_charges=None, required_structure_types=None, context='get_molecule')¶

Validate input parameters for molecule retrieval and processing.

Performs extensive validation on various input parameters used in molecule retrieval and processing operations. It checks for type correctness, value validity, and consistency across different parameters based on the context of the operation.

Args:

modes (Optional[Union[str, list[str]]]): The mode(s) of molecule identification.

services (Optional[Union[str, list[str]]]): The service(s) to be used for retrieval.

identifiers (Optional[Union[str, list[str], list[list[str]]]]): The molecule identifier(s).

required_formulas (Optional[Union[str, list[str]]]): Required chemical formula(s).

required_charges (Optional[Union[str, int, list[Union[str, int]]]]): Required charge(s).

required_structure_types (Optional[Union[str, list[str]]]): Required structure type(s).

context (str): The context of the operation. Defaults to “get_molecule”.

Raises:

TypeError: If any parameter is of an incorrect type.

ValueError: If any parameter has an invalid value or if there are inconsistencies between parameters.

Notes:

The method’s behavior varies based on the ‘context’ parameter.
For ‘get_molecule’ and ‘get_molecules_batch’ contexts, it expects single values for modes and services.
For ‘batch’ and ‘find_single’ contexts, it expects lists for modes and services.
Checks for compatibility between modes and services.
Performs specific validations for charges, structure types, and formulas.
Ensures consistency in the lengths of certain parameters when applicable.

replace_non_printable_characters(string_input: str) → str¶

Replaces all non-printable characters from the given string.

This function replaces different white space types by a simple white space, then it iterates through each character in the input string and keeps only the printable characters, effectively removing any non-printable characters such as control characters or certain Unicode characters.

Args:: string (str): The input string to be processed.
Returns:: str: A new string containing only the printable characters from the input.

Clean and standardize a chemical name string.

Processes a chemical name to standardize its format, remove unwanted characters, and optionally convert special characters or prepare it for use in filenames.

Args:

chemical_name (str): The chemical name to be cleaned.

normalize (Optional[bool]): If True, normalizes Unicode characters. Defaults to True.

unescape_html (Optional[bool]): If True, unescapes HTML entities. Defaults to True.

spell_out_greek_characters (Optional[bool]): If True, replaces Greek letters with their spelled-out names. Defaults to False.

for_filename (Optional[bool]): If True, prepares the name for use as a filename by removing non-alphanumeric characters. Defaults to False.

Returns:

str: The cleaned and standardized chemical name.

Notes:

Strips leading and trailing whitespace from the input.
Replaces various special characters with standardized alternatives.
Optionally unescapes HTML entities (e.g., ‘'’ to “’”).
Can spell out Greek letters (e.g., ‘α’ to ‘alpha’).
Normalizes Unicode characters to their closest ASCII representation if normalize is True.
When preparing for filenames, removes all non-alphanumeric characters and converts to lowercase.
This method is cached for performance optimization.

Example:

>>> clean_chemical_name("α-Pinene", spell_out_greek_characters=True)
'alpha-Pinene'
>>> clean_chemical_name("Sodium chloride, ≥99%", for_filename=True)
'sodiumchloride99'

filter_and_sort_synonyms(synonyms: list[str], number_of_synonyms_to_take: int | None = 5, strict: bool | None = False) → list[str]¶

Filter and sort a list of synonyms based on various criteria.

Processes a list of synonyms, applying filters to remove unwanted entries and sorting the results to return the most relevant synonyms.

Args:

synonyms (list[str]): A list of synonym strings to process.

number_of_synonyms_to_take (Optional[int]): The maximum number of synonyms to return. Defaults to 5.

strict (Optional[bool]): If True, returns an empty list when no synonyms pass the filters. If False, returns [“Noname”] when no synonyms pass the filters. Defaults to False.

Returns:

list[str]: A list of filtered and sorted synonyms, with a maximum length of number_of_synonyms_to_take.

Notes:

Synonyms are split on ‘|’ characters and each part is processed separately.
Various heuristics are applied to filter out unwanted synonyms, including:
- Rejecting synonyms containing specific tokens (e.g., “SCHEMBL”, “AKOS”, etc.)
- Filtering based on character case and specific regex patterns.
Attempts to standardize the format of certain types of synonyms.
If no synonyms pass the filters and strict is False, returns [“Noname”].

Expand and standardize a chemical name using various heuristics.

Applies a series of transformations to standardize and potentially expand a given chemical name.

Args:

name (str): The chemical name to process.

prefixes_to_delete (Optional[list[str]]): Prefixes to remove from the name.

suffixes_to_use_as_prefix (Optional[list[str]]): Suffixes to move to the beginning of the name as prefixes.

suffixes_to_delete (Optional[list[str]]): Suffixes to remove from the name.

parts_to_delete (Optional[list[str]]): Specific parts to remove from the name.

maps_to_replace (Optional[dict[str, str]]): A dictionary of string replacements to apply to the name.

Returns:

str: The processed chemical name after applying all specified transformations.

Notes:

Unescapes HTML entities in the name.
Handles various formatting issues, such as reversing comma-separated parts.
Default lists are provided for prefixes, suffixes, and parts to delete if not specified.
Applies a series of regex-based transformations to standardize the name format.
Special handling is implemented for stereochemistry indicators (e.g., cis/trans, E/Z).

filter_and_sort_CAS(synonyms: list[str]) → list[str]¶

Filter and sort CAS (Chemical Abstracts Service) registry numbers from a list of synonyms.

Processes a list of synonyms to extract valid CAS registry numbers, remove duplicates, and sort them based on frequency and heuristic approach.

Args:

synonyms (list[str]): A list of strings that may contain CAS registry numbers.

Returns:

list[str]: A list of unique, valid CAS registry numbers, sorted by frequency and heuristic rules. If no valid CAS numbers are found, returns an empty list.

Notes:

CAS numbers are validated using the is_valid_CAS method.
Duplicate CAS numbers are removed, keeping only unique entries.
If all CAS numbers in the input are identical, only one is returned.
If multiple unique CAS numbers are found with equal frequency: They are sorted based on the numeric value of their first segment. The CAS number with the lowest first segment is returned first.
If CAS numbers have different frequencies, they are sorted by frequency (most common first).
Whitespace is stripped from each synonym before processing.

combine_molecules(grouped_SMILES: str, molecules: list[moleculeresolver.molecule.Molecule]) → moleculeresolver.molecule.Molecule | None¶

Combine multiple Molecule objects into a single Molecule.

Merges information from multiple Molecule objects that represent the same chemical structure.

Args:

grouped_SMILES (str): The SMILES representation of the combined molecule.

molecules (list[Molecule]): A list of Molecule objects to combine.

Returns:

Optional[Molecule]: A new Molecule object combining information from all input molecules, or None if combination fails.

Raises:

ValueError: If molecules have different structures or identifiers when from the same service.

Notes:

Combines synonyms, CAS numbers, and other information from all input molecules.
Prioritizes CAS numbers from official registry if available.
Handles cases where molecules are from the same or different services.

group_molecules_by_structure(molecules: list[moleculeresolver.molecule.Molecule], group_also_by_services: bool | None = True, differentiate_isomers: bool | None = None, differentiate_tautomers: bool | None = None, differentiate_isotopes: bool | None = None, check_for_resonance_structures: bool | None = None) → dict[str, list[moleculeresolver.molecule.Molecule]]¶

Group a list of Molecule objects by their structural similarity.

Organizes molecules into groups based on their structural equivalence, optionally considering the services they come from.

Args:

molecules (list[Molecule]): A list of Molecule objects to group.

group_also_by_services (Optional[bool]): If True, groups molecules by both structure and service. Defaults to True.

Returns:

dict[str, list[Molecule]]: A dictionary where keys are SMILES strings and values are lists of structurally equivalent Molecule objects.

Raises:

ValueError: If input contains molecules from different sources that were previously combined.

Notes:

Uses structural comparison to group molecules.
Can handle molecules from primary sources (not previously combined).
When all molecules are from the same service and mode, it may select the most common structure as the representative.

filter_molecules(molecules: list[moleculeresolver.molecule.Molecule], required_formula: str | None = None, required_charge: int | str | None = None, required_structure_type: str | None = None) → list[moleculeresolver.molecule.Molecule]¶

Filter a list of Molecule objects based on specified criteria.

Filters molecules based on their chemical formula, charge, and structure type.

Args:

molecules (list[Molecule]): A list of Molecule objects to filter.

required_formula (Optional[str]): The required chemical formula.

required_charge (Optional[Union[int, str]]): The required molecular charge.

required_structure_type (Optional[str]): The required structure type.

Returns:

list[Molecule]: A list of Molecule objects that meet the specified criteria.

Notes:

Filters out molecules with invalid or None SMILES.
Applies standardization to SMILES strings of valid molecules.
For ‘salt’ structure type, attempts to disconnect more metals if initial check fails.
Uses the check_SMILES method to validate molecules against criteria.

Filter and combine a list of molecules based on specified criteria.

Processes a list of molecules (or a single molecule) by applying filters and then combining the filtered results into a single molecule.

Args:

molecules (Union[list[Molecule], Molecule]): A list of Molecule objects or a single Molecule object to process.

required_formula (Optional[str]): The required chemical formula to filter by. Defaults to None.

required_charge (Optional[Union[int, str]]): The required molecular charge to filter by. Defaults to None.

required_structure_type (Optional[str]): The required structure type to filter by. Defaults to None.

Returns:

Optional[Molecule]: A single Molecule object that results from filtering and combining the input molecules. Returns None if no molecules pass the filtering process or if the combination process fails.

Notes:

If a single Molecule is provided, it’s converted to a list for processing.
First filters the molecules using filter_molecules.
Filtered molecules are then grouped by structure using group_molecules_by_structure.
For each group of structurally similar molecules, combine_molecules is called.
If only one final molecule remains after combining, it is returned.
If no molecules pass the filters or if multiple distinct molecules remain after combining, the method returns None.

formula_to_dictionary(formula: str, allow_duplicate_atoms: bool | None = True) → dict[str, int | float]¶

Convert a chemical formula string to a dictionary of element counts.

Parses a chemical formula and returns a dictionary where keys are element symbols and values are their respective counts in the formula.

Args:

formula (str): The chemical formula to parse.

allow_duplicate_atoms (Optional[bool]): If True, allows multiple occurrences of the same element in the formula. If False, raises an error for duplicate elements. Defaults to True.

Returns:

dict[str, Union[int, float]]: A dictionary where keys are element symbols and values are their counts. Values are integers if all counts are whole numbers, otherwise floats.

Raises:

ValueError: If allow_duplicate_atoms is False and an element appears more than once.

Notes:

Supports nested parentheses, square brackets, and curly braces in formulas.
Handles hydrates and solvates (formulas connected with ‘*’).
Processes fractional counts (e.g., 0.5) and converts to integers when possible.
Uses recursion to handle nested structures in the formula.
Caches results for improved performance on repeated calls with the same formula.

to_hill_formula(mol_or_dictionary: rdkit.Chem.rdchem.Mol | dict[str, int | float]) → str¶

Convert a molecule or element dictionary to a Hill formula string.

Generates a Hill formula representation of a molecule, either from an RDKit molecule object or a dictionary of element counts.

Args:

mol_or_dictionary (Union[Chem.rdchem.Mol, dict[str, Union[int, float]]]): Either an RDKit molecule object or a dictionary where keys are element symbols and values are their counts.

Returns:

str: The Hill formula string representation of the molecule.

Notes:

If input is an RDKit molecule, hydrogens are added explicitly before processing.
The Hill system orders elements as follows:
1. Carbon (C) always comes first if present.
2. Hydrogen (H) comes second if carbon is present.
3. All other elements follow in alphabetical order.
For elements with a count of 1, the number is omitted in the formula.
Handles both integer and float counts, though typically counts are integers.
If a count is a float but effectively an integer (e.g., 2.0), it’s treated as an integer.

Example:

>>> mol = Chem.MolFromSmiles("CCO")
>>> to_hill_formula(mol)
'C2H6O'
>>> to_hill_formula({"C": 2, "H": 6, "O": 1})
'C2H6O'

check_formula(mol_or_formula: rdkit.Chem.rdchem.Mol | str, required_formula: str | None, differentiate_deuterium: bool = True) → bool¶

Check if a molecule or formula matches a required chemical formula.

Compares the chemical formula of a given molecule or formula string against a required formula.

Args:

mol_or_formula (Union[Chem.rdchem.Mol, str]): The molecule (as an RDKit Mol object) or a chemical formula string to check.

required_formula (Optional[str]): The required chemical formula to match against. If None, the check always returns True.

differentiate_deuterium (bool): Whether or not to consider deuterium the same as hydrogen or not. e.g. CDCl3 == CHCl3

Returns:

bool: True if the molecule or formula matches the required formula, False otherwise.

Notes:

If required_formula is None, the method always returns True.
For RDKit Mol objects, the molecule is standardized before calculating its formula.
The comparison is done by converting both the input and required formulas to dictionaries using formula_to_dictionary.

check_charge(mol: rdkit.Chem.rdchem.Mol, required_charge: int | str | None) → bool¶

Check if a molecule’s charge matches a required charge.

Compares the formal charge of a given molecule against a required charge.

Args:

mol (Chem.rdchem.Mol): The molecule (as an RDKit Mol object) to check.

required_charge (Optional[Union[int, str]]): The required charge to match against. Can be an integer or one of the strings: “zero”, “non_zero”, “positive”, “negative”. If None, the check always returns True.

Returns:

bool: True if the molecule’s charge matches the required charge, False otherwise.

Notes:

If required_charge is None, the method always returns True.
Uses RDKit’s GetFormalCharge to calculate the molecule’s charge.
String inputs for required_charge allow for more flexible charge requirements: * “zero”: Charge must be exactly 0. * “non_zero”: Charge must not be 0. * “positive”: Charge must be greater than 0. * “negative”: Charge must be less than 0.

check_molecular_mass(mol: rdkit.Chem.rdchem.Mol, required_molecular_mass: float | int, percentage_deviation_allowed: float | None = 0.001) → bool¶

Check if a molecule’s mass matches a required molecular mass within a specified tolerance.

Compares the calculated molecular mass of a given molecule against a required molecular mass, allowing for a specified percentage of deviation.

Args:

mol (Chem.rdchem.Mol): The molecule to check.

required_molecular_mass (Union[float, int]): The required molecular mass to match.

percentage_deviation_allowed (Optional[float]): Allowed percentage deviation between the calculated and required molecular mass. Defaults to 0.001 (0.1%).

Returns:

bool: True if the molecule’s mass is within the allowed deviation of the required mass, False otherwise.

Raises:

TypeError: If required_molecular_mass is not a float or an integer.

Notes:

Uses RDKit’s Descriptors.MolWt to calculate the molecule’s mass.
The comparison uses the absolute relative difference between the calculated and required mass, allowing for the specified percentage of deviation.
The minimum value between the required and calculated mass is used as the denominator when calculating the relative difference to handle cases where one of the values might be zero or very small.

check_SMILES(SMILES: str, required_formula: str | None = None, required_charge: int | str | None = None, required_structure_type: str | None = None) → bool¶

Validate a SMILES string against specified requirements.

Checks if a given SMILES string represents a valid molecule and meets the optional requirements for formula, charge, and structure type.

Args:

SMILES (str): The SMILES string to validate.

required_formula (Optional[str]): The expected molecular formula. Defaults to None.

required_charge (Optional[Union[int, str]]): The expected molecular charge. Defaults to None.

required_structure_type (Optional[str]): The expected structure type. Defaults to None.

Returns:

bool: True if the SMILES string is valid and meets all specified requirements, False otherwise.

Notes:

Uses caching to improve performance for repeated calls with the same arguments.
Performs a quick validity check on the SMILES string.
If the SMILES string is valid, creates a molecule object for further checks.
Verifies the formula, charge, and structure type if respective requirements are provided.

get_structure_type_from_SMILES(SMILES: str) → str | None¶

Determine the structure type of a molecule from its SMILES representation.

Analyzes the SMILES string to classify the molecule into different structure types based on its composition and charge distribution.

Args:

SMILES (str): The SMILES representation of the molecule.

Returns:

Optional[str]: A string representing the structure type of the molecule. Possible values are:

“mixture_neutrals”: A mixture of neutral molecules.
“neutral”: A single neutral molecule.
“salt”: A salt (mixture of oppositely charged ions with total charge of 0).
“mixture_ions”: A mixture of ions with non-zero total charge.
“ion”: A single charged molecule.
“mixture_neutrals_salts”: A mixture of neutral molecules and salts.
“mixture_neutrals_ions”: A mixture of neutral molecules and ions.
None: If the SMILES cannot be parsed.

Notes:

Splits the SMILES on ‘.’ to handle mixtures.
Uses RDKit to calculate formal charges for each part of the SMILES.
Classification is based on the number of parts and their charges.
Returns None if any part of the SMILES fails to parse.

check_structure_type(SMILES: str, required_structure_type: str) → bool¶

Check if a molecule’s structure type matches a required structure type.

Compares the structure type of a molecule (determined from its SMILES) against a required structure type.

Args:

SMILES (str): The SMILES representation of the molecule to check.

required_structure_type (str): The required structure type to match against. Must be one of the following: “mixture_neutrals”, “mixture_ions”, “neutral”, “salt”, “ion”, “mixture_neutrals_salts”, “mixture_neutrals_ions”.

Returns:

bool: True if the molecule’s structure type matches the required type, False otherwise.

Raises:

ValueError: If required_structure_type is not one of the allowed values.

Notes:

If required_structure_type is None, the method always returns True.
Uses get_structure_type_from_SMILES to determine the actual structure type.
The comparison is case-sensitive and must match exactly.

InChI_to_SMILES(inchi: str) → str | None¶

Convert an InChI string to a standardized SMILES string.

Converts an InChI representation of a molecule to its standardized SMILES form.

Args:

inchi (str): The InChI string to convert.

Returns:

Optional[str]: The standardized SMILES string, or None if conversion fails.

Notes:

This method is cached to improve performance for repeated calls with the same InChI.
Uses get_from_InChI to convert InChI to a molecule object.
Returns None if the InChI cannot be converted to a molecule.

SMILES_to_InChI(smiles: str) → str | None¶

Convert a SMILES string to an InChI string.

Converts a SMILES (Simplified Molecular Input Line Entry System) string to its InChI (International Chemical Identifier) representation.

Args:

smiles (str): The SMILES string to convert.

Returns:

Optional[str]: The InChI string if conversion is successful, None otherwise.

Notes:

Uses get_from_SMILES to create an RDKit molecule object from the SMILES.
This method is cached for performance optimization.

get_from_InChI(inchi: str, addHs: bool | None = False) → rdkit.Chem.rdchem.Mol | None¶

Create an RDKit molecule object from an InChI string.

Attempts to create an RDKit molecule object from a given InChI string, with options for handling hydrogens.

Args:

inchi (str): The InChI string to convert. addHs (Optional[bool]): If True, explicitly adds hydrogen atoms to the molecule. Defaults to False.

Returns:

Optional[Chem.rdchem.Mol]: An RDKit molecule object if conversion is successful, None otherwise.

Notes:

Returns None if the input InChI is None or an empty string.
First attempts to create the molecule with sanitization.
If that fails, attempts to create the molecule without sanitization.
Updates the property cache of the molecule, with fallback to non-strict update.
Calculates the Smallest Set of Smallest Rings (SSSR) for the molecule.
Optionally adds explicit hydrogens based on the addHs parameter.
This method is cached for performance optimization.

get_from_SMILES(SMILES: str, addHs: bool | None = False) → rdkit.Chem.rdchem.Mol | None¶

Create an RDKit molecule object from a SMILES string.

Attempts to create an RDKit molecule object from a given SMILES string, with options for handling multiple parts and hydrogens.

Args:

SMILES (str): The SMILES string to convert. addHs (bool): If True, explicitly adds hydrogen atoms to the molecule. Defaults to False.

Returns:

Optional[Chem.rdchem.Mol]: An RDKit molecule object if conversion is successful, None otherwise.

Notes:

Returns None if the input SMILES is None.
Handles multi-part SMILES strings (separated by ‘.’) by combining them into a single molecule.
For single-part SMILES, attempts to create the molecule with sanitization first.
If sanitization fails, attempts creation without sanitization for certain elements.
Updates the property cache of the molecule, with fallback to non-strict update.
Calculates the Smallest Set of Smallest Rings (SSSR) for the molecule.
Optionally adds explicit hydrogens based on the addHs parameter.
This method is cached for performance optimization.

to_SMILES(mol: rdkit.Chem.rdchem.Mol, isomeric: bool | None = True, canonicalize_tautomer: bool | None = False, remove_isotopes: bool | None = False) → str¶

Convert an RDKit molecule object to a SMILES string.

Generates a SMILES string from an RDKit molecule object, with options for handling isomers, tautomers, and isotopes.

Args:

mol (Chem.rdchem.Mol): The RDKit molecule object to convert. isomeric (Optional[bool]): If True, includes isomeric information in the SMILES. Defaults to True. canonicalize_tautomer (Optional[bool]): If True, canonicalizes the tautomer before generating SMILES. Defaults to False. remove_isotopes (Optional[bool]): If True, removes isotope information before generating SMILES. Defaults to False.

Returns:

str: The SMILES string representation of the molecule.

Notes:

If remove_isotopes is True, uses remove_isotopes method to strip isotope information.
If canonicalize_tautomer is True, standardizes the tautomeric form of the molecule.
Uses RDKit’s MolToSmiles function to generate the final SMILES string.

is_valid_SMILES_fast(SMILES: str) → bool¶

Quickly check if a string is a potentially valid SMILES representation.

Performs a fast, preliminary check on a string to determine if it could be a valid SMILES representation using regular expressions.

Args:

SMILES (str): The string to check.

Returns:

bool: True if the string passes the preliminary SMILES validity check, False otherwise.

Notes:

Returns False for None or non-string inputs.
Uses a regular expression for validation.
This is a fast check and may not catch all invalid SMILES strings.
The method is cached for performance optimization.

is_valid_SMILES(SMILES: str) → bool¶

Thoroughly check if a string is a valid SMILES representation.

Performs a comprehensive check to determine if a string is a valid SMILES representation by attempting to parse it into a molecule.

Args:

SMILES (str): The string to check.

Returns:

bool: True if the string is a valid SMILES representation, False otherwise.

Notes:

First calls is_valid_SMILES_fast for a quick preliminary check.
If the fast check passes, attempts to convert the SMILES to a molecule using get_from_SMILES.
Returns True only if both the fast check passes and the conversion to a molecule succeeds.
This method is more thorough but slower than is_valid_SMILES_fast.
The method is cached for performance optimization.

is_valid_InChI(InChI: str) → bool¶

Check if a string is a valid InChI representation.

Determines if a string is a valid InChI representation by performing a format check and attempting to convert it to a molecule.

Args:

InChI (str): The string to check.

Returns:

bool: True if the string is a valid InChI representation, False otherwise.

Notes:

Returns False for None or non-string inputs.
Checks if the string matches the expected InChI format using a regular expression.
If the format check passes, attempts to convert the InChI to a molecule using RDKit.
Returns True only if both the format check passes and the conversion to a molecule succeeds.
The method is cached for performance optimization.

is_valid_InChIKey(InChIKey: str) → bool¶

Check if a string is a valid InChIKey representation.

Determines if a string is a valid InChIKey representation by performing a format check.

Args:

InChIKey (str): The string to check.

Returns:

bool: True if the string is a valid InChIKeyy representation, False otherwise.

Notes:

Returns False for None or non-string inputs.
Checks if the string matches the expected InChIKey format using a regular expression.
If the format check passes, returns True only if the format check passes.
The method is cached for performance optimization.

_get_close_digits_on_keyboard(d: str) → list[str]¶

Returns digits that are physically adjacent to d on a QWERTY keyboard’s top number row.

Args:: d (str): a digit
Returns:: list[str]: A list of close digits on the keyboard.

is_valid_CAS(cas: str | bool | None) → bool¶

Check if a string is a valid CAS Registry Number.

Validates whether a string represents a valid CAS Registry Number by checking its format and verifying its check digit.

Args:

cas (Union[str, bool, None]): The string to check.

Returns:

bool: True if the string is a valid CAS Registry Number, False otherwise.

Notes:

Returns False for None, boolean inputs, or non-string inputs.
Checks if the string matches the standard CAS format using a regular expression.
Performs the CAS check digit validation algorithm.
The method is cached for performance optimization.

Example:

>>> is_valid_CAS("7732-18-5")
True
>>> is_valid_CAS("7732-18-6")
False

expand_CAS_heuristically(CAS: str | list[str], max_swaps: int = 2) → list[str]¶

Perform a BFS over keyboard-adjacent digit swaps to find valid CAS numbers with the lowest amount of swaps, hopefully allowing to fix a CAS.

Each single-digit swap must replace the digit with a QWERTY-adjacent digit from the top row (e.g., ‘3’ can swap with ‘2’ or ‘4’). The search will stop once the minimal number of swaps resulting in a valid CAS is found, or after exceeding max_swaps.

Args:: CAS (str): The input CAS string to be modified. max_swaps (int, optional): The maximum number of swaps (BFS depth) to

explore. Defaults to 2.
Returns:: list[str]: A list of all valid CAS strings found at the minimal swap distance. Returns an empty list if none are found within max_swaps.

are_InChIs_equal(InChI1: str, InChI2: str, differentiate_isomers: bool | None = None, differentiate_tautomers: bool | None = None, differentiate_isotopes: bool | None = None, check_for_resonance_structures: bool | None = None) → bool¶

Compare two InChI strings for equality based on specified criteria.

Determines if two InChI strings represent the same chemical entity, with options to consider or ignore various structural features.

Args:

InChI1 (str): The first InChI string to compare. InChI2 (str): The second InChI string to compare. differentiate_isomers (Optional[bool]): If True, considers isomers as different. Defaults to None. differentiate_tautomers (Optional[bool]): If True, considers tautomers as different. Defaults to None. differentiate_isotopes (Optional[bool]): If True, considers isotopes as different. Defaults to None. check_for_resonance_structures (Optional[bool]): If True, checks for resonance structures. Defaults to None.

Returns:

bool: True if the InChIs are considered equal based on the specified criteria, False otherwise.

Notes:

Returns False if either InChI string is empty or None.
If the InChI strings are identical, returns True immediately.
Converts InChI strings to RDKit molecule objects for comparison.
Uses the are_equal method for detailed comparison based on the specified criteria.
This method is cached for performance optimization.

are_SMILES_equal(smiles1: str, smiles2: str, differentiate_isomers: bool | None = None, differentiate_tautomers: bool | None = None, differentiate_isotopes: bool | None = None, check_for_resonance_structures: bool | None = None) → bool¶

Compare two SMILES strings for equality based on specified criteria.

Determines if two SMILES strings represent the same chemical entity, with options to consider or ignore various structural features.

Args:

smiles1 (str): The first SMILES string to compare. smiles2 (str): The second SMILES string to compare. differentiate_isomers (Optional[bool]): If True, considers isomers as different. Defaults to None. differentiate_tautomers (Optional[bool]): If True, considers tautomers as different. Defaults to None. differentiate_isotopes (Optional[bool]): If True, considers isotopes as different. Defaults to None. check_for_resonance_structures (Optional[bool]): If True, checks for resonance structures. Defaults to None.

Returns:

bool: True if the SMILES are considered equal based on the specified criteria, False otherwise.

Notes:

Converts SMILES strings to RDKit molecule objects for comparison.
Uses the are_equal method for detailed comparison based on the specified criteria.
This method is cached for performance optimization.

get_resonance_SMILES(SMILES: str) → list[str]¶

Generate resonance structures for a given molecule represented by a SMILES string.

Takes a SMILES string and returns a list of SMILES strings representing all possible resonance structures of the molecule.

Args:

SMILES (str): The input SMILES string representing the molecule.

Returns:

list[str]: A list of SMILES strings representing the resonance structures. If no resonance structures are found or if the input is invalid, returns a list containing only the input SMILES.

Notes:

Uses RDKit’s ResonanceMolSupplier to generate resonance structures.
Implements a workaround for a known RDKit issue by setting an empty progress callback.
Handles corner cases where ResonanceMolSupplier might return None for all resonance structures.
Filters out any None results from the resonance structure generation.
This method is cached for performance optimization.

Example:

>>> resolver = MoleculeResolver()
>>> resolver.get_resonance_SMILES("C=C-C=C")
['C=C-C=C', 'C=C=C-C', 'C-C=C=C']

are_equal(mol1: rdkit.Chem.rdchem.Mol, mol2: rdkit.Chem.rdchem.Mol, differentiate_isomers: bool | None = None, differentiate_tautomers: bool | None = None, differentiate_isotopes: bool | None = None, check_for_resonance_structures: bool | None = None) → bool¶

Compare two RDKit molecule objects for equality based on specified criteria.

Determines if two molecules are considered equal, with options to consider or ignore various structural features such as isomers, tautomers, isotopes, and resonance structures.

Args:

mol1: The first RDKit molecule object to compare.

mol2: The second RDKit molecule object to compare.

differentiate_isomers: If True, considers isomers as different. If None, uses the class’s default setting. Defaults to None.

differentiate_tautomers: If True, considers tautomers as different. If None, uses the class’s default setting. Defaults to None.

differentiate_isotopes: If True, considers isotopes as different. If None, uses the class’s default setting. Defaults to None.

check_for_resonance_structures: If True, checks for resonance structures. If None, uses the class’s default setting. Defaults to None.

Returns:

True if the molecules are considered equal based on the specified criteria, False otherwise.

Notes:

Returns False if either molecule is None.
Uses class default settings for unspecified comparison criteria.
Performs a series of checks based on the specified criteria: 1. Compares InChI strings if all features are differentiated. 2. Compares canonical SMILES if isotopes are not differentiated. 3. Compares InChI strings without isotope layers if tautomers and isomers are differentiated. 4. Compares InChI strings without stereo layers if only tautomers are differentiated. 5. Compares InChI strings without stereo and tautomer layers if neither are differentiated.
Optionally checks for resonance structures if specified.
Uses RDKit’s MolToInchi and MolToSmiles functions for conversions.
Handles potential RDKit exceptions during InChI generation.

Example:

>>> mol1 = Chem.MolFromSmiles("CC(=O)O")
>>> mol2 = Chem.MolFromSmiles("CC(O)=O")
>>> resolver = MoleculeResolver()
>>> resolver.are_equal(mol1, mol2, differentiate_tautomers=False)
True
>>> resolver.are_equal(mol1, mol2, differentiate_tautomers=True)
False

find_duplicates_in_molecule_dictionary(molecules: dict[Any, moleculeresolver.molecule.Molecule], clustered_molecules: dict | None = None) → Generator[tuple[Any, moleculeresolver.molecule.Molecule, Any, moleculeresolver.molecule.Molecule, bool], None, None]¶

Find duplicate molecules within a dictionary of molecules.

Identifies and yields pairs of duplicate molecules from the input dictionary. It can use a pre-clustered dictionary of molecules for efficiency.

Args:

molecules (dict[Any, Molecule]): A dictionary of molecules to search for duplicates. The keys can be of any type, and the values are Molecule objects.

clustered_molecules (Optional[dict]): A pre-clustered dictionary of molecules, grouped by molecular formula. If None, the method will generate this clustering. Defaults to None.

Returns:

tuple[Any, Molecule, Any, Molecule, bool]: Pairs of duplicate molecules found in the input dictionary.

Notes:

If clustered_molecules is not provided, it generates the clustering using group_molecule_dictionary_by_formula.
Uses the intersect_molecule_dictionaries method to find duplicates.
This method is particularly useful for efficiently identifying duplicate molecules in large datasets.

Example:

>>> molecules = {
...     "mol1": Molecule("CC"),
...     "mol2": Molecule("CCC"),
...     "mol3": Molecule("CC")  # Duplicate of mol1
... }
>>> resolver = MoleculeResolver()
>>> duplicates = list(resolver.find_duplicates_in_molecule_dictionary(molecules))
>>> print(duplicates)
[('mol1', Molecule(...), 'mol3', Molecule(...), True)]

group_molecule_dictionary_by_formula(molecules: dict[Any, moleculeresolver.molecule.Molecule]) → dict[str, dict[Any, tuple[rdkit.Chem.rdchem.Mol, moleculeresolver.molecule.Molecule]]]¶

Group a dictionary of molecules by their molecular formulas.

Takes a dictionary of molecules and returns a new dictionary where molecules are grouped by their molecular formulas.

Args:

molecules (dict[Any, Molecule]): A dictionary of molecules to be grouped. The keys can be of any type, and the values are Molecule objects.

Returns:

dict[str, dict[Any, tuple[Chem.rdchem.Mol, Molecule]]]: A dictionary where: - Keys are molecular formulas (str). - Values are dictionaries where:

Keys are the original keys from the input dictionary.

Values are tuples containing: (an RDKit Mol object, the original Molecule object)

Notes:

Uses the get_from_SMILES method to convert SMILES to RDKit Mol objects.
Calculates molecular formulas using RDKit’s CalcMolFormula function.
This grouping is useful for efficient comparison and searching of molecules with the same formula.

Example:

>>> molecules = {
...     "mol1": Molecule("CC"),
...     "mol2": Molecule("CCC"),
...     "mol3": Molecule("CCO")
... }
>>> resolver = MoleculeResolver()
>>> grouped = resolver.group_molecule_dictionary_by_formula(molecules)
>>> for formula, mol_dict in grouped.items():
...     print(f"{formula}: {list(mol_dict.keys())}")
C2H6: ['mol1']
C3H8: ['mol2']
C2H6O: ['mol3']

intersect_molecule_dictionaries(molecules: dict[Any, moleculeresolver.molecule.Molecule], other_molecules: dict[Any, moleculeresolver.molecule.Molecule], mode: str | None = 'SMILES', clustered_molecules: dict | None = None, clustered_other_molecules: dict | None = None, report_same_keys: bool | None = True) → Generator[tuple[Any, moleculeresolver.molecule.Molecule, Any, moleculeresolver.molecule.Molecule, bool], None, None]¶

Intersect two dictionaries of molecules based on structural similarity.

Compares molecules from two dictionaries and yields pairs of molecules that are structurally similar, along with their keys and isomer information.

Args:

molecules (dict[Any, Molecule]): First dictionary of molecules to compare.

other_molecules (dict[Any, Molecule]): Second dictionary of molecules to compare.

mode (Optional[str]): Comparison mode, either “SMILES” or “inchi”. Defaults to “SMILES”.

clustered_molecules (Optional[dict]): Pre-clustered version of molecules. If None, clustering will be performed. Defaults to None.

clustered_other_molecules (Optional[dict]): Pre-clustered version of other_molecules. If None, clustering will be performed. Defaults to None.

report_same_keys (Optional[bool]): If True, report matches even if the keys are identical. Defaults to True.

Returns:

tuple[Any, Molecule, Any, Molecule, bool]: A tuple containing:

Key from the first dictionary.

Molecule from the first dictionary.

Key from the second dictionary.

Molecule from the second dictionary.

Boolean indicating if the molecules have the same isomer information.

Raises:

ValueError: If an unsupported mode is specified.

Notes:

Molecules are first grouped by molecular formula for efficient comparison.
Supports comparison using either SMILES or InChI representation.
For SMILES mode, molecules are compared with and without considering isomer information.
For InChI mode, molecules are compared with and without considering stereochemistry.
Uses the are_equal and are_InChIs_equal methods for comparisons.

get_SMILES_from_Mol_format(*, molblock: str | None = None, url: str | None = None) → str | None¶

Convert a molecule from MOL format to SMILES representation.

Takes either a MOL block string or a URL pointing to a MOL file, converts it to an RDKit molecule object, standardizes it, and returns the SMILES representation.

Args:

molblock (Optional[str]): A string containing the molecule information in MOL format. Defaults to None.

url (Optional[str]): A URL pointing to a file containing the molecule information in MOL format. Defaults to None.

Returns:

Optional[str]: The SMILES representation of the molecule if conversion is successful, None otherwise.

Raises:

ValueError: If both molblock and url are None, or if both are provided. TypeError: If molblock or url is provided but is not a string.

Notes:

Either molblock or url must be provided, but not both.
If a URL is provided, the method will attempt to fetch the MOL data from it.
Includes a fix for potential issues in the MOL block format.
If the initial conversion fails, it attempts to create the molecule without sanitization.
The resulting molecule is standardized before converting to SMILES.
Uses RDKit for molecule manipulation and conversion.

get_SMILES_from_image_file(image_path: str, engines_order: list[str] | None = ['osra', 'molvec', 'imago'], mode: str | None = 'single') → str | list[str]¶

Extract SMILES representation from a chemical structure image file.

Uses multiple optical structure recognition engines to convert a chemical structure image into a SMILES string.

Args:

image_path (str): The file path to the image containing the chemical structure.

engines_order (Optional[list[str]]): The order in which to try different recognition engines. Defaults to [“osra”, “molvec”, “imago”].

mode (Optional[str]): The extraction mode. Can be either “single” (return first successful result): or “all” (return results from all successful engines). Defaults to “single”.

Returns:

Union[str, list[str]]: The SMILES representation of the chemical structure. If mode is “single”, returns the first successful SMILES string. If mode is “all”, returns a list of all successful SMILES strings.

Raises:

ValueError: If an invalid mode is specified.

Notes:

This function will throw an error if the services are offline.
Attempts to use the specified engines in the given order.
OSRA is typically the most effective engine and is set as the default first choice.
Uses the molvec.ncats.io API for structure recognition.
If an engine successfully recognizes the structure, the result is converted to SMILES.
In “single” mode, the method stops after the first successful recognition.
Final SMILES strings are standardized before being returned.
If no engines successfully recognize the structure, an empty string or list is returned.

show_mol_and_pause(mol: rdkit.Chem.rdchem.Mol, name: str | None = None, size: tuple[int, int] | None = (1000, 1000)) → None¶

Display a molecule image with additional information and pause execution.

Generates an image of the molecule, adds title information including the molecule’s name (if provided) and formal charge, and displays the image.

Args:

mol (Chem.rdchem.Mol): The RDKit molecule object to be displayed.

name (Optional[str]): The name of the molecule to be displayed in the title. If None, only the charge is shown. Defaults to None.

size (Optional[tuple[int, int]]): The size of the output image in pixels (width, height). Defaults to (1000, 1000).

Notes:

Adjusts drawing options based on the image size for optimal visualization.
The formal charge of the molecule is always displayed in the title.
The image is displayed using the default image viewer of the system.
Execution is paused after displaying the image (implicit in img.show()).
Uses RDKit’s Draw module for molecule rendering.
The title is added to the image using the PIL (Python Imaging Library) module.
The font size for atom labels and other drawing options are scaled based on the image size.

save_mol_to_PNG(mol: rdkit.Chem.rdchem.Mol, filename: str, atom_infos: list | None = None, atom_infos_format_string: str | None = '%.3f', size: tuple[int, int] | None = (1000, 1000)) → None¶

Save a molecule image to a PNG file with optional atom information.

Generates an image of the molecule, optionally adds atom-specific information, and saves it as a PNG file.

Args:

mol (Chem.rdchem.Mol): The RDKit molecule object to be saved.

filename (str): The path and name of the file where the image will be saved.

atom_infos (Optional[list]): A list of values to be displayed for each atom. If provided, must have the same length as the number of atoms in the molecule. Defaults to None.

atom_infos_format_string (Optional[str]): The format string to use when converting atom_infos values to strings. Defaults to “%.3f”.

size (Optional[tuple[int, int]]): The size of the output image in pixels (width, height). Defaults to (1000, 1000).

Notes:

If atom_infos is provided, each atom in the molecule will be annotated with the corresponding value.
Creates a deep copy of the molecule to avoid modifying the original.
Atom annotations are added as ‘atomNote’ properties to each atom.
The image is saved in PNG format using RDKit’s MolToFile function.
Uses RDKit’s Draw module for molecule rendering.
If atom_infos is provided but doesn’t match the number of atoms, it may lead to unexpected results.

normalize_html(html_code: str) → str¶

Normalize HTML content.

This method is a placeholder for HTML normalization logic.

Args:

html_code (str): The HTML content to normalize.

Returns:

str: The normalized HTML content.

Notes:

Currently not implemented.

parse_items_from_html(html_code: str, split_tag: str, properties_regex: list[tuple[str, str, list[int]]], property_indices_required: list[int] | None = None) → list[list]¶

Parse and extract items from HTML content based on specified regex patterns.

Splits the HTML content, applies regex patterns to extract properties, and returns a list of items that meet the specified requirements.

Args:

html_code (str): The HTML content to parse.

split_tag (str): The HTML tag used to split the content into parts. If empty, the entire HTML is treated as one part.

properties_regex (list[tuple[str, str, list[int]]]): A list of tuples, each containing:

source_type (str): ‘html’ or ‘text’ to indicate where to apply the regex.

property_regex (str): The regex pattern to extract the property.

allowed_match_number (list[int]): A list of allowed numbers of matches for the regex.

property_indices_required (Optional[list[int]]): Indices of properties that must be non-None for an item to be included in the result. Defaults to None.

Returns:

list[list]: A list of items, where each item is a list of extracted properties.

Raises:

RuntimeError: If the number of regex matches doesn’t fall within the allowed range for any property.

Notes:

First normalizes the HTML using the normalize_html method.
Splits the HTML based on the provided split_tag.
For each part, applies the regex patterns to extract properties.
Items are only included in the result if they meet the requirements specified by property_indices_required.
HTMLparser was thought of here, but IMHO it is way too complicated to match the correct opening and closing tags with the python std library. I did not want to depend on beautifulsoup at the beginning, but it is a dependency that I am willing to add in the future. The following code is a simple and quick solution that works for the time being.

get_java_path() → str | None¶

Get the full path of the Java executable.

Determines the full path of the Java executable on the system, supporting Windows, Linux, and macOS platforms.

Returns:

Optional[str]: The full path to the Java executable if found, None otherwise.

Raises:

NotImplementedError: If the method is called on an unsupported operating system.

Notes:

On Windows, uses the ‘where’ command to locate Java.
On Linux and macOS, uses the ‘which’ command.
Checks if the found Java paths actually exist on the file system.
If multiple Java installations are found, the first valid path is returned.
Returns None if Java is not found or if the ‘where’/’which’ command fails.

_get_and_run_OPSIN_executable(names: tuple[str], allow_uninterpretable_stereo: bool | None = False) → list[str | None]¶

Retrieve a molecule from OPSIN based on its name.

Queries the OPSIN (Open Parser for Systematic IUPAC Nomenclature) service to convert a chemical name into a molecular structure, with optional constraints.

Args:

name (str): The chemical name to be converted to a molecular structure. required_formula (Optional[str]): The expected molecular formula. required_charge (Optional[int]): The expected molecular charge. required_structure_type (Optional[str]): The expected structure type. allow_warnings (Optional[bool]): If True, accept results with warnings.

Returns:

Optional[Molecule]: A Molecule object if successful, None otherwise.

Notes:

Checks the molecule cache before querying OPSIN.
Uses a resilient request method to handle potential network issues.
The SMILES string returned by OPSIN is checked against the required parameters.
If the SMILES check fails, it attempts to convert the InChI to SMILES.
Standardizes the SMILES string before creating the Molecule object.
Any warnings from OPSIN are included in the Molecule’s additional_information.
The result is filtered and combined with other results before being returned.
If no valid molecule is found, the method returns None.

get_molecule_from_OPSIN_batchmode(names: list[str], allow_uninterpretable_stereo: bool | None = False) → list[list[moleculeresolver.molecule.Molecule] | None]¶

Convert a batch of chemical names to molecules using OPSIN in offline mode.

Uses OPSIN to convert a list of chemical names to molecular structures in batch mode.

Args:

names (list[str]): A list of chemical names to be converted. allow_uninterpretable_stereo (Optional[bool]): Allows OPSIN to ignore uninterpretable stereochemistry.

Returns:

list[Optional[list[Molecule]]]: A list where each element corresponds to an input identifier. Each element is either a list of Molecule objects (if found) or None (if not found or invalid).

Raises:

FileNotFoundError: If the Java installation could not be found. RuntimeError: If there was a problem parsing the OPSIN offline output file.

Notes:

Checks a cache for previously processed molecules.
Downloads the latest version of OPSIN if not already present.
Runs OPSIN in offline mode using Java, processing all names in a single batch.
Standardizes the SMILES strings returned by OPSIN.
Uses a temporary directory for OPSIN operations, cleaned up after use.

get_molecule_for_ion_from_partial_pubchem_search(identifier: str, required_formula: str | None = None, required_charge: int | None = None) → list[tuple[moleculeresolver.molecule.Molecule, int]] | None¶

Search PubChem for an ion molecule based on a partial identifier.

Performs a partial search on PubChem using the given identifier and returns matching ion molecules that satisfy the specified formula and charge requirements.

Args:

identifier (str): The partial identifier to search for. required_formula (Optional[str]): The expected molecular formula. required_charge (Optional[int]): The expected molecular charge.

Returns:

Optional[list[tuple[Molecule, int]]]: A list of tuples with Molecule objects and their occurrence counts.

Raises:

ValueError: If the parameters fail validation in the _check_parameters method.

Notes:

Cached to improve performance for repeated calls.
Uses a resilient request method to handle potential network issues.
The search is performed on the PubChem Compound database.
For each compound found, checks if it’s a salt or mixture.
Each component is checked against the required formula and charge.
Standardizes the SMILES strings of matching molecules.
Results are sorted by frequency.
Returns None if no matching molecules are found.

get_molecule_from_ChEBI(identifier: str, mode: str, required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None) → moleculeresolver.molecule.Molecule | None¶

Retrieve a molecule from ChEBI based on the provided identifier and mode.

Queries the ChEBI database to retrieve molecule information based on various identifiers.

Args:

identifier (str): The identifier to search for. mode (str): The type of identifier (e.g., ‘name’, ‘cas’, ‘formula’, ‘smiles’, ‘inchi’, ‘inchikey’). required_formula (Optional[str]): The expected molecular formula. required_charge (Optional[int]): The expected molecular charge. required_structure_type (Optional[str]): The expected structure type.

Returns:

Optional[Molecule]: A Molecule object if found and meets all requirements, None otherwise.

Raises:

ValueError: If the input parameters fail validation.

Notes:

Checks a cache for previously retrieved molecules.
Uses the ChEBI web services API to query the database.
Extracts SMILES, synonyms, CAS numbers, and ChEBI ID.
Filters results based on required formula, charge, and structure type.
Combines multiple matching molecules into a single result if necessary.
Uses resilient network requests to handle potential connection issues.

get_molecules_using_batchmode_from(identifiers: list[list[str]], modes: list[list[str]], service: str, batch_size: int | None = 1000, progressbar: bool | None = False) → tuple[dict[str, list[moleculeresolver.molecule.Molecule | None]], list[str]]¶

Retrieve molecules in batch mode from a specified service.

Performs batch retrieval of molecules from a supported service using provided identifiers and modes.

Args:

identifiers (list[list[str]]): A list of lists containing identifiers for molecules. modes (list[list[str]]): A list of lists containing modes corresponding to the identifiers. service (str): The name of the service to use for retrieval. batch_size (Optional[int]): Number of identifiers to process in each batch. progressbar (Optional[bool]): Display a progress bar during processing.

Returns:

tuple[dict[str, list[Optional[Molecule]]], list[str]]: A dictionary of results and a list of supported modes.

Raises:

ValueError: If the specified service does not have batch capabilities or is not supported. ValueError: If the list of identifiers and modes for a molecule are not of the same length. NotImplementedError: If batch functionality for the specified service is not implemented. RuntimeError: If the number of returned values doesn’t match the number of unique values requested.

Notes:

Checks if the service supports batch capabilities.
Groups identifiers by mode for efficient batch processing.
Uses different batch retrieval functions based on the service.
Supports services like PubChem, SRS, CompTox, and OPSIN.
Organizes results by mode.

get_molecule_from_CompTox(identifier: str, mode: str, required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None) → moleculeresolver.molecule.Molecule | None¶

Retrieve a molecule from CompTox based on the provided identifier and mode.

Queries the CompTox database to retrieve molecule information based on various identifiers.

Args:

identifier (str): The identifier to search for. mode (str): The type of identifier. required_formula (Optional[str]): The expected molecular formula. required_charge (Optional[int]): The expected molecular charge. required_structure_type (Optional[str]): The expected structure type.

Returns:

Optional[Molecule]: A Molecule object if found and meets all requirements, None otherwise.

Raises:

ValueError: If the input parameters fail validation.

Notes:

Checks a cache for previously retrieved molecules.
Uses the CompTox API to query the database.
Extracts SMILES, synonyms, CAS numbers, and DTXSID.
Filters results based on required formula, charge, and structure type.
Includes CompTox QC level in the additional information.
# new API, at some point we need to change to https://api-ccte.epa.gov/docs/chemical.html#/

get_molecule_from_CTS(identifier: str, mode: str, required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None) → moleculeresolver.molecule.Molecule | None¶

Retrieve a molecule from CTS (Chemical Translation Service) based on the provided identifier and mode.

Queries CTS to convert various chemical identifiers into molecular structures and associated information.

Args:

identifier (str): The chemical identifier to search for. mode (str): The type of identifier (e.g., ‘name’, ‘cas’, ‘inchi’, ‘smiles’). required_formula (Optional[str]): The expected molecular formula. required_charge (Optional[int]): The expected molecular charge. required_structure_type (Optional[str]): The expected structure type.

Returns:

Optional[Molecule]: A Molecule object if found and meets all requirements, None otherwise.

Raises:

ValueError: If the input parameters fail validation.

Notes:

Checks a cache for previously retrieved molecules.
Uses the CTS REST API to query the service.
Retrieves SMILES, CAS numbers, and synonyms.
Filters out radicals and mixtures based on the required_structure_type.
Filters results based on required formula, charge, and structure type.
Issues a warning and returns None if CTS is down.
Uses resilient network requests to handle potential connection issues.

get_CompTox_request_unique_id() → str¶

Generate a unique identifier for CompTox requests based on the current timestamp.

Returns:

str: A unique base36-encoded string derived from the current timestamp.

Notes:

Uses the current time in milliseconds for increased uniqueness.
Base36 encoding results in a shorter string compared to decimal representation.
Useful for generating unique, short, and time-based request IDs.

get_molecules_from_CompTox_batchmode(identifiers: list[str], mode: str) → list[moleculeresolver.molecule.Molecule | None]¶

Retrieve molecules from CompTox in batch mode.

This method queries the CompTox (Computational Toxicology) database to retrieve molecule information for multiple identifiers in a single batch request.

Args:

identifiers (list[str]): A list of chemical identifiers to search for in CompTox. mode (str): The type of identifier. Supported modes are ‘name’, ‘cas’, and ‘inchikey’.

Returns:

list[Optional[Molecule]]: A list of Molecule objects corresponding to the input identifiers. Each element is either a Molecule object if found, or None if not found or invalid.

Raises:

TypeError: If any of the identifiers is not a string. ValueError: If the input parameters fail validation in the _check_parameters method.

Notes:

The method first checks a cache for previously retrieved molecules.
It uses the CompTox batch search API to query the database.
The search is performed based on the specified mode for all identifiers.
The method retrieves SMILES, synonyms, CAS numbers, IUPAC names, and QC levels.
It creates Molecule objects for valid results, including metadata.
The results are filtered based on the quality of the match and data.
The method uses resilient network requests to handle potential connection issues.
It includes a polling mechanism to wait for the CompTox job to complete.
The results are processed from an Excel file returned by the CompTox API.
Synonyms are extracted from a separate sheet in the Excel file.

Example:

>>> resolver = MoleculeResolver()
>>> identifiers = ["50-00-0", "64-17-5", "71-43-2"]
>>> molecules = resolver.get_molecules_from_CompTox_batchmode(identifiers, mode="cas")
>>> for identifier, molecule in zip(identifiers, molecules):
...     if molecule:
...         print(f"Found molecule for {identifier}: {molecule.get_SMILES()}")
...     else:
...         print(f"No molecule found for {identifier}")

get_molecule_from_Chemeo(identifier: str, mode: str, required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None) → moleculeresolver.molecule.Molecule | None¶

Retrieve a molecule from Chemeo based on the provided identifier and mode.

This method queries the Chemeo database to retrieve molecule information using various types of chemical identifiers.

Args:

identifier (str): The chemical identifier to search for in Chemeo. mode (str): The type of identifier. Supported modes include ‘name’, ‘cas’, and ‘smiles’. required_formula (Optional[str]): The expected molecular formula. Defaults to None. required_charge (Optional[int]): The expected molecular charge. Defaults to None. required_structure_type (Optional[str]): The expected structure type. Defaults to None.

Returns:

Optional[Molecule]: A Molecule object if a matching molecule is found and meets all requirements, None otherwise.

Raises:

ValueError: If the input parameters fail validation in the _check_parameters method.

Notes:

The method first checks a cache for previously retrieved molecules.
It requires a valid Chemeo API token to perform the search.
The search is performed in two steps: first converting the identifier to a Chemeo CID, then searching for the compound details using the CID.
It extracts SMILES, synonyms, CAS numbers, and other relevant information from the API response.
The method prioritizes 3D mol block over 2D mol block, then InChI, and finally SMILES for structure representation.
It filters results to ensure the returned molecule matches the input SMILES (if mode is ‘smiles’).
The results are further filtered based on the required formula, charge, and structure type.
The method uses resilient network requests to handle potential connection issues.

Example:

>>> resolver = MoleculeResolver()
>>> molecule = resolver.get_molecule_from_Chemeo("ethanol", mode="name")
>>> if molecule:
...     print(f"Found molecule: {molecule.get_SMILES()}")
... else:
...     print("No matching molecule found")

get_molecules_from_pubchem_batchmode(original_identifiers: list[str], mode: str) → list[list[moleculeresolver.molecule.Molecule] | None]¶

Retrieve molecules from PubChem in batch mode.

This method queries the PubChem database to retrieve molecule information for multiple identifiers in a single batch request using the PubChem Power User Gateway (PUG).

Args:

original_identifiers (list[str]): A list of chemical identifiers to search for in PubChem.

mode (str): The type of identifier. Supported modes include ‘name’, ‘cas’, ‘formula’, ‘smiles’, ‘inchi’, and ‘inchikey’.

Returns:

list[Optional[list[Molecule]]]: A list where each element corresponds to an input identifier. Each element is either a list of Molecule objects (if found) or None (if not found or invalid). Multiple Molecule objects may be returned for a single identifier if multiple matches are found.

Raises:

TypeError: If any of the identifiers is not a string. ValueError: If the input parameters fail validation in the _check_parameters method.

Notes:

The method first checks parameters using the _check_parameters method.
It cleans identifiers, especially for ‘name’ mode, removing certain prefixes and suffixes.
The method uses PubChem’s PUG XML API for batch requests.
It performs two main steps: CID search and information retrieval for found CIDs.
The search and retrieval are done in batches to handle large numbers of identifiers efficiently.
It extracts SMILES, synonyms, CAS numbers, and other relevant information from the API response.
The method uses resilient network requests to handle potential connection issues.
It includes a polling mechanism to wait for PubChem jobs to complete.
Results are processed from gzipped files returned by the PubChem API.
The method standardizes SMILES strings and creates Molecule objects with metadata.
API: https://pubchem.ncbi.nlm.nih.gov/docs/power-user-gateway
API: https://pubchem.ncbi.nlm.nih.gov/docs/identifier-exchange-service

Example:

>>> resolver = MoleculeResolver()
>>> identifiers = ["ethanol", "acetone", "benzene"]
>>> results = resolver.get_molecules_from_pubchem_batchmode(identifiers, mode="name")
>>> for identifier, molecules in zip(identifiers, results):
...     if molecules:
...         print(f"Found {len(molecules)} molecule(s) for {identifier}")
...     else:
...         print(f"No molecules found for {identifier}")

get_molecule_from_pubchem(identifier: str, mode: str, required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None) → moleculeresolver.molecule.Molecule | None¶

Retrieve a molecule from PubChem based on the provided identifier and mode.

This method queries the PubChem database to retrieve molecule information using various types of chemical identifiers.

Args:

identifier (str): The chemical identifier to search for in PubChem.

mode (str): The type of identifier. Supported modes include ‘name’, ‘cas’, ‘formula’, ‘smiles’, ‘inchi’, and ‘inchikey’.

required_formula (Optional[str]): The expected molecular formula. Defaults to None.

required_charge (Optional[int]): The expected molecular charge. Defaults to None.

required_structure_type (Optional[str]): The expected structure type. Defaults to None.

Returns:

Optional[Molecule]: A Molecule object if a matching molecule is found and meets all requirements,: None otherwise.

Raises:

ValueError: If the input parameters fail validation in the _check_parameters method.

Notes:

The method first checks a cache for previously retrieved molecules.
It uses the PubChem REST API to query the database.
For ‘formula’ mode, it retrieves multiple results and processes them individually.
The method extracts SMILES, synonyms, CAS numbers, and PubChem CID from the API response.
It standardizes SMILES strings and creates Molecule objects with metadata.
The results are filtered based on the required formula, charge, and structure type.
For ‘formula’ mode, if no results are found, it tries searching with the Hill formula.
The method uses resilient network requests to handle potential connection issues.
It includes special handling for different PubChem response formats and data structures.

Example:

>>> resolver = MoleculeResolver()
>>> molecule = resolver.get_molecule_from_pubchem("ethanol", mode="name")
>>> if molecule:
...     print(f"Found molecule: {molecule.get_SMILES()}")
... else:
...     print("No matching molecule found")

get_molecule_from_CAS_registry(identifier: str, mode: str, required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None) → moleculeresolver.molecule.Molecule | None¶

Retrieve a molecule from CAS Registry based on the provided identifier and mode.

This method queries the CAS (Chemical Abstracts Service) Registry to retrieve molecule information using various types of chemical identifiers.

Args:

identifier (str): The chemical identifier to search for in CAS Registry.

mode (str): The type of identifier. Supported modes include ‘name’, ‘cas’, ‘formula’, ‘smiles’, ‘inchi’, and ‘inchikey’.

required_formula (Optional[str]): The expected molecular formula. Defaults to None. required_charge (Optional[int]): The expected molecular charge. Defaults to None.

required_structure_type (Optional[str]): The expected structure type. Defaults to None.

Returns:

Optional[Molecule]: A Molecule object if a matching molecule is found and meets all requirements, None otherwise.

Raises:

ValueError: If the input parameters fail validation in the _check_parameters method.

Notes:

The method first checks a cache for previously retrieved molecules.
It uses the CAS Common Chemistry API to query the database.
The search is performed in two steps: first a general search, then a detailed lookup.
It extracts SMILES, synonyms, CAS numbers, and InChI from the API response.
The method standardizes SMILES strings and creates Molecule objects with metadata.
Results are filtered based on the required formula, charge, and structure type.
For ‘smiles’ mode, it performs additional checks to ensure the returned structure matches the input.
If no results are found for ‘smiles’ mode, it attempts to search using the InChI representation.
The method uses resilient network requests to handle potential connection issues.
Rejected status codes (403, 404) are handled gracefully.

Example:

>>> resolver = MoleculeResolver()
>>> molecule = resolver.get_molecule_from_CAS_registry("64-17-5", mode="cas")
>>> if molecule:
...     print(f"Found molecule: {molecule.get_SMILES()}")
... else:
...     print("No matching molecule found")

_match_SRS_results_to_identifiers(identifiers: list[str], mode: str, results: list[dict]) → dict[str, list[tuple]]¶

Match SRS (Substance Registry Services) results to input identifiers.

This method processes the results from an SRS query and matches them to the original input identifiers, organizing the data for easy retrieval.

Args:

identifiers (list[str]): A list of chemical identifiers used in the original query. mode (str): The type of identifier used. Supported modes are ‘cas’ and ‘name’. results (list[dict]): A list of dictionaries containing the SRS query results.

Returns:

dict[str, list[tuple]]: A dictionary where keys are the original identifiers and values are lists of tuples. Each tuple contains information about a matched molecule: (SMILES, primary_names, synonyms, CAS numbers, ITN, all_synonyms_lower).

Raises:

RuntimeError: If an ITN (Internal Tracking Number) is encountered more than once in the results.

Notes:

The method processes each result, extracting relevant information such as names, CAS numbers, and SMILES.
It organizes the data into several dictionaries for efficient lookup:
- infos_by_ITN: Stores all information for each ITN.
- ITNs_by_primary_name: Maps primary names to their corresponding ITNs.
- ITNs_by_synonym: Maps synonyms to their corresponding ITNs.
- ITNs_by_all_names: Maps all names (primary and synonyms) to their corresponding ITNs.
- ITNs_by_CAS: Maps CAS numbers to their corresponding ITNs.
The method attempts to standardize SMILES notations and convert InChI to SMILES if necessary.
It implements a matching algorithm that tries to find the best match for each input identifier.
The matching process is iterative and attempts to resolve ambiguities when multiple matches are found.
It uses various strategies to determine the best match, including checking for unique matches, comparing SMILES, and counting synonym occurrences.

Example:

>>> resolver = MoleculeResolver()
>>> identifiers = ["50-00-0", "ethanol"]
>>> mode = "name"
>>> results = [...]  # SRS query results
>>> matched = resolver._match_SRS_results_to_identifiers(identifiers, mode, results)
>>> for identifier, matches in matched.items():
...     print(f"Matches for {identifier}: {len(matches)}")

get_molecules_from_SRS_batchmode(identifiers: list[str], mode: str) → list[list[moleculeresolver.molecule.Molecule] | None]¶

Retrieve molecules from SRS (Substance Registry Services) in batch mode.

This method queries the EPA’s Substance Registry Services to retrieve molecule information for multiple identifiers in a single batch request.

Args:

identifiers (list[str]): A list of chemical identifiers to search for in SRS.

mode (str): The type of identifier. Supported modes are determined by the _check_parameters method, typically including ‘name’ and ‘cas’.

Returns:

list[Optional[list[Molecule]]]: A list where each element corresponds to an input identifier. Each element is either a list of Molecule objects (if found) or None (if not found or invalid).

Raises:

TypeError: If any of the identifiers is not a string. ValueError: If the input parameters fail validation in the _check_parameters method.

Notes:

The method first checks a cache for previously retrieved molecules.
It uses the SRS REST API to query the database.
The search is performed in batches to handle large numbers of identifiers efficiently.
It extracts SMILES, synonyms, CAS numbers, and other relevant information from the API response.
The method uses resilient network requests to handle potential connection issues.
Results are processed and matched to the original identifiers using the _match_SRS_results_to_identifiers method.
The SRS API has a limit on URL length, so the method splits large batches into smaller chunks.
Synonyms are filtered and sorted before being added to the Molecule objects.
API: https://www.postman.com/api-evangelist/workspace/environmental-protection-agency-epa/collection/35240-6b84cc71-ce77-48b8-babd-323eb8d670bd
new API; https://cdxappstest.epacdx.net/oms-substance-registry-services/swagger-ui/

Example:

>>> resolver = MoleculeResolver()
>>> identifiers = ["50-00-0", "64-17-5", "71-43-2"]
>>> molecules = resolver.get_molecules_from_SRS_batchmode(identifiers, mode="cas")
>>> for identifier, molecule in zip(identifiers, molecules):
...     if molecule:
...         print(f"Found molecule for {identifier}: {molecule[0].get_SMILES()}")
...     else:
...         print(f"No molecule found for {identifier}")

get_molecule_from_SRS(identifier: str, mode: str, required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None) → moleculeresolver.molecule.Molecule | None¶

Retrieve a molecule from SRS (Substance Registry Services) based on the provided identifier and mode.

This method queries the EPA’s Substance Registry Services to retrieve molecule information using various types of chemical identifiers.

Args:

identifier (str): The chemical identifier to search for in SRS.

mode (str): The type of identifier. Supported modes are determined by the _check_parameters method, typically including ‘name’ and ‘cas’.

required_formula (Optional[str]): The expected molecular formula. Defaults to None.

required_charge (Optional[int]): The expected molecular charge. Defaults to None.

required_structure_type (Optional[str]): The expected structure type. Defaults to None.

Returns:

Optional[Molecule]: A Molecule object if a matching molecule is found and meets all requirements, None otherwise.

Raises:

ValueError: If the input parameters fail validation in the _check_parameters method. RuntimeError: If more than one molecule is found for the given identifier.

Notes:

The method first checks parameters using the _check_parameters method.
It then checks a cache for previously retrieved molecules.
The method uses the SRS REST API to query the database.
It constructs the API URL based on the mode and identifier.
The query uses resilient network requests to handle potential connection issues.
If a result is found, it processes the JSON response to extract relevant information.
The method extracts SMILES, primary names, synonyms, CAS numbers, and ITN (Internal Tracking Number).
It filters and sorts synonyms before creating the Molecule object.
The SMILES is checked against required formula, charge, and structure type if specified.
Only one molecule is returned; if multiple are found, it raises a RuntimeError.

Example:

>>> resolver = MoleculeResolver()
>>> molecule = resolver.get_molecule_from_SRS("50-00-0", mode="cas")
>>> if molecule:
...     print(f"Found molecule: {molecule.get_SMILES()}")
... else:
...     print("No matching molecule found")

_get_info_from_CIR(structure_identifier: str, representation: str, resolvers_to_use: tuple[str], expected_number_of_results: int | None = None) → list[str] | None¶

Retrieve chemical information from the Chemical Identifier Resolver (CIR).

This method queries the CIR API to obtain various representations of a chemical structure.

Args:

structure_identifier (str): The chemical identifier to search for. representation (str): The desired representation of the chemical (e.g., ‘smiles’, ‘iupac_name’). resolvers_to_use (tuple[str]): A tuple of resolver names to use in the query. expected_number_of_results (Optional[int]): The expected number of results. Defaults to None.

Returns:

Optional[list[str]]: A list of strings containing the requested chemical information, or None if the query fails or no results are found.

Raises:

RuntimeError: If the number of results doesn’t match the expected number. requests.exceptions.ConnectionError: If there’s a connection error during the API request.

Notes:

This method is cached to improve performance for repeated queries.
It uses a resilient request mechanism to handle potential network issues.
If CIR is down, it sets a flag to avoid further attempts in the same session.
There’s a 1-second delay between requests to avoid overwhelming the CIR server.
The method can handle connection reset errors by reinitializing the session.
API: https://cactus.nci.nih.gov/chemical/structure_documentation
API: https://search.r-project.org/CRAN/refm<ans/webchem/html/cir_query.html
API: https://github.com/mcs07/CIRpy

Example:

>>> resolver = MoleculeResolver()
>>> smiles = resolver._get_info_from_CIR("ethanol", "smiles", ("name",))
>>> print(smiles)

get_iupac_name_from_CIR(SMILES: str) → str | None¶

Retrieve the IUPAC name for a given SMILES string using the Chemical Identifier Resolver (CIR).

This method uses the CIR API to convert a SMILES representation of a molecule to its IUPAC name.

Args:

SMILES (str): The SMILES string representation of the molecule.

Returns:

Optional[str]: The IUPAC name of the molecule if found, or None if not found or in case of an error.

Notes:

This method is cached to improve performance for repeated queries.
It internally uses the _get_info_from_CIR method to perform the API request.
The method specifically requests the ‘iupac_name’ representation and uses ‘smiles’ as the resolver.

Example:

>>> resolver = MoleculeResolver()
>>> iupac_name = resolver.get_iupac_name_from_CIR("CCO")
>>> print(iupac_name)

get_molecule_from_CIR(identifier: str, mode: str, required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None) → moleculeresolver.molecule.Molecule | None¶

Retrieve a molecule from Chemical Identifier Resolver (CIR) based on the provided identifier and mode.

This method queries the CIR to retrieve molecule information using various types of chemical identifiers.

Args:

identifier (str): The chemical identifier to search for in CIR. mode (str): The type of identifier. Supported modes include ‘formula’, ‘name’, ‘smiles’, ‘inchi’, ‘inchikey’, and ‘cas’. required_formula (Optional[str]): The expected molecular formula. Defaults to None. required_charge (Optional[int]): The expected molecular charge. Defaults to None. required_structure_type (Optional[str]): The expected structure type. Defaults to None.

Returns:

Optional[Molecule]: A Molecule object if a matching molecule is found and meets all requirements, None otherwise.

Raises:

ValueError: If the input parameters fail validation in the _check_parameters method.

Notes:

The method first checks parameters using the _check_parameters method.
It then checks a cache for previously retrieved molecules.
The method uses different resolvers based on the input mode.
It retrieves SMILES representation from CIR using the _get_info_from_CIR method.
If a SMILES is found, it retrieves additional information like names and CAS numbers.
Synonyms and CAS numbers are filtered and sorted before being added to the Molecule object.
The resulting molecule is filtered based on the required formula, charge, and structure type if specified.

Example:

>>> resolver = MoleculeResolver()
>>> molecule = resolver.get_molecule_from_CIR("ethanol", mode="name")
>>> if molecule:
...     print(f"Found molecule: {molecule.get_SMILES()}")
... else:
...     print("No matching molecule found")

get_molecule_from_NIST(identifier: str, mode: str, required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None) → moleculeresolver.molecule.Molecule | None¶

Retrieve a molecule from NIST (National Institute of Standards and Technology) based on the provided identifier and mode.

This method queries the NIST Chemistry WebBook to retrieve molecule information using various types of chemical identifiers.

Args:

identifier (str): The chemical identifier to search for in NIST. mode (str): The type of identifier. Supported modes include ‘formula’, ‘name’, ‘cas’, ‘inchi’, and ‘smiles’. required_formula (Optional[str]): The expected molecular formula. Defaults to None. required_charge (Optional[int]): The expected molecular charge. Defaults to None. required_structure_type (Optional[str]): The expected structure type. Defaults to None.

Returns:

Optional[Molecule]: A Molecule object if a matching molecule is found and meets all requirements, None otherwise.

Raises:

ValueError: If the input parameters fail validation in the _check_parameters method. RuntimeError: If the webpage format has changed and cannot be parsed.

Notes:

The method first checks parameters using the _check_parameters method.
It then checks a cache for previously retrieved molecules.
For ‘smiles’ mode, it converts the SMILES to InChI before querying NIST.
The method uses web scraping techniques to extract information from the NIST Chemistry WebBook.
It handles different response formats, including search results and direct molecule pages.
The method extracts InChI, CAS numbers, names, and synonyms from the HTML content.
It converts InChI to SMILES using the InChI_to_SMILES method.
Synonyms and CAS numbers are filtered and sorted before being added to the Molecule object.
The resulting molecule is filtered based on the required formula, charge, and structure type if specified.
For search result pages, it recursively queries each result until a matching molecule is found.

Example:

>>> resolver = MoleculeResolver()
>>> molecule = resolver.get_molecule_from_NIST("64-17-5", mode="cas")
>>> if molecule:
...     print(f"Found molecule: {molecule.get_SMILES()}")
... else:
...     print("No matching molecule found")

find_salt_molecules_and_stoichometric_coefficients(identifiers: list[str], modes: list[str] | None = ['name'], required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None, services_to_use: list[str] | None = None, search_iupac_name: bool | None = False, interactive: bool | None = False, minimum_number_of_crosschecks: int | None = 1, ignore_exceptions: bool | None = False) → tuple[list, list[int]]¶

Finds salt molecules based on the provided identifiers and criteria.

This method searches for salt molecules using the specified identifiers and optional parameters. It checks the structure type, charge, and other attributes to ensure accurate identification of salt compounds. If the search is unsuccessful, it attempts to derive information from synonyms or provided SMILES strings.

Args:

identifiers (list[str]): A list of identifiers for the salt to search for.

modes (Optional[list[str]]): The modes of identification for the identifiers. Defaults to [‘name’].

required_formula (Optional[str]): A chemical formula that the molecules must match. Defaults to None, meaning no specific formula is required.

required_charge (Optional[int]): The charge that the molecules must possess. Defaults to None, which sets the charge to 0.

required_structure_type (Optional[str]): The required structure type (e.g., “salt”). Defaults to None, which defaults to “salt”.

services_to_use (Optional[list[str]]): Specific services to be used for retrieving data. Defaults to None, meaning all services are available.

search_iupac_name (Optional[bool]): If True, attempts to search using the IUPAC name. Defaults to False.

interactive (Optional[bool]): If True, allows for interactive user input if necessary. Defaults to False.

minimum_number_of_crosschecks (Optional[int]): Minimum number of services to cross-check for validity. Defaults to 1.

ignore_exceptions (Optional[bool]): If True, ignores exceptions that may occur during the search process. Defaults to False.

Returns:

tuple[list, list[int]]: A tuple containing:

A list of found molecules, where each molecule is represented as a tuple of relevant information.

A list of stoichiometric coefficients corresponding to the found molecules.

Raises:

ValueError: If the provided SMILES does not represent a salt. NotImplementedError: If the functionality for salts with more than 2 ions is invoked.

Example:

>>> find_salt_molecules_and_stoichometric_coefficients(["NaCl", "sodium chloride"], modes=['name', 'name'], required_charge=0)
([salt_molecule, cation_molecule, anion_molecule], [...stoichiometric coefficients...])

static _validate_resolution_mode(resolution_mode: ResolutionMode) → None¶: Validate the chosen resolution mode.

_collect_opsin_isomer_matches(grouped_molecules: dict[str, list[moleculeresolver.molecule.Molecule]], candidate_smiles: list[str]) → dict[str, bool]¶: Check whether candidate groups have at least one OPSIN-confirmed isomeric match.

find_single_molecule(identifiers: list[str], modes: list[str] | None = ['name'], required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None, services_to_use: list[str] | None = None, search_iupac_name: bool | None = False, interactive: bool | None = False, ignore_exceptions: bool | None = False, search_strategy: SearchStrategy = 'first_hit', resolution_mode: ResolutionMode = 'legacy') → moleculeresolver.molecule.Molecule | None¶

Searches for a single molecule across multiple chemical databases and services.

This method attempts to find a molecule based on the provided identifiers and modes, using various chemical databases and services. It returns the first matching molecule that satisfies all the specified requirements.

Args:

identifiers (list[str]): A list of identifiers for the molecule (e.g., names, formulas, CAS numbers).

modes (Optional[list[str]]): A list of modes corresponding to each identifier. Defaults to [‘name’]

required_formula (Optional[str]): The required molecular formula. Defaults to None.

required_charge (Optional[int]): The required molecular charge. Defaults to None.

required_structure_type (Optional[str]): The required structure type. Defaults to None.

services_to_use (Optional[list[str]]): A list of services to use for the search. If None, all available services will be used. Defaults to None.

search_iupac_name (Optional[bool]): Whether to search for IUPAC names. Defaults to False.

interactive (Optional[bool]): Whether to run in interactive mode. Defaults to False.

ignore_exceptions (Optional[bool]): Whether to ignore exceptions during the search. Defaults to False.

search_strategy (str): Search strategy. “first_hit” keeps legacy behavior; “exhaustive” evaluates all identifier/service combinations. resolution_mode (str): Resolution mode. Included for API consistency and future expansion. Accepted values are “legacy”, “consensus”, “strict_isomer”.

Returns:

Optional[Molecule]: A Molecule object if found, None otherwise.

Raises:

Various exceptions depending on the services used and error conditions encountered.

Notes:

This method searches through multiple chemical databases and services in a specific order. It stops and returns the first matching molecule that satisfies all the specified requirements. The search order and the exact behavior may depend on the available services and the provided parameters.

find_single_molecule_interactively(identifiers: list[str], modes: list[str] | None = ['name'], required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None) → moleculeresolver.molecule.Molecule | None¶

Interactively searches for a single molecule based on user input.

This method prompts the user to input various identifiers for a molecule and attempts to find a matching molecule that satisfies all specified requirements. It allows for iterative refinement of the search based on user feedback.

Args:

identifiers (list[str]): Initial list of identifiers for the molecule.

modes (Optional[list[str]]): List of modes corresponding to each identifier. Defaults to [‘name’].

required_formula (Optional[str]): The required molecular formula. Defaults to None.

required_charge (Optional[int]): The required molecular charge. Defaults to None.

required_structure_type (Optional[str]): The required structure type. Defaults to None.

Returns:

Optional[Molecule]: A Molecule object if a matching molecule is found and confirmed by the user, None if the search is aborted or no matching molecule is found.

Raises:

Various exceptions may be raised during the molecule search process.

Notes:

The method uses a combination of automated searches and user input.
It supports various input formats including PubChem CID, name, CAS number, SMILES, and InChI.
The user can view and confirm the molecular structure before accepting the result.
If no synonyms are found, the user is prompted to provide a name for the molecule.
The method allows the user to input or confirm the CAS number.

static _rank_candidate_evidence(candidate_evidence: list[moleculeresolver.resolution.CandidateEvidence]) → list[moleculeresolver.resolution.CandidateEvidence]¶: Rank candidate evidence by agreement and concordance.

_build_candidate_evidence(grouped_molecules: dict[str, list[moleculeresolver.molecule.Molecule]]) → list[moleculeresolver.resolution.CandidateEvidence]¶: Create base evidence objects with service and identifier concordance.

_build_resolution_result(best_molecule: moleculeresolver.molecule.Molecule | None, grouped_molecules: dict[str, list[moleculeresolver.molecule.Molecule]], selected_smiles: str | None, selection_reason: str) → moleculeresolver.resolution.ResolutionResult¶: Build full include_evidence payload from grouped candidate molecules.

find_single_molecule_crosschecked(identifiers: list[str], modes: list[str] | None = ['name'], required_formula: str | None = None, required_charge: int | None = None, required_structure_type: str | None = None, services_to_use: list[str] | None = None, search_iupac_name: bool | None = False, minimum_number_of_crosschecks: int | None = 1, try_to_choose_best_structure: bool | None = True, ignore_exceptions: bool | None = False, search_strategy: SearchStrategy = 'first_hit', resolution_mode: ResolutionMode = 'legacy', include_evidence: bool = False) → moleculeresolver.molecule.Molecule | None | list[moleculeresolver.molecule.Molecule | None] | moleculeresolver.resolution.ResolutionResult¶

Finds a single molecule with cross-checking across multiple services.

This method searches for a molecule using the provided identifiers and modes, cross-checking the results across multiple chemical services to ensure accuracy.

Args:

identifiers (list[str]): List of identifiers for the molecule. modes (Optional[list[str]]): List of search modes. Defaults to [‘name’]. required_formula (Optional[str]): Required molecular formula. Defaults to None. required_charge (Optional[int]): Required molecular charge. Defaults to None. required_structure_type (Optional[str]): Required structure type. Defaults to None. services_to_use (Optional[list[str]]): List of services to use. If None, all available services are used. search_iupac_name (Optional[bool]): Whether to search for IUPAC names. Defaults to False. minimum_number_of_crosschecks (Optional[int]): Minimum number of services that must agree. Defaults to 1. try_to_choose_best_structure (Optional[bool]): Whether to attempt to select the best structure. Defaults to True. ignore_exceptions (Optional[bool]): Whether to ignore exceptions during search. Defaults to False. search_strategy (str): Search strategy. “first_hit” keeps legacy behavior; “exhaustive” evaluates all identifier/service combinations. resolution_mode (str): Resolution mode. Accepted values are “legacy”, “consensus”, “strict_isomer”. include_evidence (bool): If True, return ResolutionResult with evidence payload.

Returns:

Union[Optional[Molecule], list[Optional[Molecule]]]: A single Molecule object if a best structure is chosen, a list of Molecule objects if multiple structures are found and not choosing the best, or None if no matching molecule is found.

Raises:

ValueError: If minimum_number_of_crosschecks exceeds the number of services used.

Notes:

The method searches across multiple services and cross-checks the results.
It filters and groups molecules based on structure similarity.
If try_to_choose_best_structure is True, it attempts to select the most reliable structure.
The method uses various heuristics to resolve conflicts between different services.
OPSIN is given preference for name-based searches when available.
ChEBI results are given lower priority in case of conflicts.

find_multiple_molecules_parallelized(identifiers: list[str], modes: list[str] | None = ['name'], required_formulas: list[str] | None = None, required_charges: list[int] | None = None, required_structure_types: list[str] | None = None, services_to_use: list[str] | None = None, search_iupac_name: bool | None = False, minimum_number_of_crosschecks: int | None = 1, try_to_choose_best_structure: bool | None = True, progressbar: bool | None = True, max_workers: int | None = 5, ignore_exceptions: bool = True, search_strategy: SearchStrategy = 'first_hit', resolution_mode: ResolutionMode = 'legacy') → list[moleculeresolver.molecule.Molecule | None]¶

Finds multiple molecules in parallel based on provided identifiers and criteria.

This method utilizes multithreading to search for multiple molecules concurrently. It checks various attributes such as required formulas, charges, and structure types, and fetches data from specified services, with options for progress tracking and batch processing.

Args:

identifiers (list[str]): A list of identifiers for the molecules to search for.

modes (Optional[list[str]]): The modes of identification for the identifiers. Defaults to [‘name’]

required_formulas (Optional[list[str]]): A list of chemical formulas hat the molecules must match. Defaults to None, which means no specific formula is required for each identifier.

required_charges (Optional[list[int]]): A list of charges that the molecules must possess. Defaults to None, indicating no specific charge is required.

required_structure_types (Optional[list[str]]): A list of required structure types (e.g., “salt”). Defaults to None, meaning no specific structure type is required.

services_to_use (Optional[list[str]]): Specific services to be used for retrieving data. Defaults to None, meaning all available services are considered.

search_iupac_name (Optional[bool]): If True, attempts to search using the IUPAC name. Defaults to False.

minimum_number_of_crosschecks (Optional[int]): Minimum number of services to cross-check for validity. Defaults to 1.

try_to_choose_best_structure (Optional[bool]): If True, attempts to select the best structure among the results. Defaults to True.

progressbar (Optional[bool]): If True, displays a progress bar during the search. Defaults to True.

max_workers (Optional[int]): Maximum number of threads to use for parallel processing. Defaults to 5.

ignore_exceptions (Optional[bool]): If True, ignores exceptions that may occur during the search process. Defaults to True.

search_strategy (str): Search strategy. “first_hit” keeps legacy behavior; “exhaustive” evaluates all identifier/service combinations. resolution_mode (str): Resolution mode. Accepted values are “legacy”, “consensus”, “strict_isomer”.

Returns:

list[Optional[Molecule]]: A list of found molecules, where each molecule is represented as an instance of the Molecule class, or None if not found.

Raises:

ValueError: If any of the identifiers or parameters are invalid.

Example:

>>> find_multiple_molecules_parallelized(
...     identifiers=["NaCl", "K2SO4"],
...     modes=['name'],
...     required_charges=[0, -2]
... )
[<Molecule instance for NaCl>, <Molecule instance for K2SO4>]

moleculeresolver.__version__¶

moleculeresolver¶

Submodules¶

Attributes¶

Classes¶

Package Contents¶

MoleculeResolver

Navigation

Related Topics