Package resokit.datasets
The ResoKit.datasets package includes tools for loading datasets.
- resokit.datasets.clear_memory(which: str, verbose: bool = True, files: bool = False) None[source]
Clear the memory for the specified dataset.
- Parameters:
which (str) – Which dataset (‘eu’, ‘nasa’, ‘datasets’, ‘p’, ‘s’, ‘binary’, ‘all’).
verbose (bool, optional. Default: True.) – Whether to print informational messages.
files (bool, optional. Default: False.) – If True, also removes the files from disk.
- resokit.datasets.load(source: str, from_memory: bool = True, from_zip: str | Path | bool = True, from_file: str | Path | bool = True, dir_path: str | Path | bool | None = True, to_resokit: bool = True, to_df: bool = False, check_age: bool = False, only_index: bool = False, only_rows: list | int = False, verbose: bool = True, store: bool | str = True, store_index: bool | str = True) DataFrame | ResokitDataFrame | ResoKitDataset | None[source]
Load the dataset from a specified source.
The dataset is loaded from a ZIP archive or a CSV file, or from memory if already stored. The priority is given to the memory saved dataset, then to the zip archive, and finally to the file.
Note
Storing the dataset in memory is useful for faster access and to avoid reading the file multiple times.
Note
If both from_file and from_zip are provided, it is assumed that the file inside the ZIP archive is the same as the one provided in from_file. Finally, the path constructed is: dir_path / zip_name / file_name.
- Parameters:
source (str) – Identifier for the data source (‘eu’ or ‘nasa’).
from_memory (bool, optional. Default: True.) – If True, loads the dataset from memory if available.
from_zip (str or Path or bool, optional. Default: True.) – Path to the ZIP archive to load the dataset. If True, default ZIP filename is used. If False, the file is not loaded from the ZIP archive.
from_file (str or Path or bool, optional. Default: True.) – Path to the file to load the dataset. If True, default filename is used. If False, the file is not loaded.
dir_path (str, Path or bool, optional. Default: True.) – Directory path to load the dataset from. If True or None the default directory is used.
to_resokit (bool, optional. Default: True.) – If True, returns the dataset including only the columns required by ResoKit.
to_df (bool, optional. Default: False.) – If True, returns the raw dataset as a pandas DataFrame. If False, returns the dataset as a ResoKitDataset.
check_age (bool, optional. Default: False.) – If True, displays the file’s last modified date. used by ResoKit.
only_index (bool, optional. Default: False.) – If True, loads only the index columns. If p or a string starting with “p”, loads the parsed index columns. Only compatible with from_memory=True. If not previously stored, None is returned.
only_rows (list|int, optional. Default: [].) – If provided, loads only the specified rows. Remember that python is 0-indexed, so the first row (system) is 0.
verbose (bool, optional. Default: True.) – If True, prints messages about the process.
store (bool, str, optional. Default: True.) – If str, then “f” or “y” or “s” or “o” overwrites the stored dataset. If True, stores the dataset in memory.
store_index (bool, str, optional. Default: True.) – If True, stores the dataset index in memory. If only_rows is provided, the index is not stored. If str, then “f” or “y” or “s” or “o” overwrites the stored index.
- Returns:
dataset – The loaded dataset as a pandas DataFrame or a ResoKitDataset.
- Return type:
DataFrame or ResoKitDataset
- resokit.datasets.download(source: str, to_memory: bool = True, to_file: str | Path | bool = True, to_zip: str | Path | bool = True, dir_path: str | Path | bool | None = True, overwrite: bool = False, soft: bool = False, check_outd: bool = True, only_new_rows: bool = False, to_resokit: None | bool = None, verbose: bool = True, chunk_size: int = 1024, print_size: float = 0.15) Path | DataFrame | ResoKitDataset | None | dict[source]
Download a dataset from a specified source and save it locally.
The dataset is downloaded from the internet, from the online NASA or exoplanet.eu databases, and can be stored in a file, a ZIP archive, in memory, and/or simply returned.
Note
Requires the requests library.
- Parameters:
source (str) – Identifier for the data source (‘eu’ or ‘nasa’). If “all” or “both”, downloads both datasets.
to_memory (bool, optional. Default: True.) – If True, stores the dataset in memory.
to_file (str or Path or bool, optional. Default: True.) – Path or str to the file to store the dataset. If True, default filename is used. If False, the file is not saved nor created.
to_zip (str or Path or bool, optional. Default: True.) – Path or str to the ZIP archive to store the dataset. If True, default ZIP filename is used. If False, the file is not saved nor created in the ZIP archive.
dir_path (str or Path or bool or None. Default: True) – Directory path to save the dataset, or path to the ZIP archive. If None or True the default directory is used.
overwrite (bool, optional. Default: False.) – If True, overwrites the file if it already exists. The memory stored Dataset and Index are always overwritten, independently of this parameter.
soft (bool, optiona. Default: False) – If True, prints a message instead of raising an error, in case of file existing and overwrite = False.
check_outd (bool, optional. Default: True.) – Whether to check if the dataset is already up-to-date.
only_new_rows (bool, optional. Default: False.) – Whether to perform a query of only rows updated after the latest local row-update. If no previous local dataset exists an error is raised. If False, the whole dataset is downloaded.
to_resokit (bool, dict, optional. Default: None.) – If True, returns the dataset as a ResoKitDataset. If False, returns the dataset as a pandas DataFrame. If None, returns the path to the downloaded file.
verbose (bool, optional. Default: True.) – If True, displays messages about the download process.
chunk_size (int, optional. Default: 1024.) – Size of the chunks to download the dataset, in bytes. Default is 1024 bytes (1 KB).
print_size (float, optional. Default: 0.15.) – Update frequency for the download progress bar.
- Returns:
downloaded – Path to the downloaded dataset (and or zip archive), or the dataset if to_resokit is not None.
- Return type:
Path or pd.DataFrame or None
- resokit.datasets.update(source: str, load_kwargs: Dict | None = None, to_memory: bool = True, to_file: str | Path | bool = True, to_zip: str | Path | bool = True, dir_path: str | Path | bool | None = True, overwrite: bool = False, check_outd: bool = True, to_resokit: None | bool = None, verbose: bool = True) Path | DataFrame | ResoKitDataset | None | dict[source]
Update the local dataset with new rows from a specified source.
This function is a wrapper for the function resokit.datasets.download(…, only_new_rows=True); but is mandatory that the dataset exists previously to be loaded first. No download printing progress available for this function.
Note
Requires the requests library.
- Parameters:
source (str) – Identifier for the data source (‘eu’ or ‘nasa’). If “all” or “both”, downloads both datasets.
load_kwargs (dict or None, optional. Defalt: None) – Dictionary with keyboard arguments for the resokit.load function. If None, the default arguments are used.
to_memory (bool, optional. Default: True.) – If True, stores the dataset in memory.
to_file (str or Path or bool, optional. Default: True.) – Path or str to the file to store the dataset. If True, default filename is used. If False, the file is not saved nor created.
to_zip (str or Path or bool, optional. Default: True.) – Path or str to the ZIP archive to store the dataset. If True, default ZIP filename is used. If False, the file is not saved nor created in the ZIP archive.
dir_path (str or Path or bool or None. Default: True) – Directory path to save the dataset, or path to the ZIP archive. If None or True the default directory is used.
overwrite (bool, optional. Default: False.) – If True, overwrites the file if it already exists. The memory stored Dataset and Index are always overwritten, independently of this parameter.
check_outd (bool, optional. Default: True.) – Whether to check if the dataset is already up-to-date.
to_resokit (bool, dict, optional. Default: None.) – If True, returns the dataset as a ResoKitDataset. If False, returns the dataset as a pandas DataFrame. If None, returns the path to the downloaded file.
verbose (bool, optional. Default: True.) – If True, displays messages about the download process.
- Returns:
updated – Path to the updated dataset (and or zip archive), or the dataset if to_resokit is not None.
- Return type:
Path or pd.DataFrame or None
- resokit.datasets.check_outdated(which: str = 'both', verbose: bool = True, soft=True) bool | Tuple[bool, bool][source]
Check if the specified stored dataset is outdated.
- Parameters:
which (str, optional. Default: 'both') – Which dataset (‘eu’ or ‘nasa’). If ‘both’, then both ‘eu’ and ‘nasa’. If ‘all’, then ‘both’ and both binaries too.
verbose (bool, optional. Default: True.) – Whether to print informational messages.
- Returns:
outdated – Whether the dataset is outdated.
- Return type:
bool
- resokit.datasets.load_binary(which: str | bool, from_memory: bool = True, from_file: str | bool = True, dir_path: str | Path | bool = True, rename_columns: bool = True, ret_header: bool = False, inferr: bool = False, clean: bool = True, verbose: bool = True) DataFrame | str[source]
Load a binary dataset.
- Parameters:
which (str, bool) – Which dataset to load: ‘circumbinary’ or ‘c’ or ‘p’ for the p-type circumbinaries dataset, ‘simple’ or ‘s’ for the s-type binaries dataset. If True, loads the default dataset (circumbinary). If False, loads the simple binary dataset.
from_memory (bool, optional. Default: True.) – If True, loads the dataset from memory if available.
from_file (str or bool, optional. Default: True.) – If True, default filename is used. If False, the file is not loaded.
dir_path (str, Path or bool, optional. Default: True.) – Directory path to load the dataset from. If True or None the default directory is used.
rename_columns (bool, optional. Default: True.) – If True, rename the columns for human readability.
ret_header (bool, optional. Default: False.) – If True, return the header. If False, return the data.
inferr (bool, optional. Default: False.) – If False, the width of the columns is fixed. (Recommended) If True, the parsed width of the columns is inferred. Use in case the dataset cannot be parsed with fixed-width columns.
clean (bool, optional. Default: True.) – If True, replace the unknown values with NaN.
verbose (bool, optional. Default: True.) – If True, print the header and messages.
- Returns:
- headerstr if ret_header is True.
The header of the dataset.
- datapd.DataFrame if ret_header is False.
The dataset as a pandas DataFrame.
- Return type:
Union[pd.DataFrame, str]
- resokit.datasets.download_binary(which: str, to_file: str | Path | bool = True, dir_path: str | Path | bool | None = True, to_memory: bool = True, return_data: bool = True, overwrite: bool = False, soft: bool = True, verbose: bool = True, chunk_size: int = 1024, print_size: float = 1e-05) Path | DataFrame | None | dict[source]
Download a binary dataset from a specified source and save it locally.
The dataset is downloaded from the internet and can be stored in a file, in memory, and/or simply returned.
Note
Requires the requests library.
- Parameters:
which (str) – Which dataset to download: ‘circumbinary’ or ‘c’ or ‘p’ for the p-type circumbinaries dataset, ‘simple’ or ‘s’ for the s-type binaries dataset. If “all” or “both”, downloads both datasets.
to_file (str or Path or bool, optional. Default: True.) – Path or str to the file to store the dataset. If True, default filename is used. If False, the file is not saved nor created.
dir_path (str or Path or bool or None. Default:True) – Directory path to save the dataset. If None or True, the default directory is used.
to_memory (bool, optional. Default: True.) – If True, stores the dataset in memory.
return_data (bool, optional. Default: True.) – If True, returns the dataset.
overwrite (bool, optional. Default: False.) – If True, overwrites the file if it already exists. It also overwrites the stored dataset in memory.
soft (bool, optiona. Default: True) – If True, prints a message instead of raising an error, in case of file existing and overwrite = False.
verbose (bool, optional. Default: True.) – If True, displays messages about the download process.
chunk_size (int, optional. Default: 1024.) – Size of the chunks to download the dataset, in bytes. Default is 1024 bytes (1 KB).
print_size (float, optional. Default: 0.15.) – Update frequency for the download progress bar.
- Returns:
downloaded – Path to the downloaded dataset (and or zip archive), or the dataset if return_data is True, or None.
- Return type:
Path or pd.DataFrame or str or None
- resokit.datasets.check_binary_outdated(which: str | bool = 'both', verbose: bool = True, soft=True) bool | Tuple[bool, bool][source]
Check if the specified stored binary dataset is outdated.
- Parameters:
which (str, bool) – Which dataset: ‘p’ (circumbinary) or ‘s’ (single binary). If ‘both’ or ‘all, both datasets are checked. If True, circumbinary; if False, single binary.
verbose (bool, optional. Default: True.) – Whether to print informational messages.
- Returns:
outdated – Whether the dataset is outdated.
- Return type:
bool
- resokit.datasets.query_new_rows(source: str, check_outd: bool = True, to_resokit: None | bool = False, verbose: bool = True, rename: bool = True, load_kwargs: Dict | None = None) DataFrame | ResoKitDataset | Tuple[source]
Query online the rows updated after latest local dataset row-update.
The rows are queried according the the corresponding row-update value. The resulting pandas dataframe is cached for the duration of the session. If querying from NASA, the rows will have all (including non default and controversial) new planets.
Note
This function does not update the local dataset, but caches the queries in case of reusing when calling resokit.databases.update.
Note
Requires the requests library.
- Parameters:
source (str) – Identifier for the data source (‘eu’ or ‘nasa’). If “all” or “both”, queries rows from both datasets.
check_outd (bool, optional. Default: True.) – Whether to check if the dataset is already up-to-date. If so, no query is performed.
to_resokit (bool, dict, optional. Default: None.) – Formats the final dataset: If True, as a ResoKitDataset. If False, as a pandas DataFrame. If None, as a ResoKitDataset, using all original columns.
verbose (bool, optional. Default: True.) – If True, displays messages about the query process.
rename (bool, optional. Default: True.) – If True, renames the columns to match the original databe column names. Mainly for EU database queries.
load_kwargs (dict, None, optional. Default: None) – Dictionary with keyboard arguments for the resokit.load function. If None, the default arguments are used.
- Returns:
downloaded – The requested rows with specified format; or tuple if both sources requested.
- Return type:
pd.DataFrame or ResoKitDataset or Tuple