Exemplos de KeywordProcessor.add_non_word_boundary em Python, exemplos de flashtext.KeywordProcessor.add_non_word_boundary em Python

Exemplo n.º 1

0

Exibir arquivo

def _create_flashtext_object():
    """
    Instantiates a Flashtext object.
    Separators are specified to not be considered as word boundaries
    """
    keyword_processor = KeywordProcessor()
    # special characters are included as natively flashtext library does not handle them correctly
    for separator in [
            "-",
            "_",
            "/",
            "é",
            "è",
            "ê",
            "â",
            "ô",
            "ö",
            "ü",
            "û",
            "ù",
            "ï",
            "î",
            "æ",
    ]:
        keyword_processor.add_non_word_boundary(separator)
    return keyword_processor

Exemplo n.º 2

0

Exibir arquivo

Arquivo: tokenizer.py Projeto: AntoineSimoulin/melusine

def _create_flashtext_object():
    """
    Instantiates a Flashtext object.
    Separators are specified to not be considered as word boundaries
    """
    keyword_processor = KeywordProcessor()
    # special characters are included as natively flashtext library does not handle them correctly
    for separator in ['-', '_', '/', 'é', 'è', 'ê', 'â', 'ô', 'ö', 'ü', 'û', 'ù', 'ï', 'î', 'æ']:
        keyword_processor.add_non_word_boundary(separator)
    return keyword_processor

Exemplo n.º 3

0

Exibir arquivo

    async def on_message(self, message):
        if message.author == self.client.user:
            return
        """
        Figured that this might be time to explain our specialcommand system

        These commands can be of multiple words eg (welcome to ds) and do not require a prefix, as they're words searched inside messages

        It searches inside messages, eg:(considering command in question is "hello", it'll search if the message contains "hello" or not, therefore it will respond to a message such as "Oh, I forgot, Hello I am iron man")

        -Every command and response here are stored in the json file named "NoPref.json"        
        
        -Every command and response can be added on the go, with the methods ?cc and ?sc (although I somehow forgot to add deleting said commands lol)

        """

        # Iterate through the command and response in the dict

        for command, response in self.Dictwithstuff.items():
            """
            keyword_processors, this is from the flashtext library imported above
            pretty much a faster way to search through messages, since we don't need regex here anyways

            """
            # initialize a new keyword_processor
            keyword_processor = KeywordProcessor(case_sensitive=False)
            # set the command as the keyword to search for
            keyword_processor.add_keyword(command)
            #Add " and ' to be recognised as a word so "no u" doesn't triggered
            keyword_processor.add_non_word_boundary("'")
            keyword_processor.add_non_word_boundary('"')
            # get the content of the message
            messagecont = message.content
            if command in keyword_processor.extract_keywords(messagecont):
                # if the command is anything other than "im sad"
                # send the usual response
                if command != "im sad":
                    await message.channel.send(response)
                    break
                # if it is "im sad" then send the response generator
                await message.channel.send(self.im_sad_gen())
            # if the message is a ree(with as many e's)
            # we need to regex pattern match it
            elif pattern.match(messagecont):
                await message.channel.send(
                    "https://tenor.com/view/ree-pepe-triggered-angry-ahhhh-gif-13627544"
                )

                break

Exemplo n.º 4

0

Exibir arquivo

import logging
from flashtext import KeywordProcessor
from importers.repository import elastic

log = logging.getLogger(__name__)

log.info("Initalizing keyword processor")
keyword_processor = KeywordProcessor()
[keyword_processor.add_non_word_boundary(token) for token in list('åäöÅÄÖ')]
for t in elastic.load_terms('KOMPETENS'):
    keyword_processor.add_keyword(t['term'], t)
for t in elastic.load_terms('YRKE'):
    keyword_processor.add_keyword(t['term'], t)


def enrich(annonser):
    results = []
    for annons in annonser:
        # Fetch information from title, header and content
        text = "%s %s %s" % (annons.get('header', ''), annons.get(
            'title', {}).get('freetext', ''), annons.get('content', {}).get(
                'text', ''))
        kwords = keyword_processor.extract_keywords(text)
        annons['skills'] = list(
            set([
                ont['concept'].lower() for ont in kwords
                if ont['type'] == 'KOMPETENS'
            ]))
        annons['occupations'] = list(
            set([
                ont['concept'].lower() for ont in kwords

Exemplo n.º 5

0

Exibir arquivo

Arquivo: Flashtext大规模数据清洗的利器.py Projeto: jiji87432/hello-world

# Out[31]: 'color'

# 获取字典中的所有关键词
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('j2ee', 'Java')
keyword_processor.add_keyword('colour', 'color')
keyword_processor.get_all_keywords()
# output: {'colour': 'color', 'j2ee': 'Java'}

# 除\w [A-Za-z0-9_]之外的任何字符，都认为是一个单词的边界
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Big Apple')
print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
# ['Big Apple']
print(keyword_processor.extract_keywords('I love Big Apple_Bay Area.'))
# []
print(keyword_processor.extract_keywords('I love Big Apple2Bay Area.'))
# []

# 设置或添加字符作为单词字符的一部分
keyword_processor.add_non_word_boundary('/')
print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
# []


def main():
    pass


if __name__ == '__main__':
    main()

Exemplo n.º 6

0

Exibir arquivo

Arquivo: Flashtext大规模数据清洗的利器.py Projeto: gswyhq/hello-world

# Out[31]: 'color'

# 获取字典中的所有关键词
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('j2ee', 'Java')
keyword_processor.add_keyword('colour', 'color')
keyword_processor.get_all_keywords()
# output: {'colour': 'color', 'j2ee': 'Java'}

# 除\w [A-Za-z0-9_]之外的任何字符，都认为是一个单词的边界
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Big Apple')
print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
# ['Big Apple']
print(keyword_processor.extract_keywords('I love Big Apple_Bay Area.'))
# []
print(keyword_processor.extract_keywords('I love Big Apple2Bay Area.'))
# []

# 设置或添加字符作为单词字符的一部分
keyword_processor.add_non_word_boundary('/')
print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
# []


def main():
    pass


if __name__ == '__main__':
    main()

Exemplo n.º 7

0

Exibir arquivo

class FlashTextEntityExtractor(EntityExtractor):

    defaults = {
        # text will be processed with case insensitive as default
        "case_sensitive": False,
        "non_word_boundaries": "",
    }

    def required_components(cls) -> List[Type[Component]]:
        return [Tokenizer]

    def __init__(
        self,
        component_config: Optional[Dict[Text, Any]] = None,
        lookups: Optional[Dict[Text, List[Text]]] = None,
    ):
        """This component extracts entities using lookup tables."""

        super().__init__(component_config)
        self.keyword_processor = KeywordProcessor(
            case_sensitive=self.component_config["case_sensitive"]
        )
        for non_word_boundary in self.component_config["non_word_boundaries"]:
            self.keyword_processor.add_non_word_boundary(non_word_boundary)
        if lookups:
            self.keyword_processor.add_keywords_from_dict(lookups)
            self.lookups = lookups

    def train(
        self,
        training_data: TrainingData,
        config: Optional[RasaNLUModelConfig] = None,
        **kwargs: Any,
    ) -> None:
        lookups = self._extract_lookups(
            training_data,
            use_only_entities=True,
        )

        if len(lookups.keys()) == 0:
            rasa.shared.utils.io.raise_warning(
                "No lookup tables defined in the training data that have a "
                "name equal to any entity in the training data. In order for "
                "this component to work you need to define valid lookup tables "
                "in the training data."
            )
        self.lookups = lookups
        self.keyword_processor.add_keywords_from_dict(lookups)

    def process(self, message: Message, **kwargs: Any) -> None:
        extracted_entities = self._extract_entities(message)
        extracted_entities = self.add_extractor_name(extracted_entities)

        message.set(
            ENTITIES, message.get(ENTITIES, []) + extracted_entities, add_to_output=True
        )

    def _extract_lookups(
        self, training_data: TrainingData, use_only_entities: True
    ) -> Dict[Text, List[Text]]:
        if not training_data.lookup_tables or len(training_data.lookup_tables) == 0:
            return {}
        return {
            lookup_table["name"]: lookup_table["elements"]
            for lookup_table in training_data.lookup_tables
            if (
                not use_only_entities
                or (
                    use_only_entities and lookup_table["name"] in training_data.entities
                )
            )
        }

    def _extract_entities(self, message: Message) -> List[Dict[Text, Any]]:
        """Extract entities of the given type from the given user message."""
        if len(self.keyword_processor) == 0:
            return []
        matches = self.keyword_processor.extract_keywords(
            message.get(TEXT), span_info=True
        )

        return [
            {
                ENTITY_ATTRIBUTE_TYPE: match[0],
                ENTITY_ATTRIBUTE_START: match[1],
                ENTITY_ATTRIBUTE_END: match[2],
                ENTITY_ATTRIBUTE_VALUE: message.get(TEXT)[match[1] : match[2]],
            }
            for match in matches
        ]

    @classmethod
    def load(
        cls,
        meta: Dict[Text, Any],
        model_dir: Optional[Text] = None,
        model_metadata: Optional[Metadata] = None,
        cached_component: Optional["FlashTextEntityExtractor"] = None,
        **kwargs: Any,
    ) -> "FlashTextEntityExtractor":

        file_name = meta.get("file")
        lookup_file = os.path.join(model_dir, file_name)

        if os.path.exists(lookup_file):
            lookups = rasa.shared.utils.io.read_json_file(lookup_file)
            return FlashTextEntityExtractor(meta, lookups)

        return FlashTextEntityExtractor(meta)

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        """Persist this model into the passed directory.
        Return the metadata necessary to load the model again."""
        file_name = f"{file_name}.json"
        lookup_file = os.path.join(model_dir, file_name)
        rasa.shared.utils.io.dump_obj_as_json_to_file(lookup_file, self.lookups)
        return {"file": file_name}

Exemplo n.º 8

0

Exibir arquivo

Arquivo: flashtextexamples.py Projeto: robutseverywhere/Python

print('j2ee' in keyword_processor)
print(keyword_processor.get_keyword('j2ee'))

keyword_processor['color'] = 'color'

print(keyword_processor['color'])

# To set or add characters as part of word characters.

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword("Big Apple")

print(keyword_processor.extract_keywords("I love the Big Apple/Bay Area."))

keyword_processor.add_non_word_boundary("/")

print(keyword_processor.extract_keywords("I love the Big Apple/Bay Area."))

# Searching for a single word in a document.
document = """Batman is a fictional superhero appearing in American comic books published by DC Comics. The character was created by artist Bob Kane and writer Bill Finger,[4][5] and first appeared in Detective Comics #27 (1939). Originally named the "Bat-Man", the character is also referred to by such epithets as the Caped Crusader, the Dark Knight, and the World's Greatest Detective.[6]

Batman's secret identity is Bruce Wayne, a wealthy American playboy, philanthropist, and owner of Wayne Enterprises. After witnessing the murder of his parents Dr. Thomas Wayne and Martha Wayne as a child, he swore vengeance against criminals, an oath tempered by a sense of justice. Bruce Wayne trains himself physically and intellectually and crafts a bat-inspired persona to fight crime.[7]

Batman operates in the fictional Gotham City with assistance from various supporting characters, including his butler Alfred, police commissioner Gordon, and vigilante allies such as Robin. Unlike most superheroes, Batman does not possess any superpowers; rather, he relies on his genius intellect, physical prowess, martial arts abilities, detective skills, science and technology, vast wealth, intimidation, and indomitable will. A large assortment of villains make up Batman's rogues gallery, including his archenemy, the Joker.

The character became popular soon after his introduction in 1939 and gained his own comic book title, Batman, the following year. As the decades went on, differing interpretations of the character emerged. The late 1960s Batman television series used a camp aesthetic, which continued to be associated with the character for years after the show ended. Various creators worked to return the character to his dark roots, culminating in 1986 with The Dark Knight Returns by Frank Miller. The success of Warner Bros.' live-action Batman feature films have helped maintain the character's prominence in mainstream culture.[8]

An American cultural icon, Batman has garnered enormous popularity and is among the most identifiable comic book characters. Batman has been licensed and adapted into a variety of media, from radio to television and film, and appears on various merchandise sold around the world, such as toys and video games. The character has also intrigued psychiatrists, with many trying to understand his psyche. In 2015, FanSided ranked Batman as number one on their list of "50 Greatest Super Heroes In Comic Book History".[9] Kevin Conroy, Bruce Greenwood, Peter Weller, Anthony Ruivivar, Jason O'Mara, and Will Arnett, among others, have provided the character's voice for animated adaptations. Batman has been depicted in both film and television by Lewis Wilson, Robert Lowery, Adam West, Michael Keaton, Val Kilmer, George Clooney, Christian Bale, and Ben Affleck. """

processor = KeywordProcessor()