Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Python | Split Text into Sentences

#1
Python | Split Text into Sentences

Rate this post

✨Summary: There are four different ways to split a text into sentences:
? Using nltk module
? Using re.split()
? Using re.findall()
? Using replace

Minimal Example


text = "God is Great! I won a lottery." # Method 1
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text)) # Method 2
import re
res = [x for x in re.split("[//.|//!|//?]", text) if x!=""]
print(res) # Method 3
res = re.findall(r"[^.!?]+", text)
print(res) # Method 4
def splitter(txt, delim): for i in txt: if i in delim: txt = txt.replace(i, ',') res = txt.split(',') res.pop() return res sep = ['.', '!']
print(splitter(text, sep)) # Output: ['God is Great', ' I won a lottery']

Problem Formulation


Problem: Given a string/text containing numerous sentences; How will you split the string into sentences?

Example: Let’s visualize the problem with the help of an example.

# Input
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
# output
['This is sentence 1', ' This is sentence 2', ' This is sentence 3']

Method 1: Using nltk.tokenize


Natural Language Processing (NLP) has a process known as tokenization using which a large quantity of text can be divided into smaller parts called tokens. The Natural Language toolkit contains a very important module known as NLTK tokenize sentence which further comprises sub-modules. We can use this module and split a given text into sentences.

Code:

from nltk.tokenize import sent_tokenize
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
print(sent_tokenize(text)) # ['This is sentence 1.', ' This is sentence 2!', ' This is sentence 3?']

Explanation: 

  • Import the sent_tokenize module.
  • Further, the sentence_tokenizer module allows you to parse the given sentences and break them into individual sentences at the occurrence of punctuations like periods, exclamation,  question marks, etc.

Caution: You might get an error after installing the nltk package. So, here’s the entire process to install nltk in your system.

Install nltk using → pip install nltk

Then go ahead and type the following in your Python shell:

import nltk
nltk.download('punkt')

That’s it! You are now ready to use the sentence_tokenizer module in your code.

Method 2: Using re.split


The re.split(pattern, string) method matches all occurrences of the pattern in the string and divides the string along the matches resulting in a list of strings between the matches. For example, re.split('a', 'bbabbbab') results in the list of strings ['bb', 'bbb', 'b'].

Approach: Split the given string using alphanumeric separators, and use the either-or (|) metacharacter. It allows you to specify each separator within the expression like so: re.split("[//.|//!|//?]", text). Thus, whenever the script encounters any of the mentioned characters specified within the pattern, it will split the given string. The expression x!="" ignores all the empty characters.

Code:

import re
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
res = [x for x in re.split("[//.|//!|//?]", text) if x!=""]
print(res) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']

?Recommended Read:  Python Regex Split

Method 3: Using findall


The re.findall(pattern, string) method scans the string from left to right, searching for all non-overlapping matches of the pattern. It returns a list of strings in the matching order when scanning the string from left to right.

Code:

import re
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
res = re.findall(r"[^.!?]+", text)
print(res) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']

Explanation: In the expression, i.e., re.findall(r"[^.!?]+", text), all occurrences of characters are grouped except the punctuation marks. []+ denotes that all occurrences of one or more characters except (given by ^) ‘!’, ‘?’, and ‘.’ will be returned. Thus, whenever the script finds and groups all characters until any of the mentioned characters within the square brackets are found. As soon as one of the mentioned characters is found it splits the string and finds the next group of characters.

?Related Read: Python re.findall() – Everything You Need to Know

Method 4: Using replace


Approach: The idea here is to replace all the punctuation marks (‘!’, ‘?’, and ‘.’) present in the given string with a comma (,) and then split the modified string to get the list of split substrings. The problem here is the last element returned will be an empty string. You can use the pop() method to remove the last element out of the list of substrings (the empty string).

Code:

def splitter(txt, delim): for i in txt: if i in delim: txt = txt.replace(i, ',') res = txt.split(',') res.pop() return res sep = ['.', '!', '?']
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
print(splitter(text, sep)) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']

?Related Read: Python String replace()

Conclusion


We have successfully solved the given problem using different approaches. I hope this article helped you in your Python coding journey. Please subscribe and stay tuned for more interesting articles.

Happy coding! ?


Do you want to master the regex superpower? Check out my new book The Smartest Way to Learn Regular Expressions in Python with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video.



https://www.sickgaming.net/blog/2022/12/...sentences/
Reply



Possibly Related Threads…
Thread Author Replies Views Last Post
  [Tut] 5 Easy Ways to Edit a Text File From Command Line (Windows) xSicKxBot 0 1,323 03-05-2023, 08:32 AM
Last Post: xSicKxBot
  [Tut] Python Video to Text – Speech Recognition xSicKxBot 0 1,148 01-23-2023, 02:55 PM
Last Post: xSicKxBot
  [Tut] Python | Split String and Remove newline xSicKxBot 0 1,294 12-16-2022, 10:38 PM
Last Post: xSicKxBot
  [Tut] Python | Split String with Regex xSicKxBot 0 1,414 12-13-2022, 06:04 AM
Last Post: xSicKxBot
  [Tut] How to Split a Multi-line String into Multiple Lines? xSicKxBot 0 1,339 12-12-2022, 09:10 AM
Last Post: xSicKxBot
  [Tut] Python | Split String into List of Substrings xSicKxBot 0 1,442 12-11-2022, 12:17 PM
Last Post: xSicKxBot
  [Tut] Python | Split String Variable Spaces xSicKxBot 0 1,297 12-07-2022, 06:41 AM
Last Post: xSicKxBot
  [Tut] Python | Split String Multiple Whitespaces xSicKxBot 0 1,335 12-06-2022, 10:04 AM
Last Post: xSicKxBot
  [Tut] Python | Split String by Number xSicKxBot 0 1,152 12-05-2022, 01:21 PM
Last Post: xSicKxBot
  [Tut] Python | Split String Hyphen xSicKxBot 0 1,197 12-02-2022, 08:43 AM
Last Post: xSicKxBot

Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016