How to build your own personal voice assistant like Siri, Alexa using Python?

9 min readFeb 21, 2022

The world we see today is a sci-fi utopian, bending technology, design, and nature together in a way that is harmonious and seamless. We effortlessly use a plethora of technologies. For instance, could you imagine a decade ago that you would be able to talk to a phone, console, or speaker and it would perform the task scripts with only your voice commands and no action on your part? Although there is still time that we seek companionship and fall in love with our AI/Operating System as shown in the movie ‘Her” but voice assistant has come a long way, killing all needs for computer peripherals. In this blog, we will decode the history of virtual assistants and how you can program & build your own AI personal virtual assistant using Python.

From Shoebox to Smart Speakers…

For those who don’t know, an AI virtual assistant is a piece of software that understands written or verbal commands and completes tasks assigned by the user. The first attempts for voice assistants can be traced back to the early ’60s when IBM introduced the first digital speech recognition tool. While very primitive, it did recognise 16 words and 9 digits. The next breakthrough is in the ’90s when Dragon launched the very first software product that led the way with competent voice recognition and transcription.

Virtual assistants become mainstream when Apple introduced Siri in Feb 2010 with integration in iPhone 4S. The team used a mixture of Natural Language Processing and speech recognition to drive the virtual assistant innovation. Siri was trained to initiate after a wake-up phrase “Hey Siri”, a user could then ask a question, for instance, “What’s the weather like in Chennai today?”. These texts were then passed to NLP software to interpret. After Siri, Google Now, and Microsoft’s Cortana soon followed the trend.

The next milestone was later achieved by Amazon’s Alexa and its launch to Echo Dot, ushering in for what we call today “The Smart Speaker” — and the birth of Voicebot.ai.

The Smart Speaker will play out for years to come but we expect that the voice assistant revolution will later morph into an ambient voice revolution where it isn’t constrained by limited devices, or assigned user tasks. Instead, they will be embedded into the environments we inhabit.

Let’s get started with Personal Voice-Assistant AI development

Let’s make a distinction here before we start. If you want to build voice and NLP capabilities into your own application, you have several cloud and API options. For Apple, you can use their Sirikit API, along with the $99 cost of registering yourself as an apple developer and publishing on the Apple Store. One such example is Swiggy and its UI voice command to track the delivery partner. Other cloud options include Amazon’s Alexa with AWS account & Google Now.

But in case you don’t wanna lock yourself in a particular ecosystem, you can develop your own system to enable voice-assistant. It’s just a matter of speech recognition, a pipeline, a rules engine, a query parser, and pluggable architecture with open APIs

The components & Python Packages for Voice interface

Now we’d like to discuss the basic technologies in AI voice assistants. Simply put, what makes it different from a visual one, and characterise it as a voice interface.

There are few components of Voice assistant:

Voice Input/output

It implies that the user does not need to touch their screen or GUI elements to make a request. Voice command is more than enough. Our voice assistant software will perform the given task using STT. They convert voice tasks given by the user into text scripts, analyze and perform them. We will be using Speech recognition & the pyttsx3 package library to convert speech to text and vice versa. The packages support Mac OS X, Linux, and Windows.

NLP & Intelligent Interpretation

Our voice assistant shouldn’t be limited to certain catchphrases, the user should be free while communicating. The response is made by tagging certain elements that can be credible for your user. We will be integrating Wolfram Alpha API to compute expert-level answers using Wolfram’s knowledge base algorithms and AI technology. All made possible by Wolfram Language.

Subprocesses

This is a standard library from Python to process various system commands like to log off or restart, predict the current time, and set alarms. We will be using OS Library in python to enable the functions to interact with the operating system.

Compress the speech

Apart from the essential features, we will use several other Python libraries such as Wikipedia, Ecapture, Time, DateTime, request, and others to enable more functions.

To begin with, it’s necessary to install all the above-mentioned package libraries in your system using the pip command. If you wanna clear your Python Fundamentals, visit here.

If you seek professional guidance, enroll in GUVI’s IIT-M curated Python Course. Above all, It will help you get ahead with Python Programming.

Writing script for Personal Voice Assistants

First of all, let’s import all the libraries using the pip command or terminal. For sake of clarity, we’ll name our personal voice assistant “JARVIS-One”.

( Any Resemblance is uncanny )

Setting up Speech Engine

We are going to use Sapi5, a Microsoft text to speech engine for voice recognition. The Pyttsx3 module is stored in a variable name engine. We can set the voice id as either 0 or 1. ‘0’ indicates male voice & ‘1’ indicates Female Voice.

Further, we will define a function speak which will convert text to speech. The speak function will take the texts as an argument and it will further initialise the engine.

‘RunAndWait’ Command

Just as the name suggests, this function blocks other voice requests while processing all currently queued commands. It invokes callbacks for appropriate engine notification and returns back all the commands queued before the next call are emptied from the queue.

Greeting the User

The Python Library supports wishMe function for personal voice assistant to greet the user. The now().hour function abstract’s the hour from the current time.

If the hour is greater than zero and less than 12, the voice assistant wishes you with the message “Good Morning <F_name>”.

If the hour is greater than 12 and less than 18, the voice assistant wishes you the following message “Good Afternoon <F_name>”.

Else it voices out the message “Good evening”

Setting up command function for our personal voice assistant

Now we need to define a specific function takecommand for the personal voice assistant to understand, adapt and analyze the human language. The microphones capture the voice input and the recognizer recognizes the speech to give a response.

We will also incorporate exception handling to rule out all exceptions during the run time error. The recognize_Google function uses google audio to recognize speech.

The Ongoing Function

The main function starts from here, the command given by the human interaction/user is stored in the variable statement.

The voice assitant-JARVIS can now listen to some trigger words assigned by the user.

Summoning Skills/Powers

Now that we have finished setting up the voice assistant, we will build the essential skills.

1. Accessing Data from Web Browsers-G-Mail, Google Chrome & YouTube

The Open_new_tab function accepts web browser URL’s as a parameter that needs to be accessed. While Python time sleep function delays the execution of the program for a given time.

2. Fetching Data with Wikipedia API

Once we have successfully imported the Wikipedia API, we will use the following command to extract data from it. The wikipedia.summary() function helps users ask for any trivia, and execute it with a short summary as a variable result.

3. Time Prediction

JARVIS-one can predict the current time from datetime.now() function, which will display time in hour, minute & second in a variable name strTime.

4. Clicking Pictures

The ec.capture() function enables JARVIS-One click pictures from your camera. It has 3 parameters: Camera Index, Window Name & Save Name.

If there are two webcams, the first will has an indication with ‘0’, and the second will have an indication of ‘1’. Moreover, it can either be a string or a variable. In case you don’t wanna access this window, type as False.

You can also give the name to the clicked image, if you don’t wish to save the image, type as False.

5. To fetch latest news

JARVIS-One is programmed to fetch top headline news from Time of India by using the web browser function.

6. Fetching Data from web

The open_new_tab() function will help search and extract data from a web browser. For instance, you can search for pictures of blue dandelions. Jarvis-One will help open google images and fetch them.

7. Wolfram Alpha API for geographical and computational questions

Third-party API Wolfram Alpha API enables Jarvis-one to answer computational and geographical questions. However, to access Wolfram alpha API, you need to create an account and have a unique app ID from their official website. The client is an instance (class) created for wolfram alpha whereas res variable stores the response given by the wolfram alpha.

8. Weather Forecasting

With an API key from Open Weather Map, your personal voice assistant can detect weather. It is an online service that offers weather data for all locations. We can use city_name_variables command using takecommand() function.

view rawVoice_assistant.py hosted with

<img draggable=”false” role=”img” class=”emoji” alt=”” src=”https://s.w.org/images/core/emoji/13.1.0/svg/2764.svg"> by GitHub

9. Credits

It will add an element of fun to program Jarvis_ONE to answer the questions such as “what it can do” and “who created it”.

elif 'who are you' in statement or 'what can you do' in statement:
speak('I am JARVIS-one version 1 point O your personal assistant. I am programmed to minor tasks like'
'opening youtube,google chrome, gmail and stackoverflow ,predict time,take a photo,search wikipedia,predict weather'
'In different cities, get top headline news from times of india and you can ask me computational or geographical questions too!')
elif "who made you" in statement or "who created you" in statement or "who discovered you" in statement:
speak("I was built by F_NAME")
print("I was built by F_NAME")

10. Subprcesses-Log Off Your System

The subprocess.call() function here is used to process the system function to log off or to turn off your PC. Further, it invokes your AI assistant to automatically turn off your PC.

elif "log off" in statement or "sign out" in statement:
speak("Ok , your pc will log off in 10 sec make sure you exit from all applications")
subprocess.call(["shutdown", "/l"])
time.sleep(3)

Now that you have got the hang of it, you can build your own personal voice assistant from scratch. Similarly, you can incorporate so many other free APIs available to enable more functionalities. In case you want to realign your code, visit this Git Repository. (All credit goes to the developer).