Heute 575

Gestern 759

Insgesamt 39457154

Samstag, 22.06.2024
eGovernment Forschung seit 2001 | eGovernment Research since 2001

With an ability to hear and understand the world in strikingly human ways, Clay Garner asks could this new AI frontier make our cities smarter, safer, and more human than ever before?

We’ve just entered a new era of artificial intelligence for cities and residents – one where AI won’t just read and write, but will also see, hear, and understand the world around us in strikingly human ways.

The recently unveiled GPT-4o from OpenAI represents a huge leap in what AI can do. Short for “omni,” this new AI model can perceive images, audio, and multimodal inputs beyond just text. It responds with human-like speed, generating outputs of not just text but images, audio, and video as well.

What does it mean for cities?

This multimodal AI breakthrough could transform how cities operate and serve residents. Imagine a resident encountering a hazard like a downed power line and simply pointing their camera at it.

The AI would instantly analyse the visuals and any audio context to grasp the precise emergency. It could then relay safety instructions in the resident’s native language while seamlessly alerting the proper department with all relevant details for rapid dispatch.

Or consider a resident needing to reference a dense permitting form or zoning document. Instead of wading through legalese, they could photograph the document. The multimodal model would ingest the full text and imagery, then converse naturally to guide the resident through it using visuals and audio explanations.

The possibilities span virtually every facet of municipal services. For intelligent traffic management, AI would fuse vehicle data with crowdsourced photos, videos, and audio describing real-time conditions – a unified view to reroute traffic smoothly around disruptions. In urban planning, it could ingest 3D maps, aerial photos, official documents, and resident feedback in any format to model development scenarios grounded in physical and regulatory reality.

But perhaps the biggest breakthrough is how multimodal AI could make interacting with cities truly natural for residents and officials alike. Instead of awkwardly typing queries into one-directional chatbots or digging through apps for answers, we could simply talk to our city’s AI assistant like we would another person, sharing documents or capturing photos and videos to convey our full intent. The AI would understand not just our words but our visual and audio context.

Need for robust governance

Of course, like any new technology, multimodal AI raises important considerations around privacy, security, and ethical deployment. Cities will need robust governance to ensure models like GPT-4o are implemented responsibly and equitably – particularly as the amount and variety of resident-related information expands.

There must be clear policies on data practices, human oversight, and guarding against harmful bias or misuse. Still, the power of multimodal AI to enhance municipal operations and services is undeniable. Computer vision, speech recognition, and general multimedia understanding have been AI frontiers for decades. Now, the latest large language models are finally bringing all those capabilities together into stunningly coherent, easy-to-use systems.

For cities, GPT-4o and the next wave of multimodal AI offers the tantalising prospect of artificial intelligences that can finally understand and interact with the full richness of human experiences – our words, our images, our voices.

With great possibility comes great responsibility. But embraced thoughtfully, this new AI frontier could make our cities smarter, safer, and more human than ever before.


Autor(en)/Author(s): Clay Garner

Quelle/Source: Smart Cities World, 20.05.2024

Bitte besuchen Sie/Please visit:

Zum Seitenanfang