Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

prakhar21/spark-streaming

Open more actions menu

Repository files navigation

Twitter Spark Streaming using PySpark

Twitter Spark Streaming

* Testing Environment * ( On Virtual Machine )
  • Ubuntu - 12.04 ( 32 bit )
  • 8GB Ram
  • i7 - QuadCore
Contents in folder:
  • tweepy_kafka_producer.py ( Publishes tweets to kafka topic )
  • twitter_stream.py ( Process the tweets )
  • tweet_task.csv ( Sample output file )
  • get_intent.py ( classifies into topics )
  • generate_Map.py ( Visualizes geo points on map )
  • draw_Pies.py ( draws pie charts )
  • generate_cloud.py ( draws word cloud )
  • my_map.html ( locations plotted map )
  • figure_1.png ( Word cloud )
  • figure_2.png ( pie plot for top 17 locations )
  • tweet_task_intent.csv ( Contains tweet , intent , topic )

Pipeline Structure

Data is ingested in realtime using tweepy ( Twitter API ) and is sent to the producer which publishes it to some user defined topic in kafka.

We then create a consumer, which subscribes to the topic and eventually gets the data.

Technologies Used

This application uses a number of open source technologies to work properly:

  • Apache Spark 1.5.0 - Big Data Processing Engine
  • Tweepy - Twitter Application Programming Interface
  • Kafka - Messaging System
  • Sublime Text Editor - Text Editor
  • Python 2.7 - Programming Language
  • Sentiment140 - API

Python Packages Used

  • matplotlib , wordcloud ,scipy ,geocoder
  • re , pylab , webbrowser , pygmaps
  • json , nltk , collections , urllib2
  • gensim , monkeylearn , random, csv

Kafka Commands

Starts Zookeeper Server:

$ bin/zookeeper-server-start.sh config/zookeeper.properties

Starts Kafka Server:

$ nohup ~/kafka/bin/kafka-server-start.sh ~/kafka/config/server.properties > ~/kafka/kafka.log 2>&1 &

Create Topic ( t ):

$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic socialcops

Start Producer:

$ python tweepy_kafka_producer.py

Start Consumer ( another terminal ):

From inside Spark directory -

$ bin/spark-submit --master local[3] --jars external/kafka-assembly/target/scala-2.10/spark-streaming-kafka-assembly-1.5.0.jar SocialCops_Task_SoftwareChallenge/twitter_stream.py localhost:2181 socialcops

Analyse Intent and Topic ( saves to tweet_task_intent.csv )

$ python get_intent.py tweet_task.csv

Draw Pie Charts:

$ python draw_pies.py tweet_task.csv

Plot on map

$ python generate_Map.py tweet_task.csv

Draw WordCloud

$ python generate_cloud.py tweet_task.csv

License

Prakhar Mishra

About

Twitter Spark Streaming using PySpark

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
Morty Proxy This is a proxified and sanitized view of the page, visit original site.