user-level-audio-auditor (Transcriptions-Only)

Paper: The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services

Published: PoPETS 2019

Methodology

Transcription-only black-box access to ASR model:

Input: audio & its true transcription
Output: its predicted transcription

User-level Membership Inference Attack:

Querying with a user’s data, if this user has any data within target model’s training set, even if the query data are not members of the training set, this user is the user-level member of this training set.

Fig. 2 depicts a workflow of our audio auditor auditing an ASR model. Generally, there are two processes, i.e., training and auditing. The former process is to build a binary classifier as a user-level membership auditor A_audit using a supervised learning algorithm. The latter uses this auditor to audit an ASR model F_tar by querying a few audios spoken by one user u. In Section 4.4, we show that only a small number of audios per user can determine whether u ∈ U_tar or u /∈ U_tar. Furthermore, a small number of users used to train the auditor is sufficient to provide a satisfying result.

Data Prepare

Each matrix obtains 4 columns {id, transcript_txt, frame_length, txt, txt_length}.

For record id=777-126732-0046, under the folder ./testing_set_for_auditor, extract elements from /decode_dev_clean_2out_dnn2/decode.1.log and dev-clean-2-true-txt.txt

{777-126732-0046, "IN ANY CASE HE HAD NOT THE TIME", 223, "IN ANY CASE HE HAD NOT THE TIME", 31}

log to txt.

decodelog2txt.sh:

input: dataset = $1 = testing_set_for_auditor/decode_test_clean_2_user_out_dnn2; label = $2 = nonmember_test_clean_2_user
output: txt_f = testing_set_for_auditor/nonmember_test_clean_2_user.txt
Reduce irrelevant information from the raw transcription results.

$ ./decodelog2txt.sh testing_set_for_auditor/decode_train_clean_100_user_out_dnn2 member_train_clean_100_user 
>> out_log/decodelog2txt_member_train_clean_100_user.txt 2>&1 && echo 's' || echo 'e'
$ cp out_log/decodelog2txt_member_train_clean_100_user.txt testing_set_for_auditor/

txt to csv.

txt2csv.py:

input: txt_in = "testing_set_for_auditor/nonmember_test_clean_2_user.txt"; true_in = "testing_set_for_auditor/test-clean-2-user-true-txt.txt"
output: csv_file = "data/nonmember_test-clean-2-user.csv"
Convert txt_in (transcription results) to a matrix focusing on sentence id.
Extract 4 features (trans_txt, frame_length, true_txt, true_txt_length) for each sentence id.
Save as .csv focusing on sentence id. header = ['id', 'predicted_txt', 'true_txt', 'true_txt_length', 'frame_length'].

$ python ./txt2csv.py

Data Preprocess (feature extraction)

Transfer sentence-id-record ('id') data to user-id-record ('user') data. Mainly process 2 string-type features --- 'predicted_txt' and 'true_txt' --- into int-type features as similarity score. The other 2 int-type features including previous processed 2 int-type features are analyzed statistically.

Word2Vec Model Training

word_embedding.py:

input: predicted_path = testing_set_for_auditor//.log; true_label_path = True_transcripts/*.txt
output: w2vModel = word2vec_libri.model
Train a Word2Vec with 2 kinds of Vocabularies (logs and true_txt files) --> save as .model
Update the pretrained model (word2vec_*.model) with another total_samples.

$ python ./word_embedding.py

Word2Vec Model Update

New log files found:

Repeat ## Data Preprocess 1. Word2Vec Model Training
Repeat

$ python ./word_embedding.py

Similarity Score Between 'predicted_txt' and 'true_txt'

feature_sentence.py:

input: csv = data/member_dev-clean-2.csv
output: feats_file = data/member_feats3_dev-clean-2.csv
Load pretrained Word2Vec model (word2vec_*.model)
Initial the list for processing original 4 feats into 3 feats except 'id' Convert the 2nd and 3rd columns (string-type features) into word vectors --- 1 word 1 vector and 1 similarity score. Specifically, ['id', 'predicted_text', 'true_text', 'true_text_length', 'frame_length'] ==> ['id', 'similarity', 'frame_length', 'speed']
Save initial features as .csv focusing on sentence id

$ python ./feature_sentence.py

Similarity Statistic for Each User

feature_speaker.py:

input: feats_csv = data/member_feats3_dev-clean-2.csv
output: feats_user_file = data/member_feats3_user_dev-clean-2.csv
Statistically analyze the list for processing 3 feats towards each user where 'id' = user#-chapter#-sentence# Specifically, ['id', 'predicted_text', 'true_text', 'true_text_length', 'frame_length'] ==> ['user', 'similarity_statistics', 'frame_length_statistics', 'speed_statistics']
Save processed features as .csv focusing on user(speaker) id

$ python ./feature_speaker.py

User-level Audio Auditor Model

$ python ./audit_speaker.py

Name	Name	Last commit message	Last commit date
Latest commit History 9 Commits
True_transcripts	True_transcripts
testing_auditor_100user	testing_auditor_100user
training_auditor_360shd_gru	training_auditor_360shd_gru
README.md	README.md
audit_speaker.py	audit_speaker.py
compute_ASR_wer.sh	compute_ASR_wer.sh
csv2txt.py	csv2txt.py
decodelog2txt.sh	decodelog2txt.sh
feature_sentence.py	feature_sentence.py
feature_speaker.py	feature_speaker.py
run.sh	run.sh
txt2csv.py	txt2csv.py
word_embedding.py	word_embedding.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

user-level-audio-auditor (Transcriptions-Only)

Table of Contents

Methodology

Data Prepare

Data Preprocess (feature extraction)

User-level Audio Auditor Model

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

skyInGitHub/User-Level-Audio-Auditor

Folders and files

Latest commit

History

Repository files navigation

user-level-audio-auditor (Transcriptions-Only)

Table of Contents

Methodology

Data Prepare

Data Preprocess (feature extraction)

User-level Audio Auditor Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages