๐Ÿ’ป My Work/๐Ÿง  AI

[๋”ฅ๋Ÿฌ๋‹] Optical Music Recognition(OMR) ๋“œ๋Ÿผ ์•…๋ณด ์ธ์‹ ๋ชจ๋ธ

Jaeseo Kim 2024. 5. 20. 09:39

๐Ÿƒโ€โ™‚๏ธ ํ•ด๋‹น ๊ธ€์€ Tensorflow๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ์— ํ™˜๊ฒฝ์ด ๊ตฌ์ถ•๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Anaconda3 + tensorflow ํ‚ค์›Œ๋“œ๋กœ ๊ตฌ๊ธ€๋งํ•ด์„œ ๋‚˜์˜ค๋Š” ๋ธ”๋กœ๊ทธ๋“ค์„ ์ฐธ๊ณ  ๋ฐ”๋ž๋‹ˆ๋‹ค. :)


 

00. ๋ชฉํ‘œ

๋“œ๋Ÿผ ์•…๋ณด ์ด๋ฏธ์ง€๋ฅผ ๋„ฃ์—ˆ์„ ๋•Œ, ์Œํ‘œ(ํ•ด๋‹น ๊ธ€์—์„œ๋Š” ์Œ์ •๋งŒ ๊ตฌ๋ถ„)๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

โ€ข  Input : ์•…๋ณด ์ด๋ฏธ์ง€
โ€ข  Output : ์Œ์ •

 

์•„๋ž˜๋Š” ํ•ด๋‹น ๊ธ€์—์„œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹ ์ƒ˜ํ”Œ์ž…๋‹ˆ๋‹ค. ๊ฐ ๋งˆ๋””์— ๋Œ€ํ•œ ์ด๋ฏธ์ง€์™€ ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

Drum Sheet Dataset.zip
0.12MB

๋“œ๋Ÿผ ๋ฐ์ดํ„ฐ ์…‹

 

๋ฐ์ดํ„ฐ์…‹์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ, ๋ผ๋ฒจ์€ Alfaro๊ฐ€ ๋‹จ์ผ์Œํ–ฅ ์Œ์•…์„ ์ขŒ์—์„œ ์šฐ๋กœ ์ฝ๋Š” 1์ฐจ์› ์‹œํ€€์Šค๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด ์ œ์•ˆํ•œ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.

์ด ์ธ์ฝ”๋”ฉ์€ ๊ฐ ์ฐจ๋ก€๋Œ€๋กœ ๋‚˜ํƒ€๋‚˜๋Š” note์™€ symbol ์‚ฌ์ด์— '+' ๊ธฐํ˜ธ๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ , ์ฝ”๋“œ์˜ ๊ฐœ๋ณ„ ์Œํ‘œ๋ฅผ ์•„๋ž˜์—์„œ ์œ„ ์ˆœ์„œ๋Œ€๋กœ ๋‚˜์—ดํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์Œํ‘œ๊ฐ€ ๋™์‹œ์— ๋‚˜์˜จ ๊ฒฝ์šฐ๋Š” '|' ๊ธฐํ˜ธ๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

ํ•ด๋‹น ๊ธ€์—์„œ๋Š” ์Œ์ •๋งŒ์„ ๊ตฌ๋ถ„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— non-note ์Œ์•… ๊ธฐํ˜ธ(clefs, key signatures, time signatures, and barlines)์™€ ์‰ผํ‘œ๋Š” nonote๋กœ ๊ตฌ๋ถ„ํ•ฉ๋‹ˆ๋‹ค.

์•…๋ณด ์ด๋ฏธ์ง€

clef-percussion+note-F4_quarter|note-A5_quarter+note-C5_eighth|note-G5_eighth+note-G5_eighth+note-F4_eighth|note-G5_eighth+note-F4_eighth|note-G5_eighth+note-C5_eighth|note-G5_eighth+note-G5_eighth+barline

๋ผ๋ฒจ

 

(์ฐธ๊ณ ) ์Œ์ •ํ‘œ

Pitch ์ •๋ณด (https://musictheoryde-mystified.com/text-notation-pitch-and-octave-numbering/)

 

 

 

01. ๋ฐฐ๊ฒฝ

์–ด๋–ป๊ฒŒ ํ•™์Šตํ•˜๊ณ  ์Œํ‘œ๋ฅผ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์›๋ฆฌ๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต์—๋Š” 3๊ฐ€์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  1. CNN (Convolution Neural Network)
  2. RNN (Recurrent Neural Network)
  3. CTC Algorithm (Connectionist Temporal Classification)

 

Baoguang Shi ,  Xiang Bai ,  Cong Yao. 2015. "An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition"

 

CNN์„ ํ†ตํ•ด ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ Feature Sequence๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
์ถ”์ถœํ•œ Feature Sequence๋“ค์„ RNN์˜ Input์œผ๋กœ ํ•˜์—ฌ ์ด๋ฏธ์ง€์˜ Text Sequence๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ CNN + RNN ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ชจ๋ธ์„ CRNN์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿผ Sequence Modeling์—์„œ CRNN์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ์š”?

CNN์€ ์ „์ฒด ์ด๋ฏธ์ง€์—์„œ ํŠน์ • ๋ถ€๋ถ„๋งŒ ๋ฐ˜์˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ „์ฒด ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋‹ด์„ ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์„ ์‹œํ€€์Šค ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” RNN์„ ์‚ฌ์šฉํ•˜์—ฌ ์ •๋ณด๋ฅผ ์ข…ํ•ฉํ•ด์„œ ๋ฌธ์ž๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

https://tv.kakao.com/channel/3150758/cliplink/391419266

 

 

CTC๋Š” ์Œ์„ฑ ์ธ์‹๊ณผ ๋ฌธ์ž ์ธ์‹์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.

์Œ์„ฑ ํ˜น์€ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์–ด๋””์„œ๋ถ€ํ„ฐ ์–ด๋””๊นŒ์ง€๊ฐ€ ํ•œ ๋ฌธ์ž์— ํ•ด๋‹นํ•˜๋Š”์ง€ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๊ด€๊ณ„ ์ •๋ ฌ์„ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

Sequence Modeling With CTC (https://distill.pub/2017/ctc/)

 

CTC๋Š” ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ์ž„์˜๋กœ ๋ถ„ํ• ๋œ ๊ฐ ์˜์—ญ๋งˆ๋‹ค์˜ ํŠน์ง•์— ๋Œ€ํ•ด ํ™•๋ฅ ์ ์œผ๋กœ ์˜ˆ์ธกํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
P(Y|X) ์ฆ‰, ์ฃผ์–ด์ง„ X์— ๋Œ€ํ•ด์„œ Y์ผ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ด ์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.


์œ„ ๊ทธ๋ฆผ์—์„œ ฯต์„ Blank Token์ด๋ผ๊ณ  ๋ถ€๋ฅด๋Š”๋ฐ, ๋ฌธ์ž ์ด๋ฏธ์ง€๊ฐ€ ์—†๋Š” ๋ถ€๋ถ„์€ ๋นˆ์นธ(Blank)์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ 

๊ฐ ๋‹จ๊ณ„๋ณ„ ์˜ˆ์ธก๋œ ์ค‘๋ณต ๋ฌธ์ž๋“ค์„ ํ•ฉ์ณ์„œ ์ตœ์ข… ๋ฌธ์ž๋ฅผ ์–ป๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 

 

 

02. Library

ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ฒ ์Šต๋‹ˆ๋‹ค.

import glob
import numpy as np
import matplotlib.pyplot as plt 
import tensorflow as tf 
from tensorflow import keras 
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

 

 

03. Data Load

์ด๋ฏธ์ง€์™€ Label์„ ๋‹ด์€ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•ด์ฃผ๊ฒ ์Šต๋‹ˆ๋‹ค.

x_dataset_path=f"{dataset_path}/measure/"
x_all_dataset_path = glob.glob(f"{x_dataset_path}/*")
x_file_list = [file for file in x_all_dataset_path if file.endswith(f".png")]
x_file_list.sort()

y_dataset_path=f"{dataset_path}/annotation/"
y_all_dataset_path = glob.glob(f"{y_dataset_path}/*")
y_file_list = [file for file in y_all_dataset_path if file.endswith(f".txt")]
y_file_list.sort()

images = x_file_list
labels = y_file_list

print("์ด ์ด๋ฏธ์ง€ ๊ฐœ์ˆ˜: ", len(images))
print("์ด ๋ผ๋ฒจ ๊ฐœ์ˆ˜: ", len(labels))

์ถœ๋ ฅ ๊ฒฐ๊ณผ

 

 

๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ, ์ด๋ฏธ์ง€ ํฌ๊ธฐ ๋“ฑ๋„ ์ง€์ •ํ•ด์ค๋‹ˆ๋‹ค.

# ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ์ง€์ •
batch_size = 16

# ์ด๋ฏธ์ง€ ํฌ๊ธฐ ์ง€์ •
img_width = 256
img_height = 128


# ์ œ์ผ ๊ธด ๋ผ๋ฒจ ๊ธธ์ด
max_length = 24

 

 

 

04. Data Pre-Processing

๋ฌธ์ž๋ฅผ ์ˆซ์ž๋กœ encoding ํ•˜๊ณ , ์ˆซ์ž๋ฅผ ๋ฌธ์ž๋กœ decoding ํ•˜๊ธฐ ์œ„ํ•œ char_to_num๊ณผ num_to_char๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๊ฒ ์Šต๋‹ˆ๋‹ค.

์šฐ์„  Pitch(์Œ์ •)์— ๋Œ€ํ•ด ๊ตฌ๋ถ„ํ•˜๋Š” ๋ชจ๋ธ์„ ์œ„ํ•ด ํ•„์š”ํ•œ vocabulary๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

char_to_int_mapping = [
    "|",  #1
    "nonote",#2
    "note-D4",#3
    "note-E4",#4
    "note-F4",#5
    "note-G4",#6
    "note-A4",#7
    "note-B4",#8
    "note-C5",#9
    "note-D5",#10
    "note-E5",#11
    "note-F5",#12
    "note-G5",#13
    "note-A5",#14
    "note-B5",#15
]

# ๋ฌธ์ž๋ฅผ ์ˆซ์ž๋กœ ๋ณ€ํ™˜
char_to_num = layers.StringLookup(
    vocabulary=list(char_to_int_mapping), mask_token=None
)

# ์ˆซ์ž๋ฅผ ๋ฌธ์ž๋กœ ๋ณ€ํ™˜
num_to_char = layers.StringLookup(
    vocabulary=char_to_num.get_vocabulary(), mask_token=None, invert=True
)

 

char_to_num.get_vocabulary() ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ง€์ •๋œ vocabulary๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ

 

ํ˜„์žฌ ๋ผ๋ฒจ์—” ์Œํ‘œ, ์‰ผํ‘œ, ๋งˆ๋””์„  ๋“ฑ ๋ชจ๋“  ๊ฒŒ ๋‹ค ํฌํ•จ๋˜์–ด ์žˆ๊ธฐ ๋–„๋ฌธ์— ์ด์ค‘์—์„œ ์Œํ‘œ ์ •๋ณด๋งŒ ๊ฐ€์ ธ์™€ vocabulary์— ๋งคํ•‘๋˜๋„๋ก ์ฒ˜๋ฆฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

# ๊ฐ token์— ๋งž๋Š” string list๋กœ ๋งŒ๋“ค๊ธฐ
def map_pitch(note):
    pitch_mapping = {
        "note-D4": 1,
        "note-E4": 2,
        "note-F4": 3,
        "note-G4": 4,
        "note-A4": 5,
        "note-B4": 6,
        "note-C5": 7,
        "note-D5": 8,
        "note-E5": 9,
        "note-F5": 10,
        "note-G5": 11,
        "note-A5": 12,
        "note-B5": 13,
    }
    return "nonote" if note not in pitch_mapping else note

def map_rhythm(note):
    duration_mapping =  {
        "[PAD]":0,
        "+": 1,
        "|": 2,
        "barline": 3,
        "clef-percussion": 4,

        "note-eighth": 5,
        "note-eighth.": 6,
        "note-half": 7,
        "note-half.": 8,

        "note-quarter": 9,
        "note-quarter.": 10,
        "note-16th": 11,
        "note-16th.": 12,

        "note-whole": 13,
        "note-whole.": 14,

        "rest_eighth": 15,
        "rest-eighth.": 16,
        "rest_half": 17,
        "rest_half.": 18,

        "rest_quarter": 19,
        "rest_quarter.": 20,
        "rest_16th": 21,
        "rest_16th.": 22,

        "rest_whole": 23,
        "rest_whole.": 24,
        
        "timeSignature-4/4": 25
    }
    return note if note in duration_mapping else "<unk>"

def map_lift(note):
    lift_mapping =  {
        "lift_null" : 1,
        "lift_##"   : 2,
        "lift_#"    : 3,
        "lift_bb"   : 4,
        "lift_b"    : 5,
        "lift_N"    : 6
    }
    return "nonote" if note not in lift_mapping else note
    
def symbol2pitch_rhythm_lift(symbol_lift, symbol_pitch, symbol_rhythm):
    return map_lift(symbol_lift), map_pitch(symbol_pitch), map_rhythm(symbol_rhythm)

def note2pitch_rhythm_lift(note):
    # note-G#3_eighth
    note_split = note.split("_") # (note-G#3) (eighth)
    note_pitch_lift = note_split[:1][0]
    note_rhythm = note_split[1:][0]
    rhythm=f"note-{note_rhythm}"

    note_note, pitch_lift = note_pitch_lift.split("-") # (note) (G#3)
    if len(pitch_lift)>2:
        pitch = f"note-{pitch_lift[0]+pitch_lift[-1]}" # (G3)
        lift = f"lift_{pitch_lift[1:-1]}"
    else:
        pitch = f"note-{pitch_lift}" 
        lift = f"lift_null"
    return symbol2pitch_rhythm_lift(lift, pitch, rhythm)

def rest2pitch_rhythm_lift(rest):
    # rest-quarter
    return symbol2pitch_rhythm_lift("nonote", "nonote", rest)

def map_pitch2isnote(pitch_note):
    group_notes = []
    note_split = pitch_note.split("+")
    for note_s in note_split:
        if "nonote" in note_s:
            group_notes.append("nonote")
        elif "note-" in note_s:
            group_notes.append("note")
    return "+".join(group_notes)


def map_notes2pitch_rhythm_lift_note(note_list):
    result_lift=[]
    result_pitch=[]
    result_rhythm=[]
    result_note=[]

    for notes in note_list:
        group_lift = []
        group_pitch = []
        group_rhythm = []
        group_notes_token_len=0

        # ์šฐ์„  +๋กœ ๋‚˜๋ˆ„๊ณ , ์•ˆ์— | ์žˆ๋Š” ์ง€ ํ™•์ธํ•ด์„œ ๋จผ์ € ๋ถ™์ด๊ธฐ
        # note-G#3_eighth + note-G3_eighth + note-G#3_eighth|note-G#3_eighth + rest-quarter
        note_split = notes.split("+")
        for note_s in note_split:
            if "|" in note_s:
                mapped_lift_chord = []
                mapped_pitch_chord = []
                mapped_rhythm_chord = []
                
                # note-G#3_eighth|note-G#3_eighth
                note_split_chord = note_s.split("|") # (note-G#3_eighth) (note-G#3_eighth)
                for idx, note_s_c in enumerate(note_split_chord):
                    chord_lift, chord_pitch, chord_rhythm = note2pitch_rhythm_lift(note_s_c)

                    mapped_lift_chord.append(chord_lift)
                    mapped_pitch_chord.append(chord_pitch)
                    mapped_rhythm_chord.append(chord_rhythm)

                    # --> '|' ๋„ token์ด๊ธฐ ๋•Œ๋ฌธ์— lift, pitch์—” nonote ์ถ”๊ฐ€ํ•ด์ฃผ๊ธฐ
                    if idx != len(note_split_chord)-1:
                        mapped_lift_chord.append("nonote")
                        # mapped_pitch_chord.append("nonote")

                group_lift.append(" ".join(mapped_lift_chord))
                group_pitch.append(" | ".join(mapped_pitch_chord))
                group_rhythm.append(" | ".join(mapped_rhythm_chord))

                # --> '|' ๋„ token์ด๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๊ฐ€๋œ token ๊ฐœ์ˆ˜ ๋”ํ•˜๊ธฐ
                # ๋™์‹œ์— ์นœ ๊ฑธ ํ•˜๋‚˜์˜ string์œผ๋กœ ํ•ด๋ฒ„๋ฆฌ๋Š” ๊ฑฐ๋‹ˆ๊นŒ ์ฃผ์˜ํ•˜๊ธฐ
                group_notes_token_len+=len(note_split_chord) + len(note_split_chord)-1

            elif "note" in note_s:
                if "_" in note_s:
                    # note-G#3_eighth
                    note2lift, note2pitch, note2rhythm = note2pitch_rhythm_lift(note_s)
                    group_lift.append(note2lift)
                    group_pitch.append(note2pitch)
                    group_rhythm.append(note2rhythm)
                    group_notes_token_len+=1
            
            elif "rest" in note_s:
                if "_" in note_s:
                    # rest_quarter
                    rest2lift, rest2pitch, rest2rhythm =rest2pitch_rhythm_lift(note_s)
                    group_lift.append(rest2lift)
                    group_pitch.append(rest2pitch)
                    group_rhythm.append(rest2rhythm)
                    group_notes_token_len+=1
            else:
                # clef-F4+keySignature-AM+timeSignature-12/8
                symbol2lift, symbol2pitch, symbol2rhythm = symbol2pitch_rhythm_lift("nonote", "nonote", note_s)
                group_lift.append(symbol2lift)
                group_pitch.append(symbol2pitch)
                group_rhythm.append(symbol2rhythm)
                group_notes_token_len+=1

        toks_len= group_notes_token_len

        # lift, pitch
        emb_lift= " ".join(group_lift)
        emb_pitch= " ".join(group_pitch)

        # rhythm
        emb_rhythm= " ".join(group_rhythm)

        # ๋’ค์— ๋‚จ์€ ๊ฑด ํŒจ๋”ฉ
        if toks_len < max_length :
            for _ in range(max_length - toks_len ):
                emb_lift+=" [PAD]"
                emb_pitch+=" [PAD]"        
                emb_rhythm+=" [PAD]"

        result_lift.append(emb_lift)
        result_pitch.append(emb_pitch)
        result_rhythm.append(emb_rhythm)
        result_note.append(map_pitch2isnote(emb_pitch))
    return result_lift, result_pitch, result_rhythm, result_note
def read_txt_file(file_path):
    # ํ…์ŠคํŠธ ํŒŒ์ผ์„ ์ฝ์–ด์„œ ๋‚ด์šฉ์„ ๋ฆฌ์ŠคํŠธ๋กœ ๋ฐ˜ํ™˜
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.readlines()
        # ๊ฐ ์ค„์˜ ๊ฐœํ–‰ ๋ฌธ์ž ์ œ๊ฑฐ
        content = [line.strip() for line in content]
    return content[0]

contents = []
# ๊ฐ ํŒŒ์ผ์„ ์ฝ์–ด์„œ ๋‚ด์šฉ์„ ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€
for annotation_path in labels:
    content = read_txt_file(annotation_path)
    # ์‚ฌ์ด์‚ฌ์ด์— + ๋กœ ์—ฐ๊ฒฐํ•ด์ฃผ๊ธฐ
    content=content.replace(" ","+")
    content=content.replace("\t","+")
    contents.append(content)
    
result_lift, result_pitch, result_rhythm, result_note = map_notes2pitch_rhythm_lift_note(contents)
labels=result_pitch

 

์ฒ˜๋ฆฌ๋œ ๊ฑธ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

print(contents[0])
print(labels[0]) 
print(char_to_num(tf.strings.split(labels[0])))

์ถœ๋ ฅ ๊ฒฐ๊ณผ

 

 

sklearn์˜ train_test_split()์„ ์ด์šฉํ•ด Data๋ฅผ Train Set๊ณผ Validation Set์œผ๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

์ด Dataset์—์„œ 90%๋ฅผ Train Set์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  10%๋ฅผ Validation Set์œผ๋กœ ์ง€์ •ํ•ด ์ฃผ๊ธฐ ์œ„ํ•ด test_size๋ฅผ 0.1๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

x_train, x_valid, y_train, y_valid = train_test_split(np.array(images), np.array(labels), test_size=0.1)

 

 

๋งˆ์ง€๋ง‰์œผ๋กœ Dataset ์ƒ์„ฑ ์‹œ ์ ์šฉ๋  encode_single_sample() ํ•จ์ˆ˜๋ฅผ ์ง€์ •ํ•ด์ค๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ tensorflow์— ์ ํ•ฉํ•œ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜์‹œ์ผœ์ค„ ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€๋Š” ๋ฐ”์ด๋„ˆ๋ฆฌ ์ด๋ฏธ์ง€๋กœ ๋ณ€ํ™˜๋˜๊ณ , ์œ„์—์„œ ์ง€์ •ํ•œ ํฌ๊ธฐ์— ๋งž๊ฒŒ resize๋ฉ๋‹ˆ๋‹ค. ์ดํ›„, ์ด๋ฏธ์ง€๊ฐ€ ์›๋ž˜ ๊ฐ€๋กœ๋กœ ๊ธด ํ˜•ํƒœ์˜€๋Š”๋ฐ ์ฒซ ์Œํ‘œ๋ถ€ํ„ฐ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•ด์„ํ•˜๊ธธ ์›ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์œ„์—์„œ๋ถ€ํ„ฐ ์•„๋ž˜๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ๋„๋ก ์ด๋ฏธ์ง€์˜ ๊ฐ€๋กœ ์„ธ๋กœ๋ฅผ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

๋ผ๋ฒจ์€ ๊ฐ string๋งˆ๋‹ค splitํ•˜์—ฌ encoding๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

def encode_single_sample(img_path, label):
    # 1. ์ด๋ฏธ์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
    img = tf.io.read_file(img_path)
    # 2. ์ด๋ฏธ์ง€๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  grayscale๋กœ ๋ณ€ํ™˜
    img = tf.io.decode_png(img, channels=1)
    # 3. [0,255]์˜ ์ •์ˆ˜ ๋ฒ”์œ„๋ฅผ [0,1]์˜ ์‹ค์ˆ˜ ๋ฒ”์œ„๋กœ ๋ณ€ํ™˜
    img = tf.image.convert_image_dtype(img, tf.float32)
    # 4. ์ด๋ฏธ์ง€ resize
    img = tf.image.resize(img, [img_height, img_width])
    # 5. ์ด๋ฏธ์ง€์˜ ๊ฐ€๋กœ ์„ธ๋กœ ๋ณ€ํ™˜ 
    img = tf.transpose(img, perm=[1, 0, 2])
 
    # 6. ๋ผ๋ฒจ ๊ฐ’์˜ ๋ฌธ์ž๋ฅผ ์ˆซ์ž๋กœ ๋ณ€ํ™˜
    label_r = char_to_num(tf.strings.split(label))

    # 7. ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ return
    return {"image": img, "label": label_r}

 

 

05. Dataset ๊ฐ์ฒด ์ƒ์„ฑ

tf.data.Dataset์„ ์ด์šฉํ•˜์—ฌ numpy array ํ˜น์€ tensor๋กœ๋ถ€ํ„ฐ Dataset์„ ๋งŒ๋“ค์–ด์ฃผ๊ฒ ์Šต๋‹ˆ๋‹ค.

์œ„์—์„œ ์ •์˜ํ•œ encode_single_sample ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•˜์—ฌ ์œ„์—์„œ ์ง€์ •ํ•œ batch size๋กœ train, validation Dataset์„ ๋งŒ๋“ค์–ด์ฃผ๊ฒ ์Šต๋‹ˆ๋‹ค.

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = (
    train_dataset.map(
        encode_single_sample, num_parallel_calls=tf.data.AUTOTUNE
    )
    .batch(batch_size)
    .prefetch(buffer_size=tf.data.AUTOTUNE)
)

validation_dataset = tf.data.Dataset.from_tensor_slices((x_valid, y_valid))
validation_dataset = (
    validation_dataset.map(
        encode_single_sample, num_parallel_calls=tf.data.AUTOTUNE
    )
    .batch(batch_size)
    .prefetch(buffer_size=tf.data.AUTOTUNE)
)

 

 

06. Data ์‹œ๊ฐํ™”

๋งŒ๋“ค์–ด์ง„ Dataset์˜ ์ด๋ฏธ์ง€์™€ Label์„ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€๋Š” ๊ฐ€๋กœ์™€ ์„ธ๋กœ๋ฅผ ๋ณ€ํ™˜ํ–ˆ์—ˆ๋Š”๋ฐ, ์•„๋ž˜ ์ฝ”๋“œ์—์„œ๋Š” imshow(img[:, :, 0].T์—์„œ T๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค์‹œ Transpose ํ•˜์—ฌ ๋ณด๊ธฐ ์‰ฝ๊ฒŒ ์ถœ๋ ฅํ•ด ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

_, ax = plt.subplots(4, 1)

for batch in train_dataset.take(1):
    images = batch["image"]
    labels = batch["label"]
    for i in range(4):
        img = (images[i] * 255).numpy().astype("uint8")
        label = tf.strings.join(num_to_char(labels[i]), separator=' ').numpy().decode("utf-8").replace('[UNK]', '')
        print(labels[i])
        ax[i].imshow(img[:, :, 0].T, cmap="gray")
        ax[i].set_title(label)
        ax[i].axis("off")
plt.show()

 

train dataset batch 1๊ฐœ์˜ ์ด๋ฏธ์ง€ ์‹œ๊ฐํ™”

 

 

07. Model

CTC Loss๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ CTC Layer ํด๋ž˜์Šค๋ฅผ ๊ตฌํ˜„ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

CTC Loss๋Š” keras.backend.ctc_batch_cost๋ฅผ ํ†ตํ•ด ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

class CTCLayer(layers.Layer):
    def __init__(self, name=None):
        super().__init__(name=name)
        self.loss_fn = keras.backend.ctc_batch_cost

    def call(self, y_true, y_pred):
        batch_len = tf.cast(tf.shape(y_true)[0], dtype="int64")
        input_length = tf.cast(tf.shape(y_pred)[1], dtype="int64")
        label_length = tf.cast(tf.shape(y_true)[1], dtype="int64")

        input_length = input_length * tf.ones(shape=(batch_len, 1), dtype="int64")
        label_length = label_length * tf.ones(shape=(batch_len, 1), dtype="int64")

        loss = self.loss_fn(y_true, y_pred, input_length, label_length)
        self.add_loss(loss)
        
        return y_pred

 

 

CRNN ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. 2๊ฐœ์˜ Convolution block๊ณผ 2๊ฐœ์˜ LSTM ๋ชจ๋ธ์ด ๊ฒฐํ•ฉํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

ํ•™์Šต์— ์‚ฌ์šฉ๋˜๋Š” loss๋Š” ์œ„์—์„œ ์ง€์ •ํ•œ CTC Layer์˜ loss๋ฅผ ์ด์šฉํ•ด์„œ ํ•™์Šตํ•˜๋„๋ก ์„ค์ •ํ•ด ์ค๋‹ˆ๋‹ค.

def build_model():
    # Inputs
    input_img = layers.Input(
        shape=(img_width, img_height, 1), name="image", dtype="float32"
    )
    labels = layers.Input(name="label", shape=(None,), dtype="float32")

    # ์ฒซ๋ฒˆ์งธ convolution block
    x = layers.Conv2D(
        32,
        (3, 3),
        activation="relu",
        kernel_initializer="he_normal",
        padding="same",
        name="Conv1",
    )(input_img)
    x = layers.MaxPooling2D((2, 2), name="pool1")(x)

    # ๋‘๋ฒˆ์งธ convolution block
    x = layers.Conv2D(
        64,
        (3, 3),
        activation="relu",
        kernel_initializer="he_normal",
        padding="same",
        name="Conv2",
    )(x)
    x = layers.MaxPooling2D((2, 2), name="pool2")(x)

    # ์•ž์— 2๊ฐœ์˜ convolution block์—์„œ maxpooling(2,2)์„ ์ด 2๋ฒˆ ์‚ฌ์šฉ
    # feature map์˜ ํฌ๊ธฐ๋Š” 1/4๋กœ downsampling
    # ๋งˆ์ง€๋ง‰ layer์˜ filter ์ˆ˜๋Š” 64๊ฐœ ๋‹ค์Œ RNN์— ๋„ฃ๊ธฐ ์ „์— reshape
    new_shape = ((img_width // 4), (img_height // 4) * 64)
    x = layers.Reshape(target_shape=new_shape, name="reshape")(x)
    x = layers.Dense(64, activation="relu", name="dense1")(x)
    x = layers.Dropout(0.2)(x)

    # RNNs
    x = layers.Bidirectional(layers.LSTM(128, return_sequences=True, dropout=0.25))(x)
    x = layers.Bidirectional(layers.LSTM(64, return_sequences=True, dropout=0.25))(x)

    # Output layer
    x = layers.Dense(
        len(char_to_num.get_vocabulary()) + 1, activation="softmax", name="dense2"
    )(x)

    # ctc loss
    output = CTCLayer(name="ctc_loss")(labels, x)
    
    # Model
    model = keras.models.Model(
        inputs=[input_img, labels], outputs=output, name="omr"
    )
    
    # Optimizer
    opt = keras.optimizers.Adam()

    model.compile(optimizer=opt)
    return model

# Model
model = build_model()
model.summary()

 

 

08. Train

๊ทธ๋Ÿผ ์ด์ œ epoch๋ฅผ 200์œผ๋กœ ์„ค์ •ํ•˜๊ณ  early stopping์€ patience๋ฅผ 10์œผ๋กœ ์ง€์ •ํ•˜์—ฌ ํ•™์Šต์„ ํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

epochs = 200
early_stopping_patience = 10
early_stopping = keras.callbacks.EarlyStopping(
    monitor="val_loss", patience=early_stopping_patience, restore_best_weights=True
)

history = model.fit(
    train_dataset,
    validation_data=validation_dataset,
    epochs=epochs,
    callbacks=[early_stopping],
)
Epoch 1/200
6/6 [==============================] - 9s 339ms/step - loss: 99.2571 - val_loss: 62.9635
Epoch 2/200
6/6 [==============================] - 0s 49ms/step - loss: 58.3946 - val_loss: 49.3992
Epoch 3/200
6/6 [==============================] - 0s 42ms/step - loss: 51.5517 - val_loss: 46.9572
Epoch 4/200
6/6 [==============================] - 0s 42ms/step - loss: 49.1146 - val_loss: 45.2404
Epoch 5/200
6/6 [==============================] - 0s 40ms/step - loss: 47.8802 - val_loss: 43.4212
Epoch 6/200
6/6 [==============================] - 0s 41ms/step - loss: 45.6561 - val_loss: 41.9743
Epoch 7/200
6/6 [==============================] - 0s 40ms/step - loss: 44.0756 - val_loss: 41.5652
Epoch 8/200
6/6 [==============================] - 0s 39ms/step - loss: 43.6151 - val_loss: 38.8865
Epoch 9/200
6/6 [==============================] - 0s 39ms/step - loss: 41.6590 - val_loss: 39.4570
Epoch 10/200
6/6 [==============================] - 0s 40ms/step - loss: 41.6629 - val_loss: 38.0012
Epoch 11/200
6/6 [==============================] - 0s 40ms/step - loss: 40.3495 - val_loss: 37.4150
Epoch 12/200
6/6 [==============================] - 0s 38ms/step - loss: 39.3751 - val_loss: 38.2025
Epoch 13/200
6/6 [==============================] - 0s 39ms/step - loss: 38.5040 - val_loss: 37.1740
...
Epoch 171/200
6/6 [==============================] - 0s 39ms/step - loss: 2.8438 - val_loss: 2.2756
Epoch 172/200
6/6 [==============================] - 0s 40ms/step - loss: 2.7682 - val_loss: 2.1957

 

 

09. Predict

ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ Validation Data๋ฅผ ์Œ์ •์œผ๋กœ ์ถœ๋ ฅํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด์ค๋‹ˆ๋‹ค.

์ถœ๋ ฅ๊ฐ’์„ Decodingํ•˜๊ธฐ ์œ„ํ•ด decode_batch_predictions๋ผ๋Š” ํ•จ์ˆ˜๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

# Prediction Model
prediction_model = keras.models.Model(
    model.get_layer(name="image").input, model.get_layer(name="dense2").output
)
prediction_model.summary()

# Decoding
def decode_batch_predictions(pred):
    input_len = np.ones(pred.shape[0]) * pred.shape[1]
    results = keras.backend.ctc_decode(pred, input_length=input_len, greedy=True)[0][0][
        :, :max_length
    ]
    output_text = []
    for res in results:
        print(res)
        res = tf.strings.join(num_to_char(res), separator=' ').numpy().decode("utf-8").replace('[UNK]', '')
        output_text.append(res)
    return output_text

 

 

10. ์˜ˆ์ธก ๊ฒฐ๊ณผ ํ™•์ธ

prediction_model์— validation_dataset ๋ฐฐ์น˜ ํ•œ ๊ฐœ๋ฅผ ๋„ฃ์–ด ์‹œ๊ฐํ™”ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

#  validation dataset์—์„œ ํ•˜๋‚˜์˜ ๋ฐฐ์น˜๋ฅผ ์‹œ๊ฐํ™”
for batch in validation_dataset.take(1):
    batch_images = batch["image"]
    batch_labels = batch["label"]

    preds = prediction_model.predict(batch_images)
    pred_texts = decode_batch_predictions(preds)

    orig_texts = []
    for label in batch_labels:
        label = tf.strings.join(num_to_char(label), separator=' ').numpy().decode("utf-8").replace('[UNK]', '')
        orig_texts.append(label)

    _, ax = plt.subplots(10, 1, figsize=(100, 50))
    for i in range(len(pred_texts)):
        img = (batch_images[i, :, :, 0] * 255).numpy().astype(np.uint8)
        img = img.T
        title = f"Prediction: {pred_texts[i]}"
        ax[i].imshow(img, cmap="gray")
        ax[i].set_title(title)
        ax[i].axis("off")
plt.show()

 

์˜ˆ์ธก ๊ฒฐ๊ณผ

 


 

ํ•ด๋‹น ๊ธ€์€ ์Œ์ • ๊ตฌ๋ถ„์„ ์œ„ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

์‰ผํ‘œ ๋ฐ ๋งˆ๋””์„  ๋“ฑ์„ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด์„  ์ถ”๊ฐ€ ๋ผ๋ฒจ๋ง ์ž‘์—…์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ์…‹ ์ƒ˜ํ”Œ๋„ 92๊ฐœ๋ฐ–์— ์—†์–ด, ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์€๋ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€์šฉ๋Ÿ‰์„ ์ƒ์„ฑํ•˜์—ฌ ํ•™์Šตํ•˜๋ฉด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ผ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค.

 

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

 

์ฐธ๊ณ ๋ฌธํ—Œ

Jorge Calvo-Zaragoza, David Rizo. 2018. โ€œEnd-to-End Neural Optical Music Recognition of Monophonic Scoresโ€ https://doi.org/10.3390/app8040606.
 

End-to-End Neural Optical Music Recognition of Monophonic Scores

Optical Music Recognition is a field of research that investigates how to computationally decode music notation from images. Despite the efforts made so far, there are hardly any complete solutions to the problem. In this work, we study the use of neural n

www.mdpi.com