nlp¶
API reference for CivicLen's natural language processing toolkit.
Reference¶
comments
¶
assign_clusters(df, clusters)
¶
Inserts cluster info into the polars df of data from the initial pull
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
df from initial pull |
required |
clusters |
list[set[int]]
|
clusters from Louvain Communities |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: updated df |
Source code in civiclens/nlp/comments.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
|
build_graph(df)
¶
Builds a network graph with comments as nodes and their similarities as weights
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
df with pairs of comment indices and a cosine similarity |
required |
Returns:
Type | Description |
---|---|
Graph
|
nx.Graph:network graph with comments as nodes and their similarities as weights |
Source code in civiclens/nlp/comments.py
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
|
comment_similarity(df, model)
¶
Create df with comment mappings and their semantic similarity scores according to the SBERT paraphrase mining method using the all-mpnet-base-v2 model from hugging face.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
df with comment data |
required |
model |
SentenceTransformer
|
sbert sentence transformer model |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
df_paraphrase, df_form_letter (tuple[pl.DataFrame]): cosine similarities for form letters and non form letters |
Source code in civiclens/nlp/comments.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
compute_similiarity_clusters(embeds, sim_threshold)
¶
Extract form letters from corpus of comments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeds |
ndarray
|
array of embeddings representing the documents |
required |
sim_threshold |
float
|
distance thresholds to divide clusters |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Array of docs by cluster |
Source code in civiclens/nlp/comments.py
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 |
|
count_unique_comments(df)
¶
Counts number of unique comments identified by performing paraphrasing mining on a corpus of comments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
dataframe of similiar comments |
required |
Source code in civiclens/nlp/comments.py
94 95 96 97 98 99 100 101 102 103 104 |
|
find_central_node(G, clusters)
¶
Find the most representative comment in a cluster by identifying the most central node
Parameters:
Name | Type | Description | Default |
---|---|---|---|
G |
Graph
|
network graph with comments as nodes and their similarities as weights |
required |
clusters |
list[set[int]]
|
clusters from Louvain Communities |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
dictionary with the central comment id as the key and the degree centrality as the value |
Source code in civiclens/nlp/comments.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
|
find_form_letters(df, model, form_threshold)
¶
Finds and extracts from letters by clustering, counts number of unique comments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
dataframe of comments to extract form letters from |
required |
model |
SentenceTransformer
|
vectorize model for text embeddings |
required |
form_threshold |
int
|
threshold to consider a comment a form letter |
required |
Returns:
Type | Description |
---|---|
tuple[list[dict], int]
|
List of form letters, number of unique comments |
Source code in civiclens/nlp/comments.py
253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 |
|
get_clusters(G)
¶
Defines clusters based on the Louvain Communities algorithm
Parameters:
Name | Type | Description | Default |
---|---|---|---|
G |
Graph
|
network graph with comments as nodes and their similarities as weights |
required |
Returns:
Type | Description |
---|---|
list[set[int]]
|
list[set[int]]: sets are clusters of comment nodes |
Source code in civiclens/nlp/comments.py
127 128 129 130 131 132 133 134 135 136 137 |
|
get_doc_comments(id)
¶
Pulls all comments for a set of documents and preprocesses that into a polars dataframe
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id |
int
|
document id |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: formated polars df |
Source code in civiclens/nlp/comments.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
rep_comment_analysis(comment_data, df, model)
¶
Runs all representative comment code for a document
Parameters:
Name | Type | Description | Default |
---|---|---|---|
comment_data |
RepComment
|
empty RepComment object |
required |
df |
dataframe
|
dataframe of comments pertaining to a document |
required |
model |
SentenceTransformer
|
SBERT model for embeddings |
required |
Returns:
Name | Type | Description |
---|---|---|
RepComment |
RepComments
|
dataclass with comment data |
Source code in civiclens/nlp/comments.py
316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 |
|
representative_comments(G, clusters, df, form_letter)
¶
Creates a dataframe with the text of the representative comments along with the number of comments that are semantically represented by that text
Parameters:
Name | Type | Description | Default |
---|---|---|---|
G |
Graph
|
network graph with comments as nodes and their similarities as weights |
required |
clusters |
list[set[int]]
|
clusters from Louvain Communities |
required |
df |
DataFrame
|
df from initial pull with added cluster info |
required |
Returns:
Name | Type | Description |
---|---|---|
output_df |
DataFrame
|
df with representation information |
Source code in civiclens/nlp/comments.py
189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
|
titles
¶
TitleChain
¶
Creates more accessible titles for regulation documnents
Source code in civiclens/nlp/titles.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
get_doc_summary(id)
¶
Gets the id and summary for a given document
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id |
int
|
document id |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: formatted polars df |
Source code in civiclens/nlp/titles.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
tools
¶
Comment
¶
Source code in civiclens/nlp/tools.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
to_dict()
¶
Converts comment object to dictionary.
Source code in civiclens/nlp/tools.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
RepComments
¶
Source code in civiclens/nlp/tools.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
get_nonrepresentative_comments()
¶
Converts nonrepresentative comments to list of Comment objects.
Source code in civiclens/nlp/tools.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
to_list()
¶
Converts representative comments to list of Comment objects.
Source code in civiclens/nlp/tools.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
sentiment_analysis(comment, pipeline)
¶
Analyze sentiment of a comment.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
comment |
Comment
|
Comment object |
required |
pipeline |
pipeline
|
Hugging Face pipeline for conducting sentiment analysis |
required |
Returns:
Type | Description |
---|---|
str
|
Sentiment label as string (e.g 'postive', 'negative', 'neutral') |
Source code in civiclens/nlp/tools.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
topics
¶
HDAModel
¶
Peforms LDA topic modeling
Source code in civiclens/nlp/topics.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
generate_search_vector()
¶
Creates array of topics to use in Django serach model.
Source code in civiclens/nlp/topics.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
get_terms()
¶
Returns terms for a all topics
Source code in civiclens/nlp/topics.py
131 132 133 134 135 136 137 138 |
|
run_model(comments)
¶
Runs HDA topic analysis.
Source code in civiclens/nlp/topics.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
LabelChain
¶
Source code in civiclens/nlp/topics.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
|
generate_label(terms)
¶
Create better topic terms.
Source code in civiclens/nlp/topics.py
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
|
create_topics(comments)
¶
Condense topics for document summary
Parameters:
Name | Type | Description | Default |
---|---|---|---|
Comments |
list of Comment objects |
required |
Returns:
Type | Description |
---|---|
dict
|
Dictionary of topics, and corresponding sentiment data |
Source code in civiclens/nlp/topics.py
261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 |
|
label_topics(topics, model)
¶
Generates a label for all topics
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topics |
dict[int, list]
|
dictionary of topics, as lists of terms |
required |
model |
LabelChain
|
LLM model to generate labels |
required |
Returns:
Type | Description |
---|---|
dict[int, str]
|
Dictionary of topics, and labels |
Source code in civiclens/nlp/topics.py
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
|
stopwords(model_path)
¶
Loads in pickled set of stopword for text processing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_path |
Path
|
path from downloaded model |
required |
Returns:
Type | Description |
---|---|
set[str]
|
Set of stop words. |
Source code in civiclens/nlp/topics.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
topic_comment_analysis(comment_data, model=None, labeler=None, sentiment_analyzer=None)
¶
Run topic and sentiment analysis.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
comment_data |
RepComments
|
RepComment object |
required |
model |
HDAModel
|
instance topic model class |
None
|
labeler |
LabelChain
|
chain for generating topic labels |
None
|
sentiment_analyzer |
Callable
|
function to analyze comment text sentiment |
None
|
Returns:
Type | Description |
---|---|
RepComments
|
RepComment object with full topic anlayis complete |
Source code in civiclens/nlp/topics.py
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 |
|
text
¶
clean_text(text, patterns=None)
¶
String cleaning function for comments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
comment text |
required |
patterns |
list[str]
|
optional list of regular expression patterns to pass in (eg. [(r'\w+', "-")]) |
None
|
Returns:
Type | Description |
---|---|
str
|
Cleaned verison of text |
Source code in civiclens/utils/text.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
regex_tokenize(text, pattern='\\W+')
¶
Splits strings into tokens base on regular expression.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
string to tokenize |
required |
pattern |
str
|
regular expression to split tokens on, defaults to white space |
'\\W+'
|
Returns:
Type | Description |
---|---|
List of strings represented tokens |
Source code in civiclens/utils/text.py
5 6 7 8 9 10 11 12 13 14 15 16 |
|
sentence_splitter(text, sep='.')
¶
Splits string into sentences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
string to process |
required |
sep |
str
|
value to seperate string on, defaults to '.' |
'.'
|
Returns:
Type | Description |
---|---|
list[str]
|
List of strings split on the seperator valur |
Source code in civiclens/utils/text.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
|
truncate(text, num_words)
¶
Truncates commments:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
Text of the comment |
required |
num_words |
int
|
Number of words to keep |
required |
Returns:
Type | Description |
---|---|
str
|
Truncated commented |
Source code in civiclens/utils/text.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
database_access
¶
Database
¶
Wrapper for CivicLens postrgres DB.
Source code in civiclens/utils/database_access.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
pull_data(connection, query, schema=None, return_type='df')
¶
Takes a SQL Query and returns a polars dataframe
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str
|
SQL Query |
required |
schema |
list[str]
|
list of column names for the dataframe |
None
|
return_type |
str
|
"df" or "list" |
'df'
|
Returns:
Type | Description |
---|---|
DataFrame | List[Tuple]
|
Polars df of comment data or list of comment data |
Source code in civiclens/utils/database_access.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
upload_comments(connection, comments)
¶
Uploads comment data to database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
connection |
Database
|
Postgres client |
required |
comments |
RepComments
|
comments to be uploaded |
required |
Returns:
Type | Description |
---|---|
None
|
None, uploads comments to database |
Source code in civiclens/utils/database_access.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
|