| Title: | Text Processing Tools for Turkish E-Commerce Data |
|---|---|
| Description: | Provides several datasets useful for processing and analysis of text in Turkish from an online shopping platform. |
| Authors: | Betul Kan-Kilinc [aut, cre] (ORCID: <https://orcid.org/0000-0002-3746-2327>), Mine Çetinkaya-Rundel [ctb] (ORCID: <https://orcid.org/0000-0001-6452-2420>), Colin Rundel [ctb] (ORCID: <https://orcid.org/0000-0002-6058-8251>) |
| Maintainer: | Betul Kan-Kilinc <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-20 07:02:58 UTC |
| Source: | https://github.com/bkanx/shoppingwords |
This function processes a dataframe containing user reviews and removes predefined stopwords.
It first searches the package's internal stopwords dataset (stopwords_tr), and if
no match is found, it falls back to the broader stopwords_iso list.
match_stopwords(df)match_stopwords(df)
df |
Dataframe containing user reviews, with required columns |
The function converts text to a standardized format by removing accents and special characters, transforming it into basic Latin characters, and making all letters lowercase. It then tokenizes the text, filters out stopwords, and returns the cleaned version.
A modified dataframe with an additional cleaned_text column containing stopword-free text.
reviews_sample <- tibble::tibble( comment = c("Bu ürün xs ancak fiyatı yüksek gibi", "Fiyat çok pahalı ama kaliteli iyi"), rating = c(4.5, 3.0) ) match_stopwords(reviews_sample)reviews_sample <- tibble::tibble( comment = c("Bu ürün xs ancak fiyatı yüksek gibi", "Fiyat çok pahalı ama kaliteli iyi"), rating = c(4.5, 3.0) ) match_stopwords(reviews_sample)
Contains common negative-emotion phrases extracted from user reviews.
phrasesphrases
A tbl_df with with 205 rows and 1 variable:
ngrams.
phrasesphrases
User reviews collected from an e-commerce site.
reviewsreviews
A tbl_df with with 260,308 rows and 3 variables:
Rating score, out of 5.
Comment text, in Turkish.
Rating ID.
reviewsreviews
A test sample data used for testing analysis functions. It differs from reviews data.
The text column in this data frame is similar to the comment column in the reviews
data frame. Note that this data frame contains 170 texts that are in common, verbatim,
with comments in the reviews dataset. This is because some users made the same comments.
The id column shows that these are not the same observations, just similarly worded
comments from different reviews.
reviews_testreviews_test
A tbl_df with with 1,481 rows and 4 variables:
Rating score, out of 5.
Comment text, in Turkish.
n for negative, p for positive.
Rating ID.
reviews_testreviews_test
A dataset of stopwords used in Turkish text analysis.
stopwords_trstopwords_tr
A tbl_df with with 92 rows and 1 variable:
Stopword, in Turkish.
stopwords_trstopwords_tr