Exporting Steam Reviews

Publish Date: 2024-06-16

Exporting your Steam Reviews is a hassle, but possible.

I published over 470 reviews on Steam and want to replicate and save them on my website, and also make my positive and favorite recommendations more discoverable.

The simple Steam Community date-ordered 10-item-paged Review listing is not feasible for this. No filters, no search, no differentiation between recommendations.

What Steam Provides

GDPR gives you the right to export your personal data - which includes user data associated to your account like Steam Reviews.

The Steam Privacy Policy says:

6.6 Right to Personal Data portability

You have the right to receive your Personal Data in a structured, commonly used and machine-readable format and have the right to transmit those data to another controller under the conditions set out in article 20 of the GDPR. Valve makes your Personal Data available in structured HTML format through the Privacy Dashboard as described above.

You can find the data associated to your Steam Account, and “export points”, on the Data Related to Your Steam Account page.

How to discover this page on the Steam Website…

You can reach the relevant data information through

Licenses is Structured Data

When you look at your Licenses for example, it’s obviously structured data like the Privacy Policy claims.

It is a table with four columns (three of them named):

Date	Item	[action]	Acquisition Method
16 Jun, 2024	Planet Explorers	[Remove]	Complimentary\|Retail\|Steam Store
…	…	…	…
20 Nov, 2004	Half-Life 2 Retail Standard		Complimentary

And it’s all data.

Structured, Incomplete Curator Reviews

The My Curator Reviews page also has a table, but doesn’t have the full data right away. You have to click “Load More Data” until you have all data loaded.

Curator Group	App Reviewed	Blurb	External Link	Link Text	Created	Updated	Deleted?	Comment Count	Up votes	Click Through	Recommended?	Compensated?	Received Free Copy
…	…	…	…	…	…	…	…	…	…	…	…	…	…

Unstructured, Incomplete Reviews

The Reviewed Games link links you to your public Steam Community Reviews page.

You can only see 10 reviews (I have 479 right now, so that’s 47 pages)
You can’t see the product name [as text]
The HTML source is a mess
- Invalid page indices return pseudo-success results (0 and smaller returns the first page, higher than last page with content returns empty page list content - no adequate HTTP status code etc)
- Multiple values within single elements, some of which optional
- Arbitrary value and value label changes (1 person but 3 people, helpful and funny, hrs on record and hrs at review time)
- Arbitrary, unclear date value Posted 15 June. (will the year appear at some point?)
- A ton of whitespace
- Elements for layout fixes (CSS clear: left; div)
- Elements for layout (leftcol, rightcol, hr)

When I contacted Steam support, they responded with “it’s HTML so it’s structured”. Which is obviously not in the spirit of the law. And this HTML is of very different quality than the licenses page.

My disapproval and disagreement was ignored until I asked again a month later (I’ll give benefit of the doubt here that it got lost/unconsciously ignored). After my request, pointing - amongst other things - to the GDPR requirement of an adequate response within a month, it was escalated to the legal team and I received an answer.

However, as was pointed out above, all the date is provided as structured html, which is machine-readable and can be trivially parsed in any common programming language. Data does not have to be provided in an html <table> element to be machine-readable, <div> elements can be parsed just as easily. In order to extract your reviews into a JSON file, for instance you could use the Python script provided below.

If you have no experience running Python scripts, we can send you the result of the script by email; its length exceeds what fits into one of our support messages.

Apparently, they deem it good enough as structured parseable data, and individually requesting and combining pages is supposed to be their GDPR compliance.

I’m a programmer. And while Python is not my tool of choice, I can do that. It still took me two days of investing significant time (multiple hours) to really get, clean up, and transform the data.

A lot of my time investment was exploring and identifying options with Nushell and discovery and experimentation of the implementation, but I’m considering that part of it if that’s the supposed workflow.

With a clean data set I would have expected it to take an hour at most.

How is a non-programmer supposed to do this?

Their Python Suggestion (Incomplete Data)

Python code provided by Steam Support…

import requests
from bs4 import BeautifulSoup
import json

def fetch_and_parse_html_to_json(base_url, json_file):
data = [] # Initialize the list to store all data across pages

# Loop through all the pages
for page_number in range(1, 49):
# Generate the URL for the current page
url = f"{base_url}?p={page_number}"

# Send a request to the URL to get the HTML content
response = requests.get(url)
response.raise_for_status() # This will raise an exception for HTTP errors

# Use BeautifulSoup to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all relevant divs
titles = [div.text.strip() for div in soup.find_all('div', class_='title')]
hours = [div.text.strip() for div in soup.find_all('div', class_='hours')]
contents = [div.text.strip() for div in soup.find_all('div', class_='content')[1:]] # Skip the first 'content' div
posted_dates = [div.text.strip() for div in soup.find_all('div', class_='posted')]

# Append each item's details to the data list
for title, hour, content, posted in zip(titles, hours, contents, posted_dates):
data.append({
'title': title,
'hours': hour,
'content': content,
'posted': posted
})

# Write data to a JSON file after processing all pages
with open(json_file, 'w', encoding='utf-8') as jfile:
json.dump(data, jfile, indent=4, ensure_ascii=False)

# Usage
base_url = 'https://steamcommunity.com/id/<userid>/recommended'
json_filename = 'output.json'
fetch_and_parse_html_to_json(base_url, json_filename)

Steam Support
Escalated Legal Review

Exporting Steam Reviews Data With Nushell

For now, I will dump my current script here. Next iteration, when I update-export my reviews, I will clean it up further.

One thing I was confused by was that some queries return a single item list, where I add first. What I would prefer would be single, to assert correctness. But I wasn’t clear on why I get a list in the first place. uniq didn’t seem to do what I want either.

Another thing I was confused by was

Nushell Export Script…

# Requires query plugin
# plugin add (($nu.current-exe | path parse | get parent) + )nu_plugin_query.exe
#
# Steam Review HTML
# div.review_box
#   div.header - '5 people found this review helpful'
#   div - {clear left}
#   div.review_box_content
#     div.leftcol
#       a.game_capsule_ctn [href=https://steamcommunity.com/app/<appid>] !-> https://store.steampowered.com/app/<appid>/
#         img.game_capsule [src=https://shared.akamai.steamstatic.com/store_item_assets/steam/apps/<appid>/capsule_184x69.jpg?t=1705692327]
#     div.rightcol
#       div.vote_header
#         img.review_source.tooltip
#         div.thumb
#           a [href=https://steamcommunity.com/id/<userId>/recommended/<appid>/]
#             img [src=https://community.akamai.steamstatic.com/public/shared/images/userreviews/icon_thumbsUp.png]
#         div.title
#           a [href=https://steamcommunity.com/id/<userId>/recommended/<appid>/] - 'Recommended'
#         div.hours
#           3.7 hrs on record
#           3.7 hrs on record \n (2.6 hrs at review time)
#       div.content
#       div.posted - 'Posted 3 June.'
#       div.hr
#       div.control_block
#     div {clear left}
plugin use query

# 1. Download all pages - note max page is magic number and steam returns empty pages after last record
def downloadAllPages [userId, maxPage: int] {
  for i in 1..$maxPage { http get $"https://steamcommunity.com/id/($userId)/recommended?p=($i)" | save -f $"p($i).html" }
}

# 2. Combine all review blocks
def combine [] {
  ls p*.html | get name | sort --natural | each {|f| open --raw | to text } | each {|c| $c | query web --query '.review_box' --as-html } | flatten | str join | save -f combined.html
}

# 3. Extract data
def parseRecords [filepath: path] {
  open $filepath | query web --query '.review_box' --as-html | each {|x| $x | parseRecord }
}
def parseRecord [] {
  {
    header: ($in | qwebtext '.header' | to text)
    game_capsule: ($in | qwebhtml '.game_capsule')
    thumb: ($in | qwebhtml '.vote_header .thumb img')
    rating: ($in | qwebtext '.vote_header .title')
    hours: ($in | qwebhtml '.vote_header .hours')
    content: ($in | qwebhtml '.content')
    posted: ($in | qwebtext '.posted')
    url: ($in | qwebhtml '.vote_header .title' | parse --regex 'href="(?<url>[^"]+)"' | get url | first)
  } |
    insert hoursNow ($in.hours | parse --regex '(?<hoursNow>[0-9\.]+) hrs on record' | get hoursNow | to text) |
    insert hoursAtReview ($in.hours | parse --regex '(?<hoursAtReview>[0-9\.]+) hrs at review time' | get hoursAtReview | to text) |
    reject hours |
    insert appId ($in.url | parse --regex '/recommended/(?<appId>[0-9]+)/' | get appId | first | into int) |
    insert userId ($in.url | parse --regex '/id/(?<userId>[^/]+)/recommended/' | get userId | first) |
    insert thumbUrl ($in.thumb | parse --regex 'src="(?<url>[^"]+)"' | get url | to text) |
    reject thumb |
    insert gameCapsuleUrl ($in.game_capsule | parse --regex 'src="(?<url>[^"]+)"' | get url | to text) |
    reject game_capsule
}
def qwebhtml [path: string] { $in | query web --query $path --as-html | first }
def qwebtext [path: string] { $in | query web --query $path | to text | str trim }
def extractReviewRecord [] {
  $in | each {|x|
    $x |
    {
      hours: ($x | findHours | parseHours)
      appid: ($x | findAppId | get appid)
      # should use | uniq
      content: ($x | query web --query '.content' --as-html | first)
    } |
    append ($x | query web --query 'div.title' --as-html |
      parse '<div class="title"><a href="https://steamcommunity.com/id/{User}/recommended/{AppId}/">Recommended</a></div>'
      ) |
    flatten }
  #$in | each {|x| $x | findAppId }
}
def findHours [] {
  $in | query web --query '.hours' --as-html
}
def parseHours [] {
  #$in | parse --regex '^(?s).*?(?<hoursNow>[0-9]+[.][0-9]+) hrs on record.*?(?:(?<hoursReview>[0-9]+[.][0-9]+) hrs at review time)?.*?$' }
  #$in | parse --regex '(?<hoursReview>[0-9]+[.][0-9]+) hrs at review time' }
  $in | {
      hoursNow: ($in | parse --regex '(?<hoursNow>[0-9]+[.][0-9]+) hrs on record' | get hoursNow | to text)
      hoursAtReview: ($in | parse --regex '(?<hoursAtReview>[0-9]+[.][0-9]+) hrs at review time' | get hoursAtReview | to text)
    }
}
def findAppId [] {
  $in | query web --query 'a.game_capsule_ctn' --attribute 'href' | parse 'https://steamcommunity.com/app/{appid}'
}
def getAppNameWeb [appId: int] {
  #http get $'https://store.steampowered.com/app/($appId)/' | query web --query '#appHubAppName' | to text
  http get $'https://steamcommunity.com/app/($appId)' | query web --query '.apphub_AppName' | to text
}
def getAppName [appId: int] {
  open SteamAppList.json | get applist.apps | where appid == $appId | get name | to text
}
def insertAppNames [] {
  #const appListFilename = 'SteamAppList.json'
  #if (SteamAppList.json | not ($in | path exists)) {
    #http get https://api.steampowered.com/ISteamApps/GetAppList/v2/ | save SteamAppList.json
  #}
  #const apps = open SteamAppList.json | get applist.apps
  $in | each {|x| $x | insert appName (open SteamAppList.json | get applist.apps | where appid == $x.appId | get name | to text) }
}

# note max page is predefined magic number and steam returns empty pages after last record
downloadAllPages 50
combine
parseRecords 'combined.html' | save combined.json

http get https://api.steampowered.com/ISteamApps/GetAppList/v2/ | save SteamAppList.json
# The records from Steam API are not unique? lol => distinct them
let apps = open SteamAppList.json | get applist.apps | uniq | rename --column {name: appName}
open combined.json | join --left $apps appId appid | save combined-named.json -f

# The steam app list from the Steam API does not have all apps. Missing even store-available apps.
# This command is very slow because of the web requests.
open combined-named.json | where {|x| $x.appName | is-empty} | reject appName | each {|x| insert appName (getAppNameWeb $x.appId) } | save combined-named2.json

# Should use some form of text template
#const mdTemplate = "+++\r\ntitle = \"{Title}\""
open combined-named.json |
  each {|x| $x | insert md ($"+++\r\ntitle = \"($x.appName) Review\"\r\ndate = \"($x.posted)\"\r\n+++\r\n($x.rating)\r\n\r\n($x.content)\r\n\r\n[Originally posted as a Steam Review]\(($x.url)\)\r\n") } |
  each {|x| $x | get md | save ($'md/($x.appId).md') }

open combined-named2.json |
  each {|x| $x | insert md ($"+++\r\ntitle = \"($x.appName) Review\"\r\ndate = \"($x.posted)\"\r\n+++\r\n($x.rating)\r\n\r\n($x.content)\r\n\r\n[Originally posted as a Steam Review]\(($x.url)\)\r\n") } |
  each {|x| $x | get md | save ($'md/($x.appId).md') -f }

# I previously used the store app page in getAppNameWeb but that's not always available. I used this third step to fixup those missing. That should no longer be necessary. The community page should always be there.
#open combined-named2.json | where {|x| $x.appName | is-empty} | reject appName | each {|x| insert appName (getAppNameWeb $x.appId) } | save combined-named3.json
#open combined-named3.json |
  #each {|x| $x | insert md ($"+++\r\ntitle = \"($x.appName) Review\"\r\ndate = \"($x.posted)\"\r\n+++\r\n($x.rating)\r\n\r\n($x.content)\r\n\r\n[Originally posted as a Steam Review]\(($x.url)\)\r\n") } |
  #each {|x| $x | get md | save ($'md/($x.appId).md') -f }

Replicated Reviews

My replicated Steam Reviews are now on this website. In a shitty form. But it’s a start.