Trying to set up PySpark on Windows. Need help. (hexbear.net)

submitted 1 year ago* (last edited 1 year ago) by Tomorrow_Farewell@hexbear.net to c/technology@hexbear.net

12 comments fedilink hide all child comments

Have been trying to set it up for hours now. Nothing works.

Latest version does not seem to have winutils support, and using it causes errors when using some important methods. (EDIT: this is likely wrong, and the winutils stuff that I have should probably be fine.)
Older versions require to be built with Maven. However, that just gives me a PluginExecutionException.

I need to do this ASAP, preferably within the next 3 hours.

I have nowhere else to ask for help, it seems, especially considering that reddit-logo suspended an account I set up specifically for asking questions after I edited a relevant post.

Highly doubt that anybody will be able to help me.

EDIT2: the issue has, thankfully, been resolved. I was using Python 3.12, and switched to 3.11.8. That made the problem go away.

top 12 comments

sorted by: hot top controversial new old

[-] Edie@hexbear.net 4 points 1 year ago

I have never used PySpark, but I do know some about python.

How are you installing pyspark? Are there any errors?

[-] Tomorrow_Farewell@hexbear.net 3 points 1 year ago* (last edited 1 year ago)

pip install pyspark and installing the latest version of Apache Spark leads to errors when calling pyspark.sql.DataFrame.show() methods of DataFrame objects.
pip install pyspark and installing an older version of Apache Spark, i.e. having a version mismatch between PySpark and Apache Spark, leads to errors even when instantiating a SparkSession.
pip install pyspark==3.3.4 previously led to an error - the system was unable to build wheels for the package. Now, it seems to install that way, but behaves the same as in the previous case.
Trying to build the 3.3.4 PySpark package manually with ./build/mvn using Bash from the appropriate directory led to Caused by: org.apache.maven.plugin.PluginExecutionException: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plu gin:4.4.0:compile failed.

Running this code after having installed this stuff as in case 3:

from pyspark.sql import SparkSession
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.printSchema()
df.show()

leads to this:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "[python file path]", line 6, in <module>
    spark = SparkSession.builder.getOrCreate()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[python file path]", line 269, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[python file path]", line 483, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "[python file path]", line 197, in __init__
    self._do_init(
  File "[python file path]", line 282, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[python file path]", line 402, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[python file path]", line 1585, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "[python file path]", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.ExceptionInInitializerError
	at org.apache.spark.unsafe.array.ByteArrayMethods.<clinit>(ByteArrayMethods.java:56)
	at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes$lzycompute(MemoryManager.scala:264)
	at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes(MemoryManager.scala:254)
	at org.apache.spark.memory.MemoryManager.$anonfun$pageSizeBytes$1(MemoryManager.scala:273)
	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.memory.MemoryManager.<init>(MemoryManager.scala:273)
	at org.apache.spark.memory.UnifiedMemoryManager.<init>(UnifiedMemoryManager.scala:58)
	at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:207)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:320)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:279)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:464)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.lang.IllegalStateException: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.<init>(long,int)
	at org.apache.spark.unsafe.Platform.<clinit>(Platform.java:113)
	... 25 more
Caused by: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.<init>(long,int)
	at java.base/java.lang.Class.getConstructor0(Class.java:3784)
	at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2955)
	at org.apache.spark.unsafe.Platform.<clinit>(Platform.java:71)
	... 25 more

SUCCESS: The process with PID 21224 (child process of PID 9020) has been terminated.
SUCCESS: The process with PID 9020 (child process of PID 15684) has been terminated.
SUCCESS: The process with PID 15684 (child process of PID 4980) has been terminated.

Process finished with exit code 1

System environmental variables JAVA_HOME, HADOOP_HOME, SPARK_HOME are configured. The relevant binary directories are included in the Path system environmental variable.
PYTHON_SPARK is set to python.

EDIT: Great, and now Maven can't even attempt to build the package and throws the error

Error occurred during initialization of VM
Could not reserve enough space for 2097152KB object heap

Just great.

[-] Tomorrow_Farewell@hexbear.net 1 points 1 year ago

Just in case, if I install the library the first way, for the same piece of code the logs start with this:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)

24/07/22 19:04:46 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:612)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:594)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:789)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCo*removed*tage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
	... 26 more

[-] Edie@hexbear.net 2 points 1 year ago* (last edited 1 year ago)

One stackoverflow thread mentions running python 3.12. Are you running 3.12? Does it help if you use python 3.11?

Else, an interesting thing might be the "Caused by: java.io.EOFException", an End Of File Exception.

[-] Tomorrow_Farewell@hexbear.net 2 points 1 year ago

That actually worked. Thank you.

I have got to say that I have a special hatred for arcane programming errors like in this case.

[-] Tomorrow_Farewell@hexbear.net 1 points 1 year ago

I am running 3.12. I have not tried running 3.11. Highly doubt that that will change anything, but I guess I'll try it when I'm able to.

I'm not sure what I have to glean from the EOF exception.

[-] istanbullu@lemmy.ml 2 points 1 year ago

it's much easier in Linux

[-] Tomorrow_Farewell@hexbear.net 1 points 1 year ago

I do not have a programming setup on my Linux OS yet, although I do consider on trying to do this inside a VM. That will be a bit painful, though, as Virtual Box doesn't seem to allow for much graphics memory, meaning that the framerate will be low.

If things go dire and I will fail to find any sort of way out, I will just have to ask other people to run my code instead, ugh.

[-] bunnygirl@hexbear.net 2 points 1 year ago

You could try WSL, it's basically just a headless Linux VM so it's ideal for stuff like this. The terminal itself is just running on windows so no issues with framerate or anything https://learn.microsoft.com/en-us/windows/wsl/install

[-] Tomorrow_Farewell@hexbear.net 2 points 1 year ago

I'm not really sure how to actually run and debug relevant python files using WSL, at least on account of me not having used WSL much (I have only installed one distribution for it and set up the first user). Any help in this regard?

[-] bunnygirl@hexbear.net 2 points 1 year ago

Sorry for the late response,

I can't help for PySpark specifically cause I have no experience with it. In general tho you'll have to install the tooling you need to compile/run the program in WSL, I haven't used Spark in years so I don't know specifics but you'll want to have at least Java and Python installed here. On Ubuntu, you'll want the packages default-jdk, python3, python3-pip, python3-venv (if you're using venv), as well as python-is-python3 for convenience. If you're using venv, you might want to rerun python -m venv env again to make sure it has the files Bash needs, then do source env/bin/activate to activate the venv. You might also have to install pyspark from the Bash shell in case it needs to build anything platform specific. You can set environment variables in ~/.bashrc (It's the home dir in the Linux VM, not Windows so use the terminal to change this e.g. nano ~/.bashrc or vim ~/.bashrc if you're familiar with vi) with the shape export VARIABLE=VALUE (put quotes around VALUE if it has spaces etc), then start a new shell to load those (do exec bash to replace the currently running shell with a new process)

From there you should be able to just run the code normally but in WSL instead

[-] Tomorrow_Farewell@hexbear.net 2 points 1 year ago

Thankfully, the problem has been resolved, so, after I finish with this project, I will have more time to get an actual programming set up on a non-VM NixOS that I have already installed.

this post was submitted on 22 Jul 2024

10 points (100.0% liked)

technology

23893 readers

278 users here now

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

1. Obviously abide by the sitewide code of conduct. Bigotry will be met with an immediate ban
2. This community is about technology. Offtopic is permitted as long as it is kept in the comment sections
3. Although this is not /c/libre, FOSS related posting is tolerated, and even welcome in the case of effort posts
4. We believe technology should be liberating. As such, avoid promoting proprietary and/or bourgeois technology
5. Explanatory posts to correct the potential mistakes a comrade made in a post of their own are allowed, as long as they remain respectful
6. No crypto (Bitcoin, NFT, etc.) speculation, unless it is purely informative and not too cringe
7. Absolutely no tech bro shit. If you have a good opinion of Silicon Valley billionaires please manifest yourself so we can ban you.

founded 5 years ago

MODERATORS

context@hexbear.net

EmmaGoldman@hexbear.net

SexUnderSocialism@hexbear.net

gaycomputeruser@hexbear.net

ZoomeristLeninist@hexbear.net